Changing dataframe values after regex function problem

Question

I try to make a pipeline voor Twitter sentiment analysis. As usual data preprocessing is a thing…

Based on real tweets I made a dataframe with only 3 rows/tweets, for experiment goal.

What I try to do:
1: clear al @, ‘, http etc. from the tweet.
2: after that is done I want the cleaned tweet to replace the old tweet.

This works partially: Only a part of some tweets comes back in my dataframe. As the code does clean up the tweets, the code only places a part of the original code back.
I think the problem is somewhere in the tweet conversion from string to list, but after many hours trying I am unable to fix it.

The dataframe contents looks like this (only index and 1 column: Tweet)
tweets are of type string

Index   Tweet
0       @justanamehere and a sentence here and a link http://www.test.com
1       @Personsname are a fraud and farce, a lying person together with the fake media. Something else Personname? suppose you work with her .. @company1 @company2 #RETWEET https://x.something"
2      @companyx @companyex1 @company3 etc. AS lot of bad words here. It is a cancelculture, these rats want to badword https://x.Something

My code:

def strip_links(text):
            link_regex    = re.compile('((https?):((//)|(\\))+([wd:#@%/;$()~_?+-=\.&](#!)?)*)', re.DOTALL)
            links         = re.findall(link_regex, text)
            for link in links:
                text = text.replace(link[0], ', ')    
            return text

def strip_all_entities(text):
            entity_prefixes = ['@','#']
            for separator in  string.punctuation:
                if separator not in entity_prefixes :
                    text = text.replace(separator,' ')
            words = []
            for word in text.split():
                word = word.strip()
                if word:
                    if word[0] not in entity_prefixes:
                        words.append(word)
            row['Tweet'] = ' '.join(words)   
                 
            return ' '.join(words)


# Code hieronder is nodig omdat de tekst in het df type str heeft. Omzetten naar een list.

for index, row in df_tweet.iterrows():
  tweet = list(row['Tweet'].split(","))
      
  for t in tweet: 
    strip_all_entities(strip_links(t))

This produces this:

'and a sentence here and a link' 'are a fraud and farce' '' a lying person together with the fake media Something else Personname suppose you work with her' 'etc AS lot of bad words here It is a cancelculture' 'these rats want to badword'

But in df_tweet it shows only this:

    Tweet
0   and a sentence here and a link
1   a lying person together with the fake media So...
2   these rats want to badword

The expected result is:

index   Tweet
0       and a sentence here and a link
1       are a fraud and farce a lying person together with the fake media 
        Something else Personname? suppose you work with her
2       AS lot of bad words here It is a cancelculture these rats want to 
        badword

Thanks for helping me out!! Cheers Jan

Asked By: Janneman

||

Source

Answer 1

try:

df.Tweet = df.Tweet
    .str.replace(r'[@#]w*b', '', regex=True)
    .str.replace(r'https?://S+', '', regex=True)
    .str.replace(r's[#@%/;$()~_?+-=\.&']+', '', regex=True)
    .str.strip()

Output:

        Tweet
Index   
0       and a sentence here and a link
1       are a fraud and farce, a lying person together with the fake media. Something else Personname? suppose you work with her
2       etc. AS lot of bad words here. It is a cancelculture, these rats want to badword

To delete only non-western characters from the tweets but keep the tweets:

df.Tweet = df.Tweet
    .apply(lambda x: ''.join([i if i.isascii() else '' for i in x]))
    .str.replace(r'[@#]w*b', '', regex=True)
    .str.replace(r'https?://S+', '', regex=True)
    .str.replace(r's[#@%/;$()~_?+-=\.&']+', '', regex=True)
    .str.strip()

To delete tweets containig non-western characters:

df.Tweet = df.Tweet
    .str.replace(r'[@#]w*b', '', regex=True)
    .str.replace(r'https?://S+', '', regex=True)
    .str.replace(r's[#@%/;$()~_?+-=\.&']+', '', regex=True)
    .str.strip()
df = df[df.Tweet.apply(lambda x: x.isascii())]

Answered By: 99_m4n

Answer 2

Found solution to removing Chinese (or like so characters):

df_tweet.Tweet = df_tweet.Tweet
    .str.replace(r'[@#]w*b', '', regex=True)
    .str.replace(r'https?://S+', '', regex=True)
    .str.replace(r's[#@%/;$()~_?+-=\.&']+', '', regex=True)
    .str.replace(r'[^x00-x7f]', "", regex=True )
    .str.strip()

Answered By: Janneman

Changing dataframe values after regex function problem

Question:

Answers: