Text preprocess function cant seem to remove full twitter hashtag

Question

Im trying to make a function that uses regular expressions to remove elements from a string

In this example the given text is
‘@twitterusername Crazy wind today no birding #Python’

I want it to look like
‘crazy wind today no birding’

Instead if still includes the hashtag with this
‘crazy wind today no birding python’

Ive tried a few different patterns and cant seem to get it right here is the code

`def process(text):
processed_text = []

wordLemm = WordNetLemmatizer()

# -- Regex patterns --

# Remove urls pattern
url_pattern = r"https?://S+"

# Remove usernames pattern
user_pattern = r'@[A-Za-z0-9_]+'

# Remove all characters except digits and alphabet pattern
alpha_pattern = "[^a-zA-Z0-9]"

# Remove twitter hashtags
hashtag_pattern = r'#w+b'



for tweet_string in text:
    
    # Change text to lower case
    tweet_string = tweet_string.lower()
    
    # Remove urls
    tweet_string = re.sub(url_pattern, '', tweet_string)
    
    # Remove usernames 
    tweet_string = re.sub(user_pattern, '', tweet_string)
    
    # Remove non alphabet
    tweet_string = re.sub(alpha_pattern, " ", tweet_string)
    
    # Remove hashtags
    tweet_string = re.sub(hashtag_pattern, " ", tweet_string)
    
    
    tweetwords = ''
    for word in tweet_string.split():
        # Checking if the word is a stopword.
        #if word not in stopwordlist:
        if len(word)>1:
            # Lemmatizing the word.
            word = wordLemm.lemmatize(word)
            tweetwords += (word+' ')
        
    processed_text.append(tweetwords)
    
return processed_text`

Asked By: jensondavis

||

Source

Answer 1

The problem is that you remove the non-alpha characters before the hashtag. This means that the ‘#’ is no longer in the input string, so the hashtag does not get recognized. You should reverse these:

 # Remove hashtags
    tweet_string = re.sub(hashtag_pattern, " ", tweet_string)
 # Remove non alphabet
    tweet_string = re.sub(alpha_pattern, " ", tweet_string)

Answered By: Emanuel P

Text preprocess function cant seem to remove full twitter hashtag

Question:

Answers: