Text preprocess function cant seem to remove full twitter hashtag
Question:
Im trying to make a function that uses regular expressions to remove elements from a string
In this example the given text is
‘@twitterusername Crazy wind today no birding #Python’
I want it to look like
‘crazy wind today no birding’
Instead if still includes the hashtag with this
‘crazy wind today no birding python’
Ive tried a few different patterns and cant seem to get it right here is the code
`def process(text):
processed_text = []
wordLemm = WordNetLemmatizer()
# -- Regex patterns --
# Remove urls pattern
url_pattern = r"https?://S+"
# Remove usernames pattern
user_pattern = r'@[A-Za-z0-9_]+'
# Remove all characters except digits and alphabet pattern
alpha_pattern = "[^a-zA-Z0-9]"
# Remove twitter hashtags
hashtag_pattern = r'#w+b'
for tweet_string in text:
# Change text to lower case
tweet_string = tweet_string.lower()
# Remove urls
tweet_string = re.sub(url_pattern, '', tweet_string)
# Remove usernames
tweet_string = re.sub(user_pattern, '', tweet_string)
# Remove non alphabet
tweet_string = re.sub(alpha_pattern, " ", tweet_string)
# Remove hashtags
tweet_string = re.sub(hashtag_pattern, " ", tweet_string)
tweetwords = ''
for word in tweet_string.split():
# Checking if the word is a stopword.
#if word not in stopwordlist:
if len(word)>1:
# Lemmatizing the word.
word = wordLemm.lemmatize(word)
tweetwords += (word+' ')
processed_text.append(tweetwords)
return processed_text`
Answers:
The problem is that you remove the non-alpha characters before the hashtag. This means that the ‘#’ is no longer in the input string, so the hashtag does not get recognized. You should reverse these:
# Remove hashtags
tweet_string = re.sub(hashtag_pattern, " ", tweet_string)
# Remove non alphabet
tweet_string = re.sub(alpha_pattern, " ", tweet_string)
Im trying to make a function that uses regular expressions to remove elements from a string
In this example the given text is
‘@twitterusername Crazy wind today no birding #Python’
I want it to look like
‘crazy wind today no birding’
Instead if still includes the hashtag with this
‘crazy wind today no birding python’
Ive tried a few different patterns and cant seem to get it right here is the code
`def process(text):
processed_text = []
wordLemm = WordNetLemmatizer()
# -- Regex patterns --
# Remove urls pattern
url_pattern = r"https?://S+"
# Remove usernames pattern
user_pattern = r'@[A-Za-z0-9_]+'
# Remove all characters except digits and alphabet pattern
alpha_pattern = "[^a-zA-Z0-9]"
# Remove twitter hashtags
hashtag_pattern = r'#w+b'
for tweet_string in text:
# Change text to lower case
tweet_string = tweet_string.lower()
# Remove urls
tweet_string = re.sub(url_pattern, '', tweet_string)
# Remove usernames
tweet_string = re.sub(user_pattern, '', tweet_string)
# Remove non alphabet
tweet_string = re.sub(alpha_pattern, " ", tweet_string)
# Remove hashtags
tweet_string = re.sub(hashtag_pattern, " ", tweet_string)
tweetwords = ''
for word in tweet_string.split():
# Checking if the word is a stopword.
#if word not in stopwordlist:
if len(word)>1:
# Lemmatizing the word.
word = wordLemm.lemmatize(word)
tweetwords += (word+' ')
processed_text.append(tweetwords)
return processed_text`
The problem is that you remove the non-alpha characters before the hashtag. This means that the ‘#’ is no longer in the input string, so the hashtag does not get recognized. You should reverse these:
# Remove hashtags
tweet_string = re.sub(hashtag_pattern, " ", tweet_string)
# Remove non alphabet
tweet_string = re.sub(alpha_pattern, " ", tweet_string)