Unable to import process_tweets from utils
Question:
Thanks for looking into this, I have a python program for which I need to have process_tweet
and build_freqs
for some NLP task, nltk
is installed already and utils
wasn’t so I installed it via pip install utils
but the above mentioned two modules apparently weren’t installed, the error I got is standard one here,
ImportError: cannot import name 'process_tweet' from
'utils' (C:Pythonlibsite-packagesutils__init__.py)
what have I done wrong or is there anything missing?
Also I referred This stackoverflow answer but it didn’t help.
Answers:
Try this code, It should work:
def process_tweet(tweet):
stemmer = PorterStemmer()
stopwords_english = stopwords.words('english')
tweet = re.sub(r'$w*', '', tweet)
tweet = re.sub(r'^RT[s]+', '', tweet)
tweet = re.sub(r'https?://.*[rn]*', '', tweet)
tweet = re.sub(r'#', '', tweet)
tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,reduce_len=True)
tweet_tokens = tokenizer.tokenize(tweet)
tweets_clean = []
for word in tweet_tokens:
if (word not in stopwords_english and
word not in string.punctuation):
stem_word = stemmer.stem(word) # stemming word
tweets_clean.append(stem_word)
return tweets_clean
If you are following the NLP course on deeplearning.ai, then I believe the utils.py file was created by the instructors of that course, for use within the lab sessions, and shouldn’t be confused with the usual utils.
You can easily access any source code with ??, for example in this case: process_tweet?? (the code above from deeplearning.ai NLP course custome utils library):
def process_tweet(tweet):
"""Process tweet function.
Input:
tweet: a string containing a tweet
Output:
tweets_clean: a list of words containing the processed tweet
"""
stemmer = PorterStemmer()
stopwords_english = stopwords.words('english')
# remove stock market tickers like $GE
tweet = re.sub(r'$w*', '', tweet)
# remove old style retweet text "RT"
tweet = re.sub(r'^RT[s]+', '', tweet)
# remove hyperlinks
tweet = re.sub(r'https?://.*[rn]*', '', tweet)
# remove hashtags
# only removing the hash # sign from the word
tweet = re.sub(r'#', '', tweet)
# tokenize tweets
tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
reduce_len=True)
tweet_tokens = tokenizer.tokenize(tweet)
tweets_clean = []
for word in tweet_tokens:
if (word not in stopwords_english and # remove stopwords
word not in string.punctuation): # remove punctuation
# tweets_clean.append(word)
stem_word = stemmer.stem(word) # stemming word
tweets_clean.append(stem_word)
I guess you don’t need to use process_tweet
as all. The code in the course is just a shortcut to summarize everything you do from the beginning to the stemming step; hence, just ignore the step and just print out the tweet_stem
to see the difference between original text and preprocessed text.
You can try this.
def preprocess_tweet(tweet):
# cleaning
tweet = re.sub(r'^RT[s]+','',tweet)
tweet = re.sub(r'https?://[^snr]+', '', tweet)
tweet = re.sub(r'#', '',tweet)
tweet= re.sub(r'@', '',tweet)
# tokenization
token = TweetTokenizer(preserve_case=False, strip_handles=True,reduce_len=True)
tweet_tokenized = token.tokenize(tweet)
# STOP WORDS
stopwords_english = stopwords.words('english')
tweet_processed = []
for word in tweet_tokenized:
if (word not in stopwords_english and
word not in string.punctuation):
tweet_processed.append(word)
# stemming
tweet_stem = []
stem = PorterStemmer()
for word in tweet_processed:
stem_word = stem.stem(word)
tweet_stem.append(stem_word)
return tweet_stem
Input and Output
Thanks for looking into this, I have a python program for which I need to have process_tweet
and build_freqs
for some NLP task, nltk
is installed already and utils
wasn’t so I installed it via pip install utils
but the above mentioned two modules apparently weren’t installed, the error I got is standard one here,
ImportError: cannot import name 'process_tweet' from
'utils' (C:Pythonlibsite-packagesutils__init__.py)
what have I done wrong or is there anything missing?
Also I referred This stackoverflow answer but it didn’t help.
def process_tweet(tweet):
stemmer = PorterStemmer()
stopwords_english = stopwords.words('english')
tweet = re.sub(r'$w*', '', tweet)
tweet = re.sub(r'^RT[s]+', '', tweet)
tweet = re.sub(r'https?://.*[rn]*', '', tweet)
tweet = re.sub(r'#', '', tweet)
tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,reduce_len=True)
tweet_tokens = tokenizer.tokenize(tweet)
tweets_clean = []
for word in tweet_tokens:
if (word not in stopwords_english and
word not in string.punctuation):
stem_word = stemmer.stem(word) # stemming word
tweets_clean.append(stem_word)
return tweets_clean
If you are following the NLP course on deeplearning.ai, then I believe the utils.py file was created by the instructors of that course, for use within the lab sessions, and shouldn’t be confused with the usual utils.
You can easily access any source code with ??, for example in this case: process_tweet?? (the code above from deeplearning.ai NLP course custome utils library):
def process_tweet(tweet):
"""Process tweet function.
Input:
tweet: a string containing a tweet
Output:
tweets_clean: a list of words containing the processed tweet
"""
stemmer = PorterStemmer()
stopwords_english = stopwords.words('english')
# remove stock market tickers like $GE
tweet = re.sub(r'$w*', '', tweet)
# remove old style retweet text "RT"
tweet = re.sub(r'^RT[s]+', '', tweet)
# remove hyperlinks
tweet = re.sub(r'https?://.*[rn]*', '', tweet)
# remove hashtags
# only removing the hash # sign from the word
tweet = re.sub(r'#', '', tweet)
# tokenize tweets
tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
reduce_len=True)
tweet_tokens = tokenizer.tokenize(tweet)
tweets_clean = []
for word in tweet_tokens:
if (word not in stopwords_english and # remove stopwords
word not in string.punctuation): # remove punctuation
# tweets_clean.append(word)
stem_word = stemmer.stem(word) # stemming word
tweets_clean.append(stem_word)
I guess you don’t need to use process_tweet
as all. The code in the course is just a shortcut to summarize everything you do from the beginning to the stemming step; hence, just ignore the step and just print out the tweet_stem
to see the difference between original text and preprocessed text.
You can try this.
def preprocess_tweet(tweet):
# cleaning
tweet = re.sub(r'^RT[s]+','',tweet)
tweet = re.sub(r'https?://[^snr]+', '', tweet)
tweet = re.sub(r'#', '',tweet)
tweet= re.sub(r'@', '',tweet)
# tokenization
token = TweetTokenizer(preserve_case=False, strip_handles=True,reduce_len=True)
tweet_tokenized = token.tokenize(tweet)
# STOP WORDS
stopwords_english = stopwords.words('english')
tweet_processed = []
for word in tweet_tokenized:
if (word not in stopwords_english and
word not in string.punctuation):
tweet_processed.append(word)
# stemming
tweet_stem = []
stem = PorterStemmer()
for word in tweet_processed:
stem_word = stem.stem(word)
tweet_stem.append(stem_word)
return tweet_stem
Input and Output