ValueError: empty vocabulary; perhaps the documents only contain stop words

Question:

I’m using (for the first time) the scikit library and I got this error:

ValueError: empty vocabulary; perhaps the documents only contain stop words
File "C:UsersA605563DesktopvelibProjetPresoTraitementTwitterDico.py", line 33, in <module>
X_train_counts = count_vect.fit_transform(FileTweets)
File "C:Python27Libsite-packagessklearnfeature_extractiontext.py", line 804, in fit_transform
self.fixed_vocabulary_)
File "C:Python27Libsite-packagessklearnfeature_extractiontext.py", line 751, in _count_vocab
raise ValueError("empty vocabulary; perhaps the documents only contain stop words

But I don’t understand why that’s happening.

import sklearn
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy
import unicodedata
import nltk


TweetsFile = open('tweets2015-08-13.csv', 'r+')
f2 = open('analyzer.txt', 'a')
print TweetsFile.readline()
count_vect = CountVectorizer(strip_accents='ascii')
FileTweets =  TweetsFile.read()
FileTweets = FileTweets.decode('latin1')
FileTweets = unicodedata.normalize('NFKD', FileTweets).encode('ascii','ignore')
print FileTweets
for line in TweetsFile:
    f2.write(line.replace('n', ' '))
TweetsFile = f2
print type(FileTweets)
X_train_counts = count_vect.fit_transform(FileTweets)
print X_train_counts.shape
TweetsFile.close()

My data is raw tweets:

11/8/2015 @ Paris Marriott Champs Elysees Hotel "
2015-08-11 21:27:15,"I'm at Paris Marriott Hotel Champs-Elysees in Paris, FR <https://t.co/gAFspVw6FC>"
2015-08-11 21:24:08,"I'm at Four Seasons Hotel George V in Paris, Ile-de-France <https://t.co/dtPALvziWy>"
2015-08-11 21:22:11,    . @ Avenue des Champs-Elysees <https://t.co/8b7U05OAxG>
2015-08-11 20:54:18,Her pistol go @ Raspoutine Paris (Official) <https://t.co/le9l3dtdgM>
2015-08-11 20:50:14,"Desde Paris, con amor. @ Avenue des Champs-Elysees <https://t.co/R68JV3NT1z>"

Does anybody know what’s happening here?

Asked By: Honolulu

||

Answers:

I found a solution:

import sklearn
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy as np
import unicodedata
import nltk 
from io import StringIO


TweetsFile = open('tweets2015-08-13.csv','r+')
yourResult = [line.split(',') for line in TweetsFile.readlines()]
count_vect = CountVectorizer(input="file")
docs_new = [ StringIO.StringIO(x) for x in yourResult ]
X_train_counts = count_vect.fit_transform(docs_new)
vocab = count_vect.get_feature_names()
print X_train_counts.shape
Answered By: Honolulu

This is a much simpler solution:

x = open('bad_words_train.txt', 'r+')
count_vect = CountVectorizer(input=file)
X_train = count_vect.fit_transform(x)
print(X_train)
Answered By: LoveMeow
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.