AttributeError: 'list' object has no attribute 'lower' with CountVectorizer
Question:
I am trying to make a prediction on a pandas dataframe in Python. Somehow the CountVectorizer can’t convert the data. Does anyone know what’s causing the problem?
This is my code:
filename = 'final_model.sav'
print(response.status_code)
data = response.json()
print(data)
dictionary = pd.read_json('rating_company_small.json', lines=True)
dictionary_df = pd.DataFrame()
dictionary_df["comment text"] = dictionary["comment"]
data = pd.DataFrame.from_dict(json_normalize(data), orient='columns')
print(data)
df = pd.DataFrame()
df["comment text"] = data["Text"]
df["status"] = data["Status"]
print(df)
Processing.dataframe_cleaning(df)
comment_data = df['comment text']
tfidf = CountVectorizer()
tfidf.fit(dictionary_df["comment text"])
Test_X_Tfidf = tfidf.transform(df["comment text"])
print(comment_data)
print(Test_X_Tfidf)
loaded_model = pickle.load(open(filename, 'rb'))
predictions_NB = loaded_model.predict(Test_X_Tfidf)
This is the dataframe:
comment text status
0 [slecht, bedrijf] string
1 [leuk, bedrijfje, goed, behandeld] Approved
2 [leuk, bedrijfje, goed, behandeld] Approved
3 [leuk, bedrijfje] Approved
full error message:
Traceback (most recent call last):
File "Request.py", line 36, in <module>
Test_X_Tfidf = tfidf.transform(df["comment text"])
File "C:UsersjunioAnaconda3libsite-packagessklearnfeature_extractiontext.py", line 1112, in transform
_, X = self._count_vocab(raw_documents, fixed_vocab=True)
File "C:UsersjunioAnaconda3libsite-packagessklearnfeature_extractiontext.py", line 970, in _count_vocab
for feature in analyze(doc):
File "C:UsersjunioAnaconda3libsite-packagessklearnfeature_extractiontext.py", line 352, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "C:UsersjunioAnaconda3libsite-packagessklearnfeature_extractiontext.py", line 256, in <lambda>
return lambda x: strip_accents(x.lower())
AttributeError: 'list' object has no attribute 'lower'
I’m expecting it to return the predictions on the dataframe.
Answers:
CountVectorizer
cannot directly handle a Series
of lists, which is why you’re getting that error (lower
is a string method).
I looks like you want a MultiLabelBinarizer
instead, which can handle this input structure:
from sklearn.preprocessing import MultiLabelBinarizer
count_vec = MultiLabelBinarizer()
mlb = count_vec.fit(df["comment text"])
pd.DataFrame(mlb.transform(df["comment text"]), columns=[mlb.classes_])
bedrijf bedrijfje behandeld goed leuk slecht
0 1 0 0 0 0 1
1 0 1 1 1 1 0
2 0 1 1 1 1 0
3 0 1 0 0 1 0
However the above approach won’t account for duplicate elements in the lists, the output elements can either be 0
or 1
. If that is the behavior you’re expecting instead, you could join the lists into strings and then use a CountVectorizer
, since it is expecting strings:
text = df["comment text"].map(' '.join)
count_vec = CountVectorizer()
cv = count_vec.fit(text)
pd.DataFrame(cv.transform(text).toarray(), columns=[mlb.classes_])
bedrijf bedrijfje behandeld goed leuk slecht
0 1 0 0 0 0 1
1 0 1 1 1 1 0
2 0 1 1 1 1 0
3 0 1 0 0 1 0
Note that this is not the same as a tf-idf
of the input strings. Here you just have the actual counts. For that you have TfidfVectorizer
, which for the same example would produce:
bedrijf bedrijfje behandeld goed leuk slecht
0 0.707107 0.000000 0.000000 0.000000 0.000000 0.707107
1 0.000000 0.444931 0.549578 0.549578 0.444931 0.000000
2 0.000000 0.444931 0.549578 0.549578 0.444931 0.000000
3 0.000000 0.707107 0.000000 0.000000 0.707107 0.000000
An array of strings is what the CountVectorizer expects. Therefore, it will crash if you pass in a nested array of tokens.
Instead of
['ham', 'go', 'until', 'jurong', 'point', 'crazy', 'available', 'only', 'in', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', 'cine', 'there', 'got', 'amore', 'wat']
Send the data in the form of this to the CountVectorizer.
ham Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
I am trying to make a prediction on a pandas dataframe in Python. Somehow the CountVectorizer can’t convert the data. Does anyone know what’s causing the problem?
This is my code:
filename = 'final_model.sav'
print(response.status_code)
data = response.json()
print(data)
dictionary = pd.read_json('rating_company_small.json', lines=True)
dictionary_df = pd.DataFrame()
dictionary_df["comment text"] = dictionary["comment"]
data = pd.DataFrame.from_dict(json_normalize(data), orient='columns')
print(data)
df = pd.DataFrame()
df["comment text"] = data["Text"]
df["status"] = data["Status"]
print(df)
Processing.dataframe_cleaning(df)
comment_data = df['comment text']
tfidf = CountVectorizer()
tfidf.fit(dictionary_df["comment text"])
Test_X_Tfidf = tfidf.transform(df["comment text"])
print(comment_data)
print(Test_X_Tfidf)
loaded_model = pickle.load(open(filename, 'rb'))
predictions_NB = loaded_model.predict(Test_X_Tfidf)
This is the dataframe:
comment text status
0 [slecht, bedrijf] string
1 [leuk, bedrijfje, goed, behandeld] Approved
2 [leuk, bedrijfje, goed, behandeld] Approved
3 [leuk, bedrijfje] Approved
full error message:
Traceback (most recent call last):
File "Request.py", line 36, in <module>
Test_X_Tfidf = tfidf.transform(df["comment text"])
File "C:UsersjunioAnaconda3libsite-packagessklearnfeature_extractiontext.py", line 1112, in transform
_, X = self._count_vocab(raw_documents, fixed_vocab=True)
File "C:UsersjunioAnaconda3libsite-packagessklearnfeature_extractiontext.py", line 970, in _count_vocab
for feature in analyze(doc):
File "C:UsersjunioAnaconda3libsite-packagessklearnfeature_extractiontext.py", line 352, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "C:UsersjunioAnaconda3libsite-packagessklearnfeature_extractiontext.py", line 256, in <lambda>
return lambda x: strip_accents(x.lower())
AttributeError: 'list' object has no attribute 'lower'
I’m expecting it to return the predictions on the dataframe.
CountVectorizer
cannot directly handle a Series
of lists, which is why you’re getting that error (lower
is a string method).
I looks like you want a MultiLabelBinarizer
instead, which can handle this input structure:
from sklearn.preprocessing import MultiLabelBinarizer
count_vec = MultiLabelBinarizer()
mlb = count_vec.fit(df["comment text"])
pd.DataFrame(mlb.transform(df["comment text"]), columns=[mlb.classes_])
bedrijf bedrijfje behandeld goed leuk slecht
0 1 0 0 0 0 1
1 0 1 1 1 1 0
2 0 1 1 1 1 0
3 0 1 0 0 1 0
However the above approach won’t account for duplicate elements in the lists, the output elements can either be 0
or 1
. If that is the behavior you’re expecting instead, you could join the lists into strings and then use a CountVectorizer
, since it is expecting strings:
text = df["comment text"].map(' '.join)
count_vec = CountVectorizer()
cv = count_vec.fit(text)
pd.DataFrame(cv.transform(text).toarray(), columns=[mlb.classes_])
bedrijf bedrijfje behandeld goed leuk slecht
0 1 0 0 0 0 1
1 0 1 1 1 1 0
2 0 1 1 1 1 0
3 0 1 0 0 1 0
Note that this is not the same as a tf-idf
of the input strings. Here you just have the actual counts. For that you have TfidfVectorizer
, which for the same example would produce:
bedrijf bedrijfje behandeld goed leuk slecht
0 0.707107 0.000000 0.000000 0.000000 0.000000 0.707107
1 0.000000 0.444931 0.549578 0.549578 0.444931 0.000000
2 0.000000 0.444931 0.549578 0.549578 0.444931 0.000000
3 0.000000 0.707107 0.000000 0.000000 0.707107 0.000000
An array of strings is what the CountVectorizer expects. Therefore, it will crash if you pass in a nested array of tokens.
Instead of
['ham', 'go', 'until', 'jurong', 'point', 'crazy', 'available', 'only', 'in', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', 'cine', 'there', 'got', 'amore', 'wat']
Send the data in the form of this to the CountVectorizer.
ham Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...