get_feature_names not found in countvectorizer()

Question

I’m mining the Stack Overflow data dump of posts about deep learning libraries. I’d like to identify stop words in my corpus (like ‘python’ for instance). I want to get my feature names so I can identify the words with highest term frequencies.

I create my documents and my corpus as follows:

with open("StackOverflow_2018_Data.csv") as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    line_count = 0
    pytorch_doc = ''
    tensorflow_doc = ''
    cotag_list = []
    keras_doc = ''
    counte = 0
    for row in csv_reader:
        if row[2] == 'tensorflow':
            tensorflow_doc += row[3] + ' '
        if row[2] == 'keras':
            keras_doc += row[3] + ' '
        if row[2] == 'pytorch':
            pytorch_doc += row[3] + ' '

corpus = [pytorch_doc, tensorflow_doc, keras_doc]
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(corpus)
print(x)
x.toarray()
Dict = []
feat = x.get_feature_names()
for i,arr in enumerate(x):
    for x, ele in enumerate(arr):
        if i == 0:
            Dict += ('pytorch', feat[x], ele)
        if i == 1:
            Dict += ('tensorflow', feat[x], ele)
        if i == 2:
            Dict += ('keras', feat[x], ele)

sorted_arr = sorted(Dict, key=lambda tup: tup[2])

However, I am getting:

  File "sklearn_stopwords.py", line 83, in <module>
    main()
  File "sklearn_stopwords.py", line 50, in main
    feat = x.get_feature_names()
  File "/opt/anaconda3/lib/python3.7/site-packages/scipy/sparse/base.py", line 686, in __getattr__
    raise AttributeError(attr + " not found")
AttributeError: get_feature_names not found

Asked By: maddie

||

Source

Answer 1

get_feature_names is a method in the CountVectorizer Object. You are trying to access get_feature_names the results of fit_transform which is a scipy.sparse matrix.

You need to use vectorizer.get_feature_names().

Try this MVCE:

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = ['This is the first document.',
          'This is the second second document.',
          'And the third one.',
          'Is this the first document?']

X = vectorizer.fit_transform(corpus)

features = vectorizer.get_feature_names()

features

Output:

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

Answered By: Scott Boston

Answer 2

Make sure that the version of sklearn you are using is 1.0 or greater.

The method get_feature_names_out() substitutes the already deprecated and removed get_feature_names() one.

Example :

from sklearn.feature_extraction.text import CountVectorizer

n_gram_range = (1, 1)
stop_words = "english"


doc = """
         Supervised learning is the machine learning task of 
         learning a function that maps an input to an output based 
         on example input-output pairs.
      """

# Extract candidate words/phrases
count = CountVectorizer(ngram_range=n_gram_range,
                        stop_words=stop_words).fit([doc])

# candidates = count.get_feature_names()
candidates = count.get_feature_names_out()
candidates

Output:

array(['based', 'example', 'function', 'input', 'learning', 'machine',
       'maps', 'output', 'pairs', 'supervised', 'task'], dtype=object)

Answered By: Nikita Malviya

get_feature_names not found in countvectorizer()

Question:

Answers:

Example :

Output: