Count distinct words from a Pandas Data Frame


I’ve a Pandas data frame, where one column contains text. I’d like to get a list of unique words appearing across the entire column (space being the only split).

import pandas as pd

r1=['My nickname is ft.jgt','Someone is going to my place']


The output should look like this:


It wouldn’t hurt to get a count as well, but it is not required.

Asked By: ADJ



uniqueWords = list(set(" ".join(r1).lower().split(" ")))
count = len(uniqueWords)
Answered By: Brionius

Use collections.Counter:

>>> from collections import Counter
>>> r1=['My nickname is ft.jgt','Someone is going to my place']
>>> Counter(" ".join(r1).split(" ")).items()
[('Someone', 1), ('ft.jgt', 1), ('My', 1), ('is', 2), ('to', 1), ('going', 1), ('place', 1), ('my', 1), ('nickname', 1)]
Answered By: Ofir Israel

Building on @Ofir Israel’s answer, specific to Pandas:

from collections import Counter
result = Counter(" ".join(df['text'].values.tolist()).split(" ")).items()

Will give you what you want, this converts the text column series values to a list, splits on spaces and counts the instances.

Answered By: EdChum

Use a set to create the sequence of unique elements.

Do some clean-up on df to get the strings in lower case and split:

0             [my, nickname, is, ft.jgt]
1    [someone, is, going, to, my, place]

Each list in this column can be passed to set.update function to get unique values. Use apply to do so:

results = set()

set(['someone', 'ft.jgt', 'my', 'is', 'to', 'going', 'place', 'nickname'])

Or use with Counter() from comments:

from collections import Counter
results = Counter()
Answered By: Zeugma

If you want to do it from the DataFrame construct:

import pandas as pd

r1=['My nickname is ft.jgt','Someone is going to my place']


df.text.apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0)

My          1
Someone     1
ft.jgt      1
going       1
is          2
my          1
nickname    1
place       1
to          1
dtype: float64

If you want a more flexible tokenization use nltk and its tokenize

Answered By: cwharland

If Dataframe has ‘ a’, ‘b’, ‘c’ etc, column And to count distinct words of each column then
You could use,

Answered By: Rakesh Chaudhari


Use collections.Counter to get the counts of unique words in column in dataframe (without stopwords)


$ cat test.csv 
crazy mind california medical service data base...
california licensed producer recreational & medic...
silicon valley data clients live beyond status...
mycrazynotes inc. announces $144.6 million expans...
leading provider sustainable energy company prod ...
livefreecompany founded 2005, listed new york stock...


from collections import Counter
from string import punctuation

import pandas as pd

from nltk.corpus import stopwords
from nltk import word_tokenize

stoplist = set(stopwords.words('english') + list(punctuation))

df = pd.read_csv("test.csv", sep='t')

texts = df['Description'].str.lower()

word_counts = Counter(word_tokenize('n'.join(texts)))



[('...', 6), ('california', 2), ('data', 2), ('crazy', 1), ('mind', 1), ('medical', 1), ('service', 1), ('base', 1), ('licensed', 1), ('producer', 1), ('recreational', 1), ('&', 1), ('medic', 1), ('silicon', 1), ('valley', 1), ('clients', 1), ('live', 1), ('beyond', 1), ('status', 1), ('mycrazynotes', 1), ('inc.', 1), ('announces', 1), ('$', 1), ('144.6', 1), ('million', 1), ('expans', 1), ('leading', 1), ('provider', 1), ('sustainable', 1), ('energy', 1), ('company', 1), ('prod', 1), ('livefreecompany', 1), ('founded', 1), ('2005', 1), (',', 1), ('listed', 1), ('new', 1), ('york', 1), ('stock', 1)]
Answered By: alvas

Adding to the discussion, here are the timings for three of the proposed solutions (skipping conversion to list) on a 92816 row dataframe:

from collections import Counter
results = set()

%timeit -n 10 set(" ".join(df['description'].values.tolist()).lower().split(" "))

323 ms ± 4.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 df['description'].str.lower().str.split(" ").apply(results.update)

316 ms ± 4.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 Counter(" ".join(df['description'].str.lower().values.tolist()).split(" "))

365 ms ± 2.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

len(list(set(" ".join(df['description'].values.tolist()).lower().split(" "))))




len(Counter(" ".join(df['description'].str.lower().values.tolist()).split(" ")).items())


I tried the Pandas only approach too but it took way longer and used > 25GB of RAM making my 32GB laptop swap.

All others are pretty fast. I would use solution 1 for being a one liner, or 3 if word counts are needed.

Answered By: Ludecan
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.