Count distinct words from a Pandas Data Frame
Question:
I’ve a Pandas data frame, where one column contains text. I’d like to get a list of unique words appearing across the entire column (space being the only split).
import pandas as pd
r1=['My nickname is ft.jgt','Someone is going to my place']
df=pd.DataFrame(r1,columns=['text'])
The output should look like this:
['my','nickname','is','ft.jgt','someone','going','to','place']
It wouldn’t hurt to get a count as well, but it is not required.
Answers:
uniqueWords = list(set(" ".join(r1).lower().split(" ")))
count = len(uniqueWords)
Use collections.Counter
:
>>> from collections import Counter
>>> r1=['My nickname is ft.jgt','Someone is going to my place']
>>> Counter(" ".join(r1).split(" ")).items()
[('Someone', 1), ('ft.jgt', 1), ('My', 1), ('is', 2), ('to', 1), ('going', 1), ('place', 1), ('my', 1), ('nickname', 1)]
Building on @Ofir Israel’s answer, specific to Pandas:
from collections import Counter
result = Counter(" ".join(df['text'].values.tolist()).split(" ")).items()
result
Will give you what you want, this converts the text column series values to a list, splits on spaces and counts the instances.
Use a set
to create the sequence of unique elements.
Do some clean-up on df
to get the strings in lower case and split:
df['text'].str.lower().str.split()
Out[43]:
0 [my, nickname, is, ft.jgt]
1 [someone, is, going, to, my, place]
Each list in this column can be passed to set.update
function to get unique values. Use apply
to do so:
results = set()
df['text'].str.lower().str.split().apply(results.update)
print(results)
set(['someone', 'ft.jgt', 'my', 'is', 'to', 'going', 'place', 'nickname'])
Or use with Counter()
from comments:
from collections import Counter
results = Counter()
df['text'].str.lower().str.split().apply(results.update)
print(results)
If you want to do it from the DataFrame construct:
import pandas as pd
r1=['My nickname is ft.jgt','Someone is going to my place']
df=pd.DataFrame(r1,columns=['text'])
df.text.apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0)
My 1
Someone 1
ft.jgt 1
going 1
is 2
my 1
nickname 1
place 1
to 1
dtype: float64
If you want a more flexible tokenization use nltk
and its tokenize
If Dataframe has ‘ a’, ‘b’, ‘c’ etc, column And to count distinct words of each column then
You could use,
Counter(dataframe['a']).items()
TL;DR
Use collections.Counter
to get the counts of unique words in column in dataframe (without stopwords)
Given:
$ cat test.csv
Description
crazy mind california medical service data base...
california licensed producer recreational & medic...
silicon valley data clients live beyond status...
mycrazynotes inc. announces $144.6 million expans...
leading provider sustainable energy company prod ...
livefreecompany founded 2005, listed new york stock...
Code:
from collections import Counter
from string import punctuation
import pandas as pd
from nltk.corpus import stopwords
from nltk import word_tokenize
stoplist = set(stopwords.words('english') + list(punctuation))
df = pd.read_csv("test.csv", sep='t')
texts = df['Description'].str.lower()
word_counts = Counter(word_tokenize('n'.join(texts)))
word_count.most_common()
[out]:
[('...', 6), ('california', 2), ('data', 2), ('crazy', 1), ('mind', 1), ('medical', 1), ('service', 1), ('base', 1), ('licensed', 1), ('producer', 1), ('recreational', 1), ('&', 1), ('medic', 1), ('silicon', 1), ('valley', 1), ('clients', 1), ('live', 1), ('beyond', 1), ('status', 1), ('mycrazynotes', 1), ('inc.', 1), ('announces', 1), ('$', 1), ('144.6', 1), ('million', 1), ('expans', 1), ('leading', 1), ('provider', 1), ('sustainable', 1), ('energy', 1), ('company', 1), ('prod', 1), ('livefreecompany', 1), ('founded', 1), ('2005', 1), (',', 1), ('listed', 1), ('new', 1), ('york', 1), ('stock', 1)]
Adding to the discussion, here are the timings for three of the proposed solutions (skipping conversion to list) on a 92816 row dataframe:
from collections import Counter
results = set()
%timeit -n 10 set(" ".join(df['description'].values.tolist()).lower().split(" "))
323 ms ± 4.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit -n 10 df['description'].str.lower().str.split(" ").apply(results.update)
316 ms ± 4.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit -n 10 Counter(" ".join(df['description'].str.lower().values.tolist()).split(" "))
365 ms ± 2.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
len(list(set(" ".join(df['description'].values.tolist()).lower().split(" "))))
13561
len(results)
13561
len(Counter(" ".join(df['description'].str.lower().values.tolist()).split(" ")).items())
13561
I tried the Pandas only approach too but it took way longer and used > 25GB of RAM making my 32GB laptop swap.
All others are pretty fast. I would use solution 1 for being a one liner, or 3 if word counts are needed.
I’ve a Pandas data frame, where one column contains text. I’d like to get a list of unique words appearing across the entire column (space being the only split).
import pandas as pd
r1=['My nickname is ft.jgt','Someone is going to my place']
df=pd.DataFrame(r1,columns=['text'])
The output should look like this:
['my','nickname','is','ft.jgt','someone','going','to','place']
It wouldn’t hurt to get a count as well, but it is not required.
uniqueWords = list(set(" ".join(r1).lower().split(" ")))
count = len(uniqueWords)
Use collections.Counter
:
>>> from collections import Counter
>>> r1=['My nickname is ft.jgt','Someone is going to my place']
>>> Counter(" ".join(r1).split(" ")).items()
[('Someone', 1), ('ft.jgt', 1), ('My', 1), ('is', 2), ('to', 1), ('going', 1), ('place', 1), ('my', 1), ('nickname', 1)]
Building on @Ofir Israel’s answer, specific to Pandas:
from collections import Counter
result = Counter(" ".join(df['text'].values.tolist()).split(" ")).items()
result
Will give you what you want, this converts the text column series values to a list, splits on spaces and counts the instances.
Use a set
to create the sequence of unique elements.
Do some clean-up on df
to get the strings in lower case and split:
df['text'].str.lower().str.split()
Out[43]:
0 [my, nickname, is, ft.jgt]
1 [someone, is, going, to, my, place]
Each list in this column can be passed to set.update
function to get unique values. Use apply
to do so:
results = set()
df['text'].str.lower().str.split().apply(results.update)
print(results)
set(['someone', 'ft.jgt', 'my', 'is', 'to', 'going', 'place', 'nickname'])
Or use with Counter()
from comments:
from collections import Counter
results = Counter()
df['text'].str.lower().str.split().apply(results.update)
print(results)
If you want to do it from the DataFrame construct:
import pandas as pd
r1=['My nickname is ft.jgt','Someone is going to my place']
df=pd.DataFrame(r1,columns=['text'])
df.text.apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0)
My 1
Someone 1
ft.jgt 1
going 1
is 2
my 1
nickname 1
place 1
to 1
dtype: float64
If you want a more flexible tokenization use nltk
and its tokenize
If Dataframe has ‘ a’, ‘b’, ‘c’ etc, column And to count distinct words of each column then
You could use,
Counter(dataframe['a']).items()
TL;DR
Use collections.Counter
to get the counts of unique words in column in dataframe (without stopwords)
Given:
$ cat test.csv
Description
crazy mind california medical service data base...
california licensed producer recreational & medic...
silicon valley data clients live beyond status...
mycrazynotes inc. announces $144.6 million expans...
leading provider sustainable energy company prod ...
livefreecompany founded 2005, listed new york stock...
Code:
from collections import Counter
from string import punctuation
import pandas as pd
from nltk.corpus import stopwords
from nltk import word_tokenize
stoplist = set(stopwords.words('english') + list(punctuation))
df = pd.read_csv("test.csv", sep='t')
texts = df['Description'].str.lower()
word_counts = Counter(word_tokenize('n'.join(texts)))
word_count.most_common()
[out]:
[('...', 6), ('california', 2), ('data', 2), ('crazy', 1), ('mind', 1), ('medical', 1), ('service', 1), ('base', 1), ('licensed', 1), ('producer', 1), ('recreational', 1), ('&', 1), ('medic', 1), ('silicon', 1), ('valley', 1), ('clients', 1), ('live', 1), ('beyond', 1), ('status', 1), ('mycrazynotes', 1), ('inc.', 1), ('announces', 1), ('$', 1), ('144.6', 1), ('million', 1), ('expans', 1), ('leading', 1), ('provider', 1), ('sustainable', 1), ('energy', 1), ('company', 1), ('prod', 1), ('livefreecompany', 1), ('founded', 1), ('2005', 1), (',', 1), ('listed', 1), ('new', 1), ('york', 1), ('stock', 1)]
Adding to the discussion, here are the timings for three of the proposed solutions (skipping conversion to list) on a 92816 row dataframe:
from collections import Counter
results = set()
%timeit -n 10 set(" ".join(df['description'].values.tolist()).lower().split(" "))
323 ms ± 4.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit -n 10 df['description'].str.lower().str.split(" ").apply(results.update)
316 ms ± 4.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit -n 10 Counter(" ".join(df['description'].str.lower().values.tolist()).split(" "))
365 ms ± 2.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
len(list(set(" ".join(df['description'].values.tolist()).lower().split(" "))))
13561
len(results)
13561
len(Counter(" ".join(df['description'].str.lower().values.tolist()).split(" ")).items())
13561
I tried the Pandas only approach too but it took way longer and used > 25GB of RAM making my 32GB laptop swap.
All others are pretty fast. I would use solution 1 for being a one liner, or 3 if word counts are needed.