Count hashtag frequency in a dataframe
Question:
I am trying to count the frequency of hashtag words in the ‘text’ column of my dataframe.
index text
1 ello ello ello ello #hello #ello
2 red green blue black #colours
3 Season greetings #hello #goodbye
4 morning #goodMorning #hello
5 my favourite animal #dog
word_freq = df.text.str.split(expand=True).stack().value_counts()
The above code will perform a frequency count on all strings in the text column, but I just to return the hashtag frequencies.
For example after running the code on my dataframe above, it should return
#hello 3
#goodbye 1
#goodMorning 1
#ello 1
#colours 1
#dog 1
Is there a way of slightly re-jigging my word_freq code so it only counts hashtag words and returns them in the way I put above? Thanks in advance.
Answers:
Use Series.str.findall
on column text
to find all hashtag words then use Series.explode
+ Series.value_counts
:
counts = df['text'].str.findall(r'(#w+)').explode().value_counts()
Another idea using Series.str.split
+ DataFrame.stack
:
s = df['text'].str.split(expand=True).stack()
counts = s[lambda x: x.str.startswith('#')].value_counts()
Result:
print(counts)
#hello 3
#dog 1
#colours 1
#ello 1
#goodMorning 1
#goodbye 1
Name: text, dtype: int64
one way using str.extractall
that would remove the #
from the result. Then value_counts
as well
s = df['text'].str.extractall('(?<=#)(w*)')[0].value_counts()
print(s)
hello 3
colours 1
goodbye 1
ello 1
goodMorning 1
dog 1
Name: 0, dtype: int64
A slightly detailed solution but this does the trick.
dictionary_count=data_100.TicketDescription.str.split(expand=True).stack().value_counts().to_dict()
dictionary_count={'accessgtgtjust': 1,
'sent': 1,
'investigate': 1,
'edit': 1,
'#prd': 1,
'getting': 1}
ert=[i for i in list(dictionary_count.keys()) if '#' in i]
ert
Out[238]: ['#prd']
unwanted = set(dictionary_count.keys()) - set(ert)
for unwanted_key in unwanted:
del dictionary_count[unwanted_key]
dictionary_count
Out[241]: {'#prd': 1}
I am trying to count the frequency of hashtag words in the ‘text’ column of my dataframe.
index text
1 ello ello ello ello #hello #ello
2 red green blue black #colours
3 Season greetings #hello #goodbye
4 morning #goodMorning #hello
5 my favourite animal #dog
word_freq = df.text.str.split(expand=True).stack().value_counts()
The above code will perform a frequency count on all strings in the text column, but I just to return the hashtag frequencies.
For example after running the code on my dataframe above, it should return
#hello 3
#goodbye 1
#goodMorning 1
#ello 1
#colours 1
#dog 1
Is there a way of slightly re-jigging my word_freq code so it only counts hashtag words and returns them in the way I put above? Thanks in advance.
Use Series.str.findall
on column text
to find all hashtag words then use Series.explode
+ Series.value_counts
:
counts = df['text'].str.findall(r'(#w+)').explode().value_counts()
Another idea using Series.str.split
+ DataFrame.stack
:
s = df['text'].str.split(expand=True).stack()
counts = s[lambda x: x.str.startswith('#')].value_counts()
Result:
print(counts)
#hello 3
#dog 1
#colours 1
#ello 1
#goodMorning 1
#goodbye 1
Name: text, dtype: int64
one way using str.extractall
that would remove the #
from the result. Then value_counts
as well
s = df['text'].str.extractall('(?<=#)(w*)')[0].value_counts()
print(s)
hello 3
colours 1
goodbye 1
ello 1
goodMorning 1
dog 1
Name: 0, dtype: int64
A slightly detailed solution but this does the trick.
dictionary_count=data_100.TicketDescription.str.split(expand=True).stack().value_counts().to_dict()
dictionary_count={'accessgtgtjust': 1,
'sent': 1,
'investigate': 1,
'edit': 1,
'#prd': 1,
'getting': 1}
ert=[i for i in list(dictionary_count.keys()) if '#' in i]
ert
Out[238]: ['#prd']
unwanted = set(dictionary_count.keys()) - set(ert)
for unwanted_key in unwanted:
del dictionary_count[unwanted_key]
dictionary_count
Out[241]: {'#prd': 1}