Count hashtag frequency in a dataframe

Question:

I am trying to count the frequency of hashtag words in the ‘text’ column of my dataframe.

index        text
1            ello ello ello ello #hello #ello
2            red green blue black #colours
3            Season greetings #hello #goodbye 
4            morning #goodMorning #hello
5            my favourite animal #dog

word_freq = df.text.str.split(expand=True).stack().value_counts()

The above code will perform a frequency count on all strings in the text column, but I just to return the hashtag frequencies.

For example after running the code on my dataframe above, it should return

#hello        3
#goodbye      1
#goodMorning  1
#ello         1
#colours      1
#dog          1

Is there a way of slightly re-jigging my word_freq code so it only counts hashtag words and returns them in the way I put above? Thanks in advance.

Asked By: user12194362

||

Answers:

Use Series.str.findall on column text to find all hashtag words then use Series.explode + Series.value_counts:

counts = df['text'].str.findall(r'(#w+)').explode().value_counts()

Another idea using Series.str.split + DataFrame.stack:

s = df['text'].str.split(expand=True).stack()
counts = s[lambda x: x.str.startswith('#')].value_counts()

Result:

print(counts)
#hello          3
#dog            1
#colours        1
#ello           1
#goodMorning    1
#goodbye        1
Name: text, dtype: int64
Answered By: Shubham Sharma

one way using str.extractall that would remove the # from the result. Then value_counts as well

s = df['text'].str.extractall('(?<=#)(w*)')[0].value_counts()
print(s)
hello          3
colours        1
goodbye        1
ello           1
goodMorning    1
dog            1
Name: 0, dtype: int64
Answered By: Ben.T

A slightly detailed solution but this does the trick.

dictionary_count=data_100.TicketDescription.str.split(expand=True).stack().value_counts().to_dict()

dictionary_count={'accessgtgtjust': 1,
'sent': 1,
'investigate': 1,
'edit': 1,
'#prd': 1,
'getting': 1}

ert=[i for i in list(dictionary_count.keys()) if '#' in i]

ert
Out[238]: ['#prd']

unwanted = set(dictionary_count.keys()) - set(ert)

for unwanted_key in unwanted: 
   del dictionary_count[unwanted_key]

dictionary_count
Out[241]: {'#prd': 1}
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.