How can I count the number of occurrences of a given string in a string array in pandas

Question:

I want to see which tags occur most frequently in my dataset. When i try to do this on my own i get something like this:

df['tags'].value_counts()

[‘Startup’] 80
[‘Bitcoin’] 79
[‘The Daily Pick’] 78
[‘Addiction’, ‘Health’, ‘Body’, ‘Alcohol’, ‘Mental Health’] 62

Some articles have many tags but
I would like to count the tracking count for each tag separately.

Asked By: IvonaK

||

Answers:

IIUC, You need to use ast.literal_eval, explode(), and then use value_counts().

from ast import literal_eval
import pandas as pd

res = df['tags'].apply(literal_eval).explode().value_counts()
print(res)

Output:

Startup      4
Bitcoin      3
Addiction    2
Health       2
Name: tags, dtype: int64

Sample input DataFrame:

df = pd.DataFrame({
    "tags" : [
        "['Startup']", "['Startup']", "['Startup']", "['Startup']",
        "['Bitcoin']", "['Bitcoin']", "['Bitcoin']", 
        "['Addiction', 'Health']", "['Addiction', 'Health']"
    ]
})

By thanks @ljmc:

NB. ast.literal_eval is not safe always. from doc:

This function had been documented as “safe” in the past without defining what that meant. That was misleading. This is specifically designed not to execute Python code, unlike the more general eval(). […] But it is not free from attack: A relatively small input can lead to memory exhaustion or to C stack exhaustion, crashing the process. There is also the possibility for excessive CPU consumption denial of service on some inputs. Calling it on untrusted data is thus not recommended.

Answered By: I'mahdi

You can use a collections.Counter and apply or agg to your series.

import pandas as pd
from collections import Counter

df = pd.DataFrame({
    "tags": [['Startup'], ["Bitcoin"], ["Startup", "Ethereum"]]
})

c = Counter()
df["tags"].apply(c.update)

c contains

Counter({'Startup': 2, 'Bitcoin': 1, 'Ethereum': 1})
Answered By: ljmc
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.