count number of times a column value contains elements in a list in Python
Question:
I have a set like this
list = {‘AGB’, ‘YTE’, ‘ENN’, ‘TAP’, ‘XAL’, ‘MUI’}
and a dataframe like this
ColA
ColB
ColC
ENN,JAX,ATL,ERT,CMH,RSW,TAP,ABQ,ENN
45
Three
TUY,XAL,MUI,AUS,OPP,YTE,ERT
32
Three
I would like to count how many times ColA’s value has elements in the set in ColD and ColE, ColD for unique and ColD for all occurrences. So far, I have been using
df[‘ColD’] = df[‘ColA’].apply(lambda x:sum(i in list for i in x)), but no success, would very appreciate if someone can help solve the issue. Thank you.
ColA
ColD
ColE
ENN,JAX,ATL,ERT,CMH,RSW,TAP,ABQ,ENN
2
3
TUY,XAL,MUI,AUS,OPP,YTE,ERT
3
3
Answers:
You can split your string and explode to get one word per row then check if match to the list/set. Finally, group by level then aggregate with nunique (count one occurrence) and size (count all occurrence):
s = {'AGB', 'YTE', 'ENN', 'TAP', 'XAL', 'MUI'}
out = (df.join(df['ColA'].str.split(',').explode()
.loc[lambda x: x.isin(s)]
.groupby(level=0)
.agg(ColD='nunique', ColE='size')))
Output:
>>> out
ColA ColB ColC ColD ColE
0 ENN,JAX,ATL,ERT,CMH,RSW,TAP,ABQ,ENN 45 Three 2 3
1 TUY,XAL,MUI,AUS,OPP,YTE,ERT 32 Three 3 3
You can write a function and find elements that exists in desired lst
and return count and count_unique like the below:
st = {'AGB', 'YTE', 'ENN', 'TAP', 'XAL', 'MUI'}
def count_items(x):
lst = [item for item in x.split(',') if item in st]
return len(set(lst)), len(lst)
df[['ColD', 'ColE']] = df['ColA'].apply(count_items)
print(df)
ColA ColB ColC ColD ColE
0 ENN,JAX,ATL,ERT,CMH,RSW,TAP,ABQ,ENN 45 Three 2 3
1 TUY,XAL,MUI,AUS,OPP,YTE,ERT 32 Three 3 3
Here is an option using pd.Series.str.count
we do '|'.join(s)
to create a string from your set which creates the following regex pattern 'AGB|ENN|YTE|XAL|TAP|MUI'
the pipe delimiter is the OR
operator in regex, which is what str.count
uses. So we are essentially saying count the number of times AGB OR ENN OR ... MUI
is in df['ColA']
To get the unique count we need to split the string into a list and get the unique values before using str.count
I should note that this regex 'AGB|YTE...'
, this will count any occurrence so for example ENNN
would be counted.
s = {'AGB', 'YTE', 'ENN', 'TAP', 'XAL', 'MUI'}
df['D'] = df['ColA'].str.split(',').agg(set).astype(str).str.count('|'.join(s))
df['E'] = df['ColA'].str.count('|'.join(s))
ColA ColB ColC D E
0 ENN,JAX,ATL,ERT,CMH,RSW,TAP,ABQ,ENN 45 Three 2 3
1 TUY,XAL,MUI,AUS,OPP,YTE,ERT 32 Three 3 3
I have a set like this
list = {‘AGB’, ‘YTE’, ‘ENN’, ‘TAP’, ‘XAL’, ‘MUI’}
and a dataframe like this
ColA | ColB | ColC |
---|---|---|
ENN,JAX,ATL,ERT,CMH,RSW,TAP,ABQ,ENN | 45 | Three |
TUY,XAL,MUI,AUS,OPP,YTE,ERT | 32 | Three |
I would like to count how many times ColA’s value has elements in the set in ColD and ColE, ColD for unique and ColD for all occurrences. So far, I have been using
df[‘ColD’] = df[‘ColA’].apply(lambda x:sum(i in list for i in x)), but no success, would very appreciate if someone can help solve the issue. Thank you.
ColA | ColD | ColE |
---|---|---|
ENN,JAX,ATL,ERT,CMH,RSW,TAP,ABQ,ENN | 2 | 3 |
TUY,XAL,MUI,AUS,OPP,YTE,ERT | 3 | 3 |
You can split your string and explode to get one word per row then check if match to the list/set. Finally, group by level then aggregate with nunique (count one occurrence) and size (count all occurrence):
s = {'AGB', 'YTE', 'ENN', 'TAP', 'XAL', 'MUI'}
out = (df.join(df['ColA'].str.split(',').explode()
.loc[lambda x: x.isin(s)]
.groupby(level=0)
.agg(ColD='nunique', ColE='size')))
Output:
>>> out
ColA ColB ColC ColD ColE
0 ENN,JAX,ATL,ERT,CMH,RSW,TAP,ABQ,ENN 45 Three 2 3
1 TUY,XAL,MUI,AUS,OPP,YTE,ERT 32 Three 3 3
You can write a function and find elements that exists in desired lst
and return count and count_unique like the below:
st = {'AGB', 'YTE', 'ENN', 'TAP', 'XAL', 'MUI'}
def count_items(x):
lst = [item for item in x.split(',') if item in st]
return len(set(lst)), len(lst)
df[['ColD', 'ColE']] = df['ColA'].apply(count_items)
print(df)
ColA ColB ColC ColD ColE
0 ENN,JAX,ATL,ERT,CMH,RSW,TAP,ABQ,ENN 45 Three 2 3
1 TUY,XAL,MUI,AUS,OPP,YTE,ERT 32 Three 3 3
Here is an option using pd.Series.str.count
we do '|'.join(s)
to create a string from your set which creates the following regex pattern 'AGB|ENN|YTE|XAL|TAP|MUI'
the pipe delimiter is the OR
operator in regex, which is what str.count
uses. So we are essentially saying count the number of times AGB OR ENN OR ... MUI
is in df['ColA']
To get the unique count we need to split the string into a list and get the unique values before using str.count
I should note that this regex 'AGB|YTE...'
, this will count any occurrence so for example ENNN
would be counted.
s = {'AGB', 'YTE', 'ENN', 'TAP', 'XAL', 'MUI'}
df['D'] = df['ColA'].str.split(',').agg(set).astype(str).str.count('|'.join(s))
df['E'] = df['ColA'].str.count('|'.join(s))
ColA ColB ColC D E
0 ENN,JAX,ATL,ERT,CMH,RSW,TAP,ABQ,ENN 45 Three 2 3
1 TUY,XAL,MUI,AUS,OPP,YTE,ERT 32 Three 3 3