How can I create a column in one DataFrame containing the count of matches in another DataFrame's column?

Question:

I have two pandas DataFrames. The first one, df1, contains a column of file paths and a column of lists containing what users have read access to these file paths. The second DataFrame, df2, contains a list of all possible users. I’ve created an example below:

df1 = pd.DataFrame()
df1['path'] = ['C:/pathA', 'C:/pathB', 'C:/pathC', 'C:/pathD']
df1['read'] = [['userA', 'userC', 'userD'], 
               ['userA', 'userB'], 
               ['userB', 'userD'], 
               ['userA', 'userB', 'userC', 'userD']]
print(df1)

       path                          read
0  C:/pathA         [userA, userC, userD]
1  C:/pathB                [userA, userB]
2  C:/pathC                [userB, userD]
3  C:/pathD  [userA, userB, userC, userD]
df2 = pd.DataFrame(data=['userA', 'userB', 'userC', 'userD'], columns=['user'])
print(df2)

    user
0  userA
1  userB
2  userC
3  userD

The end goal is to create a new column df2['read_count'], which should take each user string from df2['user'] and find the total number of matches in the column df1['read'].

The expected output would be exactly that – a count of matches of each user string in the column of lists in df1['read']. Here is what I am expecting based on the example:

df2

    user  read_count
0  userA           3
1  userB           3
2  userC           2
3  userD           3

I tried putting something together using another question and list comprehension, but no luck. Here is what I currently have:

df2['read_count'] = [sum(all(val in cell for val in row)
                     for cell in df1['read'])
                     for row in df2['user']]
print(df2)

    user  read_count
0  userA           0
1  userB           0
2  userC           0
3  userD           0

What is wrong with the code I currently have? I’ve tried actually following through the loops but it all seemed right, but it seems like my code can’t detect the matches I want.

Asked By: gojacketsanddawgs

||

Answers:

You can use:

df2.merge(df1['read'].explode().value_counts(),
          left_on='user', right_index=True)

Or, if you really need to use a different kind of aggregation that depends on "path":

df2.merge(df1.explode('read').groupby('read')['path'].count(),
          left_on='user', right_index=True)

output:

    user  path
0  userA     3
1  userB     3
2  userC     2
3  userD     3

Without df2:

df1['read'].explode().value_counts().reset_index()

# or
# df1.explode('read').groupby('read', as_index=False)['path'].count()

output:

   index  path
0  userA     3
1  userB     3
2  userC     2
3  userD     3
Answered By: mozway

Let’s use np.hstack to flatten the lists then count the unique values using Counter and map the counts back to df2

from collections import Counter
  
df2['count'] = df2['user'].map(Counter(np.hstack(df1.read)))

Result

    user  count
0  userA      3
1  userB      3
2  userC      2
3  userD      3
Answered By: Shubham Sharma
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.