pandas value_counts(): directly compare two instances

Question:

I have done .value_counts() on two dataFrames (similar column) and would like to compare the two.
I also tried with converting the resulting Series to dataframes (.to_frame('counts') as suggested in this thread), but it doesn’t help.

first = df1['company'].value_counts()
second = df2['company'].value_counts()

I tried to merge but I think the main problem is that I dont have the company name as a column but its the index (?). Is there a way to resolve it or to use a different way to get the comparison?

GOAL: The end goal is to be able to see which companies occur more in df2 than in df1, and the value_counts() themselves (or the difference between them).

Asked By: ArieAI

||

Answers:

To compare the value counts between the two data frames, you can use the merge function and specify the left_index and right_index parameters as True. This will use the index values (in this case, the company names) as the keys to join the two data frames. You can then specify the how parameter as 'outer' to get the union of the two value counts, and use the fillna function to fill any missing values with 0.

Here’s an example of how you can do this:

merged = first.to_frame('first').merge(second.to_frame('second'), left_index=True, right_index=True, how='outer').fillna(0)

This will give you a new data frame with the company names as the index and two columns 'first' and 'second' containing the value counts for each data frame. You can then use the apply function to compare the two value counts and create a new column with the difference:

merged['difference'] = merged.apply(lambda x: x['second'] - x['first'], axis=1)

You can then use the sort_values function to sort the data frame by the 'difference' column to see which companies occur more in df2 than in df1.

sorted_df = merged.sort_values('difference', ascending=False)
Answered By: Jan

You might use collections.Counter ability to subtract as follows

import collections
import pandas as pd
df1 = pd.DataFrame({'company':['A','A','A','B','B','C','Y']})
df2 = pd.DataFrame({'company':['A','B','B','C','C','C','Z']})
c1 = collections.Counter(df1['company'])
c2 = collections.Counter(df2['company'])
c1.subtract(c2)
print(c1)

gives output

Counter({'A': 2, 'Y': 1, 'B': 0, 'Z': -1, 'C': -2})

Explanation: where value is positive means more instances are in df1, where value is zero then number is equal, where value is negative means more instances are in df2.

Answered By: Daweo

Use from this code

df2['x'] = '2'
df1['x'] = '1'
df = pd.concat([df1[['company', 'x']], df2[['company', 'x']]])
df = pd.pivot_table(df, index=['company'], columns=['x'], aggfunc={'values': 'sum'}).reset_index()

Now filter on df for related data

Answered By: Alireza
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.