How do I filter a pandas DataFrame based on value counts?

Question:

I’m working in Python with a pandas DataFrame of video games, each with a genre. I’m trying to remove any video game with a genre that appears less than some number of times in the DataFrame, but I have no clue how to go about this. I did find a StackOverflow question that seems to be related, but I can’t decipher the solution at all (possibly because I’ve never heard of R and my memory of functional programming is rusty at best).

Help?

Asked By: uchuujin

||

Answers:

Use groupby filter:

In [11]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])

In [12]: df
Out[12]:
   A  B
0  1  2
1  1  4
2  5  6

In [13]: df.groupby("A").filter(lambda x: len(x) > 1)
Out[13]:
   A  B
0  1  2
1  1  4

I recommend reading the split-combine-section of the docs.

Answered By: Andy Hayden

Solutions with better performance should be GroupBy.transform with size for count per groups to Series with same size like original df, so possible filter by boolean indexing:

df1 = df[df.groupby("A")['A'].transform('size') > 1]

Or use Series.map with Series.value_counts:

df1 = df[df['A'].map(df['A'].value_counts()) > 1]
Answered By: jezrael

@jezael solution works very well, Here is a different approach to filter based on values count :

For example, if the dataset is :

df = pd.DataFrame({'a': [1,2,3,3,1,6], 'b': [11,2,33,4,55,6]})

Convert and save the count as a dictionary

ount_freq = dict(df['a'].value_counts())

Create a new column and copy the target column, map the dictionary with newly created column

df['count_freq'] = df['a']
df['count_freq'] = df['count_freq'].map(count_freq)

Now we have a new column with count freq, you can now define a threshold and filter easily with this column.

df[df.count_freq>1]
Answered By: Aaditya Ura

Additionlly, in case one wants to filter and have ‘count’ column:

attr = 'A'
limit = 10
df2 = df.groupby(attr)[attr].agg(count='count')
df2 = df2.loc[df2['count'] > limit].reset_index()
print(df2)

#outputs rows with grouped 'A' count > 10 and columns ==> index, count, A

I might be a little late to this party but:

df = pd.DataFrame(df_you_have.groupby(['IdA', 'SomeOtherA'])['theA_you_want_to_count'].count())
df.reset_index(inplace=True)

This is how you create a new dataframe and then just filter it…

df[df['A']>100]
Answered By: Nik Cadez
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.