How to only keep rows in a Pandas DataFrame based on its count in a given column

Question:

I have a Pandas DataFrame with some categorical data in one of the columns. On doing value_counts on that particular column, I get something similar to:

HR                          176
Coding                       81
Reject                       74
Database Administration      21
Finance                      17
Project Management           16
Sales                        15
DevOps                       13
Core Electronics             10
Networking                   10
Medical Science               9
Core Mechanical               8
Web Development               4
Puzzles                       3
behavioural                   3
not a question                2
civil engineering             1
Mathematics                   1
Finance, Medical Science      1
Sales, HR                     1

What I’d like to do is to only keep the categories with a count >= some threshold (e.g. 10). All the smaller categories should get clubbed in a separate "Other" category i.e. the result should look like:

HR                          176
Coding                       81
Reject                       74

*Other*                      33

Database Administration      21
Finance                      17
Project Management           16
Sales                        15
DevOps                       13
Core Electronics             10
Networking                   10

I’ve done this in the past by hacking together a defaultdict(int) and only taking the instances where count >= threshold. I want to know if there is a Pandas canonical way of achieving the same.

Asked By: Abirbhav G.

||

Answers:

Is this the answer you’re looking for :

Pandas: Selecting rows based on value counts of a particular column

Else maybe this is what you want :

data = pd.DataFrame([["researcher",150],["politician",15],["builder",1],["teacher",5],])
data.columns = ["category", "count"]
filter_value = 10
d1 = data[data['count'] >= filter_value]
d2 = data[data['count'] < filter_value]
d1["tag"] = "filter_passed"
d2["tag"] = "Others"
data = pd.concat([d1,d2])
>>> data
     category  count            tag
0  researcher    150  filter_passed
1  politician     15  filter_passed
2     builder      1         Others
3     teacher      5         Others
Answered By: bvittrant

I would use a mask to perform boolean indexing and concat:

m = s>=10
out = (pd.concat([s[m], pd.Series(s[~m].sum(), index=['Others'])])
         .sort_values(ascending=False)
      )

output:

HR                         176
Coding                      81
Reject                      74
Others                      33
Database Administration     21
Finance                     17
Project Management          16
Sales                       15
DevOps                      13
Core Electronics            10
Networking       
Answered By: mozway
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.