Only keep pandas columns where value_count of all values greater than some threshold;

Question

I need to drop columns where:

The value_counts of any unique value is below some threshold

(s.value_counts() > THRESHOLD).all()

OR number of unique values is greater than some other threshold

nunique() > OTHER_THRESH

I tried to use Pandas: Get values from column that appear more than X times to get the value counts across all columns, but I’m stuck on the indexing.

>>> test
      col1  col2    a    b     c
col1                            
1      0.0     3  5.0  6.0   7.0
2      0.0     4  8.0  9.0  10.0

>>> test.apply(lambda s: (s.value_counts() > 1).all() if s.nunique() < 3 else s.nunique() > 1)
col1     True
col2    False
a       False
b       False
c       False

>>> test[test.apply(lambda s: (s.value_counts() > 1).all() if s.nunique() < 3 else s.nunique() > 1).index]
      col1  col2    a    b     c
col1                            
1      0.0     3  5.0  6.0   7.0
2      0.0     4  8.0  9.0  10.0

I wanted just col1 in the example, but got everything again. I could just iterate over column names:

>>> asdf = test.apply(lambda s: (s.value_counts() > 1).all() if s.nunique() < 3 else s.nunique() > 1)
>>> test[asdf[asdf == True].index]
      col1
col1      
1      0.0
2      0.0

But I’m not sure that this is the “correct”/standard way to do it (standard meaning efficient and legible). Assigning asdf to an entire apply function and then extracting its index seems overly hacky/complicated. How can I use pandas more effectively here to ensure efficient computation?

Asked By: Dave Liu

||

Source

Answer 1

You just need to use .loc

m = test.apply(lambda s: (s.value_counts() > 1).all() if s.nunique() < 3 else s.nunique() > 1)
test.loc[:, m]

Out[742]:
      col1
col1
1        0.0
2        0.0

Answered By: Andy L.

Answer 2

The below code snippet filter those columns of a data frame where the
no. of unique values is equals to less than given threshold limit, say 20 unique values.

counter = []
for col in df.columns:
    if df[col].nunique() <= 20:
        counter.append(col)
print(list(counter))

Answered By: Hitul Adatiya

Only keep pandas columns where value_count of all values greater than some threshold;

Question:

Answers: