Only keep pandas columns where value_count of all values greater than some threshold;
Question:
I need to drop columns where:
- The value_counts of any unique value is below some threshold
(s.value_counts() > THRESHOLD).all()
- OR number of unique values is greater than some other threshold
nunique() > OTHER_THRESH
I tried to use Pandas: Get values from column that appear more than X times to get the value counts across all columns, but I’m stuck on the indexing.
>>> test
col1 col2 a b c
col1
1 0.0 3 5.0 6.0 7.0
2 0.0 4 8.0 9.0 10.0
>>> test.apply(lambda s: (s.value_counts() > 1).all() if s.nunique() < 3 else s.nunique() > 1)
col1 True
col2 False
a False
b False
c False
>>> test[test.apply(lambda s: (s.value_counts() > 1).all() if s.nunique() < 3 else s.nunique() > 1).index]
col1 col2 a b c
col1
1 0.0 3 5.0 6.0 7.0
2 0.0 4 8.0 9.0 10.0
I wanted just col1
in the example, but got everything again. I could just iterate over column names:
>>> asdf = test.apply(lambda s: (s.value_counts() > 1).all() if s.nunique() < 3 else s.nunique() > 1)
>>> test[asdf[asdf == True].index]
col1
col1
1 0.0
2 0.0
But I’m not sure that this is the “correct”/standard way to do it (standard meaning efficient and legible). Assigning asdf
to an entire apply function and then extracting its index seems overly hacky/complicated. How can I use pandas more effectively here to ensure efficient computation?
Answers:
You just need to use .loc
m = test.apply(lambda s: (s.value_counts() > 1).all() if s.nunique() < 3 else s.nunique() > 1)
test.loc[:, m]
Out[742]:
col1
col1
1 0.0
2 0.0
The below code snippet filter those columns of a data frame where the
no. of unique values is equals to less than given threshold limit, say 20 unique values.
counter = []
for col in df.columns:
if df[col].nunique() <= 20:
counter.append(col)
print(list(counter))
I need to drop columns where:
- The value_counts of any unique value is below some threshold
(s.value_counts() > THRESHOLD).all()
- OR number of unique values is greater than some other threshold
nunique() > OTHER_THRESH
I tried to use Pandas: Get values from column that appear more than X times to get the value counts across all columns, but I’m stuck on the indexing.
>>> test
col1 col2 a b c
col1
1 0.0 3 5.0 6.0 7.0
2 0.0 4 8.0 9.0 10.0
>>> test.apply(lambda s: (s.value_counts() > 1).all() if s.nunique() < 3 else s.nunique() > 1)
col1 True
col2 False
a False
b False
c False
>>> test[test.apply(lambda s: (s.value_counts() > 1).all() if s.nunique() < 3 else s.nunique() > 1).index]
col1 col2 a b c
col1
1 0.0 3 5.0 6.0 7.0
2 0.0 4 8.0 9.0 10.0
I wanted just col1
in the example, but got everything again. I could just iterate over column names:
>>> asdf = test.apply(lambda s: (s.value_counts() > 1).all() if s.nunique() < 3 else s.nunique() > 1)
>>> test[asdf[asdf == True].index]
col1
col1
1 0.0
2 0.0
But I’m not sure that this is the “correct”/standard way to do it (standard meaning efficient and legible). Assigning asdf
to an entire apply function and then extracting its index seems overly hacky/complicated. How can I use pandas more effectively here to ensure efficient computation?
You just need to use .loc
m = test.apply(lambda s: (s.value_counts() > 1).all() if s.nunique() < 3 else s.nunique() > 1)
test.loc[:, m]
Out[742]:
col1
col1
1 0.0
2 0.0
The below code snippet filter those columns of a data frame where the
no. of unique values is equals to less than given threshold limit, say 20 unique values.
counter = []
for col in df.columns:
if df[col].nunique() <= 20:
counter.append(col)
print(list(counter))