Select only columns that have at most N unique values
Question:
I want to count the number of unique values in each column and select only those columns which have less than 32 unique values.
I tried using
df.filter(nunique<32)
and
df[[ c for df.columns in df if c in c.nunique<32]]
but because nunique is a method and not function they don’t work. Thought len(set() would work and tried
df.apply(lambda x : len(set(x))
but doesn’t work as well. Any ideas please? thanks in advance!
Answers:
nunique
can be called on the entire DataFrame (you have to call it). You can then filter out columns using loc
:
df.loc[:, df.nunique() < 32]
Minimal Verifiable Example
df = pd.DataFrame({'A': list('abbcde'), 'B': list('ababab')})
df
A B
0 a a
1 b b
2 b a
3 c b
4 d a
5 e b
df.nunique()
A 5
B 2
dtype: int64
df.loc[:, df.nunique() < 3]
B
0 a
1 b
2 a
3 b
4 a
5 b
If anyone wants to do it in a method chaining fashion, you can:
df.loc[:, lambda x: x.nunique() < 3]
I want to count the number of unique values in each column and select only those columns which have less than 32 unique values.
I tried using
df.filter(nunique<32)
and
df[[ c for df.columns in df if c in c.nunique<32]]
but because nunique is a method and not function they don’t work. Thought len(set() would work and tried
df.apply(lambda x : len(set(x))
but doesn’t work as well. Any ideas please? thanks in advance!
nunique
can be called on the entire DataFrame (you have to call it). You can then filter out columns using loc
:
df.loc[:, df.nunique() < 32]
Minimal Verifiable Example
df = pd.DataFrame({'A': list('abbcde'), 'B': list('ababab')})
df
A B
0 a a
1 b b
2 b a
3 c b
4 d a
5 e b
df.nunique()
A 5
B 2
dtype: int64
df.loc[:, df.nunique() < 3]
B
0 a
1 b
2 a
3 b
4 a
5 b
If anyone wants to do it in a method chaining fashion, you can:
df.loc[:, lambda x: x.nunique() < 3]