Remove low frequency values from pandas.dataframe
Question:
How can I remove values from a column in pandas.DataFrame
, that occurs rarely, i.e. with a low frequency? Example:
In [4]: df[col_1].value_counts()
Out[4]: 0 189096
1 110500
2 77218
3 61372
...
2065 1
2067 1
1569 1
dtype: int64
So, my question is: how to remove values like 2065, 2067, 1569
and others? And how can I do this for ALL columns, that contain .value_counts()
like this?
UPDATE: About ‘low’ I mean values like 2065
. This value occurs in col_1
1 (one) times and I want to remove values like this.
Answers:
You probably don’t want to remove the entire row in your DataFrame if only one column has values below your threshold, so I’ve simply removed these data points and replaced them with None
.
I loop through each column and perform a value_counts
on each. I then get the index values for each items that occurs at or below the target threshold values. Finally, I use .loc
to locate these elements values in the column and then replace them with None
.
df = pd.DataFrame({'A': ['a', 'b', 'b', 'c', 'c'],
'B': ['a', 'a', 'b', 'c', 'c'],
'C': ['a', 'a', 'b', 'b', 'c']})
>>> df
A B C
0 a a a
1 b a a
2 b b b
3 c c b
4 c c c
threshold = 1 # Remove items less than or equal to threshold
for col in df:
vc = df[col].value_counts()
vals_to_remove = vc[vc <= threshold].index.values
df[col].loc[df[col].isin(vals_to_remove)] = None
>>> df
A B C
0 None a a
1 b a a
2 b None b
3 c c b
4 c c None
I see there are two ways you might want to do this.
For the entire DataFrame
This method removes the values that occur infrequently in the entire DataFrame. We can do it without loops, using built-in functions to speed things up.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, high=9, size=(100,2)),
columns = ['A', 'B'])
threshold = 10 # Anything that occurs less than this will be removed.
value_counts = df.stack().value_counts() # Entire DataFrame
to_remove = value_counts[value_counts <= threshold].index
df.replace(to_remove, np.nan, inplace=True)
Column-by-column
This method removes the entries that occur infrequently in each column.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, high=9, size=(100,2)),
columns = ['A', 'B'])
threshold = 10 # Anything that occurs less than this will be removed.
for col in df.columns:
value_counts = df[col].value_counts() # Specific column
to_remove = value_counts[value_counts <= threshold].index
df[col].replace(to_remove, np.nan, inplace=True)
With apply
this can be accomplished in one line. value_counts
returns the counts of unique values.
The occurences are mapped to each value with map
. where
replaces all values where the condition
is not met (here: number of occurences is greater than 1) with NaN.
In [6]: df = pd.DataFrame({'A': ['a', 'b', 'b', 'c', 'c'], 'B': ['a', 'a', 'b', 'c', 'c'], 'C': ['a', 'a', 'b', 'b', 'c']})
In [7]: df
Out[7]:
A B C
0 a a a
1 b a a
2 b b b
3 c c b
4 c c c
In [8]: df.apply(lambda col: col.where(col.map(col.value_counts()) > 1))
Out[8]:
A B C
0 NaN a a
1 b a a
2 b NaN b
3 c c b
4 c c NaN
Credits to @Alexander for the dataset.
How can I remove values from a column in pandas.DataFrame
, that occurs rarely, i.e. with a low frequency? Example:
In [4]: df[col_1].value_counts()
Out[4]: 0 189096
1 110500
2 77218
3 61372
...
2065 1
2067 1
1569 1
dtype: int64
So, my question is: how to remove values like 2065, 2067, 1569
and others? And how can I do this for ALL columns, that contain .value_counts()
like this?
UPDATE: About ‘low’ I mean values like 2065
. This value occurs in col_1
1 (one) times and I want to remove values like this.
You probably don’t want to remove the entire row in your DataFrame if only one column has values below your threshold, so I’ve simply removed these data points and replaced them with None
.
I loop through each column and perform a value_counts
on each. I then get the index values for each items that occurs at or below the target threshold values. Finally, I use .loc
to locate these elements values in the column and then replace them with None
.
df = pd.DataFrame({'A': ['a', 'b', 'b', 'c', 'c'],
'B': ['a', 'a', 'b', 'c', 'c'],
'C': ['a', 'a', 'b', 'b', 'c']})
>>> df
A B C
0 a a a
1 b a a
2 b b b
3 c c b
4 c c c
threshold = 1 # Remove items less than or equal to threshold
for col in df:
vc = df[col].value_counts()
vals_to_remove = vc[vc <= threshold].index.values
df[col].loc[df[col].isin(vals_to_remove)] = None
>>> df
A B C
0 None a a
1 b a a
2 b None b
3 c c b
4 c c None
I see there are two ways you might want to do this.
For the entire DataFrame
This method removes the values that occur infrequently in the entire DataFrame. We can do it without loops, using built-in functions to speed things up.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, high=9, size=(100,2)),
columns = ['A', 'B'])
threshold = 10 # Anything that occurs less than this will be removed.
value_counts = df.stack().value_counts() # Entire DataFrame
to_remove = value_counts[value_counts <= threshold].index
df.replace(to_remove, np.nan, inplace=True)
Column-by-column
This method removes the entries that occur infrequently in each column.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, high=9, size=(100,2)),
columns = ['A', 'B'])
threshold = 10 # Anything that occurs less than this will be removed.
for col in df.columns:
value_counts = df[col].value_counts() # Specific column
to_remove = value_counts[value_counts <= threshold].index
df[col].replace(to_remove, np.nan, inplace=True)
With apply
this can be accomplished in one line. value_counts
returns the counts of unique values.
The occurences are mapped to each value with map
. where
replaces all values where the condition
is not met (here: number of occurences is greater than 1) with NaN.
In [6]: df = pd.DataFrame({'A': ['a', 'b', 'b', 'c', 'c'], 'B': ['a', 'a', 'b', 'c', 'c'], 'C': ['a', 'a', 'b', 'b', 'c']})
In [7]: df
Out[7]:
A B C
0 a a a
1 b a a
2 b b b
3 c c b
4 c c c
In [8]: df.apply(lambda col: col.where(col.map(col.value_counts()) > 1))
Out[8]:
A B C
0 NaN a a
1 b a a
2 b NaN b
3 c c b
4 c c NaN
Credits to @Alexander for the dataset.