Python: Removing Rows on Count condition
Question:
I have a problem filtering a pandas
dataframe.
city
NYC
NYC
NYC
NYC
SYD
SYD
SEL
SEL
...
df.city.value_counts()
I would like to remove rows of cities that has less than 4 count frequency, which would be SYD and SEL for instance.
What would be the way to do so without manually dropping them city by city?
Answers:
This is one way using pd.Series.value_counts
.
counts = df['city'].value_counts()
res = df[~df['city'].isin(counts[counts < 5].index)]
counts
is a pd.Series
object. counts < 5
returns a Boolean series. We filter the counts series by the Boolean counts < 5
series (that’s what the square brackets achieve). We then take the index of the resultant series to find the cities with < 5 counts. ~
is the negation operator.
Remember a series is a mapping between index and value. The index of a series does not necessarily contain unique values, but this is guaranteed with the output of value_counts
.
I think you’re looking for value_counts()
# Import the great and powerful pandas
import pandas as pd
# Create some example data
df = pd.DataFrame({
'city': ['NYC', 'NYC', 'SYD', 'NYC', 'SEL', 'NYC', 'NYC']
})
# Get the count of each value
value_counts = df['city'].value_counts()
# Select the values where the count is less than 3 (or 5 if you like)
to_remove = value_counts[value_counts <= 3].index
# Keep rows where the city column is not in to_remove
df = df[~df.city.isin(to_remove)]
Here you go with filter
df.groupby('city').filter(lambda x : len(x)>3)
Out[1743]:
city
0 NYC
1 NYC
2 NYC
3 NYC
Solution two transform
sub_df = df[df.groupby('city').city.transform('count')>3].copy()
# add copy for future warning when you need to modify the sub df
Another solution :
threshold=3
df['Count'] = df.groupby('City')['City'].transform(pd.Series.value_counts)
df=df[df['Count']>=threshold]
df.drop(['Count'], axis = 1, inplace = True)
print(df)
City
0 NYC
1 NYC
2 NYC
3 NYC
I have a problem filtering a pandas
dataframe.
city
NYC
NYC
NYC
NYC
SYD
SYD
SEL
SEL
...
df.city.value_counts()
I would like to remove rows of cities that has less than 4 count frequency, which would be SYD and SEL for instance.
What would be the way to do so without manually dropping them city by city?
This is one way using pd.Series.value_counts
.
counts = df['city'].value_counts()
res = df[~df['city'].isin(counts[counts < 5].index)]
counts
is a pd.Series
object. counts < 5
returns a Boolean series. We filter the counts series by the Boolean counts < 5
series (that’s what the square brackets achieve). We then take the index of the resultant series to find the cities with < 5 counts. ~
is the negation operator.
Remember a series is a mapping between index and value. The index of a series does not necessarily contain unique values, but this is guaranteed with the output of value_counts
.
I think you’re looking for value_counts()
# Import the great and powerful pandas
import pandas as pd
# Create some example data
df = pd.DataFrame({
'city': ['NYC', 'NYC', 'SYD', 'NYC', 'SEL', 'NYC', 'NYC']
})
# Get the count of each value
value_counts = df['city'].value_counts()
# Select the values where the count is less than 3 (or 5 if you like)
to_remove = value_counts[value_counts <= 3].index
# Keep rows where the city column is not in to_remove
df = df[~df.city.isin(to_remove)]
Here you go with filter
df.groupby('city').filter(lambda x : len(x)>3)
Out[1743]:
city
0 NYC
1 NYC
2 NYC
3 NYC
Solution two transform
sub_df = df[df.groupby('city').city.transform('count')>3].copy()
# add copy for future warning when you need to modify the sub df
Another solution :
threshold=3
df['Count'] = df.groupby('City')['City'].transform(pd.Series.value_counts)
df=df[df['Count']>=threshold]
df.drop(['Count'], axis = 1, inplace = True)
print(df)
City
0 NYC
1 NYC
2 NYC
3 NYC