Dropping rows that contains a specific condition
Question:
I got a dataset and I want to drop a few unusable rows. I used a filter to the specific condition in which i want the rows to be dropped
filter = df.groupby(['Bairro'], group_keys=False, sort=True).size() > 1 print(filter.to_string())
Bairro
01
True
02
False
All the data in which the condition is false is useless. I’ve tried a few things, none of them work.
So, I’d like the dataframe to maintain only the values where the condition is true:
Bairro
01
True
df2 = ((df.groupby(['Bairro']).size()) != 1)
I was even planning to dropping value by value, but it didn’t work as well
df2 = df[~df.isin(['02']).any(axis=1)]
Tried passing the filter as a condition:
df.drop(df[df.groupby(['Bairro'], group_keys=False, sort=True).size() > 1], inplace = True)
Answers:
It seems like the df.loc method could help you in this instance. In your example:
new_df = df.loc[df['col2'] == "True"]
Or if you would like to use multiple conditions:
new_df = df.loc[(df['col1'] == "True") & (df['col2'] == "True")]
I think you’re over-engineering your solution therefore I’ve opted for a more detailed explaination of the answer.
One way to filter a dataframe is to simply subscript a list/array of booleans. If the length of the array is the same as the length of the dataframe, this will output a view of the dataframe containing only rows aligned with the True values.
Here is an example:
import pandas as pd
df = pd.DataFrame({
'numbers': [0,1,2,3,4],
'letters': ['a','b','c','d','e'],
'colors': ['red', 'blue', 'yellow', 'green', 'purple']
})
df
Which outputs:
numbers
letters
colors
0
0
a
red
1
1
b
blue
2
2
c
yellow
3
3
d
green
4
4
e
purple
This is what I mean by subscripting a boolean list (not sure if this is accepted terminology)
boolean_list = [True, True, False, True, False]
filtered_df = df[boolean_list]
filtered_df
Which outputs:
numbers
letters
colors
0
0
a
red
1
1
b
blue
3
3
d
green
We can use simple arguments to produce this boolean list from a dataframe
df['numbers']>2
Outputs:
0 False
1 False
2 False
3 True
4 True
Name: numbers, dtype: bool
We can streamline the filtering with this redundant looking piece of code:
df[df['numbers']>2]
outputs:
numbers
letters
colors
3
3
d
green
4
4
e
purple
While it looks redundant, all we’ve done there is subscribe a list of booleans. As written, this does not change df at all, for that we would need to do df = df[filter_argument]
For more complicated filtering we can use .apply() to get our list of booleans. Say we only want rows where the letter in ‘letters’ is present in the color in ‘colors’:
def letter_in_color(row):
return row['letters'] in row['colors']
boolean_arr = df.apply(letter_in_color, axis = 1)
print(boolean_arr)
0 False
1 True
2 False
3 False
4 True
dtype: bool
letter_in_color_df = df[boolean_array]
letter_in_color_df
numbers
letters
colors
1
1
b
blue
4
4
e
purple
I did this long explaination because while the concept of filtering a df with a boolean array is quite simple, looking at code which does that often looks weird or redundant and it isn’t clear what is really going on.
I hope you didn’t stop reading there:
because there is an important and powerful tool which you can add to the above situations to preclude many errors and unexpected behavior: ".loc[]" This is a more explicit and powerful indexer, and in all of the above cases we can gain its benefits with very few changes:
df[boolean_array] becomes df.loc[boolean_array]
For more information about df.loc[] instead of df[] see this answer
I got a dataset and I want to drop a few unusable rows. I used a filter to the specific condition in which i want the rows to be dropped
filter = df.groupby(['Bairro'], group_keys=False, sort=True).size() > 1 print(filter.to_string())
Bairro | |
---|---|
01 | True |
02 | False |
All the data in which the condition is false is useless. I’ve tried a few things, none of them work.
So, I’d like the dataframe to maintain only the values where the condition is true:
Bairro | |
---|---|
01 | True |
df2 = ((df.groupby(['Bairro']).size()) != 1)
I was even planning to dropping value by value, but it didn’t work as well
df2 = df[~df.isin(['02']).any(axis=1)]
Tried passing the filter as a condition:
df.drop(df[df.groupby(['Bairro'], group_keys=False, sort=True).size() > 1], inplace = True)
It seems like the df.loc method could help you in this instance. In your example:
new_df = df.loc[df['col2'] == "True"]
Or if you would like to use multiple conditions:
new_df = df.loc[(df['col1'] == "True") & (df['col2'] == "True")]
I think you’re over-engineering your solution therefore I’ve opted for a more detailed explaination of the answer.
One way to filter a dataframe is to simply subscript a list/array of booleans. If the length of the array is the same as the length of the dataframe, this will output a view of the dataframe containing only rows aligned with the True values.
Here is an example:
import pandas as pd
df = pd.DataFrame({
'numbers': [0,1,2,3,4],
'letters': ['a','b','c','d','e'],
'colors': ['red', 'blue', 'yellow', 'green', 'purple']
})
df
Which outputs:
numbers | letters | colors | |
---|---|---|---|
0 | 0 | a | red |
1 | 1 | b | blue |
2 | 2 | c | yellow |
3 | 3 | d | green |
4 | 4 | e | purple |
This is what I mean by subscripting a boolean list (not sure if this is accepted terminology)
boolean_list = [True, True, False, True, False]
filtered_df = df[boolean_list]
filtered_df
Which outputs:
numbers | letters | colors | |
---|---|---|---|
0 | 0 | a | red |
1 | 1 | b | blue |
3 | 3 | d | green |
We can use simple arguments to produce this boolean list from a dataframe
df['numbers']>2
Outputs:
0 False
1 False
2 False
3 True
4 True
Name: numbers, dtype: bool
We can streamline the filtering with this redundant looking piece of code:
df[df['numbers']>2]
outputs:
numbers | letters | colors | |
---|---|---|---|
3 | 3 | d | green |
4 | 4 | e | purple |
While it looks redundant, all we’ve done there is subscribe a list of booleans. As written, this does not change df at all, for that we would need to do df = df[filter_argument]
For more complicated filtering we can use .apply() to get our list of booleans. Say we only want rows where the letter in ‘letters’ is present in the color in ‘colors’:
def letter_in_color(row):
return row['letters'] in row['colors']
boolean_arr = df.apply(letter_in_color, axis = 1)
print(boolean_arr)
0 False
1 True
2 False
3 False
4 True
dtype: bool
letter_in_color_df = df[boolean_array]
letter_in_color_df
numbers | letters | colors | |
---|---|---|---|
1 | 1 | b | blue |
4 | 4 | e | purple |
I did this long explaination because while the concept of filtering a df with a boolean array is quite simple, looking at code which does that often looks weird or redundant and it isn’t clear what is really going on.
I hope you didn’t stop reading there:
because there is an important and powerful tool which you can add to the above situations to preclude many errors and unexpected behavior: ".loc[]" This is a more explicit and powerful indexer, and in all of the above cases we can gain its benefits with very few changes:
df[boolean_array] becomes df.loc[boolean_array]
For more information about df.loc[] instead of df[] see this answer