drop rows from pandas dataframe using for loop and if statements
Question:
I am trying to clean a dataset although run into an error where red is not recognised and I am not sure if I have written the function correctly. Ideally I want to drop rows based on the tolerances per colour and length. I am trying to create a function for this. I want to be able to pass a colour, upper tolerance and lower tolerance and remove the row from the dataset.
Thanks!
import pandas as pd
df = pd.DataFrame(
{
"Colour": [
"Red",
"Red",
"Red",
"Red",
"Red",
"Blue",
"Blue",
"Blue",
"Green",
"Green",
"Green",
],
"Length": [14, 15, 16, 20, 15, 15, 18, 17, 15, 19, 18],
}
)
def tolerance_drop(Colour, Upper, Lower):
for i in range(0, len(df)):
if (df.loc[i, "Colour"] == Colour) & (df.loc[i, "Length"] > Upper):
df.drop([i])
elif (df.loc[i, "Colour"] == Colour) & (df.loc[i, "Length"] < Lower):
df.drop([i])
else:
break
# should remove 2 red rows giving 9 remaining rows
tolerance_drop("Red", 19.150, 14.5)
print(df)
Output:
it simply prints the dataframe the same as before. No rows are deleted.
Answers:
As pointed out in the comments, there are better ways for doing this.
But if you are learning and want to know why your function doesn’t work, you should try this:
def tolerance_drop(Colour, Upper, Lower):
for i in range(0, len(df)):
if df.loc[i, "Colour"] == Colour and (df.loc[i, "Length"] > Upper or df.loc[i, "Length"] < Lower):
df.drop([i], inplace=True)
tolerance_drop("Red", 19.150, 14.5)
print(df)
In your version, the break
statement will exit the for
-loop as soon as that line of code is reached, so you don’t want that.
In python &
is a bitwise operator that has a different meaning. To combine conditions, you can use and
/or
.
When you drop a row, the resulting dataframe won’t be magically saved into the same variable, unless you use the inplace=True
argument.
Output:
Colour Length
1 Red 15
2 Red 16
4 Red 15
5 Blue 15
6 Blue 18
7 Blue 17
8 Green 15
9 Green 19
10 Green 18
Avoid using an explicit looping if you able to apply pandas vectorized operations.
Simple filtering:
In [466]: df = df[~((df.Colour == 'Red') & ((df.Length > 19.150) | (df.Length < 14.5)))]
In [467]: df
Out[467]:
Colour Length
1 Red 15
2 Red 16
4 Red 15
5 Blue 15
6 Blue 18
7 Blue 17
8 Green 15
9 Green 19
10 Green 18
I am trying to clean a dataset although run into an error where red is not recognised and I am not sure if I have written the function correctly. Ideally I want to drop rows based on the tolerances per colour and length. I am trying to create a function for this. I want to be able to pass a colour, upper tolerance and lower tolerance and remove the row from the dataset.
Thanks!
import pandas as pd
df = pd.DataFrame(
{
"Colour": [
"Red",
"Red",
"Red",
"Red",
"Red",
"Blue",
"Blue",
"Blue",
"Green",
"Green",
"Green",
],
"Length": [14, 15, 16, 20, 15, 15, 18, 17, 15, 19, 18],
}
)
def tolerance_drop(Colour, Upper, Lower):
for i in range(0, len(df)):
if (df.loc[i, "Colour"] == Colour) & (df.loc[i, "Length"] > Upper):
df.drop([i])
elif (df.loc[i, "Colour"] == Colour) & (df.loc[i, "Length"] < Lower):
df.drop([i])
else:
break
# should remove 2 red rows giving 9 remaining rows
tolerance_drop("Red", 19.150, 14.5)
print(df)
Output:
it simply prints the dataframe the same as before. No rows are deleted.
As pointed out in the comments, there are better ways for doing this.
But if you are learning and want to know why your function doesn’t work, you should try this:
def tolerance_drop(Colour, Upper, Lower):
for i in range(0, len(df)):
if df.loc[i, "Colour"] == Colour and (df.loc[i, "Length"] > Upper or df.loc[i, "Length"] < Lower):
df.drop([i], inplace=True)
tolerance_drop("Red", 19.150, 14.5)
print(df)
In your version, the break
statement will exit the for
-loop as soon as that line of code is reached, so you don’t want that.
In python &
is a bitwise operator that has a different meaning. To combine conditions, you can use and
/or
.
When you drop a row, the resulting dataframe won’t be magically saved into the same variable, unless you use the inplace=True
argument.
Output:
Colour Length
1 Red 15
2 Red 16
4 Red 15
5 Blue 15
6 Blue 18
7 Blue 17
8 Green 15
9 Green 19
10 Green 18
Avoid using an explicit looping if you able to apply pandas vectorized operations.
Simple filtering:
In [466]: df = df[~((df.Colour == 'Red') & ((df.Length > 19.150) | (df.Length < 14.5)))]
In [467]: df
Out[467]:
Colour Length
1 Red 15
2 Red 16
4 Red 15
5 Blue 15
6 Blue 18
7 Blue 17
8 Green 15
9 Green 19
10 Green 18