Iterate trough df and apply function if condition is met
Question:
Considering the bellow df:
import pandas as pd
import numpy as np
data = [(3,5,7), (2,4,6),(5,8,9)]
df = pd.DataFrame(data, columns = ['A','B','C'])
I want to iterate trough each row and check if value from C is grater than 8 in this way:
-
If value of column C it’s less than 8, append that row to another df called df2
-
if value from column C is grater then 8, I apply a function that takes the max value of column C, and adds it to my current value
This function looks like this:
def increase_value(max_value):
max_value= max(df['B'])
df['C'] = df['C'] + max_value
return df
I tried to create function that does all the above and then call it
I tried to make a function that iterate the df, takes each row and apply the 2 conditions
def check_value(df):
df2 = pd.DataFrame(columns=df.columns)
for index, row in df.iterrows():
if row['C'] < 8:
df2 = df2.append(row)
else:
max_b_value = max(df['B'])
row = increase_value(max_b_value)
df2 = df2.append(row)
return df2
df2 = check_value(df)
This is not what i want because it’s duplicating rows, instead of just applying the increase_value
function
I know there are other ways of doing this, but I need to use those 2 functions. Can someone please, tell me what I am doing wrong ? Also is this the best way to extract each row from the df?
Expected output:
for the last row 9 is grater than 8 and should apply the increase_value function. so for last row the value of column C will be 9 + 8 = 17
----------------
| A | B | C |
| 3 | 5 | 7 |
| 2 | 4 | 6 |
| 5 | 8 | 17 |
----------------
Answers:
"Also is this the best way to extract each row from the df?"
You should make use of vectorisation rather than iteration. Try the following instead:
df2 = pd.DataFrame({
'A': [0],
'B': [0],
'C': [0],
})
df2 = pd.concat([
df2,
df.loc[df['C'] < 8],
])
And for the other task:
df['C'] = np.where(
df['C'] > 8,
df['C'] + max(df['B']),
df['C']
)
Output:
print(df)
A B C
0 3 5 7
1 2 4 6
2 5 8 17
print(df2)
A B C
0 0 0 0
0 3 5 7
1 2 4 6
If you want to keep your functions, use:
def increase_value(row, max_value):
row['C'] = row['C'] + max_value
return row
def check_value(df):
max_b_value = df['B'].max()
data = []
for index, row in df.iterrows():
if row['C'] < 8:
data.append(row)
else:
row = increase_value(row, max_b_value)
data.append(row.copy())
return pd.concat(data, axis=1).T
df2 = check_value(df)
Output:
>>> df2
A B C
0 3 5 7
1 2 4 6
2 5 8 33
took the liberty to refactor and annotate sharmu1‘s more concise and performant answer to somewhat similar function input/output:
def check_value_new(df):
temp_df = df.loc[df['C'] < 8] # temp_df that only has rows that meet this condition
df2 = temp_df.copy() # create a copy to set as df2, don't need to use concat
return df2
def add_max(df):
max_b_val = df['B'].max()
df['C'] = np.where(
df['C'] >= 8, # 'if' column "C" >= 8
df['C'] + max_b_val, # add max value of b columns
df['C'] # else: stay the same
)
return df
data = [(3,5,7), (2,4,6),(5,8,9)]
df = pd.DataFrame(data, columns = ['A','B','C'])
# get df2
df2 = check_value_new(df)
# process df
df = add_max(df)
output (printed):
df:
A B C
0 3 5 7
1 2 4 6
2 5 8 17
df2:
A B C
0 3 5 7
1 2 4 6
solution for original functions:
import pandas as pd
def check_value(df):
df2_list = [] # to be converted into dataframe
max_b_value = df['B'].max()
for index, row in df.iterrows():
if row['C'] < 8:
df2_list.append(row)
# only append if you want to add to df2
else:
# increase value by adding it to the row in-place
row['C'] += max_b_value
# dont need to apend because not adding to df2
df2 = pd.DataFrame(df2_list)
return df2
data = [(3,5,7), (2,4,6),(5,8,9)]
df = pd.DataFrame(data, columns = ['A','B','C'])
df2 = check_value(df)
output (printed):
df:
A B C
0 3 5 7
1 2 4 6
2 5 8 17
df2:
A B C
0 3 5 7
1 2 4 6
main reasons why your code is not working (for orig):
1 increase_value()
your increase_value()
function is returning the df
every time, and that is being added as a row to the original df
itself. each row that has the else
condition will be converted to the df
2 df.append()
has been deprecated from pandas.
A better way if you want to create a new dataframe (df2
) from rows of an existing one is to just save a list_of_rows
, then instantiate a dataframe using pd.DataFrame(list_of_rows
)
3 setting and editing of global variables
you’re probably running this code in a jupyter notebook, so you are used to being able to access your variables at any cell. In a typical python script, setting global variables (like your df
) is only required in specific cases.
Considering the bellow df:
import pandas as pd
import numpy as np
data = [(3,5,7), (2,4,6),(5,8,9)]
df = pd.DataFrame(data, columns = ['A','B','C'])
I want to iterate trough each row and check if value from C is grater than 8 in this way:
-
If value of column C it’s less than 8, append that row to another df called df2
-
if value from column C is grater then 8, I apply a function that takes the max value of column C, and adds it to my current value
This function looks like this:
def increase_value(max_value):
max_value= max(df['B'])
df['C'] = df['C'] + max_value
return df
I tried to create function that does all the above and then call it
I tried to make a function that iterate the df, takes each row and apply the 2 conditions
def check_value(df):
df2 = pd.DataFrame(columns=df.columns)
for index, row in df.iterrows():
if row['C'] < 8:
df2 = df2.append(row)
else:
max_b_value = max(df['B'])
row = increase_value(max_b_value)
df2 = df2.append(row)
return df2
df2 = check_value(df)
This is not what i want because it’s duplicating rows, instead of just applying the increase_value
function
I know there are other ways of doing this, but I need to use those 2 functions. Can someone please, tell me what I am doing wrong ? Also is this the best way to extract each row from the df?
Expected output:
for the last row 9 is grater than 8 and should apply the increase_value function. so for last row the value of column C will be 9 + 8 = 17
----------------
| A | B | C |
| 3 | 5 | 7 |
| 2 | 4 | 6 |
| 5 | 8 | 17 |
----------------
"Also is this the best way to extract each row from the df?"
You should make use of vectorisation rather than iteration. Try the following instead:
df2 = pd.DataFrame({
'A': [0],
'B': [0],
'C': [0],
})
df2 = pd.concat([
df2,
df.loc[df['C'] < 8],
])
And for the other task:
df['C'] = np.where(
df['C'] > 8,
df['C'] + max(df['B']),
df['C']
)
Output:
print(df)
A B C
0 3 5 7
1 2 4 6
2 5 8 17
print(df2)
A B C
0 0 0 0
0 3 5 7
1 2 4 6
If you want to keep your functions, use:
def increase_value(row, max_value):
row['C'] = row['C'] + max_value
return row
def check_value(df):
max_b_value = df['B'].max()
data = []
for index, row in df.iterrows():
if row['C'] < 8:
data.append(row)
else:
row = increase_value(row, max_b_value)
data.append(row.copy())
return pd.concat(data, axis=1).T
df2 = check_value(df)
Output:
>>> df2
A B C
0 3 5 7
1 2 4 6
2 5 8 33
took the liberty to refactor and annotate sharmu1‘s more concise and performant answer to somewhat similar function input/output:
def check_value_new(df):
temp_df = df.loc[df['C'] < 8] # temp_df that only has rows that meet this condition
df2 = temp_df.copy() # create a copy to set as df2, don't need to use concat
return df2
def add_max(df):
max_b_val = df['B'].max()
df['C'] = np.where(
df['C'] >= 8, # 'if' column "C" >= 8
df['C'] + max_b_val, # add max value of b columns
df['C'] # else: stay the same
)
return df
data = [(3,5,7), (2,4,6),(5,8,9)]
df = pd.DataFrame(data, columns = ['A','B','C'])
# get df2
df2 = check_value_new(df)
# process df
df = add_max(df)
output (printed):
df:
A B C
0 3 5 7
1 2 4 6
2 5 8 17
df2:
A B C
0 3 5 7
1 2 4 6
solution for original functions:
import pandas as pd
def check_value(df):
df2_list = [] # to be converted into dataframe
max_b_value = df['B'].max()
for index, row in df.iterrows():
if row['C'] < 8:
df2_list.append(row)
# only append if you want to add to df2
else:
# increase value by adding it to the row in-place
row['C'] += max_b_value
# dont need to apend because not adding to df2
df2 = pd.DataFrame(df2_list)
return df2
data = [(3,5,7), (2,4,6),(5,8,9)]
df = pd.DataFrame(data, columns = ['A','B','C'])
df2 = check_value(df)
output (printed):
df:
A B C
0 3 5 7
1 2 4 6
2 5 8 17
df2:
A B C
0 3 5 7
1 2 4 6
main reasons why your code is not working (for orig):
1 increase_value()
your increase_value()
function is returning the df
every time, and that is being added as a row to the original df
itself. each row that has the else
condition will be converted to the df
2 df.append()
has been deprecated from pandas.
A better way if you want to create a new dataframe (df2
) from rows of an existing one is to just save a list_of_rows
, then instantiate a dataframe using pd.DataFrame(list_of_rows
)
3 setting and editing of global variables
you’re probably running this code in a jupyter notebook, so you are used to being able to access your variables at any cell. In a typical python script, setting global variables (like your df
) is only required in specific cases.