Iterate trough df and apply function if condition is met

Question:

Considering the bellow df:

import pandas as pd
import numpy as np
data = [(3,5,7), (2,4,6),(5,8,9)]
df = pd.DataFrame(data, columns = ['A','B','C'])

I want to iterate trough each row and check if value from C is grater than 8 in this way:

  • If value of column C it’s less than 8, append that row to another df called df2

  • if value from column C is grater then 8, I apply a function that takes the max value of column C, and adds it to my current value

This function looks like this:

def increase_value(max_value):
     max_value= max(df['B'])
     df['C'] = df['C'] + max_value
     return df

I tried to create function that does all the above and then call it

I tried to make a function that iterate the df, takes each row and apply the 2 conditions

def check_value(df):
    df2 = pd.DataFrame(columns=df.columns)
    for index, row in df.iterrows():
        if row['C'] < 8:
            df2 = df2.append(row)
        else:
            max_b_value = max(df['B'])
            row = increase_value(max_b_value)
            df2 = df2.append(row)
    return df2
df2 = check_value(df)

This is not what i want because it’s duplicating rows, instead of just applying the increase_value function

I know there are other ways of doing this, but I need to use those 2 functions. Can someone please, tell me what I am doing wrong ? Also is this the best way to extract each row from the df?

Expected output:
for the last row 9 is grater than 8 and should apply the increase_value function. so for last row the value of column C will be 9 + 8 = 17

----------------
| A |  B | C   |
| 3 |  5 | 7   |  
| 2 |  4 | 6   |
| 5 |  8 | 17  |
----------------
Asked By: user3619789

||

Answers:

"Also is this the best way to extract each row from the df?"

You should make use of vectorisation rather than iteration. Try the following instead:

df2 = pd.DataFrame({
    'A': [0],
    'B': [0],
    'C': [0],
})

df2 = pd.concat([
    df2,
    df.loc[df['C'] < 8],
])

And for the other task:

df['C'] = np.where(
    df['C'] > 8,
    df['C'] + max(df['B']),
    df['C']
)

Output:

print(df)

   A  B   C
0  3  5   7
1  2  4   6
2  5  8  17

print(df2)

   A  B  C
0  0  0  0
0  3  5  7
1  2  4  6
Answered By: sharmu1

If you want to keep your functions, use:

def increase_value(row, max_value):
     row['C'] = row['C'] + max_value
     return row

def check_value(df):
    max_b_value = df['B'].max()
    data = []
    for index, row in df.iterrows():
        if row['C'] < 8:
            data.append(row)
        else:
            row = increase_value(row, max_b_value)
            data.append(row.copy())
    return pd.concat(data, axis=1).T

df2 = check_value(df)

Output:

>>> df2
   A  B   C
0  3  5   7
1  2  4   6
2  5  8  33
Answered By: Corralien

took the liberty to refactor and annotate sharmu1‘s more concise and performant answer to somewhat similar function input/output:

def check_value_new(df):
    temp_df = df.loc[df['C'] < 8] # temp_df that only has rows that meet this condition
    df2 = temp_df.copy() # create a copy to set as df2, don't need to use concat
    return df2

def add_max(df):
    max_b_val = df['B'].max()
    
    df['C'] = np.where(
        df['C'] >= 8, # 'if' column "C" >= 8
        df['C'] + max_b_val, # add max value of b columns
        df['C'] # else: stay the same
    )
    return df

data = [(3,5,7), (2,4,6),(5,8,9)]
df = pd.DataFrame(data, columns = ['A','B','C']) 

# get df2
df2 = check_value_new(df)

# process df
df = add_max(df)

output (printed):

df:
   A  B   C
0  3  5   7
1  2  4   6
2  5  8  17

df2:
   A  B  C
0  3  5  7
1  2  4  6

solution for original functions:

import pandas as pd

def check_value(df):
    df2_list = [] # to be converted into dataframe
    max_b_value = df['B'].max()
    for index, row in df.iterrows():
        if row['C'] < 8:
            df2_list.append(row)
            # only append if you want to add to df2
        else:
            # increase value by adding it to the row in-place
            row['C'] += max_b_value 
            # dont need to apend because not adding to df2
    df2 = pd.DataFrame(df2_list)
    return df2

data = [(3,5,7), (2,4,6),(5,8,9)]
df = pd.DataFrame(data, columns = ['A','B','C'])

df2 = check_value(df)

output (printed):

df:
   A  B   C
0  3  5   7
1  2  4   6
2  5  8  17

df2:
   A  B  C
0  3  5  7
1  2  4  6

main reasons why your code is not working (for orig):

1 increase_value()

your increase_value() function is returning the df every time, and that is being added as a row to the original df itself. each row that has the else condition will be converted to the df

2 df.append() has been deprecated from pandas.

A better way if you want to create a new dataframe (df2) from rows of an existing one is to just save a list_of_rows, then instantiate a dataframe using pd.DataFrame(list_of_rows)

3 setting and editing of global variables

you’re probably running this code in a jupyter notebook, so you are used to being able to access your variables at any cell. In a typical python script, setting global variables (like your df) is only required in specific cases.

Answered By: marcus jwt
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.