Copy values to next rows based on multiple conditions without a for loop
Question:
I have a dataframe with 3 columns. For the most part, column C is empty. There is only a value for each instance when the value in column A is not equal to the value in column A on the previous row. As example:
A B C
123 10-11-2023 100
123 10-11-2023
123 19-11-2023
123 27-11-2023
456 08-12-2023 80
456 10-12-2023
456 14-12-2023
789 28-10-2023 130
789 28-10-2023
789 28-10-2023
789 03-11-2023
Now I want to copy the value in column C for to all rows where the value in column A and B equal the value in column A and B of the row where column C is not empty. Then the result should be:
A B C
123 10-11-2023 100
123 10-11-2023 100
123 19-11-2023
123 27-11-2023
456 08-12-2023 80
456 10-12-2023
456 14-12-2023
789 28-10-2023 130
789 28-10-2023 130
789 28-10-2023 130
789 03-11-2023
Currently I have this as a for loop:
non_empty_C = df0['C'].dropna()
for index, value in non_empty_C.iteritems():
match_A = df0.loc[index, 'A']
match_B = df0.loc[index, 'B']
mask_match = (df0['A'] == match_A) & (df0['B'] == match_B)
df0.loc[mask_match, 'C'] = df0['C'].ffill()
This works, but as I have a dataframe of up to 1 million rows it’s much too slow. I tried to rewrite this to vectorized operations, but I don’t exactly know how to do this, since I have the complexity of 2 conditions and comparing values to previous rows. Any suggestions?
Answers:
You can actually achieve this without using any loops. The key is to leverage the groupby
and transform
methods provided by pandas
. These functions make it very easy to apply a function to each group of values, in your case, filling missing 'C'
values with the first available value in each ('A', 'B')
group.
import pandas as pd
data = {'A': [123, 123, 123, 123, 456, 456, 456, 789, 789, 789, 789],
'B': ['10-11-2023', '10-11-2023', '19-11-2023', '27-11-2023', '08-12-2023', '10-12-2023', '14-12-2023', '28-10-2023', '28-10-2023', '28-10-2023', '03-11-2023'],
'C': [100, None, None, None, 80, None, None, 130, None, None, None]}
df = pd.DataFrame(data)
# Vectorized operation to fill missing values in C based on the first value in each (A, B) group
df['C'] = df.groupby(['A', 'B'])['C'].transform('first')
print(df)
I have a dataframe with 3 columns. For the most part, column C is empty. There is only a value for each instance when the value in column A is not equal to the value in column A on the previous row. As example:
A B C
123 10-11-2023 100
123 10-11-2023
123 19-11-2023
123 27-11-2023
456 08-12-2023 80
456 10-12-2023
456 14-12-2023
789 28-10-2023 130
789 28-10-2023
789 28-10-2023
789 03-11-2023
Now I want to copy the value in column C for to all rows where the value in column A and B equal the value in column A and B of the row where column C is not empty. Then the result should be:
A B C
123 10-11-2023 100
123 10-11-2023 100
123 19-11-2023
123 27-11-2023
456 08-12-2023 80
456 10-12-2023
456 14-12-2023
789 28-10-2023 130
789 28-10-2023 130
789 28-10-2023 130
789 03-11-2023
Currently I have this as a for loop:
non_empty_C = df0['C'].dropna()
for index, value in non_empty_C.iteritems():
match_A = df0.loc[index, 'A']
match_B = df0.loc[index, 'B']
mask_match = (df0['A'] == match_A) & (df0['B'] == match_B)
df0.loc[mask_match, 'C'] = df0['C'].ffill()
This works, but as I have a dataframe of up to 1 million rows it’s much too slow. I tried to rewrite this to vectorized operations, but I don’t exactly know how to do this, since I have the complexity of 2 conditions and comparing values to previous rows. Any suggestions?
You can actually achieve this without using any loops. The key is to leverage the groupby
and transform
methods provided by pandas
. These functions make it very easy to apply a function to each group of values, in your case, filling missing 'C'
values with the first available value in each ('A', 'B')
group.
import pandas as pd
data = {'A': [123, 123, 123, 123, 456, 456, 456, 789, 789, 789, 789],
'B': ['10-11-2023', '10-11-2023', '19-11-2023', '27-11-2023', '08-12-2023', '10-12-2023', '14-12-2023', '28-10-2023', '28-10-2023', '28-10-2023', '03-11-2023'],
'C': [100, None, None, None, 80, None, None, 130, None, None, None]}
df = pd.DataFrame(data)
# Vectorized operation to fill missing values in C based on the first value in each (A, B) group
df['C'] = df.groupby(['A', 'B'])['C'].transform('first')
print(df)