Copy values to next rows based on multiple conditions without a for loop

Question:

I have a dataframe with 3 columns. For the most part, column C is empty. There is only a value for each instance when the value in column A is not equal to the value in column A on the previous row. As example:

A     B             C
123   10-11-2023    100
123   10-11-2023
123   19-11-2023
123   27-11-2023
456   08-12-2023    80
456   10-12-2023
456   14-12-2023
789   28-10-2023    130
789   28-10-2023
789   28-10-2023
789   03-11-2023

Now I want to copy the value in column C for to all rows where the value in column A and B equal the value in column A and B of the row where column C is not empty. Then the result should be:

A     B             C
123   10-11-2023    100
123   10-11-2023    100
123   19-11-2023
123   27-11-2023
456   08-12-2023    80
456   10-12-2023
456   14-12-2023
789   28-10-2023    130
789   28-10-2023    130
789   28-10-2023    130
789   03-11-2023

Currently I have this as a for loop:

non_empty_C = df0['C'].dropna()

for index, value in non_empty_C.iteritems():
    match_A = df0.loc[index, 'A']
    match_B = df0.loc[index, 'B']
    mask_match = (df0['A'] == match_A) & (df0['B'] == match_B)
    df0.loc[mask_match, 'C'] = df0['C'].ffill()

This works, but as I have a dataframe of up to 1 million rows it’s much too slow. I tried to rewrite this to vectorized operations, but I don’t exactly know how to do this, since I have the complexity of 2 conditions and comparing values to previous rows. Any suggestions?

Asked By: Prmake

||

Answers:

You can actually achieve this without using any loops. The key is to leverage the groupby and transform methods provided by pandas. These functions make it very easy to apply a function to each group of values, in your case, filling missing 'C' values with the first available value in each ('A', 'B') group.

import pandas as pd

data = {'A': [123, 123, 123, 123, 456, 456, 456, 789, 789, 789, 789],
        'B': ['10-11-2023', '10-11-2023', '19-11-2023', '27-11-2023', '08-12-2023', '10-12-2023', '14-12-2023', '28-10-2023', '28-10-2023', '28-10-2023', '03-11-2023'],
        'C': [100, None, None, None, 80, None, None, 130, None, None, None]}

df = pd.DataFrame(data)

# Vectorized operation to fill missing values in C based on the first value in each (A, B) group
df['C'] = df.groupby(['A', 'B'])['C'].transform('first')

print(df)
Answered By: DataJanitor
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.