Efficiently replace values from a column to another column Pandas DataFrame

Question

I have a Pandas DataFrame like this:

   col1 col2 col3
1   0.2  0.3  0.3
2   0.2  0.3  0.3
3     0  0.4  0.4
4     0    0  0.3
5     0    0    0
6   0.1  0.4  0.4

I want to replace the col1 values with the values in the second column (col2) only if col1 values are equal to 0, and after (for the zero values remaining), do it again but with the third column (col3). The Desired Result is the next one:

   col1 col2 col3
1   0.2  0.3  0.3
2   0.2  0.3  0.3
3   0.4  0.4  0.4
4   0.3    0  0.3
5     0    0    0
6   0.1  0.4  0.4

I did it using the pd.replace function, but it seems too slow.. I think must be a faster way to accomplish that.

df.col1.replace(0,df.col2,inplace=True)
df.col1.replace(0,df.col3,inplace=True)

is there a faster way to do that?, using some other function instead of the pd.replace function?

Asked By: Pablo

||

Source

Answer 1

I’m not sure if it’s faster, but you’re right that you can slice the dataframe to get your desired result.

df.col1[df.col1 == 0] = df.col2
df.col1[df.col1 == 0] = df.col3
print(df)

Output:

   col1  col2  col3
0   0.2   0.3   0.3
1   0.2   0.3   0.3
2   0.4   0.4   0.4
3   0.3   0.0   0.3
4   0.0   0.0   0.0
5   0.1   0.4   0.4

Alternatively if you want it to be more terse (though I don’t know if it’s faster) you can combine what you did with what I did.

df.col1[df.col1 == 0] = df.col2.replace(0, df.col3)
print(df)

Output:

   col1  col2  col3
0   0.2   0.3   0.3
1   0.2   0.3   0.3
2   0.4   0.4   0.4
3   0.3   0.0   0.3
4   0.0   0.0   0.0
5   0.1   0.4   0.4

Answered By: Andrew

Answer 2

Using np.where is faster. Using a similar pattern as you used with replace:

df['col1'] = np.where(df['col1'] == 0, df['col2'], df['col1'])
df['col1'] = np.where(df['col1'] == 0, df['col3'], df['col1'])

However, using a nested np.where is slightly faster:

df['col1'] = np.where(df['col1'] == 0, 
                      np.where(df['col2'] == 0, df['col3'], df['col2']),
                      df['col1'])

Timings

Using the following setup to produce a larger sample DataFrame and timing functions:

df = pd.concat([df]*10**4, ignore_index=True)

def root_nested(df):
    df['col1'] = np.where(df['col1'] == 0, np.where(df['col2'] == 0, df['col3'], df['col2']), df['col1'])
    return df

def root_split(df):
    df['col1'] = np.where(df['col1'] == 0, df['col2'], df['col1'])
    df['col1'] = np.where(df['col1'] == 0, df['col3'], df['col1'])
    return df

def pir2(df):
    df['col1'] = df.where(df.ne(0), np.nan).bfill(axis=1).col1.fillna(0)
    return df

def pir2_2(df):
    slc = (df.values != 0).argmax(axis=1)
    return df.values[np.arange(slc.shape[0]), slc]

def andrew(df):
    df.col1[df.col1 == 0] = df.col2
    df.col1[df.col1 == 0] = df.col3
    return df

def pablo(df):
    df['col1'] = df['col1'].replace(0,df['col2'])
    df['col1'] = df['col1'].replace(0,df['col3'])
    return df

I get the following timings:

%timeit root_nested(df.copy())
100 loops, best of 3: 2.25 ms per loop

%timeit root_split(df.copy())
100 loops, best of 3: 2.62 ms per loop

%timeit pir2(df.copy())
100 loops, best of 3: 6.25 ms per loop

%timeit pir2_2(df.copy())
1 loop, best of 3: 2.4 ms per loop

%timeit andrew(df.copy())
100 loops, best of 3: 8.55 ms per loop

I tried timing your method, but it’s been running for multiple minutes without completing. As a comparison, timing your method on just the 6 row example DataFrame (not the much larger one tested above) took 12.8 ms.

Answered By: root

Answer 3

approach using pd.DataFrame.where and pd.DataFrame.bfill

df['col1'] = df.where(df.ne(0), np.nan).bfill(axis=1).col1.fillna(0)
df

Another approach using np.argmax

def pir2(df):
    slc = (df.values != 0).argmax(axis=1)
    return df.values[np.arange(slc.shape[0]), slc]

I know there is a better way to use numpy to slice. I just can’t think of it at the moment.

Answered By: piRSquared

Answer 4

Generally speaking, there are three type of methods to do this conditionally replacement task. They are:

numpy.where
pandas.Series.mask or pandas.Series.where which is the opposite of Series.mask
pandas.DataFrame.loc

You can try pandas.Series.mask

df['col1'] = df['col1'].mask(df['col1'].eq(0), df['col2'])
df['col1'] = df['col1'].mask(df['col1'].eq(0), df['col3'])

   col1  col2  col3
1   0.2   0.3   0.3
2   0.2   0.3   0.3
3   0.4   0.4   0.4
4   0.3   0.0   0.3
5   0.0   0.0   0.0
6   0.1   0.4   0.4

Or pandas.Series.where

df['col1'] = df['col1'].where(df['col1'].ne(0), df['col2'])
df['col1'] = df['col1'].where(df['col1'].ne(0), df['col3'])

At last, you can try loc

df.loc[df['col1'].eq(0), 'col1'] = df['col2']
df.loc[df['col1'].eq(0), 'col1'] = df['col3']

Answered By: Ynjxsjmh

Answer 5

Alternatively you can use combine:

replace_zeros = lambda x, y: y if x == 0 else x
df['col1'].combine(df['col2'], func=replace_zeros).combine(df['col3'], func=replace_zeros)

Output:

1    0.2
2    0.2
3    0.4
4    0.3
5    0.0
6    0.1
dtype: float64

Answered By: rachwa

Efficiently replace values from a column to another column Pandas DataFrame

Question:

Answers: