Efficiently replacing values in each row of pandas dataframe based on condition

Question:

I would like to work with a pandas data frame to get a strange yet desired output dataframe. For each row, I’d like any values of 0.0 to be replaced with an empty string (”), and all values of 1.0 to be replaced with the value of the index. Any given value on a row can only be 1.0 or 0.0.

Here’s some example data:

# starting df
df = pd.DataFrame.from_dict({'A':[1.0,0.0,0.0],'B':[1.0,1.0,0.0],'C':[0.0,1.0,1.0]})
df.index=['x','y','z']
print(df)

What the input df looks like:

     A    B    C
x  1.0  1.0  0.0
y  0.0  1.0  1.0
z  0.0  0.0  1.0

What I would like the output df to look like:

   A  B  C
x  x  x   
y     y  y
z        z

So far I’ve got this pretty inefficient but seemingly working code:

for idx in df.index:
    df.loc[idx] = df.loc[idx].map(str).replace('1.0',str(idx))
    df.loc[idx] = df.loc[idx].map(str).replace('0.0','')

Could anyone please suggest an efficient way to do this?

The real data frame I’ll be working with has a shape of (4548, 2044) and the values will always be floats (1.0 or 0.0), like in the example. I’m manipulating the usher_barcodes.csv data from "raw.githubusercontent.com/andersen-lab/Freyja/main/freyja/data/…" into a format required by another pipeline, where the column headers are lineage names and the values are mutations (taken from the index). The column headers and index values will likely be different each time I need to run this code because the lineage assignments are constantly changing.

Thanks!

Answers:

You can simply do:

for idx, row in df.iterrows():
    df.loc[idx] = ['' if val == 0 else idx for val in row]

which gives:

   A  B  C
x  x  x   
y     y  y
z        z

Use numpy.where with broadcasting index convert to numpy array:

df = pd.DataFrame(np.where(df.eq(1), 
                           df.index.to_numpy()[:, None], 
                           ''),
                   index = df.index, 
                   columns = df.columns)

print(df)
   A  B  C
x  x  x   
y     y  y
z        z

Performance with data by size (4548,2044):

np.random.seed(2023)
df = pd.DataFrame(np.random.choice([0.0,1.0], size=(4548,2044))).add_prefix('c')
df.index = df.index.astype(str) + 'r'
# print (df)

In [87]: %timeit df.eq(1).mul(df.index, axis=0)
684 ms ± 36.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [90]: %timeit pd.DataFrame(np.where(df.eq(1),df.index.to_numpy()[:, None],''),index = df.index, columns = df.columns)
449 ms ± 26.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Answered By: jezrael

Take advantage of the fact that 1*'x' -> 'x' and 0*'x' -> '':

out = df.eq(1).mul(df.index, axis=0)

NB. the eq(1) converts the float to boolean as True is equivalent to 1. You could also use astype(int) if you only have 0./1..

Output:

   A  B  C
x  x  x   
y     y  y
z        z
Answered By: mozway
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.