Efficiently replacing values in each row of pandas dataframe based on condition

Question

I would like to work with a pandas data frame to get a strange yet desired output dataframe. For each row, I’d like any values of 0.0 to be replaced with an empty string (”), and all values of 1.0 to be replaced with the value of the index. Any given value on a row can only be 1.0 or 0.0.

Here’s some example data:

# starting df
df = pd.DataFrame.from_dict({'A':[1.0,0.0,0.0],'B':[1.0,1.0,0.0],'C':[0.0,1.0,1.0]})
df.index=['x','y','z']
print(df)

What the input df looks like:

     A    B    C
x  1.0  1.0  0.0
y  0.0  1.0  1.0
z  0.0  0.0  1.0

What I would like the output df to look like:

   A  B  C
x  x  x   
y     y  y
z        z

So far I’ve got this pretty inefficient but seemingly working code:

for idx in df.index:
    df.loc[idx] = df.loc[idx].map(str).replace('1.0',str(idx))
    df.loc[idx] = df.loc[idx].map(str).replace('0.0','')

Could anyone please suggest an efficient way to do this?

The real data frame I’ll be working with has a shape of (4548, 2044) and the values will always be floats (1.0 or 0.0), like in the example. I’m manipulating the usher_barcodes.csv data from "raw.githubusercontent.com/andersen-lab/Freyja/main/freyja/data/…" into a format required by another pipeline, where the column headers are lineage names and the values are mutations (taken from the index). The column headers and index values will likely be different each time I need to run this code because the lineage assignments are constantly changing.

Thanks!

Asked By: frustrated_bioinformatician

||

Source

Answer 1

You can simply do:

for idx, row in df.iterrows():
    df.loc[idx] = ['' if val == 0 else idx for val in row]

which gives:

   A  B  C
x  x  x   
y     y  y
z        z

Answered By: Serge de Gosson de Varennes

Answer 2

Use numpy.where with broadcasting index convert to numpy array:

df = pd.DataFrame(np.where(df.eq(1), 
                           df.index.to_numpy()[:, None], 
                           ''),
                   index = df.index, 
                   columns = df.columns)

print(df)
   A  B  C
x  x  x   
y     y  y
z        z

Performance with data by size (4548,2044):

np.random.seed(2023)
df = pd.DataFrame(np.random.choice([0.0,1.0], size=(4548,2044))).add_prefix('c')
df.index = df.index.astype(str) + 'r'
# print (df)

In [87]: %timeit df.eq(1).mul(df.index, axis=0)
684 ms ± 36.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [90]: %timeit pd.DataFrame(np.where(df.eq(1),df.index.to_numpy()[:, None],''),index = df.index, columns = df.columns)
449 ms ± 26.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Answered By: jezrael

Answer 3

Take advantage of the fact that 1*'x' -> 'x' and 0*'x' -> '':

out = df.eq(1).mul(df.index, axis=0)

NB. the eq(1) converts the float to boolean as True is equivalent to 1. You could also use astype(int) if you only have 0./1..

Output:

   A  B  C
x  x  x   
y     y  y
z        z

Answered By: mozway

Efficiently replacing values in each row of pandas dataframe based on condition

Question:

Answers: