Add prefix to ffill, identifying values which were carried forward

Question:

Is there a wayto add a prefix when filling na’s with ffill in pandas? I have a dataframe containing, taxonomic information like so:

| Kingdom  | Phylum        | Class       | Order           | Family           | Genus         |

| Bacteria | Firmicutes    | Bacilli     | Lactobacillales | Lactobacillaceae | Lactobacillus |

| Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales   |                  |               |

| Bacteria | Bacteroidetes |             |                 |                  |               |

Since not all of the taxa in my dataframe can be classified fully, I have some empty cells. Replacing the spaces with NA and using ffill I can fill these with the last valid string in each row but I would like to add a string to these (for example "Unknown_Bacteroidales") so I can identify which ones were carried forward.

So far I tried this taxa_formatted = "unknown_" + taxonomy.fillna(method='ffill', axis=1) but this of course adds the "unknown_" prefix to everything in the dataframe.

Asked By: adamsorbie

||

Answers:

You need to use mask and update:

#make true nan's first.
#df = df.replace('',np.nan)

s = df.isnull()
df = df.ffill(axis=1)

df.update('unknown_' + df.mask(~s) )

print(df)

   Bacteria     Firmicutes                Bacilli        Lactobacillales  
0  Bacteria  Bacteroidetes            Bacteroidia          Bacteroidales   
1  Bacteria  Bacteroidetes  unknown_Bacteroidetes  unknown_Bacteroidetes   

        Lactobacillaceae          Lactobacillus  
0  unknown_Bacteroidales  unknown_Bacteroidales  
1  unknown_Bacteroidetes  unknown_Bacteroidetes  
Answered By: Umar.H

You can this using boolean masking with df.isna.

df = df.replace("", np.nan)  # if already NaN present skip this step
d = df.ffill()

d[df.isna()]+="(Copy)"
d
    Kingdom         Phylum              Class                Order                  Family                Genus
0  Bacteria     Firmicutes            Bacilli      Lactobacillales        Lactobacillaceae        Lactobacillus
1  Bacteria  Bacteroidetes        Bacteroidia        Bacteroidales  Lactobacillaceae(Copy)  Lactobacillus(Copy)
2  Bacteria  Bacteroidetes  Bacteroidia(Copy)  Bacteroidales(Copy)  Lactobacillaceae(Copy)  Lactobacillus(Copy)

You can use df.add here.

d = df.ffill(axis=1)
df.add("unkown_" + d[df.isna()],fill_value='')

    Kingdom         Phylum                 Class                 Order                Family                 Genus
0  Bacteria     Firmicutes               Bacilli       Lactobacillales      Lactobacillaceae         Lactobacillus
1  Bacteria  Bacteroidetes           Bacteroidia         Bacteroidales  unkown_Bacteroidales  unkown_Bacteroidales
2  Bacteria  Bacteroidetes  unkown_Bacteroidetes  unkown_Bacteroidetes  unkown_Bacteroidetes  unkown_Bacteroidetes
Answered By: Ch3steR
df = df.replace("", np.nan)  # if already NaN present skip this step
d = df.ffill()

#you may use this 
d[df.isna()]+="(Copy)"
d
    Kingdom         Phylum              Class                Order                  Family                Genus
0  Bacteria     Firmicutes            Bacilli      Lactobacillales        Lactobacillaceae        Lactobacillus
1  Bacteria  Bacteroidetes        Bacteroidia        Bacteroidales  Lactobacillaceae(Copy)  Lactobacillus(Copy)
2  Bacteria  Bacteroidetes  Bacteroidia(Copy)  Bacteroidales(Copy)  Lactobacillaceae(Copy)  Lactobacillus(Copy)
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.