Filter columns containing values and NaN using specific characters and create seperate columns

Question:

I have a dataframe containing columns in the below format

df =

ID     Folder Name    Country
300    ABC 12345      CANADA
1000   NaN            USA
450    AML 2233       USA
111    ABC 2234       USA
550    AML 3312       AFRICA

Output needs to be in the below format

ID     Folder Name    Country    Folder Name - ABC   Folder Name - AML
300    ABC 12345      CANADA      ABC 12345             NaN
1000     NaN          USA         NaN                   NaN
450    AML 2233       USA         NaN                   AML 2233
111    ABC 2234       USA         ABC 2234              NaN
550    AML 3312       AFRICA      NaN                   AML 3312

I tried using the below python code:-

df_['Folder Name - ABC'] = df['Folder Name'].apply(lambda x: x.str.startswith('ABC',na = False))

Can you please help me where i am going wrong?

Asked By: Coder_Alex

||

Answers:

the startswith methode return True or False so your column will contains just a boolean values instead you can try this :

df_['Folder Name - ABC'] = df['Folder Name'].apply(lambda x: x if x.str.startswith('ABC',na = False))
Answered By: to_data

You should not use apply but boolean indexing:

df.loc[df['Folder Name'].str.startswith('ABC', na=False),
       'Folder Name - ABC'] = df['Folder Name']

However, a better approach that would not require you to loop over all possible codes would be to extract the code, pivot_table and merge:

out = df.merge(
         df.assign(col=df['Folder Name'].str.extract('(w+)'))
           .pivot_table(index='ID', columns='col',
                        values='Folder Name', aggfunc='first')
           .add_prefix('Folder Name - '),
         on='ID', how='left'
)

output:

     ID Folder Name Country Folder Name - ABC Folder Name - AML
0   300   ABC 12345  CANADA         ABC 12345               NaN
1  1000         NaN     USA               NaN               NaN
2   450    AML 2233     USA               NaN          AML 2233
3   111    ABC 2234     USA          ABC 2234               NaN
4   550    AML 3312  AFRICA               NaN          AML 3312
Answered By: mozway

does this code do the trick?

df['Folder Name - ABC'] = df['Folder Name'].where(df['Folder Name'].str.startswith('ABC'))
Answered By: SergFSM

If you have a list with the substrings to be matched at the start of each string in df['Folder Name'], you could also achieve the result as follows:

lst = ['ABC','AML']
pat = f'^({".*)|(".join(lst)}.*)'
# '^(ABC.*)|(AML.*)'

df[[f'Folder Name - {x}' for x in lst]] = 
    df['Folder Name'].str.extract(pat, expand=True)

print(df)

     ID Folder Name Country Folder Name - ABC Folder Name - AML
0   300   ABC 12345  CANADA         ABC 12345               NaN
1  1000         NaN     USA               NaN               NaN
2   450    AML 2233     USA               NaN          AML 2233
3   111    ABC 2234     USA          ABC 2234               NaN
4   550    AML 3312  AFRICA               NaN          AML 3312

If you do not already have this list, you can simply create it first by doing:

lst = df['Folder Name'].dropna().str.extract('^([A-Z]{3})')[0].unique()
# this will be an array, not a list, 
# but that doesn't affect the functionality here

N.B. If your list contains items that won’t match, you’ll end up with extra columns filled completely with NaN values. You can get rid of these at the end. E.g.:

lst = ['ABC','AML','NON']
# 'NON' won't match

pat = f'^({".*)|(".join(lst)}.*)'
df[[f'Folder Name - {x}' for x in lst]] = 
    df['Folder Name'].str.extract(pat, expand=True)

df = df.dropna(axis=1, how='all')
# dropping column `Folder Name - NON` with only `NaN` values
Answered By: ouroboros1