Filter columns containing values and NaN using specific characters and create seperate columns
Question:
I have a dataframe containing columns in the below format
df =
ID Folder Name Country
300 ABC 12345 CANADA
1000 NaN USA
450 AML 2233 USA
111 ABC 2234 USA
550 AML 3312 AFRICA
Output needs to be in the below format
ID Folder Name Country Folder Name - ABC Folder Name - AML
300 ABC 12345 CANADA ABC 12345 NaN
1000 NaN USA NaN NaN
450 AML 2233 USA NaN AML 2233
111 ABC 2234 USA ABC 2234 NaN
550 AML 3312 AFRICA NaN AML 3312
I tried using the below python code:-
df_['Folder Name - ABC'] = df['Folder Name'].apply(lambda x: x.str.startswith('ABC',na = False))
Can you please help me where i am going wrong?
Answers:
the startswith
methode return True
or False
so your column will contains just a boolean values instead you can try this :
df_['Folder Name - ABC'] = df['Folder Name'].apply(lambda x: x if x.str.startswith('ABC',na = False))
You should not use apply
but boolean indexing:
df.loc[df['Folder Name'].str.startswith('ABC', na=False),
'Folder Name - ABC'] = df['Folder Name']
However, a better approach that would not require you to loop over all possible codes would be to extract the code, pivot_table
and merge
:
out = df.merge(
df.assign(col=df['Folder Name'].str.extract('(w+)'))
.pivot_table(index='ID', columns='col',
values='Folder Name', aggfunc='first')
.add_prefix('Folder Name - '),
on='ID', how='left'
)
output:
ID Folder Name Country Folder Name - ABC Folder Name - AML
0 300 ABC 12345 CANADA ABC 12345 NaN
1 1000 NaN USA NaN NaN
2 450 AML 2233 USA NaN AML 2233
3 111 ABC 2234 USA ABC 2234 NaN
4 550 AML 3312 AFRICA NaN AML 3312
does this code do the trick?
df['Folder Name - ABC'] = df['Folder Name'].where(df['Folder Name'].str.startswith('ABC'))
If you have a list
with the substrings to be matched at the start of each string in df['Folder Name']
, you could also achieve the result as follows:
lst = ['ABC','AML']
pat = f'^({".*)|(".join(lst)}.*)'
# '^(ABC.*)|(AML.*)'
df[[f'Folder Name - {x}' for x in lst]] =
df['Folder Name'].str.extract(pat, expand=True)
print(df)
ID Folder Name Country Folder Name - ABC Folder Name - AML
0 300 ABC 12345 CANADA ABC 12345 NaN
1 1000 NaN USA NaN NaN
2 450 AML 2233 USA NaN AML 2233
3 111 ABC 2234 USA ABC 2234 NaN
4 550 AML 3312 AFRICA NaN AML 3312
If you do not already have this list, you can simply create it first by doing:
lst = df['Folder Name'].dropna().str.extract('^([A-Z]{3})')[0].unique()
# this will be an array, not a list,
# but that doesn't affect the functionality here
N.B. If your list
contains items that won’t match, you’ll end up with extra columns filled completely with NaN
values. You can get rid of these at the end. E.g.:
lst = ['ABC','AML','NON']
# 'NON' won't match
pat = f'^({".*)|(".join(lst)}.*)'
df[[f'Folder Name - {x}' for x in lst]] =
df['Folder Name'].str.extract(pat, expand=True)
df = df.dropna(axis=1, how='all')
# dropping column `Folder Name - NON` with only `NaN` values
I have a dataframe containing columns in the below format
df =
ID Folder Name Country
300 ABC 12345 CANADA
1000 NaN USA
450 AML 2233 USA
111 ABC 2234 USA
550 AML 3312 AFRICA
Output needs to be in the below format
ID Folder Name Country Folder Name - ABC Folder Name - AML
300 ABC 12345 CANADA ABC 12345 NaN
1000 NaN USA NaN NaN
450 AML 2233 USA NaN AML 2233
111 ABC 2234 USA ABC 2234 NaN
550 AML 3312 AFRICA NaN AML 3312
I tried using the below python code:-
df_['Folder Name - ABC'] = df['Folder Name'].apply(lambda x: x.str.startswith('ABC',na = False))
Can you please help me where i am going wrong?
the startswith
methode return True
or False
so your column will contains just a boolean values instead you can try this :
df_['Folder Name - ABC'] = df['Folder Name'].apply(lambda x: x if x.str.startswith('ABC',na = False))
You should not use apply
but boolean indexing:
df.loc[df['Folder Name'].str.startswith('ABC', na=False),
'Folder Name - ABC'] = df['Folder Name']
However, a better approach that would not require you to loop over all possible codes would be to extract the code, pivot_table
and merge
:
out = df.merge(
df.assign(col=df['Folder Name'].str.extract('(w+)'))
.pivot_table(index='ID', columns='col',
values='Folder Name', aggfunc='first')
.add_prefix('Folder Name - '),
on='ID', how='left'
)
output:
ID Folder Name Country Folder Name - ABC Folder Name - AML
0 300 ABC 12345 CANADA ABC 12345 NaN
1 1000 NaN USA NaN NaN
2 450 AML 2233 USA NaN AML 2233
3 111 ABC 2234 USA ABC 2234 NaN
4 550 AML 3312 AFRICA NaN AML 3312
does this code do the trick?
df['Folder Name - ABC'] = df['Folder Name'].where(df['Folder Name'].str.startswith('ABC'))
If you have a list
with the substrings to be matched at the start of each string in df['Folder Name']
, you could also achieve the result as follows:
lst = ['ABC','AML']
pat = f'^({".*)|(".join(lst)}.*)'
# '^(ABC.*)|(AML.*)'
df[[f'Folder Name - {x}' for x in lst]] =
df['Folder Name'].str.extract(pat, expand=True)
print(df)
ID Folder Name Country Folder Name - ABC Folder Name - AML
0 300 ABC 12345 CANADA ABC 12345 NaN
1 1000 NaN USA NaN NaN
2 450 AML 2233 USA NaN AML 2233
3 111 ABC 2234 USA ABC 2234 NaN
4 550 AML 3312 AFRICA NaN AML 3312
If you do not already have this list, you can simply create it first by doing:
lst = df['Folder Name'].dropna().str.extract('^([A-Z]{3})')[0].unique()
# this will be an array, not a list,
# but that doesn't affect the functionality here
N.B. If your list
contains items that won’t match, you’ll end up with extra columns filled completely with NaN
values. You can get rid of these at the end. E.g.:
lst = ['ABC','AML','NON']
# 'NON' won't match
pat = f'^({".*)|(".join(lst)}.*)'
df[[f'Folder Name - {x}' for x in lst]] =
df['Folder Name'].str.extract(pat, expand=True)
df = df.dropna(axis=1, how='all')
# dropping column `Folder Name - NON` with only `NaN` values