Align two columns of strings in pandas (merge strings recursively until match)
Question:
I have the following pandas df
import pandas as pd
df = pd.DataFrame(data={'col1': ["Sun", "Sea:", "SARS-COV-2", "Hong-Kong", "Fish", "NaN", "NaN", "NaN", "NaN", "NaN", "NaN", "NaN"],
'col2': ["Sun", "Sea", ":", "SARS", "-", "COV", "-", "2", 'Hong', '-', 'Kong', 'Fish'],
'col3': ["H", "Q", "S", "X", "Y", "Z", "L", "M", 'A', 'B', 'C', 'O']})
df
col1
col2
col3
Sun
Sun
H
Sea:
Sea
Q
SARS-COV-2
:
S
Hong-Kong
SARS
X
Fish
–
Y
NaN
COV
Z
NaN
–
L
NaN
2
M
NaN
Hong
A
NaN
–
B
NaN
Kong
C
NaN
Fish
O
I need to align col1 and col2, just as shown in df2
df2 = pd.DataFrame(data={'col1': ["Sun","Sea:", "SARS-COV-2", "Hong-Kong", "Fish"],
'col2': ["Sun", "Sea:", "SARS-COV-2", "Hong-Kong", "Fish"],
'col3': ["H", "Q", "X", "A",'O']})
df2
col1
col2
col3
Sun
Sun
H
Sea:
Sea:
Q
SARS-COV-2
SARS-COV-2
X
Hong-Kong
Hong-Kong
A
Fish
Fish
O
This is, I have to recursively merge the strings of col2 until there is a match with col1, while preserving the first col3 value
My initial approach was to use nested loops but it became very confusing.
Any thoughts? Thanks in advance
Answers:
You can split col1
by non alphanumeric character 'W'
to get the same output as col2
:
# Assume NaN is np.nan and not 'NaN' else use .replace('NaN', np.nan).dropna()
grp = (df['col1'].dropna().str.split('(W)').explode().loc[lambda x: x != ''])
df1 = df.groupby(grp.index).agg({'col2': lambda x: ''.join(x), 'col3': 'first'})
out = pd.concat([df['col1'].dropna(), df1], axis=1)
Output:
>>> out
col1 col2 col3
0 Sun Sun H
1 Sea: Sea: Q
2 SARS-COV-2 SARS-COV-2 X
3 Hong-Kong Hong-Kong A
4 Fish Fish O
I have the following pandas df
import pandas as pd
df = pd.DataFrame(data={'col1': ["Sun", "Sea:", "SARS-COV-2", "Hong-Kong", "Fish", "NaN", "NaN", "NaN", "NaN", "NaN", "NaN", "NaN"],
'col2': ["Sun", "Sea", ":", "SARS", "-", "COV", "-", "2", 'Hong', '-', 'Kong', 'Fish'],
'col3': ["H", "Q", "S", "X", "Y", "Z", "L", "M", 'A', 'B', 'C', 'O']})
df
col1 | col2 | col3 |
---|---|---|
Sun | Sun | H |
Sea: | Sea | Q |
SARS-COV-2 | : | S |
Hong-Kong | SARS | X |
Fish | – | Y |
NaN | COV | Z |
NaN | – | L |
NaN | 2 | M |
NaN | Hong | A |
NaN | – | B |
NaN | Kong | C |
NaN | Fish | O |
I need to align col1 and col2, just as shown in df2
df2 = pd.DataFrame(data={'col1': ["Sun","Sea:", "SARS-COV-2", "Hong-Kong", "Fish"],
'col2': ["Sun", "Sea:", "SARS-COV-2", "Hong-Kong", "Fish"],
'col3': ["H", "Q", "X", "A",'O']})
df2
col1 | col2 | col3 |
---|---|---|
Sun | Sun | H |
Sea: | Sea: | Q |
SARS-COV-2 | SARS-COV-2 | X |
Hong-Kong | Hong-Kong | A |
Fish | Fish | O |
This is, I have to recursively merge the strings of col2 until there is a match with col1, while preserving the first col3 value
My initial approach was to use nested loops but it became very confusing.
Any thoughts? Thanks in advance
You can split col1
by non alphanumeric character 'W'
to get the same output as col2
:
# Assume NaN is np.nan and not 'NaN' else use .replace('NaN', np.nan).dropna()
grp = (df['col1'].dropna().str.split('(W)').explode().loc[lambda x: x != ''])
df1 = df.groupby(grp.index).agg({'col2': lambda x: ''.join(x), 'col3': 'first'})
out = pd.concat([df['col1'].dropna(), df1], axis=1)
Output:
>>> out
col1 col2 col3
0 Sun Sun H
1 Sea: Sea: Q
2 SARS-COV-2 SARS-COV-2 X
3 Hong-Kong Hong-Kong A
4 Fish Fish O