Align two columns of strings in pandas (merge strings recursively until match)

Question:

I have the following pandas df

import pandas as pd
df = pd.DataFrame(data={'col1': ["Sun", "Sea:", "SARS-COV-2", "Hong-Kong", "Fish", "NaN", "NaN", "NaN", "NaN", "NaN", "NaN", "NaN"],
                        'col2': ["Sun", "Sea", ":", "SARS", "-", "COV", "-", "2", 'Hong', '-', 'Kong', 'Fish'],
                        'col3': ["H", "Q", "S", "X", "Y", "Z", "L", "M", 'A', 'B', 'C', 'O']})
df
col1 col2 col3
Sun Sun H
Sea: Sea Q
SARS-COV-2 : S
Hong-Kong SARS X
Fish Y
NaN COV Z
NaN L
NaN 2 M
NaN Hong A
NaN B
NaN Kong C
NaN Fish O

I need to align col1 and col2, just as shown in df2

df2 = pd.DataFrame(data={'col1': ["Sun","Sea:", "SARS-COV-2", "Hong-Kong", "Fish"],
                        'col2': ["Sun", "Sea:", "SARS-COV-2", "Hong-Kong", "Fish"],
                        'col3': ["H", "Q", "X",  "A",'O']})
df2
col1 col2 col3
Sun Sun H
Sea: Sea: Q
SARS-COV-2 SARS-COV-2 X
Hong-Kong Hong-Kong A
Fish Fish O

This is, I have to recursively merge the strings of col2 until there is a match with col1, while preserving the first col3 value

My initial approach was to use nested loops but it became very confusing.

Any thoughts? Thanks in advance

Asked By: moon289

||

Answers:

You can split col1 by non alphanumeric character 'W' to get the same output as col2:

# Assume NaN is np.nan and not 'NaN' else use .replace('NaN', np.nan).dropna()
grp = (df['col1'].dropna().str.split('(W)').explode().loc[lambda x: x != ''])

df1 = df.groupby(grp.index).agg({'col2': lambda x: ''.join(x), 'col3': 'first'})
out = pd.concat([df['col1'].dropna(), df1], axis=1)

Output:

>>> out
         col1        col2 col3
0         Sun         Sun    H
1        Sea:        Sea:    Q
2  SARS-COV-2  SARS-COV-2    X
3   Hong-Kong   Hong-Kong    A
4        Fish        Fish    O
Answered By: Corralien
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.