Align two columns of strings in pandas (merge strings recursively until match)

Question

I have the following pandas df

import pandas as pd
df = pd.DataFrame(data={'col1': ["Sun", "Sea:", "SARS-COV-2", "Hong-Kong", "Fish", "NaN", "NaN", "NaN", "NaN", "NaN", "NaN", "NaN"],
                        'col2': ["Sun", "Sea", ":", "SARS", "-", "COV", "-", "2", 'Hong', '-', 'Kong', 'Fish'],
                        'col3': ["H", "Q", "S", "X", "Y", "Z", "L", "M", 'A', 'B', 'C', 'O']})
df

col1	col2	col3
Sun	Sun	H
Sea:	Sea	Q
SARS-COV-2	:	S
Hong-Kong	SARS	X
Fish	–	Y
NaN	COV	Z
NaN	–	L
NaN	2	M
NaN	Hong	A
NaN	–	B
NaN	Kong	C
NaN	Fish	O

I need to align col1 and col2, just as shown in df2

df2 = pd.DataFrame(data={'col1': ["Sun","Sea:", "SARS-COV-2", "Hong-Kong", "Fish"],
                        'col2': ["Sun", "Sea:", "SARS-COV-2", "Hong-Kong", "Fish"],
                        'col3': ["H", "Q", "X",  "A",'O']})
df2

col1	col2	col3
Sun	Sun	H
Sea:	Sea:	Q
SARS-COV-2	SARS-COV-2	X
Hong-Kong	Hong-Kong	A
Fish	Fish	O

This is, I have to recursively merge the strings of col2 until there is a match with col1, while preserving the first col3 value

My initial approach was to use nested loops but it became very confusing.

Any thoughts? Thanks in advance

Asked By: moon289

||

Source

Answer 1

You can split col1 by non alphanumeric character 'W' to get the same output as col2:

# Assume NaN is np.nan and not 'NaN' else use .replace('NaN', np.nan).dropna()
grp = (df['col1'].dropna().str.split('(W)').explode().loc[lambda x: x != ''])

df1 = df.groupby(grp.index).agg({'col2': lambda x: ''.join(x), 'col3': 'first'})
out = pd.concat([df['col1'].dropna(), df1], axis=1)

Output:

>>> out
         col1        col2 col3
0         Sun         Sun    H
1        Sea:        Sea:    Q
2  SARS-COV-2  SARS-COV-2    X
3   Hong-Kong   Hong-Kong    A
4        Fish        Fish    O

Answered By: Corralien

Align two columns of strings in pandas (merge strings recursively until match)

Question:

Answers: