Pandas split corresponding rows based on separator in two columns duplicating everything else

Question:

I have an excel sheet

Col1    Col2                          Col3            Col4
John    EnglishnMaths                34n33          Pass
Sam     Science                       40              Pass
Jack    EnglishnHistorynGeography   89n07n98      Pass

Need to convert it to

Col1    Col2      Col3    Col4
John    English   34      Pass
John    Maths     33      Pass
Sam     Science   40      Pass
Jack    English   89      Pass
Jack    History   07      Pass     
Jack    Geography 98      Pass

The excel sheet has n as separator for corresponding Col2 and col3 column. Just need to pull each subject in a new row with its corresponding marks and copy all the other column contents as it is.

Tried

split_cols = ['Col2', 'Col3']

# loop over the columns and split them
separator = 'n'
for col in split_cols:
    df[[f'{col}_Split1', f'{col}_Split2']] = df[col].str.split(separator, n=1, expand=True).fillna('')

# create two new dataframes with the desired columns
df1 = df[['Col1', 'Col2_Split1', 'Col3_Split1', 'Col4']].rename(columns={'Col2_Split1': 'D', 'Col3_Split1': 'C'})
df2 = df[['Col1', 'Col2_Split2', 'Col3_Split2', 'Col4']].rename(columns={'Col2_Split2': 'D', 'Col3_Split2': 'C'})

# concatenate the two dataframes
final_df = pd.concat([df1, df2], ignore_index=True)

# print the final dataframe
print(final_df)
Asked By: spd

||

Answers:

EDITED.

You can achieve your goals using .str.split + .explode methods.

import pandas

df = pandas.DataFrame([
  ["John", "EnglishnMaths", "34n33", "Pass"],
  ["Sam", "Science", "40", "Pass"],
  ["Jack", "EnglishnHistorynGeography", "89n07n98", "Pass"],
])

df[1] = df[1].str.split("n")
df[2] = df[2].str.split("n")
df = df.explode([1, 2])
print(df)
Answered By: EyuelDK

You can explode on multiple columns (with a recent version of Pandas >= 1.3) after exploding each string into list:

# First pass
out = (df.assign(Col2=df['Col2'].str.split('n'), 
                 Col3=df['Col3'].str.split('n')))

# Fix unbalanced lists
def pad(sr):
    n = max(sr.str.len())
    sr['Col2'] = np.pad(sr['Col2'], (0, n-len(sr['Col2'])))
    sr['Col3'] = np.pad(sr['Col3'], (0, n-len(sr['Col3'])))
    return sr

m = out['Col2'].str.len() != out['Col3'].str.len()
out.loc[m, ['Col2', 'Col3']] = out.loc[m, ['Col2', 'Col3']].apply(pad, axis=1)

# Second pass
out = out.explode(['Col2', 'Col3'], ignore_index=True)
print(out)

# Output
   Col1       Col2 Col3    Col4
0  John    English   34    Pass
1  John      Maths   33    Pass
2   Sam    Science   40    Pass
3  Jack    English   89    Pass
4  Jack    History   07    Pass
5  Jack  Geography   98    Pass
6  Ryan      Maths   12  Failed
7  Ryan    Science   10  Failed
8  Ryan    History    0  Failed

Input dataframe:

import pandas as pd
import numpy as np

data = {'Col1': ['John', 'Sam', 'Jack', 'Ryan'],
        'Col2': ['EnglishnMaths', 'Science', 'EnglishnHistorynGeography', 'MathsnSciencenHistory'],
        'Col3': ['34n33', '40', '89n07n98', '12n10'],
        'Col4': ['Pass', 'Pass', 'Pass', 'Failed']}
df = pd.DataFrame(data)
print(df)

# Output
   Col1                         Col2        Col3    Col4
0  John               EnglishnMaths      34n33    Pass
1   Sam                      Science          40    Pass
2  Jack  EnglishnHistorynGeography  89n07n98    Pass
3  Ryan      MathsnSciencenHistory      12n10  Failed
Answered By: Corralien
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.