Pandas split corresponding rows based on separator in two columns duplicating everything else
Question:
I have an excel sheet
Col1 Col2 Col3 Col4
John EnglishnMaths 34n33 Pass
Sam Science 40 Pass
Jack EnglishnHistorynGeography 89n07n98 Pass
Need to convert it to
Col1 Col2 Col3 Col4
John English 34 Pass
John Maths 33 Pass
Sam Science 40 Pass
Jack English 89 Pass
Jack History 07 Pass
Jack Geography 98 Pass
The excel sheet has n
as separator for corresponding Col2 and col3 column. Just need to pull each subject in a new row with its corresponding marks and copy all the other column contents as it is.
Tried
split_cols = ['Col2', 'Col3']
# loop over the columns and split them
separator = 'n'
for col in split_cols:
df[[f'{col}_Split1', f'{col}_Split2']] = df[col].str.split(separator, n=1, expand=True).fillna('')
# create two new dataframes with the desired columns
df1 = df[['Col1', 'Col2_Split1', 'Col3_Split1', 'Col4']].rename(columns={'Col2_Split1': 'D', 'Col3_Split1': 'C'})
df2 = df[['Col1', 'Col2_Split2', 'Col3_Split2', 'Col4']].rename(columns={'Col2_Split2': 'D', 'Col3_Split2': 'C'})
# concatenate the two dataframes
final_df = pd.concat([df1, df2], ignore_index=True)
# print the final dataframe
print(final_df)
Answers:
EDITED.
You can achieve your goals using .str.split
+ .explode
methods.
import pandas
df = pandas.DataFrame([
["John", "EnglishnMaths", "34n33", "Pass"],
["Sam", "Science", "40", "Pass"],
["Jack", "EnglishnHistorynGeography", "89n07n98", "Pass"],
])
df[1] = df[1].str.split("n")
df[2] = df[2].str.split("n")
df = df.explode([1, 2])
print(df)
You can explode on multiple columns (with a recent version of Pandas >= 1.3) after exploding each string into list:
# First pass
out = (df.assign(Col2=df['Col2'].str.split('n'),
Col3=df['Col3'].str.split('n')))
# Fix unbalanced lists
def pad(sr):
n = max(sr.str.len())
sr['Col2'] = np.pad(sr['Col2'], (0, n-len(sr['Col2'])))
sr['Col3'] = np.pad(sr['Col3'], (0, n-len(sr['Col3'])))
return sr
m = out['Col2'].str.len() != out['Col3'].str.len()
out.loc[m, ['Col2', 'Col3']] = out.loc[m, ['Col2', 'Col3']].apply(pad, axis=1)
# Second pass
out = out.explode(['Col2', 'Col3'], ignore_index=True)
print(out)
# Output
Col1 Col2 Col3 Col4
0 John English 34 Pass
1 John Maths 33 Pass
2 Sam Science 40 Pass
3 Jack English 89 Pass
4 Jack History 07 Pass
5 Jack Geography 98 Pass
6 Ryan Maths 12 Failed
7 Ryan Science 10 Failed
8 Ryan History 0 Failed
Input dataframe:
import pandas as pd
import numpy as np
data = {'Col1': ['John', 'Sam', 'Jack', 'Ryan'],
'Col2': ['EnglishnMaths', 'Science', 'EnglishnHistorynGeography', 'MathsnSciencenHistory'],
'Col3': ['34n33', '40', '89n07n98', '12n10'],
'Col4': ['Pass', 'Pass', 'Pass', 'Failed']}
df = pd.DataFrame(data)
print(df)
# Output
Col1 Col2 Col3 Col4
0 John EnglishnMaths 34n33 Pass
1 Sam Science 40 Pass
2 Jack EnglishnHistorynGeography 89n07n98 Pass
3 Ryan MathsnSciencenHistory 12n10 Failed
I have an excel sheet
Col1 Col2 Col3 Col4
John EnglishnMaths 34n33 Pass
Sam Science 40 Pass
Jack EnglishnHistorynGeography 89n07n98 Pass
Need to convert it to
Col1 Col2 Col3 Col4
John English 34 Pass
John Maths 33 Pass
Sam Science 40 Pass
Jack English 89 Pass
Jack History 07 Pass
Jack Geography 98 Pass
The excel sheet has n
as separator for corresponding Col2 and col3 column. Just need to pull each subject in a new row with its corresponding marks and copy all the other column contents as it is.
Tried
split_cols = ['Col2', 'Col3']
# loop over the columns and split them
separator = 'n'
for col in split_cols:
df[[f'{col}_Split1', f'{col}_Split2']] = df[col].str.split(separator, n=1, expand=True).fillna('')
# create two new dataframes with the desired columns
df1 = df[['Col1', 'Col2_Split1', 'Col3_Split1', 'Col4']].rename(columns={'Col2_Split1': 'D', 'Col3_Split1': 'C'})
df2 = df[['Col1', 'Col2_Split2', 'Col3_Split2', 'Col4']].rename(columns={'Col2_Split2': 'D', 'Col3_Split2': 'C'})
# concatenate the two dataframes
final_df = pd.concat([df1, df2], ignore_index=True)
# print the final dataframe
print(final_df)
EDITED.
You can achieve your goals using .str.split
+ .explode
methods.
import pandas
df = pandas.DataFrame([
["John", "EnglishnMaths", "34n33", "Pass"],
["Sam", "Science", "40", "Pass"],
["Jack", "EnglishnHistorynGeography", "89n07n98", "Pass"],
])
df[1] = df[1].str.split("n")
df[2] = df[2].str.split("n")
df = df.explode([1, 2])
print(df)
You can explode on multiple columns (with a recent version of Pandas >= 1.3) after exploding each string into list:
# First pass
out = (df.assign(Col2=df['Col2'].str.split('n'),
Col3=df['Col3'].str.split('n')))
# Fix unbalanced lists
def pad(sr):
n = max(sr.str.len())
sr['Col2'] = np.pad(sr['Col2'], (0, n-len(sr['Col2'])))
sr['Col3'] = np.pad(sr['Col3'], (0, n-len(sr['Col3'])))
return sr
m = out['Col2'].str.len() != out['Col3'].str.len()
out.loc[m, ['Col2', 'Col3']] = out.loc[m, ['Col2', 'Col3']].apply(pad, axis=1)
# Second pass
out = out.explode(['Col2', 'Col3'], ignore_index=True)
print(out)
# Output
Col1 Col2 Col3 Col4
0 John English 34 Pass
1 John Maths 33 Pass
2 Sam Science 40 Pass
3 Jack English 89 Pass
4 Jack History 07 Pass
5 Jack Geography 98 Pass
6 Ryan Maths 12 Failed
7 Ryan Science 10 Failed
8 Ryan History 0 Failed
Input dataframe:
import pandas as pd
import numpy as np
data = {'Col1': ['John', 'Sam', 'Jack', 'Ryan'],
'Col2': ['EnglishnMaths', 'Science', 'EnglishnHistorynGeography', 'MathsnSciencenHistory'],
'Col3': ['34n33', '40', '89n07n98', '12n10'],
'Col4': ['Pass', 'Pass', 'Pass', 'Failed']}
df = pd.DataFrame(data)
print(df)
# Output
Col1 Col2 Col3 Col4
0 John EnglishnMaths 34n33 Pass
1 Sam Science 40 Pass
2 Jack EnglishnHistorynGeography 89n07n98 Pass
3 Ryan MathsnSciencenHistory 12n10 Failed