Stack and explode columns in pandas
Question:
I have a dataframe to which I want to apply explode and stack at the same time. Explode the ‘Attendees’ column and assign the correct values to courses. For example, for Course 1 ‘intro to’ the number of attendees was 24 but for Course 2 ‘advanced’ the number of attendees was 46. In addition to that, I want all the course names in one column.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Session':['session1', 'session2','session3'],
'Course 1':['intro to','advanced','Cv'],
'Course 2':['Computer skill',np.nan,'Write cover letter'],
'Attendees':['24 & 46','23','30']})
If I apply the explode function to ‘Attendees’ I get the result
Course_df = Course_df.assign(Attendees=Course_df['Attendees'].str.split(' & ')).explode('Attendees')
Session Course 1 Course 2 Attendees
0 session1 intro to Computer skill 24
0 session1 intro to Computer skill 46
1 session2 advanced. NaN 23
and when I apply the stack function
Course_df = (Course_df.set_index(['Session','Attendees']).stack().reset_index().rename({0:'Courses'}, axis = 1))
This is the result I get
Session level_1 Courses Attendees
0 session1 Course 1 intro to 24
1 session1 Course 2 Computer skill 46
2 session2 Course 1 advanced 23
3 session3 Course 1 Cv 30
Whereas the result I want is
Session level_1 Courses Attendees
0 session1 Course 1 intro to 24
1 session1 Course 2 Computer skill 46
2 session2 Course 1 advanced 23
3 session3 Course 1 Cv 30
4 session3 Course 2 Write cover letter 30
Answers:
Melted df first
melted_df = pd.melt(df, id_vars=['Session', 'Attendees'], value_vars=['Course 1', 'Course 2'])
and then explode
result = melted_df.assign(Attendees=melted_df['Attendees'].str.split(' & ')).explode('Attendees')
A general Solution to your problem as far as I understood your problem would be to iterate over the attendees counts or the courses.
Here I loop over the attendees counts.
Therefore, I basically do the explode
step manually and set all but the current/intended course to pd.NA
.
With df = Course_df
:
df = df.assign(Attendees = df["Attendees"].str.split(" & "))
dfn = df.iloc[:0] # Create empty dataframe with same columns as df
for didx, d in df.iterrows():
# Explode manually
for ci, attend_count in enumerate(d.Attendees):
dfr = d.to_frame().T
dfr.Attendees = attend_count
# Set other courses than "Course <ci+1>" to NaN
other_courses = [x for x in d.index if x.startswith("Course ") and x != f'Course {ci + 1}']
# other_courses = d.index.to_series().filter(regex = f'Course [^{ci+1}]').index # Alternative
for c in other_courses:
dfr[c] = pd.NA
dfn = pd.concat([dfn, dfr])
dfn.set_index(["Session", "Attendees"]).stack().reset_index().rename({0: "Courses"}, axis = 1)
This returns:
Session Attendees level_2 Courses
0 session1 24 Course 1 intro to
1 session1 46 Course 2 Computer skill
2 session2 23 Course 1 advanced
I have a dataframe to which I want to apply explode and stack at the same time. Explode the ‘Attendees’ column and assign the correct values to courses. For example, for Course 1 ‘intro to’ the number of attendees was 24 but for Course 2 ‘advanced’ the number of attendees was 46. In addition to that, I want all the course names in one column.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Session':['session1', 'session2','session3'],
'Course 1':['intro to','advanced','Cv'],
'Course 2':['Computer skill',np.nan,'Write cover letter'],
'Attendees':['24 & 46','23','30']})
If I apply the explode function to ‘Attendees’ I get the result
Course_df = Course_df.assign(Attendees=Course_df['Attendees'].str.split(' & ')).explode('Attendees')
Session Course 1 Course 2 Attendees
0 session1 intro to Computer skill 24
0 session1 intro to Computer skill 46
1 session2 advanced. NaN 23
and when I apply the stack function
Course_df = (Course_df.set_index(['Session','Attendees']).stack().reset_index().rename({0:'Courses'}, axis = 1))
This is the result I get
Session level_1 Courses Attendees
0 session1 Course 1 intro to 24
1 session1 Course 2 Computer skill 46
2 session2 Course 1 advanced 23
3 session3 Course 1 Cv 30
Whereas the result I want is
Session level_1 Courses Attendees
0 session1 Course 1 intro to 24
1 session1 Course 2 Computer skill 46
2 session2 Course 1 advanced 23
3 session3 Course 1 Cv 30
4 session3 Course 2 Write cover letter 30
Melted df first
melted_df = pd.melt(df, id_vars=['Session', 'Attendees'], value_vars=['Course 1', 'Course 2'])
and then explode
result = melted_df.assign(Attendees=melted_df['Attendees'].str.split(' & ')).explode('Attendees')
A general Solution to your problem as far as I understood your problem would be to iterate over the attendees counts or the courses.
Here I loop over the attendees counts.
Therefore, I basically do the explode
step manually and set all but the current/intended course to pd.NA
.
With df = Course_df
:
df = df.assign(Attendees = df["Attendees"].str.split(" & "))
dfn = df.iloc[:0] # Create empty dataframe with same columns as df
for didx, d in df.iterrows():
# Explode manually
for ci, attend_count in enumerate(d.Attendees):
dfr = d.to_frame().T
dfr.Attendees = attend_count
# Set other courses than "Course <ci+1>" to NaN
other_courses = [x for x in d.index if x.startswith("Course ") and x != f'Course {ci + 1}']
# other_courses = d.index.to_series().filter(regex = f'Course [^{ci+1}]').index # Alternative
for c in other_courses:
dfr[c] = pd.NA
dfn = pd.concat([dfn, dfr])
dfn.set_index(["Session", "Attendees"]).stack().reset_index().rename({0: "Courses"}, axis = 1)
This returns:
Session Attendees level_2 Courses
0 session1 24 Course 1 intro to
1 session1 46 Course 2 Computer skill
2 session2 23 Course 1 advanced