Fastest way to iterate and update dataframe
Question:
PROBLEM: I have a dataframe showing which assignments students chose to do and what grades they got on them. I am trying to determine which subsets of assignments were done by the most students and the total points earned on them. The method I’m using is very slow, so I’m wondering what the fastest way is.
My data has this structure:
STUDENT
ASSIGNMENT1
ASSIGNMENT2
ASSIGNMENT3
…
ASSIGNMENT20
Student1
50
75
100
…
50
Student2
75
25
NaN
…
NaN
…
Student2000
100
50
NaN
…
50
TARGET OUTPUT:
For every possible combination of assignments, I’m trying to get the number of completions and the sum of total points earned on each individual assignment by the subset of students who completed that exact assignment combo:
ASSIGNMENT_COMBO
NUMBER_OF_STUDENTS_WHO_DID_THIS_COMBO
ASSIGNMENT1 TOTAL POINTS
ASSIGNMENT2 TOTAL POINTS
ASSIGNMENT3 TOTAL POINTS
…
ASSIGNMENT20 TOTAL POINTS
Assignment 1, Assignment 2
900
5000
400
NaN
…
NaN
Assignment 1, Assignment 2, Assignment 3
100
3000
500
…
NaN
Assignment 2, Assignment 3
750
NaN
7000
750
…
NaN
…
All possible combos, including any number of assignments
WHAT I’VE TRIED: First, I’m using itertools to make my assignment combos and then iterating through the dataframe to classify each student by what combos of assignments they completed:
for combo in itertools.product(list_of_assignment_names, repeat=20):
for i, row in starting_data.iterrows():
ifor = str(combo)
ifor_val = 'no'
for item in combo:
if row[str(item)]>0:
ifor_val = 'yes'
starting_data.at[i,ifor] = ifor_val
Then, I make a second dataframe (assignmentcombostats) that has each combo as a row to count up the number of students who did each combo:
numberofstudents =[]
for combo in assignmentcombostats['combo']:
column = str(combo)
number = len(starting_data[starting_data[column] == 'yes'])
numberofstudents.append(number)
assignmentcombostats['numberofstudents'] = numberofstudents
This works, but it is very slow.
RESOURCES: I’ve looked at a few resources –
Answers:
One approach to speed up your code is to avoid using for loops and instead use pandas built-in functions to apply transformations on your data. Here’s an example implementation that should accomplish your desired output:
import itertools
import pandas as pd
# sample data
data = {
'STUDENT': ['Student1', 'Student2', 'Student3', 'Student4'],
'ASSIGNMENT1': [50, 75, 100, 100],
'ASSIGNMENT2': [75, 25, 50, 75],
'ASSIGNMENT3': [100, None, 75, 50],
'ASSIGNMENT4': [50, None, None, 100]
}
df = pd.DataFrame(data)
# create a list of all possible assignment combinations
assignments = df.columns[1:].tolist()
combinations = []
for r in range(1, len(assignments)+1):
combinations += list(itertools.combinations(assignments, r))
# create a dictionary to hold the results
results = {'ASSIGNMENT_COMBO': [],
'NUMBER_OF_STUDENTS_WHO_DID_THIS_COMBO': [],
'ASSIGNMENT_TOTAL_POINTS': []}
# iterate over the combinations and compute the results
for combo in combinations:
# filter the dataframe for students who have completed this combo
combo_df = df.loc[df[list(combo)].notnull().all(axis=1)]
num_students = len(combo_df)
# compute the total points for each assignment in the combo
points = combo_df[list(combo)].sum()
# append the results to the dictionary
results['ASSIGNMENT_COMBO'].append(combo)
results['NUMBER_OF_STUDENTS_WHO_DID_THIS_COMBO'].append(num_students)
results['ASSIGNMENT_TOTAL_POINTS'].append(points.tolist())
# create a new dataframe from the results dictionary
combo_stats_df = pd.DataFrame(results)
# explode the ASSIGNMENT_COMBO column into separate rows for each
assignment in the combo
combo_stats_df = combo_stats_df.explode('ASSIGNMENT_COMBO')
# create separate columns for each assignment in the combo
for i, assignment in enumerate(assignments):
combo_stats_df[f'{assignment} TOTAL POINTS'] =
combo_stats_df['ASSIGNMENT_TOTAL_POINTS'].apply(lambda x: x[i])
# drop the ASSIGNMENT_TOTAL_POINTS column
combo_stats_df = combo_stats_df.drop('ASSIGNMENT_TOTAL_POINTS',
axis=1)
print(combo_stats_df)
This code first creates a list of all possible assignment combinations using itertools.combinations. Then, it iterates over each combo and filters the dataframe to include only students who have completed the combo. It computes the number of students and the total points for each assignment in the combo using built-in pandas functions like notnull, all, and sum. Finally, it creates a new dataframe from the results dictionary and explodes the ASSIGNMENT_COMBO column into separate rows for each assignment in the combo. It then creates separate columns for each assignment and drops the ASSIGNMENT_TOTAL_POINTS column. This approach should be much faster than using for loops, especially for large dataframes.
I had a go at tidying up Bryan’s Answer
- Make a list of all possible combinations
- Iterate over each combination to find the totals and number of students
- Combine the results in to a dataframe
Setup: (Makes a dataset of 20,000 students and 10 assignments)
import itertools
import pandas as pd
import numpy as np
# Bigger random sample data
def make_data(rows, cols, nans, non_nans):
df = pd.DataFrame()
df["student"] = list(range(rows))
for i in range(1,cols+1):
a = np.random.randint(low=1-nans, high=non_nans, size=(rows)).clip(0).astype(float)
a[ a <= 0 ] = np.nan
df[f"a{i:02}"] = a
return df
rows = 20000
cols = 10
df = make_data(rows, cols, 50, 50)
# dummy columns, makes aggregates easier
df["students"] = 1
df["combo"] = ""
Transformation:
# create a list of all possible assignment combinations (ignore first and last two)
assignments = df.columns[1:-2].tolist()
combos = []
for r in range(1, len(assignments)+1):
new_combos = list(itertools.combinations(assignments, r))
combos += new_combos
# create a list to hold the results
results = list(range(len(combos)))
# ignore the student identifier column
df_source = df.iloc[:, 1:]
# iterate over the combinations and compute the results
for ix, combo in enumerate(combos):
# filter the dataframe for students who have completed this combo
df_filter = df.loc[ df[ list(combo) ].notnull().all(axis=1) ]
# aggregate the results to a single row (sum of the dummy students column counts the rows)
df_agg = df_filter.groupby("combo", as_index=False).sum().reset_index(drop=True)
# store the assignment comination in the results
df_agg["combo"] = ",".join(combo)
# add the results to the list
results[ix] = df_agg
# create a new dataframe from the results list
combo_stats_df = pd.concat(results).reset_index(drop = True)
In this demo it takes ~6 seconds to return ~1000 rows of results.
For 20 assignments that’s ~1,000,000 rows of results, so ~6000 seconds (over 1.5 hours).
Even on my desktop it takes ~2 seconds to process 1,000 combinations, so ~0.5 hours for ~1,000,000 combinations from 20 assignments.
I initially tried to write it without the loop, but the process was killed for using too much memory. I like the puzzle, it helps me learn, so I’ll ponder if there’s a way to avoid the loop while staying within memory.
PROBLEM: I have a dataframe showing which assignments students chose to do and what grades they got on them. I am trying to determine which subsets of assignments were done by the most students and the total points earned on them. The method I’m using is very slow, so I’m wondering what the fastest way is.
My data has this structure:
STUDENT | ASSIGNMENT1 | ASSIGNMENT2 | ASSIGNMENT3 | … | ASSIGNMENT20 |
---|---|---|---|---|---|
Student1 | 50 | 75 | 100 | … | 50 |
Student2 | 75 | 25 | NaN | … | NaN |
… | |||||
Student2000 | 100 | 50 | NaN | … | 50 |
TARGET OUTPUT:
For every possible combination of assignments, I’m trying to get the number of completions and the sum of total points earned on each individual assignment by the subset of students who completed that exact assignment combo:
ASSIGNMENT_COMBO | NUMBER_OF_STUDENTS_WHO_DID_THIS_COMBO | ASSIGNMENT1 TOTAL POINTS | ASSIGNMENT2 TOTAL POINTS | ASSIGNMENT3 TOTAL POINTS | … | ASSIGNMENT20 TOTAL POINTS |
---|---|---|---|---|---|---|
Assignment 1, Assignment 2 | 900 | 5000 | 400 | NaN | … | NaN |
Assignment 1, Assignment 2, Assignment 3 | 100 | 3000 | 500 | … | NaN | |
Assignment 2, Assignment 3 | 750 | NaN | 7000 | 750 | … | NaN |
… | ||||||
All possible combos, including any number of assignments |
WHAT I’VE TRIED: First, I’m using itertools to make my assignment combos and then iterating through the dataframe to classify each student by what combos of assignments they completed:
for combo in itertools.product(list_of_assignment_names, repeat=20):
for i, row in starting_data.iterrows():
ifor = str(combo)
ifor_val = 'no'
for item in combo:
if row[str(item)]>0:
ifor_val = 'yes'
starting_data.at[i,ifor] = ifor_val
Then, I make a second dataframe (assignmentcombostats) that has each combo as a row to count up the number of students who did each combo:
numberofstudents =[]
for combo in assignmentcombostats['combo']:
column = str(combo)
number = len(starting_data[starting_data[column] == 'yes'])
numberofstudents.append(number)
assignmentcombostats['numberofstudents'] = numberofstudents
This works, but it is very slow.
RESOURCES: I’ve looked at a few resources –
One approach to speed up your code is to avoid using for loops and instead use pandas built-in functions to apply transformations on your data. Here’s an example implementation that should accomplish your desired output:
import itertools
import pandas as pd
# sample data
data = {
'STUDENT': ['Student1', 'Student2', 'Student3', 'Student4'],
'ASSIGNMENT1': [50, 75, 100, 100],
'ASSIGNMENT2': [75, 25, 50, 75],
'ASSIGNMENT3': [100, None, 75, 50],
'ASSIGNMENT4': [50, None, None, 100]
}
df = pd.DataFrame(data)
# create a list of all possible assignment combinations
assignments = df.columns[1:].tolist()
combinations = []
for r in range(1, len(assignments)+1):
combinations += list(itertools.combinations(assignments, r))
# create a dictionary to hold the results
results = {'ASSIGNMENT_COMBO': [],
'NUMBER_OF_STUDENTS_WHO_DID_THIS_COMBO': [],
'ASSIGNMENT_TOTAL_POINTS': []}
# iterate over the combinations and compute the results
for combo in combinations:
# filter the dataframe for students who have completed this combo
combo_df = df.loc[df[list(combo)].notnull().all(axis=1)]
num_students = len(combo_df)
# compute the total points for each assignment in the combo
points = combo_df[list(combo)].sum()
# append the results to the dictionary
results['ASSIGNMENT_COMBO'].append(combo)
results['NUMBER_OF_STUDENTS_WHO_DID_THIS_COMBO'].append(num_students)
results['ASSIGNMENT_TOTAL_POINTS'].append(points.tolist())
# create a new dataframe from the results dictionary
combo_stats_df = pd.DataFrame(results)
# explode the ASSIGNMENT_COMBO column into separate rows for each
assignment in the combo
combo_stats_df = combo_stats_df.explode('ASSIGNMENT_COMBO')
# create separate columns for each assignment in the combo
for i, assignment in enumerate(assignments):
combo_stats_df[f'{assignment} TOTAL POINTS'] =
combo_stats_df['ASSIGNMENT_TOTAL_POINTS'].apply(lambda x: x[i])
# drop the ASSIGNMENT_TOTAL_POINTS column
combo_stats_df = combo_stats_df.drop('ASSIGNMENT_TOTAL_POINTS',
axis=1)
print(combo_stats_df)
This code first creates a list of all possible assignment combinations using itertools.combinations. Then, it iterates over each combo and filters the dataframe to include only students who have completed the combo. It computes the number of students and the total points for each assignment in the combo using built-in pandas functions like notnull, all, and sum. Finally, it creates a new dataframe from the results dictionary and explodes the ASSIGNMENT_COMBO column into separate rows for each assignment in the combo. It then creates separate columns for each assignment and drops the ASSIGNMENT_TOTAL_POINTS column. This approach should be much faster than using for loops, especially for large dataframes.
I had a go at tidying up Bryan’s Answer
- Make a list of all possible combinations
- Iterate over each combination to find the totals and number of students
- Combine the results in to a dataframe
Setup: (Makes a dataset of 20,000 students and 10 assignments)
import itertools
import pandas as pd
import numpy as np
# Bigger random sample data
def make_data(rows, cols, nans, non_nans):
df = pd.DataFrame()
df["student"] = list(range(rows))
for i in range(1,cols+1):
a = np.random.randint(low=1-nans, high=non_nans, size=(rows)).clip(0).astype(float)
a[ a <= 0 ] = np.nan
df[f"a{i:02}"] = a
return df
rows = 20000
cols = 10
df = make_data(rows, cols, 50, 50)
# dummy columns, makes aggregates easier
df["students"] = 1
df["combo"] = ""
Transformation:
# create a list of all possible assignment combinations (ignore first and last two)
assignments = df.columns[1:-2].tolist()
combos = []
for r in range(1, len(assignments)+1):
new_combos = list(itertools.combinations(assignments, r))
combos += new_combos
# create a list to hold the results
results = list(range(len(combos)))
# ignore the student identifier column
df_source = df.iloc[:, 1:]
# iterate over the combinations and compute the results
for ix, combo in enumerate(combos):
# filter the dataframe for students who have completed this combo
df_filter = df.loc[ df[ list(combo) ].notnull().all(axis=1) ]
# aggregate the results to a single row (sum of the dummy students column counts the rows)
df_agg = df_filter.groupby("combo", as_index=False).sum().reset_index(drop=True)
# store the assignment comination in the results
df_agg["combo"] = ",".join(combo)
# add the results to the list
results[ix] = df_agg
# create a new dataframe from the results list
combo_stats_df = pd.concat(results).reset_index(drop = True)
In this demo it takes ~6 seconds to return ~1000 rows of results.
For 20 assignments that’s ~1,000,000 rows of results, so ~6000 seconds (over 1.5 hours).
Even on my desktop it takes ~2 seconds to process 1,000 combinations, so ~0.5 hours for ~1,000,000 combinations from 20 assignments.
I initially tried to write it without the loop, but the process was killed for using too much memory. I like the puzzle, it helps me learn, so I’ll ponder if there’s a way to avoid the loop while staying within memory.