Fastest way to iterate and update dataframe

Question

PROBLEM: I have a dataframe showing which assignments students chose to do and what grades they got on them. I am trying to determine which subsets of assignments were done by the most students and the total points earned on them. The method I’m using is very slow, so I’m wondering what the fastest way is.

My data has this structure:

STUDENT	ASSIGNMENT1	ASSIGNMENT2	ASSIGNMENT3	…	ASSIGNMENT20
Student1	50	75	100	…	50
Student2	75	25	NaN	…	NaN
…
Student2000	100	50	NaN	…	50

TARGET OUTPUT:
For every possible combination of assignments, I’m trying to get the number of completions and the sum of total points earned on each individual assignment by the subset of students who completed that exact assignment combo:

ASSIGNMENT_COMBO	NUMBER_OF_STUDENTS_WHO_DID_THIS_COMBO	ASSIGNMENT1 TOTAL POINTS	ASSIGNMENT2 TOTAL POINTS	ASSIGNMENT3 TOTAL POINTS	…	ASSIGNMENT20 TOTAL POINTS
Assignment 1, Assignment 2	900	5000	400	NaN	…	NaN
Assignment 1, Assignment 2, Assignment 3	100	3000		500	…	NaN
Assignment 2, Assignment 3	750	NaN	7000	750	…	NaN
…
All possible combos, including any number of assignments

WHAT I’VE TRIED: First, I’m using itertools to make my assignment combos and then iterating through the dataframe to classify each student by what combos of assignments they completed:

for combo in itertools.product(list_of_assignment_names, repeat=20):
for i, row in starting_data.iterrows():
    ifor = str(combo)
    ifor_val = 'no'
    for item in combo:
        if row[str(item)]>0:
             ifor_val = 'yes'
    starting_data.at[i,ifor] = ifor_val

Then, I make a second dataframe (assignmentcombostats) that has each combo as a row to count up the number of students who did each combo:

numberofstudents =[]
for combo in assignmentcombostats['combo']:
    column = str(combo)
    number = len(starting_data[starting_data[column] == 'yes'])
    numberofstudents.append(number)
assignmentcombostats['numberofstudents'] = numberofstudents

This works, but it is very slow.

RESOURCES: I’ve looked at a few resources –

This post is what I based my current method on
This page has ideas for faster iterating, but I’m not sure of the best way to
solve my problem using vectorization

Asked By: moe

||

Source

Answer 1

One approach to speed up your code is to avoid using for loops and instead use pandas built-in functions to apply transformations on your data. Here’s an example implementation that should accomplish your desired output:

import itertools
import pandas as pd

# sample data
data = {
    'STUDENT': ['Student1', 'Student2', 'Student3', 'Student4'],
    'ASSIGNMENT1': [50, 75, 100, 100],
    'ASSIGNMENT2': [75, 25, 50, 75],
    'ASSIGNMENT3': [100, None, 75, 50],
    'ASSIGNMENT4': [50, None, None, 100]
}
df = pd.DataFrame(data)

# create a list of all possible assignment combinations
assignments = df.columns[1:].tolist()
combinations = []
for r in range(1, len(assignments)+1):
    combinations += list(itertools.combinations(assignments, r))

# create a dictionary to hold the results
results = {'ASSIGNMENT_COMBO': [], 
     'NUMBER_OF_STUDENTS_WHO_DID_THIS_COMBO': [], 
       'ASSIGNMENT_TOTAL_POINTS': []}

# iterate over the combinations and compute the results
for combo in combinations:
    # filter the dataframe for students who have completed this combo
    combo_df = df.loc[df[list(combo)].notnull().all(axis=1)]
    num_students = len(combo_df)
    # compute the total points for each assignment in the combo
    points = combo_df[list(combo)].sum()
    # append the results to the dictionary
    results['ASSIGNMENT_COMBO'].append(combo)
 results['NUMBER_OF_STUDENTS_WHO_DID_THIS_COMBO'].append(num_students)
    results['ASSIGNMENT_TOTAL_POINTS'].append(points.tolist())

# create a new dataframe from the results dictionary
combo_stats_df = pd.DataFrame(results)

# explode the ASSIGNMENT_COMBO column into separate rows for each    
assignment in the combo
combo_stats_df = combo_stats_df.explode('ASSIGNMENT_COMBO')

# create separate columns for each assignment in the combo
for i, assignment in enumerate(assignments):
combo_stats_df[f'{assignment} TOTAL POINTS'] =      
combo_stats_df['ASSIGNMENT_TOTAL_POINTS'].apply(lambda x: x[i])

# drop the ASSIGNMENT_TOTAL_POINTS column
combo_stats_df = combo_stats_df.drop('ASSIGNMENT_TOTAL_POINTS',   
axis=1)

print(combo_stats_df)

This code first creates a list of all possible assignment combinations using itertools.combinations. Then, it iterates over each combo and filters the dataframe to include only students who have completed the combo. It computes the number of students and the total points for each assignment in the combo using built-in pandas functions like notnull, all, and sum. Finally, it creates a new dataframe from the results dictionary and explodes the ASSIGNMENT_COMBO column into separate rows for each assignment in the combo. It then creates separate columns for each assignment and drops the ASSIGNMENT_TOTAL_POINTS column. This approach should be much faster than using for loops, especially for large dataframes.

Answered By: Bryan Carvalho

Answer 2

I had a go at tidying up Bryan’s Answer

Make a list of all possible combinations
Iterate over each combination to find the totals and number of students
Combine the results in to a dataframe

Setup: (Makes a dataset of 20,000 students and 10 assignments)

import itertools

import pandas as pd
import numpy as np

# Bigger random sample data
def make_data(rows, cols, nans, non_nans):
  df = pd.DataFrame()
  df["student"] = list(range(rows))
  for i in range(1,cols+1):
    a = np.random.randint(low=1-nans, high=non_nans, size=(rows)).clip(0).astype(float)
    a[ a <= 0 ] = np.nan
    df[f"a{i:02}"] = a
  return df

rows = 20000
cols = 10
df = make_data(rows, cols, 50, 50)


# dummy columns, makes aggregates easier
df["students"] = 1
df["combo"] = ""

Transformation:

# create a list of all possible assignment combinations (ignore first and last two)
assignments = df.columns[1:-2].tolist()
combos = []
for r in range(1, len(assignments)+1):
    new_combos = list(itertools.combinations(assignments, r))
    combos += new_combos

# create a list to hold the results
results = list(range(len(combos)))

# ignore the student identifier column
df_source = df.iloc[:, 1:]

# iterate over the combinations and compute the results
for ix, combo in enumerate(combos):

  # filter the dataframe for students who have completed this combo
  df_filter = df.loc[ df[ list(combo) ].notnull().all(axis=1) ]

  # aggregate the results to a single row (sum of the dummy students column counts the rows)
  df_agg = df_filter.groupby("combo", as_index=False).sum().reset_index(drop=True)

  # store the assignment comination in the results
  df_agg["combo"] = ",".join(combo)

  # add the results to the list
  results[ix] = df_agg

# create a new dataframe from the results list
combo_stats_df = pd.concat(results).reset_index(drop = True)

In this demo it takes ~6 seconds to return ~1000 rows of results.

For 20 assignments that’s ~1,000,000 rows of results, so ~6000 seconds (over 1.5 hours).

Even on my desktop it takes ~2 seconds to process 1,000 combinations, so ~0.5 hours for ~1,000,000 combinations from 20 assignments.

I initially tried to write it without the loop, but the process was killed for using too much memory. I like the puzzle, it helps me learn, so I’ll ponder if there’s a way to avoid the loop while staying within memory.

Answered By: MatBailie

Fastest way to iterate and update dataframe

Question:

Answers: