How to efficiently process a large dataset with Python?

Question:

I am currently working with a large dataset of approximately 100 million records, which is too large to fit into memory. I need to perform some operations on this dataset using Python, such as filtering, grouping, and aggregating.

I have tried using Pandas and other Python libraries, but they all seem to require loading the entire dataset into memory, which is not feasible given the size of the dataset.

What would be the best way to efficiently process this large dataset with Python? Is there a way to load only a portion of the dataset into memory at a time and process it in batches?

Let’s say I have a CSV file with the following columns:

user_id: A unique identifier for each user
date: The date the user performed a certain action
action_type: The type of action performed by the user (e.g., "click", "view", "purchase")
item_id: The ID of the item associated with the action
The file contains approximately 100 million rows.

What I want to do is filter the data to only include actions of a certain type (e.g., "purchase"), group the data by user and item, and then calculate the total number of actions and the total number of unique items per user.

Here’s some example code I’ve tried:

import pandas as pd

df = pd.read_csv('large_dataset.csv')

# Filter the data to only include "purchase" actions
purchase_df = df[df['action_type'] == 'purchase']

# Group the data by user and item
grouped_df = purchase_df.groupby(['user_id', 'item_id'])

# Calculate the total number of actions and the total number of unique items per user
results_df = grouped_df.agg({'action_type': 'count', 'item_id': 'nunique'})

However, this code causes a MemoryError due to the size of the dataset.

In this code, I am trying to filter the dataset based on a certain value in a column, group the filtered data by two columns, sum a third column, and then group the results again and calculate the sum and mean of another column. However, this code crashes due to the size of the dataset.

Asked By: AnthonyTechDev

||

Answers:

filter the data to only include actions of … "purchase", …

Grep or awk are certainly equal to that task.

$ grep ',purchase,'  < large_dataset.csv  > purchase.csv
$
$ awk '$2 ~ /purchase/ {print $1, $3, $2}' ...

… group the data by user and item, …

Let’s sort!

$ grep ',purchase,' ... | sort  > purchase.csv
$
$ awk ... | sort  > purchase.csv

Clearly python code could do the same.
It would just take more lines of code to express,
compared to using grep / awk.
Start with import csv.
The key is to stream data,
rather than slurping it all into RAM before doing any processing.

/usr/bin/sort will use temp files if RAM is too small
for all records to fit at once.

Any relational database can accomplish such tasks with an ORDER BY clause.
You might start with sqlite, which is very easy for a project to adopt, especially if you’re using SQLAlchemy.

… and then calculate the total number of actions and the total number of unique items per user.

Given ordered rows, this task has now become trivial.
All you need is a python int counter and a set of item IDs.
Keep track of current_user.
Whenever it changes, output the counter and length of the set,
then re-initialize them so they track the subsequent user.
Don’t forget to output them at EOF.

Or use /usr/bin/uniq. Or SELECT DISTINCT …

Answered By: J_H
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.