Split a parquet file by groups

Question:

I have a large-ish dataframe in a Parquet file and I want to split it into multiple files to leverage Hive partitioning with pyarrow.
Preferably without loading all data into memory.

(This question has been asked before, but I have not found a solution that is both fast and with low memory consumption.)

As a small example consider the following dataframe:

import polars as pl
from random import choice, randint
from string import ascii_letters

N = 10_000_000
pl.DataFrame({
    'id': [choice(ascii_letters) for _ in range(N)],
    'a': [randint(0, 100) for _ in range(N)],
}).write_parquet('stackoverflow.parquet')

I know that pyarrow can help out, but it’s super slow for big files.

import pyarrow.dataset as ds

ds_df = ds.dataset('stackoverflow.parquet')
ds.write_dataset(ds_df, 'stackoverflow_data', format='parquet', partitioning=['id'])

Polars can also help out, but the fastest solution I have made only works if I have the dataframe in memory:

import os
import polars as pl

df = pl.read_parquet('stackoverflow.parquet')
split_df = df.partition_by('id', as_dict=True)
for id in split_df:
    save_path = os.path.join('stackoverflow_data', f'id={id}')
    os.makedirs(save_path, exist_ok=True)
    split_df[id].write_parquet(os.path.join(save_path, 'data.parquet'))

However, for large files I prefer to work with LazyFrames.
This can be done by repeatedly filtering a LazyFrame and writing the result to disk:

df_query = pl.scan_parquet('stackoverflow.parquet')
ids = df_query.select(pl.col('id').unique()).collect().get_column('id').to_list()
for id in ids:
    save_path = os.path.join('stackoverflow_data', f'id={id}')
    os.makedirs(save_path, exist_ok=True)
    df = df_query.filter(pl.col('id') == id).collect()
    df.write_parquet(os.path.join(save_path, 'data.parquet'))

Unfortunately, this is much slower due to the repeated filtering.

Any suggestions for a better tradeoff between speed and memory usage?

Asked By: robertdj

||

Answers:

You’re never going to do better than the approach where all your data is in memory. If it fits in memory then it’s unclear what you would define as a better speed/memory tradeoff. Typically you only trade away speed for memory savings if you can’t fit your data in memory. Incidentally though, when you say:

Unfortunately, this is much slower due to the repeated filtering.

this isn’t quite right. It’s slower because of the repeated IO to the physical disk. If the file doesn’t have multiple row groups with statistics then it has to scan the whole file at each pass.

My benchmarks are that the partition_by approach takes 5.8s.

The native write_dataset approach takes 6.9s.

The scan_parquet approach is 88.1s which is about half what 26x the first approach. Given that there are 26 ids that isn’t too surprising.

The reason that the pyarrow write_dataset is so close to the optimal is that it will try to open all the final destination files at once so that as it reads data it writes it to where it will ultimately go. In that way it doesn’t reread the data like in your scan_parquet approach.

If you had saved the initial file with row groups separated by id and with statistics then your last approach would have been much faster (although still not as fast as the native dataset approach). The initialization would like something like this:

df=pl.DataFrame({
    'id': [choice(ascii_letters) for _ in range(N)],
    'a': [randint(0, 100) for _ in range(N)],
})
ids=df.get_column('id').unique()
saveschema=df.to_arrow().schema
with pq.ParquetWriter("stackoverflow2.parquet", 
                    saveschema,
                    compression='ZSTD',
                    version="2.6", ) as writer:
    for id in ids:
        writer.write_table(df.filter(pl.col('id')==id).to_arrow())

using the ParquetWriter with the for loop will create a row group each time write_table is called. Since pyarrow has statistics on by default (polars has them off by default) we don’t have to specify this. The stats include min and max for each group. Since there is only one id per group the min and max will each be that id. In subsequent scans, the reader can tell from the stats which row groups it needs to read. In this way it can massively save on IO relative to having the ids placed randomly amongst all row groups because it only reads the relevant row group.

Using this file, which is partitioned internally by row groups, will massively improve the scan_parquet approach to just 7.5s.

Of course, this only helps you if you can change your upstream file creation but it is illustrative of what you’re facing.

More reading here

Answered By: Dean MacGregor

A 2-pass method which partitions by batch can improve performance (memory and speed) by an order of magnitude.

ds_df = ds.dataset('stackoverflow.parquet')
for index, batch in enumerate(ds_df.to_batches()):
    ds.write_dataset(batch, f'temp/batch={index}', format='parquet', partitioning=['id'], partitioning_flavor='hive')
ds.write_dataset(ds.dataset('temp', partitioning='hive', schema=ds_df.schema), 'stackoverflow_data', format='parquet', partitioning=['id'])

The first pass partitions by (batch, id) without loading the entire table, using hive format for convenience. Then the second pass can take advantage of already being partitioned by id.

Answered By: A. Coady

Asked a similar question here.

I was able to take inspiration from your looped LazyFrame approach and parallelize the loop using spawn multiprocessing with polars. This gave me between a 5 and 10x speed up. It looks something like this.

df_query = pl.scan_parquet('stackoverflow.parquet')
ids = df_query.select(pl.col('id').unique()).collect().get_column('id').to_list()
def write_split_df(id):
    df = df_query.filter(pl.col('id') == id).collect()
    df.write_parquet(f'{id}.parquet')
def main():
    import multiprocessing as mp
    mp = mp.get_context('spawn')
    with mp.Pool(10) as p:
        p.imap_unordered(write_split_df, ids)
Answered By: Udit Ranasaria
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.