Polars scan s3 multi-part parquet files

Question:

I have a multipart partitioned parquet on s3. Each partition contains multiple parquet files. The below code narrows in on a single partition which may contain somewhere around 30 parquet files. When I use scan_parquet on a s3 address that includes *.parquet wildcard, it only looks at the first file in the partition. I verified this with the count of customers. It has the count from just the first file in the partition. Is there a way that it can scan across files?

import polars as pl

s3_loc = "s3://some_bucket/some_parquet/some_partion=123/*.parquet"
df = pl.scan_parquet(s3_loc)
cus_count = df.select(pl.count('customers')).collect()

If I leave off the *.parquet from the s3 address then I get the following error.

exceptions.ArrowErrorException: ExternalFormat("File out of specification: A parquet file must containt a header and footer with at least 12 bytes")

Asked By: bvmcode

||

Answers:

It looks like from the user guide on multiple files that to do so requires a loop creating many lazy dfs that you then combine together.

Another approach is to use the scan_ds function which takes a pyarrow dataset object.

import polars as pl
import s3fs
import pyarrow.dataset as ds
fs = s3fs.S3FileSystem()
# you can also make a file system with anything fsspec supports
# S3FileSystem is just a wrapper for fsspec
s3_loc = "s3://some_bucket/some_parquet/some_partion=123"
myds = ds.dataset(s3_loc, filesystem=fs)
lazy_df = pl.scan_ds(myds)
cus_count = lazy_df.select(pl.count('customers')).collect()
Answered By: Dean MacGregor
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.