Schema for pyarrow.ParquetDataset > partition columns

Question:

  1. I have a pandas DataFrame:
import pandas as pd

df = pd.DataFrame(data={"col1": [1, 2], "col2": [3.0, 4.0], "col3": ["foo", "bar"]})
  1. Using s3fs:
from s3fs import S3FileSystem

s3fs = S3FileSystem(**kwargs)

  1. I can write this as a parquet dataset
import pyarrow as pa
import pyarrow.parquet as pq

tbl = pa.Table.from_pandas(df)
root_path = "../parquet_dataset/foo"

pq.write_to_dataset(
    table=tbl,
    root_path=root_path,
    filesystem=s3fs,
    partition_cols=["col3"],
    partition_filename_cb=lambda _: "data.parquet",
)
  1. Later, I need the pq.ParquetSchema for the dumped DataFrame.
import pyarrow as pa
import pyarrow.parquet as pq


dataset = pq.ParquetDataset(root_path, filesystem=s3fs)
schema = dataset.schema

However parquet dataset -> "schema" does not include partition cols schema.

How do I get the schema for the partition columns?

Asked By: mishbah

||

Answers:

I think you need give ParquetDataset a hint of the partition keys schema.

partition_schema = pa.schema([pa.field('col3', pa.string())])
partitioning = pa.dataset.partitioning(schema=partition_schema)

partitionaldataset = pq.ParquetDataset(
    root_path, 
    partitioning=partitioning,
)

Which gives you this schema:

col1: int64
col2: double
col3: string

PS: I couldn’t completely reproduce your example (I don’t have access to S3) and I had to add use_legacy_dataset=False when writing and reading the dataset.

Answered By: 0x26res

Turns out I have to explicitly dump "metadata".

table = pa.Table.from_pandas(df)
pq.write_to_dataset(
    table=table,
    root_path=path,
    filesystem=s3fs,
    partition_cols=partition_cols,
    partition_filename_cb=lambda _: "data.parquet",
)

# Write metadata-only Parquet file from schema
pq.write_metadata(
    schema=table.schema, where=path + "/_common_metadata", filesystem=s3fs
)

Docs https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-medata-files

I only care about the "common metadata" but you can dump row stats.

Answered By: mishbah
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.