is it possible to have one meta file for multiple parquet data files?

Question:

I have a process that generates millions of small dataframes and save them to parquet in parallel.

all dataframes have the same columns and index information. and have the same number of rows (about 300).

as the dataframe is small, when they are saved into parquet files, the meta information is quite big in comparison with the data. as the meta information for each parquet file is basically the same, the disk space is wasted because the same meta are repeated millions of times.

is it possible to save one copy of meta information and other parquet files contains only the data ? when I need to read a dataframe, read the meta and the data from two different files?

some updates:

concating them into one big dataframe can save the disk space, but it’s not friendly to do parallel processing of each small dataframe.

I also tried other format like feather.but it seems that feather does not store data as effciently as parquet. the file size is smaller but it’s larger than parquet meta + parquet data

Asked By: Lei Yu

Source

Answers:

This is not possible at least using python pandas (fastparquet and pyarrow dont have any such feature).

I do see a parameter that disables statistics in footer.

pyarrow – write_statistics=Fasle
fastparquet – stats=False

However this will not save you a lot of disk space. Only few stats related info will not be written to the parquet metadata footer.

is it possible to save one copy of meta information and other parquet
files contains only the data ?

You are wanting to write multiple data files with only row groups without footer and a single metadata file that consists only footer. In this case none of those files will be a valid parquet file. This should be possible theoretically but no such known implementation exists. Checkout comments on this thread. Maybe reach out to parquet community on slack to find out if any such implementation exists.

My suggestions would be combine the dataframes somehow before writing to parquet format on disk. OR run a job at a later stage that will merge these files. Both these options are not efficient since you have a huge number of small df’s/files.

You could write 300 rows per dataframe into a some kind of intermediate database as well. Later convert to parquet.

Answered By: shadow0359