Losing index information when using dask.dataframe.to_parquet() with partitioning

Question:

When I was using dask=1.2.2 with pyarrow 0.11.1 I did not observe this behavior. After updating (dask=2.10.1 and pyarrow=0.15.1), I cannot save the index when I use to_parquet method with given partition_on and write_index arguments. Here I have created a minimal example which shows the issue:

from datetime import timedelta
from pathlib import Path

import dask.dataframe as dd
import pandas as pd

REPORT_DATE_TEST = pd.to_datetime('2019-01-01').date()
path = Path('/home/ludwik/Documents/YieldPlanet/research/trials/')

observations_nr = 3
dtas = range(0, observations_nr)
rds = [REPORT_DATE_TEST - timedelta(days=days) for days in dtas]
data_to_export = pd.DataFrame({
    'report_date': rds,
    'dta': dtas,
    'stay_date': [REPORT_DATE_TEST] * observations_nr,
    }) 
    .set_index('dta')

data_to_export_dask = dd.from_pandas(data_to_export, npartitions=1)

file_name = 'trial.parquet'
data_to_export_dask.to_parquet(path / file_name,
                               engine='pyarrow',
                               compression='snappy',
                               partition_on=['report_date'],
                               write_index=True
                              )

data_read = dd.read_parquet(path / file_name, engine='pyarrow')
print(data_read)

Which gives:

| | stay_date  |dta| report_date|
|0| 2019-01-01 | 2 | 2018-12-30 |
|0| 2019-01-01 | 1 | 2018-12-31 |
|0| 2019-01-01 | 0 | 2019-01-01 |

I did not see that described anywhere in the dask documentation.

Does anyone know how to save the index while partitioning the parquet data?

Asked By: Ludwik

||

Answers:

I might seem like an attempt to sidestep the question, but my suggestion would be to partition along the index. This would also ensure non-overlapping indexes in the partitions.

This would be like dd.from_pandas(data_to_export, npartitions=3) and then skip partition_on and write_index in to_parquet. The index would have to be sorted.

This preserves the index and sets the divisions correctly.

Note that you are not guaranteed to get exact the number of partitions you request with partitions, especially not with small data sets.

Answered By: PerJensen

The issue was in the pyarrow’s backend. I filed a bug report on their JIRA webpage:
https://issues.apache.org/jira/browse/ARROW-7782

As stated by pavithraes, this issue was fixed with pyarrow 1.0.0. Thanks for letting me know! 🙂

Answered By: Ludwik
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.