randomly accessing a row of Dask dataframe is taking a long time
Question:
I have a Dask dataframe of 100 million rows of data.
I am trying to iterate over this dataframe without loading the entire dataframe
to RAM.
For an experiment, trying to access row of index equal to 1.
%time dask_df.loc[1].compute()
The time it took is whopping 8.88 s (Wall time)
Why is it taking it so long?
What can I do to make it faster?
Thanks in advance.
Per request, here is the code.
It is just reading 100 million rows of data and trying to access a row.
`dask_df = dd.read_parquet("/content/drive/MyDrive/AffinityScore_STAGING/staging_affinity_block1.gzip", chunksize=10000000)`
Dask DataFrame Structure:
avg_user_prod_aff_score internalItemID internalUserID
npartitions=1
float32 int16 int32
len(dask_df)
100,000,000
%time dask_df.loc[1].compute()
There are just 3 columns with datatypes of float32, int16 and int32.
The dataframe is indexed starting at 0.
Writing time is actually very good which is around 2 minutes.
I must be doing something wrong here.
Answers:
Similarly to pandas, dask_df[1]
would actually reference a column, not a row. So if you have a column named 1
then you’re just loading a column from the whole frame. You can’t access rows positionally – df.iloc
only supports indexing along the second (column) axis. If your index has the value 1
in it, you could select this with df.loc
, e.g.:
df.loc[1].compute()
See the dask.dataframe docs on indexing for more information and examples.
When performing .loc
on an unindexed dataframe, Dask will need to decompress the full file. Since each partition will have its own index, .loc[N]
will check every partition for that N
, see this answer.
One way of resolving this is to pay the cost of generating a unique index once and saving the indexed parquet file. This way .loc[N]
will only load information from the specific partition (or row group) that contains row N
.
Use the sample method.
You’re failing to grasp one of the necessary differences between dask and pandas…a distributed index. That means the metadata on all files must be checked at the minumum with loc and you may also run into indices that occur within multiple (potentionally index-unsorted) partitions.
Use sample if you need random data. Loc is for something else, and not the same as pandas’ loc.
I have a Dask dataframe of 100 million rows of data.
I am trying to iterate over this dataframe without loading the entire dataframe
to RAM.
For an experiment, trying to access row of index equal to 1.
%time dask_df.loc[1].compute()
The time it took is whopping 8.88 s (Wall time)
Why is it taking it so long?
What can I do to make it faster?
Thanks in advance.
Per request, here is the code.
It is just reading 100 million rows of data and trying to access a row.
`dask_df = dd.read_parquet("/content/drive/MyDrive/AffinityScore_STAGING/staging_affinity_block1.gzip", chunksize=10000000)`
Dask DataFrame Structure:
avg_user_prod_aff_score internalItemID internalUserID
npartitions=1
float32 int16 int32
len(dask_df)
100,000,000
%time dask_df.loc[1].compute()
There are just 3 columns with datatypes of float32, int16 and int32.
The dataframe is indexed starting at 0.
Writing time is actually very good which is around 2 minutes.
I must be doing something wrong here.
Similarly to pandas, dask_df[1]
would actually reference a column, not a row. So if you have a column named 1
then you’re just loading a column from the whole frame. You can’t access rows positionally – df.iloc
only supports indexing along the second (column) axis. If your index has the value 1
in it, you could select this with df.loc
, e.g.:
df.loc[1].compute()
See the dask.dataframe docs on indexing for more information and examples.
When performing .loc
on an unindexed dataframe, Dask will need to decompress the full file. Since each partition will have its own index, .loc[N]
will check every partition for that N
, see this answer.
One way of resolving this is to pay the cost of generating a unique index once and saving the indexed parquet file. This way .loc[N]
will only load information from the specific partition (or row group) that contains row N
.
Use the sample method.
You’re failing to grasp one of the necessary differences between dask and pandas…a distributed index. That means the metadata on all files must be checked at the minumum with loc and you may also run into indices that occur within multiple (potentionally index-unsorted) partitions.
Use sample if you need random data. Loc is for something else, and not the same as pandas’ loc.