randomly accessing a row of Dask dataframe is taking a long time

Question

I have a Dask dataframe of 100 million rows of data.

I am trying to iterate over this dataframe without loading the entire dataframe
to RAM.

For an experiment, trying to access row of index equal to 1.

%time dask_df.loc[1].compute()

The time it took is whopping 8.88 s (Wall time)

Why is it taking it so long?

What can I do to make it faster?

Thanks in advance.

Per request, here is the code.
It is just reading 100 million rows of data and trying to access a row.

`dask_df = dd.read_parquet("/content/drive/MyDrive/AffinityScore_STAGING/staging_affinity_block1.gzip", chunksize=10000000)`

Dask DataFrame Structure:
avg_user_prod_aff_score internalItemID internalUserID
npartitions=1
float32 int16 int32

len(dask_df)

100,000,000

%time dask_df.loc[1].compute()

There are just 3 columns with datatypes of float32, int16 and int32.

The dataframe is indexed starting at 0.

Writing time is actually very good which is around 2 minutes.

I must be doing something wrong here.

Asked By: Nguai al

||

Source

Answer 1

Similarly to pandas, dask_df[1] would actually reference a column, not a row. So if you have a column named 1 then you’re just loading a column from the whole frame. You can’t access rows positionally – df.iloc only supports indexing along the second (column) axis. If your index has the value 1 in it, you could select this with df.loc, e.g.:

df.loc[1].compute()

See the dask.dataframe docs on indexing for more information and examples.

Answered By: Michael Delgado

Answer 2

When performing .loc on an unindexed dataframe, Dask will need to decompress the full file. Since each partition will have its own index, .loc[N] will check every partition for that N, see this answer.

One way of resolving this is to pay the cost of generating a unique index once and saving the indexed parquet file. This way .loc[N] will only load information from the specific partition (or row group) that contains row N.

Answered By: SultanOrazbayev

Answer 3

It looks like there is a performance issues with Dask when trying
access 10 million rows. It took 2.28 secs to access first 10 rows.

With 100 million rows, it takes whopping 30 secs.

Answered By: Nguai al

Answer 4

Use the sample method.

You’re failing to grasp one of the necessary differences between dask and pandas…a distributed index. That means the metadata on all files must be checked at the minumum with loc and you may also run into indices that occur within multiple (potentionally index-unsorted) partitions.

Use sample if you need random data. Loc is for something else, and not the same as pandas’ loc.

Answered By: John R

randomly accessing a row of Dask dataframe is taking a long time

Question:

Answers: