Will dask map_partitions(pd.cut, bins) actually operate on entire dataframe?

Question:

I need to use pd.cut on a dask dataframe.

This answer indicates that map_partitions will work by passing pd.cut as the function.

It seems that map_partitions passes only one partition at a time to the function. However, pd.cut will need access to an entire column of my df in order to create the bins. So, my question is: will map_partitions in this case actually operate on the the entire dataframe or am I going to get incorrect results with this this approach?

Asked By: dgLurn

||

Answers:

In your question you correctly identify why the bins should be provided explicitly.

By specifying the exact bin cuts (either based on some calculation or external reasoning), you ensure that what dask does is comparable across partitions.

# this does not guarantee comparable cuts
ddf['a'].map_partitions(pd.cut)

# this ensures the cuts are as per the specified bins
ddf['a'].map_partitions(pd.cut, bins)

If you want to generate bins in an automatic way, one way is to get the min/max for the column of interest and generate the bins with np.linspace:

# note that computation is needed to give
# actual (not delayed) values to np.linspace
bmin, bmax = dask.compute(ddf['a'].min(), ddf['a'].max)

# specify the number of desired cuts here
bins = np.linspace(bmin, bmax, num=123)
Answered By: SultanOrazbayev
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.