does dask compute store results?

Question:

Consider the following code

import dask
import dask.dataframe as dd
import pandas as pd

data_dict = {'data1':[1,2,3,4,5,6,7,8,9,10]}
df_pd     = pd.DataFrame(data_dict) 
df_dask   = dd.from_pandas(df_pd,npartitions=2)

df_dask['data1x2'] = df_dask['data1'].apply(lambda x:2*x,meta=('data1x2','int64')).compute()

print('-'*80)
print(df_dask['data1x2'])
print('-'*80)
print(df_dask['data1x2'].compute())
print('-'*80)

What I can’t figure out is: why is there a difference between the output of the first and second print? After all, I called compute when I applied the function and stored the result in df_dask[‘data1x2’].

Asked By: Nachiket

||

Answers:

The first print will only show the lazy version of the dask series, df_dask["data1x2"]:

Dask Series Structure:
npartitions=2
0    int64
5      ...
9      ...
Name: data1x2, dtype: int64
Dask Name: getitem, 15 tasks

This shows the number of partitions, index values (if known), number of tasks needed to be done to get the final result, and some other information. At this stage, dask did not compute the actual series, so the values inside this dask array are not known. Calling .compute launches computation of the 15 tasks needed to get the actual values and that’s what is printed the second time.

Answered By: SultanOrazbayev

Dask does store results in memory on the workers or scheduler. But that’s not what’s driving the differences in displayed results. The two are displayed differently because they are different types of objects.

df_dask['data1x2'] is a dask.dataframe.Series, which will only ever display a preview of the data structure and information about the number of tasks involved in calculating the values. Displaying any data requires at least moving data to the main thread, if not computation and possibly I/O, so dask will never do this unless explicitly asked to, e.g. with df.head().

df_dask['data1x2'].compute() is a pandas.Series. It no longer has anything to do with dask and is by definition in-memory. Since all pandas data structures are in memory, the data is displayed by defualt.

When you call compute on a dask object it ceases to be a dask object. In this case, the first compute returns a pandas series. When you assign a pandas series to a dask data frame, dask partitions and sends the data to the workers, and then can no longer display the whole series. So you have to call compute again if you’d like to see the series displayed.

Imagine how useful this would be if your whole data frame were too large to fit into memory, e.g. if you had 1000 columns and 10m rows. This is what dask is designed for.

Answered By: Michael Delgado