The documentation for Dask talks about repartioning to reduce overhead here.
They however seem to indicate you need some knowledge of what your dataframe will look like beforehand (ie that there will 1/100th of the data expected).
Is there a good way to repartition sensibly without making assumptions? At the moment I just repartition with
npartitions = ncores * magic_number, and set force to
True to expand partitions if need be. This one size fits all approach works but is definitely suboptimal as my dataset varies in size.
The data is time series data, but unfortunately not at regular intervals, I’ve used repartition by time frequency in the past but this would be suboptimal because of how irregular the data is (sometimes nothing for minutes then thousands in seconds)
After discussion with mrocklin a decent strategy for partitioning is to aim for 100MB partition sizes guided by
df.memory_usage().sum().compute(). With datasets that fit in RAM the additional work this might involve can be mitigated with use of
df.persist() placed at relevant points.
Just to add to Samantha Hughes’ answer:
memory_usage() by default ignores memory consumption of object dtype columns. For the datasets I have been working with recently this leads to an underestimate of memory usage of about 10x.
Unless you are sure there are no object dtype columns I would suggest specifying
deep=True, that is, repartition using:
df.repartition(npartitions= 1+df.memory_usage(deep=True).sum().compute() // n )
n is your target partition size in bytes. Adding 1 ensures the number of partitions is always greater than 1 (
// performs floor division).
I tried to check what is the optimal number for my case.
I have 100Gb csv files with 250M rows and 25 columns.
I work on laptop with 8 cores .
I run the function "describe" on 1,5,30,1000 partitions
df = df.repartition(npartitions=1) a1=df['age'].describe().compute() df = df.repartition(npartitions=5) a2=df['age'].describe().compute() df = df.repartition(npartitions=30) a3=df['age'].describe().compute() df = df.repartition(npartitions=100) a4=df['age'].describe().compute()
about speed :
5,30 > around 3 minutes
1, 1000 > around 9 minutes
but …I found that "order" functions like median or percentile give wrong number when I used more than one partition .
1 partition give right number (I checked it with small data using pandas and dask)