Dask will often have as many chunks in memory as twice the number of active threads – How to understand this?

Question:

I read the captioned sentence in dask’s website and wonder what it means. I have extracted the relevant part below for ease of reference:

A common performance problem among Dask Array users is that they have chosen a chunk size that is either too small (leading to lots of overhead) or poorly aligned with their data (leading to inefficient reading).
While optimal sizes and shapes are highly problem specific, it is rare to see chunk sizes below 100 MB in size. If you are dealing with float64 data then this is around (4000, 4000) in size for a 2D array or (100, 400, 400) for a 3D array.
You want to choose a chunk size that is large in order to reduce the number of chunks that Dask has to think about (which affects overhead) but also small enough so that many of them can fit in memory at once. Dask will often have as many chunks in memory as twice the number of active threads.

Does it mean that the same chunk will co-exist at the mother node(or process or thread?) and the child node? Is it not necessary to have the same chunk twice?

PS: I don’t quite understand the difference among node, process and thread so I just put all of them there.

Asked By: Ken T

||

Answers:

In many cases, a dask graph will involve many more chunks than there are threads. This warning is noting that multiple of these chunks per worker might be in memory at the same time. for example, in the job:

avg = dask.array.random(
    size=(1000, 1000, 1000), chunks=(10, 1000, 1000)
).mean().compute()

there are 100 chunks, each of which are ~80MB in size, and you should anticipate roughly 80MB * nworkers * 2 to be in memory at once.

Answered By: Michael Delgado

Answering this part:

I don’t quite understand the difference among node, process and thread so I just put all of them there.

  • a node is a computer machine. This can be a physical box somewhere, with a CPU, disks, etc. In the cloud, you likely have a "virtual machine" that runs on physical hardware what you don’t get to know about, but still it runs a single operating system kernel. Communication between nodes is via the network.

  • a container (you didn’t ask about this) is an isolated runtime on a node which takes up a specified amount of memory and CPU resources from the node (also called "host") but share the disk, network and GPU. Communication between containers is via the network, whether they are on the same node or not (will be faster if yes). kubernetes and yarn are examples of container frameworks. There may be several containers per node.

  • a process is a running executable thing. It may be within a container or not. It has its own isolated memory. A node will be running many processes, but a container typically runs one. dask-scheduler, dask-worker and your client session (ipython, jupyter, python…) are examples of processes. Dask processes communicate with other processes on the same machine using networking primitives (still needs serialisation of data), although other possibilities exist.

  • threads are multiple execution points that might exist within a process. They share memory, so don’t need to copy anything between themselves, but not all operations in python can run in parallel on threads, because of the "interpreter lock", which exists to make the single-threaded case safer and faster.

For dask, the number of cores you can use is n_threads * n_processes. If you weight this more towards threads, you are more efficient on memory. If you weight it more to processes, you get more parallelism. Which is best depends on your workload.

Answered By: mdurant
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.