Disable pure function assumption in dask distributed

Question:

The Dask distributed library documentation says:

By default, distributed assumes that all functions are pure.
[…]
The scheduler avoids redundant computations. If the result is already in memory from a previous call then that old result will be used rather than recomputing it.

When benchmarking function runtimes, this caching behavior will get in the way, as we want to call the same function multiple times with the same inputs.
So is there a way to completely disable it?

I know that for submit and map there is an argument available. But for computations on dask collections I have not found any solution.

Asked By: tierriminator

Source

Answers:

After some digging in the source code of distributed, I believe I have found an answer myself. Although someone might correct me if I didn’t come to the right conclusion.

Short answer

It is not possible to globally disable disable the purity assumption in distributed.
However, for dask collection it is possible to separate computations from precomputed results with dask.graph_manipulation.clone().

Long answer

Internally, dask splits its computation up into labelled tasks.
A task label is called a "key" and is used to identify results from a computation (an execution of a task). Keys are used to identify dependencies between tasks and are therefore essential for how dask works.

When we submit a new computation graph, which is basically a list of tasks with their dependencies, to the scheduler in distributed, the scheduler checks whether some tasks have already been computed by checking their keys against the keys of the finished tasks, which the scheduler still holds.
This happens quite at the beginning of Scheduler.update_graph(), which is the method being called by the client when he wants to start a new computation.
There is no switch in the current implementation to disable this. The calls to plugin.update_graph() for the registered scheduler plugins also happen after this optimization phase, so we can neither regulate this behavior through plugins.

So what can we do?
By manually modifying the keys of the individual tasks in the graph, we can trick the scheduler into thinking that we have not yet computed this task.
Task keys usually have the format prefix-token, where the prefix is the original task name (e.g. function name) and the token is a hash built from the arguments of the task.
Distributed uses the task prefix to group tasks together and get an estimate for future runtimes.
The token is primarily used to identify different executions of the same task with different arguments.
So we can just adjust the token of the key to let dask think that we are running the task with different arguments.
This is in principle what dask.graph_manipulation.clone() does for us. It copies a dask collection and returns a new one such that the keys of the tasks in the internal graph are rewritten.

Answered By: tierriminator