Data locality via many queues in Celery?

Question:

We’re trying to design a distributed pipeline that crunches large numbers of data chunks in a parallel fashion. We’re moving towards adopting celery, but one of the requirements is that we need to be able to map certain jobs to certain nodes in the cluster, e.g. if only one node has access to a certain data chunk.

The first answer that comes to mind is multiple queues, potentially even one queue per node, for a large (~64) number of nodes. Is this feasible, and efficient? Are celery queues lightweight? Is there a better way?

Asked By: TimStaley

Source

Answers:

The best answer I’ve found to date is here:

Is Celery appropriate for use with many small, distributed systems?

Which suggests that Celery is indeed a good fit for this use case. Perhaps I’ll update again when we’ve implemented.

Answered By: TimStaley

There is an old (2012) feature request to leverage data locality (dropped in 2016).

The suggested solution to consider data locality is to have a queue per worker by enabling (default is disabled, at least on Celery 5.2) and then using worker_direct.

Answered By: swimmer