Data locality via many queues in Celery?

Question:

We’re trying to design a distributed pipeline that crunches large numbers of data chunks in a parallel fashion. We’re moving towards adopting celery, but one of the requirements is that we need to be able to map certain jobs to certain nodes in the cluster, e.g. if only one node has access to a certain data chunk.

The first answer that comes to mind is multiple queues, potentially even one queue per node, for a large (~64) number of nodes. Is this feasible, and efficient? Are celery queues lightweight? Is there a better way?

Asked By: TimStaley

||

Answers:

The best answer I’ve found to date is here:

Is Celery appropriate for use with many small, distributed systems?

Which suggests that Celery is indeed a good fit for this use case. Perhaps I’ll update again when we’ve implemented.

Answered By: TimStaley

There is an old (2012) feature request to leverage data locality (dropped in 2016).

The suggested solution to consider data locality is to have a queue per worker by enabling (default is disabled, at least on Celery 5.2) and then using worker_direct.

Answered By: swimmer
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.