Can I sort the items in an Apache beam PCollection using python?

Question:

Can I sort the items in an Apache beam PCollection using python?

I need to perform an operation (transformation) that relies on the items to be sorted. But so far, I cannot find any trace of a "sorting" mechanism for the Apache beam.

My use case is not for live streams. I understand that it is pointless to talk about sorting when the data is live and/or infinite. This is an operation on an offline dataset.

Is this possible?

Asked By: Mehran

||

Answers:

Apparently, this is impossible. You cannot guarantee the order of items in Beam which in turn means that you cannot sort them. At least, so far I could not find any way of doing this. And thinking about it logically, since Beam supports stream processing and batch processing alike, and sorting is definitely impossible for streaming, then the logical conclusion is that Beam cannot support sorting at all.

But still, there might be some use cases that you think that they rely on sorting but you still can implement them without actually sorting the items. And my case was one of those.

To expand on my use case, I wanted to find the nth item in the list to implement a bucketization. Like if I want to bucketize my dataset into 4 bins and I have a total of 100 items in the dataset, I’ll need the 1st, 25th, 50th, 75th, and 100th items of the list so all my bins have the same number of items in them.

Initially, I thought I’ll need to sort the list and take the mentioned items from it. And since Beam does not support sorting, it was impossible. But then, I found another way of doing the same thing:

import apache_beam as beam


with beam.Pipeline() as p:
    all_items = (
        p
        | 'Create dummy data' >> beam.Create([i for i in range(100)])
    )

    item_1st = (
        all_items
        | '1st item' >> beam.combiners.Top.Smallest(1)
        | 'FlatMap_1 for 1st' >> beam.FlatMap(lambda record: record)
    )

    item_25th = (
        all_items
        | '75 largest items' >> beam.combiners.Top.Largest(75)
        | 'FlatMap_1 for 25' >> beam.FlatMap(lambda record: record)
        | '25th item' >> beam.combiners.Top.Smallest(1)
        | 'FlatMap_2 for 25' >> beam.FlatMap(lambda record: record)
    )

    item_50th = (
        all_items
        | '50 largest items' >> beam.combiners.Top.Largest(50)
        | 'FlatMap_1 for 50' >> beam.FlatMap(lambda record: record)
        | '50th item' >> beam.combiners.Top.Smallest(1)
        | 'FlatMap_2 for 50' >> beam.FlatMap(lambda record: record)
    )

    item_75th = (
        all_items
        | '25 largest items' >> beam.combiners.Top.Largest(25)
        | 'FlatMap_1 for 75' >> beam.FlatMap(lambda record: record)
        | '75th item' >> beam.combiners.Top.Smallest(1)
        | 'FlatMap_2 for 75' >> beam.FlatMap(lambda record: record)
    )

    item_100th = (
        all_items
        | '100th item' >> beam.combiners.Top.Largest(1)
        | 'FlatMap_1 for 100st' >> beam.FlatMap(lambda record: record)
    )

    _ = (
        (item_1st, item_25th, item_50th, item_75th, item_100th)
        | beam.Flatten()
        | f'All bins' >> beam.combiners.ToList()
        | beam.io.WriteToText('data/bins.txt')
    )

This code returns something like this:

[99, 0, 50, 75, 25]

There are a couple of notes to make here. First of all, as you can see the final output contains the numbers we were expecting but in the wrong order. That’s because Beam does not guarantee the order of items in the output. Secondly, if you run the code, you might the same answer but in a different order. That’s because the order of the items in Beam is random.

In the end, I just want to point out that the code I provided is not an answer to my original question. The answer is that Beam does not support sorting. But, there might be some other way to achieve what you want to do. Still, if you are sure that sorting is necessary for your case, then, unfortunately, Beam is not going to be practical for you.

Answered By: Mehran
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.