What do operators '>>' and '|' mean in this case?

Question:

import apache_beam as beam
with beam.pipeline() as pipeline:
  lines = pipeline | 'ReadMyFile' >> beam.io.ReadFromText(
      'gs://some/inputData.txt')

What I know is that ‘>>’ means shift right and ‘|’ is logical or. However, I do not understand what is their purpose here?

Asked By: Ahmed Jabareen

||

Answers:

This part of the Transform API.

The | operator "adds a pipeline to a transform". It is an alias for the apply() method on transforms.

The >> operator allows an extractor to pull results out of a pipeline into a wide variety of targets.

The docs include this annotated example:

# Create a pipeline object using a local runner for execution.
with beam.Pipeline('DirectRunner') as p:

  # Add to the pipeline a "Create" transform. When executed this
  # transform will produce a PCollection object with the specified values.
  pcoll = p | 'Create' >> beam.Create([1, 2, 3])

  # Another transform could be applied to pcoll, e.g., writing to a text file.
  # For other transforms, refer to transforms/ directory.
  pcoll | 'Write' >> beam.io.WriteToText('./output')

  # run() will execute the DAG stored in the pipeline.  The execution of the
  # nodes visited is done using the specified local runner.

With respect the the >> operator, the docs say:

All the transforms applied to the pipeline must have distinct full
labels. If same transform instance needs to be applied then the right
shift operator should be used to designate new names (e.g. input |
"label" >> my_transform).

Note, these particular are unique to the Apache Beam Transform API which has defined __or__ and __rrshift__ to have meanings different than the usual Python bitwise-or and bitwise-right-shift. The only thing that is the same is the operator precedence.

The action of the | is vaguely reminiscent of Unix pipelines; however, it is used like a method call. The __or__ method calls the apply() method. Per the docs, the action of apply() is to "Apply PTransforms to each PCollection. Transforms can change, filter, group, analyze, or otherwise process the elements in a PCollection. A transform creates a new output PCollection without modifying the input collection."

Conceptually, the action of the >> operator is loosely modeled on a similar use in C++ as an an "extraction operator". It as little different though because it returns a value. The __rrshift__ method is "Used to apply this PTransform to non-PValues, e.g., a tuple."

Note the second "r" in __rrshift__. This means that the meaning of the operator is determined by the right operand rather than the left. Pipeline objects don’t define __rshift__. Instead, it is the consumer that defines the extraction.

Answered By: Raymond Hettinger

In many shell languages, | is the pipe operator. It says to take the output of the left side and make it the input of the right side. Likewise >> is the "write to" operator that designates where to put the output.

The authors of the Transform API took these meanings and implemented them into Python.

The answer from Raymond Hettinger above gives better details. This is just an explanation of why.

Answered By: Frank Yellin