Broadcast variables and mapPartitions

Question:

Context

In pySpark I broadcast a variable to all nodes with the following code:

sc = spark.sparkContext # Get context

# Extract stopwords from a file in hdfs
# The result looks like stopwords = {"and", "fu", "bar" ... }
stopwords = set([line[0] for line in csv.reader(open(SparkFiles.get("stopwords.txt"), 'r'))])

# The set of stopwords is broadcasted now
stopwords = sc.broadcast(stopwords)

After broadcasting the stopwords I want to make it accessible in mapPartitions:

# Some dummy-dataframe
df = spark.createDataFrame([(["TESTA and TESTB"], ), (["TESTB and TESTA"], )], ["text"])


# The method which will be applied to mapPartitions
def stopwordRemoval(partition, passed_broadcast):
    """
    Removes stopwords from "text"-column.

    @partition: iterator-object of partition.
    @passed_stopwords: Lookup-table for stopwords.
    """

    # Now the broadcast is passed
    passed_stopwords = passed_broadcast.value

    for row in partition:
        yield [" ".join((word for word in row["text"].split(" ") if word not in passed_stopwords))]


# re-partitioning in order to get mapPartitions working
df = df.repartition(2)

# Now apply the method
df = df.select("text").rdd 
        .mapPartitions(lambda partition: stopwordRemoval(partition, stopwords)) 
        .toDF()

# Result
df.show()

#Result:
+------------+
| text       |
+------------+
|TESTA TESTB |
|TESTB TESTA |
+------------+

__Questions__

Even though it works I’m not quite sure if this is the right usage of broadcasting variables. So my questions are:

  1. Is the broadcast correctly executed when I pass it to mapParitions in the demonstrated way?
  2. Is using broadcasting within mapParitions useful since stopwords would be distributed with the function to all nodes anyway (stopwords is never reused)?

The second question relates to this question which partly answers my own. Anyhow, within the specifics it differs; that’s why I’ve chosen to also ask this question.

Asked By: Markus

||

Answers:

Some time went by and I read some additional information which answered the question for me. Thus, I wanted to share my insights.


Question 1: Is the broadcast correctly executed when I pass it to mapParitions in the demonstrated way?

First it is of note that a SparkContext.broadcast() is a wrapper around the variable to broadcast as can be read in the docs. This wrapper serializes the variable and adds the information to the execution graph to distribute the this serialized form over the nodes. Calling the broadcasts .value-argument is the command to deserialize the variable again when used.
Additionally, the docs state:

After the broadcast variable is created, it should be used instead of the value v in any functions run on the cluster so that v [the variable] is not shipped to the nodes more than once.

Secondly, I found several sources stating that this works with UDFs (User Defined Functions), e.g. here. mapPartitions() and udf()s should be considered analogous since they both, in case of pySpark, pass the data to a Python instance on the respective nodes.

Regarding this, here is the important part: Deserialization has to be part of the Python function (udf() or whatever function passed to mapPartitions()) itself, meaning its .value argument must not be passed as function-parameter.

Thus, the broadcast done the right way: The braodcasted wrapper is passed as parameter and the variable is deserialized inside stopwordRemoval().


Question 2: Is using broadcasting within mapParitions useful since stopwords would be distributed with the function to all nodes anyway (stopwords is never reused)?

Its documented that there is only an advantage if serialization yields any value for the task at hand.

The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.

This might be the case when you have a large reference to broadcast to your cluster:

[…] to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.

If this applies to your broadcast, broadcasting has an advantage.

Answered By: Markus
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.