How to install external python libraries in Pyspark?

Question:

When I was some pyspark code, it required to me to install a Python module called fuzzywuzzy (that I used to apply the leiv distance)

This is a python libraries and seems that pyspark doesn’t have the module installed… so, How can I install this module inside Pyspark??

Asked By: Jeremy Sapienza

||

Answers:

You’d use pip as normal, with the caveat that Spark can run on multiple machines, and so all machines in the Spark cluster (depending on your cluster manager) will need the same package (and version)

Or you can pass zip, whl or egg files using --py-files argument to spark-submit, which get unbundled during code execution

Answered By: OneCricketeer
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.