cfg file not resolved when trying to import python library from zip included to a path

Question:

I use Spark 2.4.0 + K8s cluster deployment mode + python 3.5.

I pack all libraries into zip archive and send it to AWS S3, then attach to context

sc = pyspark.SparkContext(appName=args.job_name, environment=environment)

sc.addPyFile('s3a://.../libs.zip')
sc.addPyFile('s3a://.../code.zip')

Import works, I can import any package. But if I import package, that reads some files from package-related folders – I get error:

NotADirectoryError: [Errno 20] Not a directory: '/var/data/spark-ce45d34b-8d2f-4fd0-b3d6-d53ecede8ef1/spark-6ce9d14f-3d90-4c3c-ba2d-9dd6ddf32457/userFiles-08e6e9ec-03fa-447d-930f-bf1bd520f55a/libs.zip/airflow/config_templates/default_airflow.cfg'

How could I solve it?

PS. Use sc.addFile('s3a:/..') and unzipping not works because spark is running in cluster mode.

UPDATE:

I’ve temporary solved this by installing all packages I need to the docker container I’m using for spark workers.

Asked By: ragelo

||

Answers:

Some pip-installed packages are not safe to be compressed into a zip. For example, used Airflow v1.10.15 was not ZIP-safe (not sure about new versions)

Answered By: ragelo