Using Pandas AWS Glue Python Shell Jobs

Question:

The AWS Documentation
https://docs.aws.amazon.com/glue/latest/dg/add-job-python.html

mentions that

The environment for running a Python shell job supports the following
libraries:

pandas (required to be installed via the python setuptools
configuration, setup.py)

But it does not mention how to make the install.

How can I use Pandas in a AWS Glue Python Shell Jobs ?

Asked By: Hugo

||

Answers:

  1. Goto https://docs.aws.amazon.com/glue/latest/dg/add-job-python.html#create-python-extra-library. Check section
    To create a Python .egg or .whl file for ‘how to create setup file for python shell job’
  2. In setup.py file, add line install_requires=['pandas==0.25.1']:
setup(name="<module name>",
        version="0.1",
        packages=['<package name if any or ignore>'],
        install_requires=['pandas==0.25.1']
    )

I also wrote small shell script to deploy python shell job without manual steps to create egg file and upload to s3 and deploy via cloudformation. Script does all automatically.
You may find code at https://github.com/fatangare/aws-python-shell-deploy

Answered By: Sandeep Fatangare

Just to clarify Sandeep’s answer, here is what worked for me

1/ Ignore AWS doc

2/ Create a setup.py file containing :

from setuptools import setup

setup(name="pandasmodule",
        version="0.1",
        packages=[],
        install_requires=['pandas==0.25.1']
    )

3/ Run this command in the folder containing the file :

python setup.py bdist_wheel

4/ Upload the .whl file to s3

5/ Configure the “Python lib path” in your Glue ETL Job to the s3 path

You can now use “import pandas as pd” in your Glue ETL Job

Answered By: Hugo

No need to do anything, just import pandas and start using it.

Answered By: user3267989

AWS Glue 2.0 supports pandas—1.0.1
https://docs.aws.amazon.com/glue/latest/dg/reduced-start-times-spark-etl-jobs.html

so in your script you can simply write : import pandas.
If you want to use other python module that is not provided in Glue, you can download .whl or .zip ->store it in S3 -> place path of it in glue job in "Python library path" and glue during a job run will do a pip install "yourmodule"

Answered By: andrzejkuba

Using the Glue Python Shell, the following script works directly for pandas:

from setuptools import setup

setup(name="pandasmodule",
        version="0.1",
        packages=[],
        install_requires=['pandas==0.25.1']
    )

# use pandas
import numpy as np
import pandas as pd

s = pd.Series([1, 3, 5, np.nan, 6, 8])

print(s)
Answered By: Yufrend
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.