How to use tabula in AWS Lambda to read PDF table
Question:
Hello I get the following error while trying to use tabula to read a table in a pdf.
I was aware of some of the difficulties (here) using this package with AWS lambda and tried to zip the tabula package via an EC2 (Ubuntu 20.02) and then, add it as a layer in the function.
Many thanks in advance!
{ "errorMessage": "`java` command is not found from this Python process.Please ensure Java is installed and PATH is set for `java`", "errorType": "JavaNotFoundError", "stackTrace": [ " File "/var/task/lambda_function.py", line 39, in lambda_handlern df = tabula.read_pdf(BytesIO(fs), pages="all", area = [box],n", " File "/opt/python/lib/python3.8/site-packages/tabula/io.py", line 420, in read_pdfn output = _run(java_options, tabula_options, path, encoding)n", " File "/opt/python/lib/python3.8/site-packages/tabula/io.py", line 98, in _runn raise JavaNotFoundError(JAVA_NOT_FOUND_ERROR)n" ] }
Code
import boto3
import read_pdf from tabula
from io import BytesIO
def lambda_handler(event, context):
client = boto3.client('s3')
s3 = boto3.resource('s3')
# Get most recent file name
response = client.list_objects_v2(Bucket='S3bucket')
all = response['Contents']
latest = max(all, key=lambda x: x['LastModified'])
latest_key = latest['Key']
# Get file
obj = s3.Object('S3bucket', latest_key)
fs = obj.get()['Body'].read()
# Read PDF
box = [3.99, .22, 8.3, 7.86]
fc = 72
for i in range(0, len(box)):
box[i] *= fc
df = tabula.read_pdf(BytesIO(fs), pages="all", area = [box], output_format = "dataframe", lattice=True)
Answers:
Tabula’s python package is just a wrapper for java code. Here’s a reference to the package here.
Java 8+ is required to be installed for this to work. Your best bet to achieve that is to develop a docker container image where your script works and deploy that image as a lambda function.
AWS has a good walkthrough that might help.
Here is the Dockerfile that ultimatley worked and allowed me to run tabula in my lambda function:
ARG FUNCTION_DIR="/var/task/"
COPY ./ ${FUNCTION_DIR}
# Install OpenJDK
RUN yum install -y java-1.8.0-openjdk
# Setup Python environment
# Install PYTHON requirements
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt
# Copy function code to container
COPY app.py ./
CMD [ "app.handler" ]
Hello I get the following error while trying to use tabula to read a table in a pdf.
I was aware of some of the difficulties (here) using this package with AWS lambda and tried to zip the tabula package via an EC2 (Ubuntu 20.02) and then, add it as a layer in the function.
Many thanks in advance!
{ "errorMessage": "`java` command is not found from this Python process.Please ensure Java is installed and PATH is set for `java`", "errorType": "JavaNotFoundError", "stackTrace": [ " File "/var/task/lambda_function.py", line 39, in lambda_handlern df = tabula.read_pdf(BytesIO(fs), pages="all", area = [box],n", " File "/opt/python/lib/python3.8/site-packages/tabula/io.py", line 420, in read_pdfn output = _run(java_options, tabula_options, path, encoding)n", " File "/opt/python/lib/python3.8/site-packages/tabula/io.py", line 98, in _runn raise JavaNotFoundError(JAVA_NOT_FOUND_ERROR)n" ] }
Code
import boto3
import read_pdf from tabula
from io import BytesIO
def lambda_handler(event, context):
client = boto3.client('s3')
s3 = boto3.resource('s3')
# Get most recent file name
response = client.list_objects_v2(Bucket='S3bucket')
all = response['Contents']
latest = max(all, key=lambda x: x['LastModified'])
latest_key = latest['Key']
# Get file
obj = s3.Object('S3bucket', latest_key)
fs = obj.get()['Body'].read()
# Read PDF
box = [3.99, .22, 8.3, 7.86]
fc = 72
for i in range(0, len(box)):
box[i] *= fc
df = tabula.read_pdf(BytesIO(fs), pages="all", area = [box], output_format = "dataframe", lattice=True)
Tabula’s python package is just a wrapper for java code. Here’s a reference to the package here.
Java 8+ is required to be installed for this to work. Your best bet to achieve that is to develop a docker container image where your script works and deploy that image as a lambda function.
AWS has a good walkthrough that might help.
Here is the Dockerfile that ultimatley worked and allowed me to run tabula in my lambda function:
ARG FUNCTION_DIR="/var/task/"
COPY ./ ${FUNCTION_DIR}
# Install OpenJDK
RUN yum install -y java-1.8.0-openjdk
# Setup Python environment
# Install PYTHON requirements
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt
# Copy function code to container
COPY app.py ./
CMD [ "app.handler" ]