zipfile.BadZipFile: File is not a zip file when using "openpyxl" engine

Question:

I have created a script which dumps the excel sheets stored in S3 into my local postgres database. I’ve used pandas read_excel and ExcelFile method to read the excel sheets.
Code for the same can be found here.

import boto3
import pandas as pd
import io
import os
from sqlalchemy import create_engine
import xlrd

os.environ["AWS_ACCESS_KEY_ID"] = "xxxxxxxxxxxx"
os.environ["AWS_SECRET_ACCESS_KEY"] = "xxxxxxxxxxxxxxxxxx"
s3 = boto3.client('s3')

obj = s3.get_object(Bucket='bucket-name', Key='file.xlsx')
data = pd.ExcelFile(io.BytesIO(obj['Body'].read()))
print(data.sheet_names)
a = len(data.sheet_names)

engine1 = create_engine('postgresql://postgres:postgres@localhost:5432/postgres')
for i in range(a):
    df = pd.read_excel(io.BytesIO(obj['Body'].read()),sheet_name=data.sheet_names[i], engine='openpyxl')
    df.to_sql("test"+str(i), engine1, index=False)

Basically, code parses the S3 bucket and runs in a loop. For each sheet, it creates a table
and dumps the data from sheet in that table.

Where I’m having trouble is, when I run this code, I get this error.

df = pd.read_excel(io.BytesIO(obj['Body'].read()),sheet_name=data.sheet_names[i-1], engine='openpyxl')
zipfile.BadZipFile: File is not a zip file

This is coming after I added ‘openpyxl’ engine in read_excel method. When I remove the engine, I get this error.

raise ValueError(
ValueError: Excel file format cannot be determined, you must specify an engine manually.

Please note that I can print the connection to database, so there is no problem in connectivity, and I’m using latest version of python and pandas. Also, I can get all the sheet_names in the excel file so I’m able to reach to that file as well.

Many Thanks!

Asked By: Suraj221b

||

Answers:

You are reading the obj twice, fully:

  1. data = pd.ExcelFile(io.BytesIO(obj['Body'].read()))
  2. pd.read_excel(io.BytesIO(obj['Body'].read()), ...)

Your object can only be .read() once, second read produce nothing, an empty b"".

In order to avoid re-reading the S3 stream many times, you could store it once in a BytesIO, and rewind that BytesIO with seek.

buf = io.BytesIO(obj["Body"].read())

pd.ExcelFile(buf)

buf.seek(0)

pd.read_excel(buf, ...)

# repeat
Answered By: ljmc