Reading .xlsx file from a bytestring

Question:

I’m trying to read an attached .xlsx file from an e-mail.

I have been able to retrieve an email.message.Message type which has a part of type application/vnd.openxmlformats-officedocument.spreadsheetml.sheet. I should be able to read it using

file = part.get_payload(decode=True)

Which gives me a bytes object starting with

b'PKx03x04x14x00x06x00x08x00x00x00!x00x93xe11xb6x93x01x00x003x07x00x00x13x00

I would like to parse this into a dictionary using

io.BytesIO(gzip.decompress(file))

For some e-mails with a zipped .csv file this works but .xlsx files cannot open with this approach. I’ve looked online but I’ve not been able to find any solution. Any help would be greatly appreciated.

Asked By: Nathan

||

Answers:

.xlsx is a ZIP rather than GZip archive. These are two completely different formats.

While you can use the zipfile module to get its contents, you’re still going to need some specialized package for Excel files to make sense of them.

Answered By: ivan_pozdeev

Excel files come in compressed form and are automatically uncompressed when loaded into Excel itself.

The openpyxl library is able to directly load these Excel files, for example:

import openpyxl
import io

xlsx = io.BytesIO(part.get_payload(decode=True))
wb = openpyxl.load_workbook(xlsx)
ws = wb['Sheet1']

for row in ws.iter_rows(values_only=True):
    print(row)

If you need extra information per cell:

for cells in ws.iter_rows():    
    print([cell.value for cell in cells])
Answered By: Martin Evans

In your case,

import openpyxl
import io

# The bytes object (Something like b'PKx03x04x14x00x06x00x08x00x00...)
file = part.get_payload(decode=True)

xlsx = io.BytesIO(file)
wb = openpyxl.load_workbook(xlsx)
ws = wb['Sheet1']

for cells in ws.iter_rows():    
    print([cell.value for cell in cells])
Answered By: Denis Biwott
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.