Reading .xlsx file from a bytestring
Question:
I’m trying to read an attached .xlsx file from an e-mail.
I have been able to retrieve an email.message.Message
type which has a part of type application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
. I should be able to read it using
file = part.get_payload(decode=True)
Which gives me a bytes object starting with
b'PKx03x04x14x00x06x00x08x00x00x00!x00x93xe11xb6x93x01x00x003x07x00x00x13x00
I would like to parse this into a dictionary using
io.BytesIO(gzip.decompress(file))
For some e-mails with a zipped .csv file this works but .xlsx files cannot open with this approach. I’ve looked online but I’ve not been able to find any solution. Any help would be greatly appreciated.
Answers:
.xlsx
is a ZIP rather than GZip archive. These are two completely different formats.
While you can use the zipfile
module to get its contents, you’re still going to need some specialized package for Excel files to make sense of them.
Excel files come in compressed form and are automatically uncompressed when loaded into Excel itself.
The openpyxl
library is able to directly load these Excel files, for example:
import openpyxl
import io
xlsx = io.BytesIO(part.get_payload(decode=True))
wb = openpyxl.load_workbook(xlsx)
ws = wb['Sheet1']
for row in ws.iter_rows(values_only=True):
print(row)
If you need extra information per cell:
for cells in ws.iter_rows():
print([cell.value for cell in cells])
In your case,
import openpyxl
import io
# The bytes object (Something like b'PKx03x04x14x00x06x00x08x00x00...)
file = part.get_payload(decode=True)
xlsx = io.BytesIO(file)
wb = openpyxl.load_workbook(xlsx)
ws = wb['Sheet1']
for cells in ws.iter_rows():
print([cell.value for cell in cells])
I’m trying to read an attached .xlsx file from an e-mail.
I have been able to retrieve an email.message.Message
type which has a part of type application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
. I should be able to read it using
file = part.get_payload(decode=True)
Which gives me a bytes object starting with
b'PKx03x04x14x00x06x00x08x00x00x00!x00x93xe11xb6x93x01x00x003x07x00x00x13x00
I would like to parse this into a dictionary using
io.BytesIO(gzip.decompress(file))
For some e-mails with a zipped .csv file this works but .xlsx files cannot open with this approach. I’ve looked online but I’ve not been able to find any solution. Any help would be greatly appreciated.
.xlsx
is a ZIP rather than GZip archive. These are two completely different formats.
While you can use the zipfile
module to get its contents, you’re still going to need some specialized package for Excel files to make sense of them.
Excel files come in compressed form and are automatically uncompressed when loaded into Excel itself.
The openpyxl
library is able to directly load these Excel files, for example:
import openpyxl
import io
xlsx = io.BytesIO(part.get_payload(decode=True))
wb = openpyxl.load_workbook(xlsx)
ws = wb['Sheet1']
for row in ws.iter_rows(values_only=True):
print(row)
If you need extra information per cell:
for cells in ws.iter_rows():
print([cell.value for cell in cells])
In your case,
import openpyxl
import io
# The bytes object (Something like b'PKx03x04x14x00x06x00x08x00x00...)
file = part.get_payload(decode=True)
xlsx = io.BytesIO(file)
wb = openpyxl.load_workbook(xlsx)
ws = wb['Sheet1']
for cells in ws.iter_rows():
print([cell.value for cell in cells])