How to access a GCS Blob that contains an xml file in a bucket with the pandas.read_xml() function in python?

Question:

I would like to access a blob file via the pandas.read_xml() function.
Like this:

pandas.read_xml(blob.open())

When printing the blob it looks like this:

<Blob: Bucket, filename.0.xml.gz, 1612169959288959>

the blob.open()function gives this:

<_io.TextIOWrapper encoding='iso-8859-1'>

and I get the error UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte. When I change the code to: blob.open(mode='rt', encoding='iso-8859-1') I get ther error lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1.

Is there even a way to read in a xml file from a bucket on gcs?

Asked By: julez8000

||

Answers:

read_xml() can directly read GCS files. Just provide the GCS URI and it can transform it to a dataframe. See sample code below and testing:

Sample file stored in GCS:

<?xml version="1.0" encoding="UTF-8"?>
<root >import pandas as pd

df = pd.read_xml("gs://my-bucket/note.xml.gz",compression="gzip")

print(df)

Output:

enter image description here

Answered By: Ricco D