How to access a GCS Blob that contains an xml file in a bucket with the pandas.read_xml() function in python?
Question:
I would like to access a blob file via the pandas.read_xml() function.
Like this:
pandas.read_xml(blob.open())
When printing the blob it looks like this:
<Blob: Bucket, filename.0.xml.gz, 1612169959288959>
the blob.open()
function gives this:
<_io.TextIOWrapper encoding='iso-8859-1'>
and I get the error UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
. When I change the code to: blob.open(mode='rt', encoding='iso-8859-1')
I get ther error lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1
.
Is there even a way to read in a xml file from a bucket on gcs?
Answers:
read_xml() can directly read GCS files. Just provide the GCS URI and it can transform it to a dataframe. See sample code below and testing:
Sample file stored in GCS:
<?xml version="1.0" encoding="UTF-8"?>
<root >import pandas as pd
df = pd.read_xml("gs://my-bucket/note.xml.gz",compression="gzip")
print(df)
Output:
I would like to access a blob file via the pandas.read_xml() function.
Like this:
pandas.read_xml(blob.open())
When printing the blob it looks like this:
<Blob: Bucket, filename.0.xml.gz, 1612169959288959>
the blob.open()
function gives this:
<_io.TextIOWrapper encoding='iso-8859-1'>
and I get the error UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
. When I change the code to: blob.open(mode='rt', encoding='iso-8859-1')
I get ther error lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1
.
Is there even a way to read in a xml file from a bucket on gcs?
read_xml() can directly read GCS files. Just provide the GCS URI and it can transform it to a dataframe. See sample code below and testing:
Sample file stored in GCS:
<?xml version="1.0" encoding="UTF-8"?>
<root >import pandas as pd
df = pd.read_xml("gs://my-bucket/note.xml.gz",compression="gzip")
print(df)
Output: