How to pd.read_xml from zipfile with UTF-16 encoding?

Question:

I have a Zip archive with a number of xml files, which I would like to read into a Pandas data frame. The xml files are UTF-16 encoded, hence they can be read as:

import pandas as pd

# works
with open("data1.xml", encoding='utf-16') as f:
    data = pd.read_xml(f)

# works
data = pd.read_xml("data1.xml", encoding='utf-16')

However, I cannot read the same file directly from the Zip archive without extracting it manually first.

import zipfile
import pandas as pd

# does not work
with zipfile.open("data1.xml") as f:
    data = pd.read_xml(f, encoding='utf-16')

The problem seems to be the encoding, but I cannot manage to specify the UTF-16 correctly.

Many thanks for your help.

Asked By: CFW

||

Answers:

ZipFile.open reads in binary mode. To read as UTF-16 text wrap in a TextIoWrapper.

Below assumes a test.zip file with UTF-16-encoded test.xml inside:

import zipfile
import pandas as pd
import io

z = zipfile.ZipFile('test.zip')
with z.open("test.xml") as f:
    t = io.TextIOWrapper(f, encoding='utf-16')
    data = pd.read_xml(t)

If the .zip file has a single .xml file in it, this works as well and is documented in pandas.read_xml (see the compression parameter):

data = pd.read_xml('test.zip', encoding='utf-16')
Answered By: Mark Tolonen
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.