Pandas read_parquet partially parses binary column

Question

I’m trying to read a parquet file that contains a binary column with multiple hex values, which is causing issues when reading it with Pandas. Pandas is automatically converting some of the hex values to characters, but some are left untouched, so the data is not really usable anymore. When reading it with PySpark, it converts all hex values to decimal base, but as the output is consistent, it’s usable.

Any ideas why pandas parse this column differently and how I can get the same output, or at least a consistent one (no partial parsing applied) as Spark returns?

The snippets of code and returned outputs :

Pandas :

df = pd.read_parquet('data.parquet'))

pd.read_parquet output:

pd.read_parquet output

Spark :

spark_df = spark.read.parquet("data.parquet")
df = spark_df.toPandas()

Spark.read.parquet output:

spark.read.parquet output

Asked By: Vilks

||

Source

Answer 1

Pandas is returning a byte string, some characters will be displayed like that, but nothing is wrong with it. For example:

x = bytes([1,10,100]) # x is shown as b'x01nd' where last 'd' is ASCII letter
list(x) # get as a list of numbers

To convert your pandas dataframe to look like spark one, use:

df['BASE_PERIOD_VECTOR'].apply(list)

Answered By: bzu

Pandas read_parquet partially parses binary column

Question:

Answers: