Pandas read_parquet partially parses binary column
Question:
I’m trying to read a parquet file that contains a binary column with multiple hex values, which is causing issues when reading it with Pandas. Pandas is automatically converting some of the hex values to characters, but some are left untouched, so the data is not really usable anymore. When reading it with PySpark
, it converts all hex values to decimal base, but as the output is consistent, it’s usable.
Any ideas why pandas parse this column differently and how I can get the same output, or at least a consistent one (no partial parsing applied) as Spark returns?
The snippets of code and returned outputs :
Pandas :
df = pd.read_parquet('data.parquet'))
pd.read_parquet
output:
Spark :
spark_df = spark.read.parquet("data.parquet")
df = spark_df.toPandas()
Spark.read.parquet
output:
Answers:
Pandas is returning a byte string, some characters will be displayed like that, but nothing is wrong with it. For example:
x = bytes([1,10,100]) # x is shown as b'x01nd' where last 'd' is ASCII letter
list(x) # get as a list of numbers
To convert your pandas dataframe to look like spark one, use:
df['BASE_PERIOD_VECTOR'].apply(list)
I’m trying to read a parquet file that contains a binary column with multiple hex values, which is causing issues when reading it with Pandas. Pandas is automatically converting some of the hex values to characters, but some are left untouched, so the data is not really usable anymore. When reading it with PySpark
, it converts all hex values to decimal base, but as the output is consistent, it’s usable.
Any ideas why pandas parse this column differently and how I can get the same output, or at least a consistent one (no partial parsing applied) as Spark returns?
The snippets of code and returned outputs :
Pandas :
df = pd.read_parquet('data.parquet'))
pd.read_parquet
output:
Spark :
spark_df = spark.read.parquet("data.parquet")
df = spark_df.toPandas()
Spark.read.parquet
output:
Pandas is returning a byte string, some characters will be displayed like that, but nothing is wrong with it. For example:
x = bytes([1,10,100]) # x is shown as b'x01nd' where last 'd' is ASCII letter
list(x) # get as a list of numbers
To convert your pandas dataframe to look like spark one, use:
df['BASE_PERIOD_VECTOR'].apply(list)