Can two parquet files be compared?

Question:

I could not find an open source tool or library to compare two parquet files. Presuming I did not overlook the obvious, is there a technical reason for this?

What would a programmer need to consider before writing a parquet diff tool?

I am using Python language.

Thank you.

Asked By: ziff

||

Answers:

The easiest combination would be to use pandas together with pyarrow. Once you have both packages installed, you can use https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_parquet.html to load the Apache Parquet file into a Pandas DataFrame and then use Pandas’ assert_frame_equal on the two resulting DataFrames.

Note that this will compare the two resulting DataFrames and not the exact contents of the Parquet files. As not all Parquet types can be matched 1:1 to Pandas, information like if it was a Date or a DateTime will get lost but Pandas offers a really good comparison infrastructure.

Alternatively, you could utilise Apache Arrow (the pyarrow package mentioned above) and read the data into pyarrow.Table and check for equality. This method preserves the type information much better but is less verbose on the differences if there are some:

import pyarrow.parquet as pq

table1 = pq.read_table('file1.parquet')
table2 = pq.read_table('file2.parquet')

assert table1.equals(table2)
Answered By: Uwe L. Korn
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.