parquet

Does any Python library support writing arrays of structs to Parquet files?

Does any Python library support writing arrays of structs to Parquet files? Question: I want to write data where some columns are arrays of strings or arrays of structs (typically key-value pairs) into a Parquet file for use in AWS Athena. After finding two Python libraries (Arrow and fastparquet) supporting writing to Parquet files I …

Total answers: 1

Can two parquet files be compared?

Can two parquet files be compared? Question: I could not find an open source tool or library to compare two parquet files. Presuming I did not overlook the obvious, is there a technical reason for this? What would a programmer need to consider before writing a parquet diff tool? I am using Python language. Thank …

Total answers: 1

Save a CSV file that's too big to fit into memory into a parquet file

Save a CSV file that's too big to fit into memory into a parquet file Question: My development environment is a single-user workstation with 4 cores but not running Spark or HDFS. I have a CSV file that’s too big to fit in memory. I want to save it as a parquet file and analyze …

Total answers: 3

What are the differences between feather and parquet?

What are the differences between feather and parquet? Question: Both are columnar (disk-)storage formats for use in data analysis systems. Both are integrated within Apache Arrow (pyarrow package for python) and are designed to correspond with Arrow as a columnar in-memory analytics layer. How do both formats differ? Should you always prefer feather when working …

Total answers: 2

Using pyarrow how do you append to parquet file?

Using pyarrow how do you append to parquet file? Question: How do you append/update to a parquet file with pyarrow? import pandas as pd import pyarrow as pa import pyarrow.parquet as pq table2 = pd.DataFrame({‘one’: [-1, np.nan, 2.5], ‘two’: [‘foo’, ‘bar’, ‘baz’], ‘three’: [True, False, True]}) table3 = pd.DataFrame({‘six’: [-1, np.nan, 2.5], ‘nine’: [‘foo’, ‘bar’, …

Total answers: 5

How to read partitioned parquet files from S3 using pyarrow in python

How to read partitioned parquet files from S3 using pyarrow in python Question: I looking for ways to read data from multiple partitioned directories from s3 using python. data_folder/serial_number=1/cur_date=20-12-2012/abcdsd0324324.snappy.parquet data_folder/serial_number=2/cur_date=27-12-2012/asdsdfsd0324324.snappy.parquet pyarrow’s ParquetDataset module has the capabilty to read from partitions. So I have tried the following code : >>> import pandas as pd >>> import …

Total answers: 5

How to read a Parquet file into Pandas DataFrame?

How to read a Parquet file into Pandas DataFrame? Question: How to read a modestly sized Parquet data-set into an in-memory Pandas DataFrame without setting up a cluster computing infrastructure such as Hadoop or Spark? This is only a moderate amount of data that I would like to read in-memory with a simple Python script …

Total answers: 8

Methods for writing Parquet files using Python?

Methods for writing Parquet files using Python? Question: I’m having trouble finding a library that allows Parquet files to be written using Python. Bonus points if I can use Snappy or a similar compression mechanism in conjunction with it. Thus far the only method I have found is using Spark with the pyspark.sql.DataFrame Parquet support. …

Total answers: 7