pyarrow | Page 2

Schema for pyarrow.ParquetDataset > partition columns

Schema for pyarrow.ParquetDataset > partition columns Question: I have a pandas DataFrame: import pandas as pd df = pd.DataFrame(data={"col1": [1, 2], "col2": [3.0, 4.0], "col3": ["foo", "bar"]}) Using s3fs: from s3fs import S3FileSystem s3fs = S3FileSystem(**kwargs) I can write this as a parquet dataset import pyarrow as pa import pyarrow.parquet as pq tbl = pa.Table.from_pandas(df) …

Total answers: 2

How would I go about converting a .csv to an .arrow file without loading it all into memory?

How would I go about converting a .csv to an .arrow file without loading it all into memory? Question: I found a similar question here: Read CSV with PyArrow In this answer it references sys.stdin.buffer and sys.stdout.buffer, but I am not exactly sure how that would be used to write the .arrow file, or name …

Total answers: 3

Occur "Could NOT find Arrow" error when using pip_pypy3 to install pyarrow

Occur "Could NOT find Arrow" error when using pip_pypy3 to install pyarrow Question: I am trying to use pypy3 to install pyarrow, but some errors occur. Basic information is blow: macOS 10.15.7 Xcode 12.3 python version 3.7.9 pypy3 version 7.3.3 pyarrow version 0.17.1 cmd is ‘pip_pypy3 install pyarrow==0.17.1’ Some key information and error content in …

Total answers: 3

Write nested parquet format from Python

Write nested parquet format from Python Question: I have a flat parquet file where one varchar columns store JSON data as a string and I want to transform this data to a nested structure, i.e. the JSON data becomes nested parquet. I know the schema of the JSON in advance if this is of any …

Total answers: 1

Does any Python library support writing arrays of structs to Parquet files?

Does any Python library support writing arrays of structs to Parquet files? Question: I want to write data where some columns are arrays of strings or arrays of structs (typically key-value pairs) into a Parquet file for use in AWS Athena. After finding two Python libraries (Arrow and fastparquet) supporting writing to Parquet files I …

Total answers: 1

What are the differences between feather and parquet?

What are the differences between feather and parquet? Question: Both are columnar (disk-)storage formats for use in data analysis systems. Both are integrated within Apache Arrow (pyarrow package for python) and are designed to correspond with Arrow as a columnar in-memory analytics layer. How do both formats differ? Should you always prefer feather when working …

Total answers: 2

Can pyarrow write multiple parquet files to a folder like fastparquet's file_scheme='hive' option?

Can pyarrow write multiple parquet files to a folder like fastparquet's file_scheme='hive' option? Question: I have a multi-million record SQL table that I’m planning to write out to many parquet files in a folder, using the pyarrow library. The data content seems too large to store in a single parquet file. However, I can’t seem …

Total answers: 1

Using pyarrow how do you append to parquet file?

Using pyarrow how do you append to parquet file? Question: How do you append/update to a parquet file with pyarrow? import pandas as pd import pyarrow as pa import pyarrow.parquet as pq table2 = pd.DataFrame({‘one’: [-1, np.nan, 2.5], ‘two’: [‘foo’, ‘bar’, ‘baz’], ‘three’: [True, False, True]}) table3 = pd.DataFrame({‘six’: [-1, np.nan, 2.5], ‘nine’: [‘foo’, ‘bar’, …

Total answers: 5

How to read partitioned parquet files from S3 using pyarrow in python

How to read partitioned parquet files from S3 using pyarrow in python Question: I looking for ways to read data from multiple partitioned directories from s3 using python. data_folder/serial_number=1/cur_date=20-12-2012/abcdsd0324324.snappy.parquet data_folder/serial_number=2/cur_date=27-12-2012/asdsdfsd0324324.snappy.parquet pyarrow’s ParquetDataset module has the capabilty to read from partitions. So I have tried the following code : >>> import pandas as pd >>> import …

Total answers: 5

How to read a list of parquet files from S3 as a pandas dataframe using pyarrow?

How to read a list of parquet files from S3 as a pandas dataframe using pyarrow? Question: I have a hacky way of achieving this using boto3 (1.4.4), pyarrow (0.4.1) and pandas (0.20.3). First, I can read a single parquet file locally like this: import pyarrow.parquet as pq path = ‘parquet/part-r-00000-1e638be4-e31f-498a-a359-47d017a0059c.gz.parquet’ table = pq.read_table(path) df …

Total answers: 8