apache-arrow

Undefined symbol at runtime. Import Python C++ extension

Undefined symbol at runtime. Import Python C++ extension Question: I have a python package (my_python_package), part of which is a C++ extension (my_ext) with a single function (my_ext_func). The extension depends on my C++ library (libmycpp) and my C++ library depends on libarrow. The problem is that I get an error while importing a function …

Total answers: 1

Very slow aggregate on Pandas 2.0 dataframe with pyarrow as dtype_backend

Very slow aggregate on Pandas 2.0 dataframe with pyarrow as dtype_backend Question: Let’s say I have the following dataframe: Code Price AA1 10 AA1 20 BB2 30 And I want to perform the following operation on it: df.groupby("code").aggregate({ "price": "sum" }) I have tried playing with the new pyarrow dtypes introduced in Pandas 2.0 and …

Total answers: 1

Pyarrow slice pushdown for Azure data lake

Pyarrow slice pushdown for Azure data lake Question: I want to access Parquet files on an Azure data lake, and only retrieve some rows. Here is a reproducible example, using a public dataset: import pyarrow.dataset as ds from adlfs import AzureBlobFileSystem abfs_public = AzureBlobFileSystem( account_name="azureopendatastorage") dataset_public = ds.dataset(‘az://nyctlc/yellow/puYear=2010/puMonth=1/part-00000-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426339-18.c000.snappy.parquet’, filesystem=abfs_public) The processing time is the same …

Total answers: 2

Use batches to add a column with pyarrow

Use batches to add a column with pyarrow Question: I am currently loading a table, calculating a new column, adding the column to the table and save the table to disk, which all works fine. The question: I tried to this batch wise, but get the error message: AttributeError: ‘pyarrow.lib.RecordBatch’ object has no attribute ‘append_column’ …

Total answers: 2

How would I go about converting a .csv to an .arrow file without loading it all into memory?

How would I go about converting a .csv to an .arrow file without loading it all into memory? Question: I found a similar question here: Read CSV with PyArrow In this answer it references sys.stdin.buffer and sys.stdout.buffer, but I am not exactly sure how that would be used to write the .arrow file, or name …

Total answers: 3

Convert Pandas DataFrame to & from In-Memory Feather

Convert Pandas DataFrame to & from In-Memory Feather Question: Using the IO tools in pandas it is possible to convert a DataFrame to an in-memory feather buffer: import pandas as pd from io import BytesIO df = pd.DataFrame({‘a’: [1,2], ‘b’: [3.0,4.0]}) buf = BytesIO() df.to_feather(buf) However, using the same buffer to convert back to a …

Total answers: 1