pyarrow | py4u

Split a parquet file by groups

Split a parquet file by groups Question: I have a large-ish dataframe in a Parquet file and I want to split it into multiple files to leverage Hive partitioning with pyarrow. Preferably without loading all data into memory. (This question has been asked before, but I have not found a solution that is both fast …

Total answers: 3

Very slow aggregate on Pandas 2.0 dataframe with pyarrow as dtype_backend

Very slow aggregate on Pandas 2.0 dataframe with pyarrow as dtype_backend Question: Let’s say I have the following dataframe: Code Price AA1 10 AA1 20 BB2 30 And I want to perform the following operation on it: df.groupby("code").aggregate({ "price": "sum" }) I have tried playing with the new pyarrow dtypes introduced in Pandas 2.0 and …

Total answers: 1

Pyarrow slice pushdown for Azure data lake

Pyarrow slice pushdown for Azure data lake Question: I want to access Parquet files on an Azure data lake, and only retrieve some rows. Here is a reproducible example, using a public dataset: import pyarrow.dataset as ds from adlfs import AzureBlobFileSystem abfs_public = AzureBlobFileSystem( account_name="azureopendatastorage") dataset_public = ds.dataset(‘az://nyctlc/yellow/puYear=2010/puMonth=1/part-00000-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426339-18.c000.snappy.parquet’, filesystem=abfs_public) The processing time is the same …

Total answers: 2

Read mutliple parquet files to pandas with select columns where select columns exist

Read mutliple parquet files to pandas with select columns where select columns exist Question: When running the below i hit an error due to some of the files missing the required columns li = [] for filename in parquet_filtered_list: df = pd.read_parquet(filename, columns = list_key_cols_aggregates ) li.append(df) df_raw_2021_to_2022 = pd.concat(li, axis=0, ignore_index=False) del li How …

Total answers: 1

AttributeError: module 'dill._dill' has no attribute 'log'

AttributeError: module 'dill._dill' has no attribute 'log' Question: I am using a python nlp module to train a dataset and ran into the following error: File "/usr/local/lib/python3.9/site-packages/nlp/utils/py_utils.py", line 297, in save_code dill._dill.log.info("Co: %s" % obj) AttributeError: module ‘dill._dill’ has no attribute ‘log’ I noticed similar posts where no attribute ‘extend’ and no attribute ‘stack’ where …

Total answers: 1

Why I can't parse timestamp in pyarrow?

Why I can't parse timestamp in pyarrow? Question: I have a JSON file with that variable: "BirthDate":"2022-09-05T08:08:46.000+00:00" And I want to create parquet based on that file. I prepared fixed schema for pyarrow where BirthDate is a pa.timestamp(‘s’). And when I trying to convert that file I got error: ERROR:root:Failed of conversion of JSON to …

Total answers: 1

pyarrow: how to save casted column's values in same table?

pyarrow: how to save casted column's values in same table? Question: i’m beginner in pyarrow and trying to cast my timestamp with AM/PM prefix. I have a column [‘Datetime’] with such values: "2021/07/25 12:00:00 AM", "2022/06/28 11:58:00 PM", "2022/03/11 10:30:00 AM", and i’m trying to get these: 2021-07-25 12:00:00, 2022-06-28 11:58:00, 2022-03-11 10:30:00, Ideally, want …

Total answers: 1

Pyarrow Join (int8 and int16)

Pyarrow Join (int8 and int16) Question: I have two Pyarrow Tables and want to join both. A.join( right_table=B, keys="A_id", right_keys="B_id" ) Now I got the following error: {ArrowInvalid} Incompatible data types for corresponding join field keys: FieldRef.Name(A_id) of type int8 and FieldRef.Name(B_id) of type int16 What is the preferred way to solve this issue? I …

Total answers: 1

Use batches to add a column with pyarrow

Use batches to add a column with pyarrow Question: I am currently loading a table, calculating a new column, adding the column to the table and save the table to disk, which all works fine. The question: I tried to this batch wise, but get the error message: AttributeError: ‘pyarrow.lib.RecordBatch’ object has no attribute ‘append_column’ …

Total answers: 2

Convert nested dictionary of string keys and array values to pyarrow Table

Convert nested dictionary of string keys and array values to pyarrow Table Question: I have data in the form of a nested Python dictionary that I would like to serialize: { top_value: [ { "probabilities": prob_array, "metrics": { "metric_a": a_array, "metric_b": b_array, "metric_c": c_array } } ] } where all *_array variables are Numpy arrays. …

Total answers: 1