parquet

Numpy array to list of lists in polars dataframe

Numpy array to list of lists in polars dataframe Question: I’m trying to save a dataframe with a 2D list in each cell to a parquet file. As example I created a polars dataframe with a 2D list. As can be seen in the table the dtype of both columns is list[list[i64]]. ┌─────────────────────┬─────────────────────┐ │ a …

Total answers: 1

with python, is there a way to load a polars dataframe directly into an s3 bucket as parquet

with python, is there a way to load a polars dataframe directly into an s3 bucket as parquet Question: looking for something like this: Save Dataframe to csv directly to s3 Python the api shows these arguments: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_parquet.html but i’m not sure how to convert the df into a stream… Asked By: rnd om || …

Total answers: 1

Comparing parquet file schema to db schema in python (including decimal precisions)

Comparing parquet file schema to db schema in python (including decimal precisions) Question: If I have a parquet file with columns that have, for example, types Decimal(38, 22) or Decimal(20, 4), is there a way to compare them to the existing schema in database in python (for example check if Decimal(38, 22) corresponds to the …

Total answers: 1

Avro, Hive or HBASE – What to use for 10 mio. records daily?

Avro, Hive or HBASE – What to use for 10 mio. records daily? Question: I have the following requirements: i need to process per day around 20.000 elements (lets call them baskets) which generate each between 100 and 1.000 records (lets call them products in basket). A single record has about 10 columns, each row …

Total answers: 1

Splitting a large CSV file and converting into multiple Parquet files – Safe?

Splitting a large CSV file and converting into multiple Parquet files – Safe? Question: I learnt, the parquet file format stores a bunch of metadata and uses various compressions to store data in an efficient way, when it comes to size and query-speed. And it possibly generates multiple files out of, let’s say: one input, …

Total answers: 2

Can a parquet file exceed 2.1GB?

Can a parquet file exceed 2.1GB? Question: I’m having an issue storing a large dataset (around 40GB) in a single parquet file. I’m using the fastparquet library to append pandas.DataFrames to this parquet dataset file. The following is a minimal example program that appends chunks to a parquet file until it crashes as the file-size …

Total answers: 1

Trying to filter in dask.read_parquet tries to compare NoneType and str

Trying to filter in dask.read_parquet tries to compare NoneType and str Question: I have a project where I pass the following load_args to read_parquet: filters = {‘filters’: [(‘itemId’, ‘=’, ‘9403cfde-7fe5-4c9c-916c-41ff0b595c5c’)]} According to the documentation, a List[Tuple] like this should be accepted and I should get all partitions which match the predicate (or equivalently, filter out …

Total answers: 2

is it possible to have one meta file for multiple parquet data files?

is it possible to have one meta file for multiple parquet data files? Question: I have a process that generates millions of small dataframes and save them to parquet in parallel. all dataframes have the same columns and index information. and have the same number of rows (about 300). as the dataframe is small, when …

Total answers: 1

Why I can't parse timestamp in pyarrow?

Why I can't parse timestamp in pyarrow? Question: I have a JSON file with that variable: "BirthDate":"2022-09-05T08:08:46.000+00:00" And I want to create parquet based on that file. I prepared fixed schema for pyarrow where BirthDate is a pa.timestamp(‘s’). And when I trying to convert that file I got error: ERROR:root:Failed of conversion of JSON to …

Total answers: 1