parquet

Out of memory trying to convert csv file to parquet using python

Out of memory trying to convert csv file to parquet using python Question: I am trying to convert a very large csv file to parquet. I have tried the following method: df1 = pd.read_csv(‘/kaggle/input/amex-default-prediction/train_data.csv’) df1.to_parquet(‘/kaggle/input/amex-default-prediction/train.parquet’) but pd.read_csv throws Out Of Memory Error Is there any way to convert to the file without loading it entirely …

Total answers: 1

.Parquet to .Hyper file conversion for any schema

.Parquet to .Hyper file conversion for any schema Question: I want to convert parquet file to hyper file format using python. There is the following git for this – https://github.com/tableau/hyper-api-samples/blob/main/Community-Supported/parquet-to-hyper/create_hyper_file_from_parquet.py. But in this case the parquet format /schema is known beforehand. What should I do if I want it to work for any parquet file, …

Total answers: 2

Schema for pyarrow.ParquetDataset > partition columns

Schema for pyarrow.ParquetDataset > partition columns Question: I have a pandas DataFrame: import pandas as pd df = pd.DataFrame(data={"col1": [1, 2], "col2": [3.0, 4.0], "col3": ["foo", "bar"]}) Using s3fs: from s3fs import S3FileSystem s3fs = S3FileSystem(**kwargs) I can write this as a parquet dataset import pyarrow as pa import pyarrow.parquet as pq tbl = pa.Table.from_pandas(df) …

Total answers: 2

Read / Write Parquet files without reading into memory (using Python)

Read / Write Parquet files without reading into memory (using Python) Question: I looked at the standard documentation that I would expect to capture my need (Apache Arrow and Pandas), and I could not seem to figure it out. I know Python best, so I would like to use Python, but it is not a …

Total answers: 4

AnalysisException: Path does not exist: dbfs:/databricks/python/lib/python3.7/site-packages/sampleFolder/data;

AnalysisException: Path does not exist: dbfs:/databricks/python/lib/python3.7/site-packages/sampleFolder/data; Question: I am packing the following code in a whl file: from pkg_resources import resource_filename def path_to_model(anomaly_dir_name: str, data_path: str): filepath = resource_filename(anomaly_dir_name, data_path) return filepath def read_data(spark) -> DataFrame: return (spark.read.parquet(str(path_to_model("sampleFolder", "data")))) I confirmed that the whl file contains the parquet files under sampleFolder/data/ directory correctly. When i …

Total answers: 2

Write nested parquet format from Python

Write nested parquet format from Python Question: I have a flat parquet file where one varchar columns store JSON data as a string and I want to transform this data to a nested structure, i.e. the JSON data becomes nested parquet. I know the schema of the JSON in advance if this is of any …

Total answers: 1

Losing index information when using dask.dataframe.to_parquet() with partitioning

Losing index information when using dask.dataframe.to_parquet() with partitioning Question: When I was using dask=1.2.2 with pyarrow 0.11.1 I did not observe this behavior. After updating (dask=2.10.1 and pyarrow=0.15.1), I cannot save the index when I use to_parquet method with given partition_on and write_index arguments. Here I have created a minimal example which shows the issue: …

Total answers: 2

Pandas : Reading first n rows from parquet file?

Pandas : Reading first n rows from parquet file? Question: I have a parquet file and I want to read first n rows from the file into a pandas data frame. What I tried: df = pd.read_parquet(path= ‘filepath’, nrows = 10) It did not work and gave me error: TypeError: read_table() got an unexpected keyword …

Total answers: 7

Transfer and write Parquet with python and pandas got timestamp error

Transfer and write Parquet with python and pandas got timestamp error Question: I tried to concat() two parquet file with pandas in python . It can work , but when I try to write and save the Data frame to a parquet file ,it display the error : ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would …

Total answers: 5