Pandas : Reading first n rows from parquet file?
Question:
I have a parquet file and I want to read first n
rows from the file into a pandas data frame.
What I tried:
df = pd.read_parquet(path= 'filepath', nrows = 10)
It did not work and gave me error:
TypeError: read_table() got an unexpected keyword argument 'nrows'
I did try the skiprows
argument as well but that also gave me same error.
Alternatively, I can read the complete parquet file and filter the first n rows, but that will require more computations which I want to avoid.
Is there any way to achieve it?
Answers:
After exploring around and getting in touch with the pandas dev team, the end point is pandas does not support argument nrows
or skiprows
while reading the parquet file.
The reason being that pandas use pyarrow
or fastparquet
parquet engines to process parquet file and pyarrow
has no support for reading file partially or reading file by skipping rows (not sure about fastparquet
). Below is the link of issue on pandas github for discussion.
Parquet file is column oriented storage, designed for that… So it’s normal to load all the file to access just one line.
The accepted answer is out of date. It is now possible to read only the first few lines of a parquet file into pandas, though it is a bit messy and backend dependent.
To read using PyArrow as the backend, follow below:
from pyarrow.parquet import ParquetFile
import pyarrow as pa
pf = ParquetFile('file_name.pq')
first_ten_rows = next(pf.iter_batches(batch_size = 10))
df = pa.Table.from_batches([first_ten_rows]).to_pandas()
Change the line batch_size = 10
to match however many rows you want to read in.
As an alternative you can use S3 Select functionality from AWS SDK for pandas as proposed by Abdel Jaidi in this answer.
pip install awswrangler
import awswrangler as wr
df = wr.s3.select_query(
sql="SELECT * FROM s3object s limit 5",
path="s3://filepath",
input_serialization="Parquet",
input_serialization_params={},
use_threads=True,
)
Using pyarrow dataset scanner:
import pyarrow as pa
n = 10
src_path = "/parquet/path"
df = pa.dataset.dataset(src_path).scanner().head(n).to_pandas()
The most straighforward option for me seems to use dask
library as
import dask.dataframe as dd
df = dd.read_parquet(path= 'filepath').head(10)
Querying Parquet with DuckDB
To provide another perspective, if you’re comfortable with SQL, you might consider using DuckDB for this. For example:
import duckdb
nrows = 10
file_path = 'path/to/data/parquet_file.parquet'
df = duckdb.query(f'SELECT * FROM "{file_path}" LIMIT {nrows};').df()
If you’re working with partitioned parquet, the above result wont include any of the partition columns since that information isn’t stored in the lower level files. Instead, you should identify the top folder as a partitioned parquet datasets and register it with a DuckDB connector:
import duckdb
import pyarrow.dataset as ds
nrows = 10
dataset = ds.dataset('path/to/data',
format='parquet',
partitioning='hive')
con = duckdb.connect()
con.register('data_table_name', dataset)
df = con.execute(f"SELECT * FROM data_table_name LIMIT {nrows};").df()
You can register multiple datasets with the connector to enable more complex queries. I find DuckDB makes working with parquet files much more convenient, especially when trying to JOIN between multiple Parquet datasets. Install it with conda install python-duckdb
or pip install duckdb
I have a parquet file and I want to read first n
rows from the file into a pandas data frame.
What I tried:
df = pd.read_parquet(path= 'filepath', nrows = 10)
It did not work and gave me error:
TypeError: read_table() got an unexpected keyword argument 'nrows'
I did try the skiprows
argument as well but that also gave me same error.
Alternatively, I can read the complete parquet file and filter the first n rows, but that will require more computations which I want to avoid.
Is there any way to achieve it?
After exploring around and getting in touch with the pandas dev team, the end point is pandas does not support argument nrows
or skiprows
while reading the parquet file.
The reason being that pandas use pyarrow
or fastparquet
parquet engines to process parquet file and pyarrow
has no support for reading file partially or reading file by skipping rows (not sure about fastparquet
). Below is the link of issue on pandas github for discussion.
Parquet file is column oriented storage, designed for that… So it’s normal to load all the file to access just one line.
The accepted answer is out of date. It is now possible to read only the first few lines of a parquet file into pandas, though it is a bit messy and backend dependent.
To read using PyArrow as the backend, follow below:
from pyarrow.parquet import ParquetFile
import pyarrow as pa
pf = ParquetFile('file_name.pq')
first_ten_rows = next(pf.iter_batches(batch_size = 10))
df = pa.Table.from_batches([first_ten_rows]).to_pandas()
Change the line batch_size = 10
to match however many rows you want to read in.
As an alternative you can use S3 Select functionality from AWS SDK for pandas as proposed by Abdel Jaidi in this answer.
pip install awswrangler
import awswrangler as wr
df = wr.s3.select_query(
sql="SELECT * FROM s3object s limit 5",
path="s3://filepath",
input_serialization="Parquet",
input_serialization_params={},
use_threads=True,
)
Using pyarrow dataset scanner:
import pyarrow as pa
n = 10
src_path = "/parquet/path"
df = pa.dataset.dataset(src_path).scanner().head(n).to_pandas()
The most straighforward option for me seems to use dask
library as
import dask.dataframe as dd
df = dd.read_parquet(path= 'filepath').head(10)
Querying Parquet with DuckDB
To provide another perspective, if you’re comfortable with SQL, you might consider using DuckDB for this. For example:
import duckdb
nrows = 10
file_path = 'path/to/data/parquet_file.parquet'
df = duckdb.query(f'SELECT * FROM "{file_path}" LIMIT {nrows};').df()
If you’re working with partitioned parquet, the above result wont include any of the partition columns since that information isn’t stored in the lower level files. Instead, you should identify the top folder as a partitioned parquet datasets and register it with a DuckDB connector:
import duckdb
import pyarrow.dataset as ds
nrows = 10
dataset = ds.dataset('path/to/data',
format='parquet',
partitioning='hive')
con = duckdb.connect()
con.register('data_table_name', dataset)
df = con.execute(f"SELECT * FROM data_table_name LIMIT {nrows};").df()
You can register multiple datasets with the connector to enable more complex queries. I find DuckDB makes working with parquet files much more convenient, especially when trying to JOIN between multiple Parquet datasets. Install it with conda install python-duckdb
or pip install duckdb