How to read a Parquet file into Pandas DataFrame?
Question:
How to read a modestly sized Parquet data-set into an in-memory Pandas DataFrame without setting up a cluster computing infrastructure such as Hadoop or Spark? This is only a moderate amount of data that I would like to read in-memory with a simple Python script on a laptop. The data does not reside on HDFS. It is either on the local file system or possibly in S3. I do not want to spin up and configure other services like Hadoop, Hive or Spark.
I thought Blaze/Odo would have made this possible: the Odo documentation mentions Parquet, but the examples seem all to be going through an external Hive runtime.
Answers:
Update: since the time I answered this there has been a lot of work on this look at Apache Arrow for a better read and write of parquet. Also: http://wesmckinney.com/blog/python-parquet-multithreading/
There is a python parquet reader that works relatively well: https://github.com/jcrobak/parquet-python
It will create python objects and then you will have to move them to a Pandas DataFrame so the process will be slower than pd.read_csv
for example.
pandas 0.21 introduces new functions for Parquet:
import pandas as pd
pd.read_parquet('example_pa.parquet', engine='pyarrow')
or
import pandas as pd
pd.read_parquet('example_fp.parquet', engine='fastparquet')
The above link explains:
These engines are very similar and should read/write nearly identical parquet format files. These libraries differ by having different underlying dependencies (fastparquet by using numba, while pyarrow uses a c-library).
Aside from pandas, Apache pyarrow also provides way to transform parquet to dataframe
The code is simple, just type:
import pyarrow.parquet as pq
df = pq.read_table(source=your_file_path).to_pandas()
For more information, see the document from Apache pyarrow Reading and Writing Single Files
Parquet files are always large. so read it using dask.
import dask.dataframe as dd
from dask import delayed
from fastparquet import ParquetFile
import glob
files = glob.glob('data/*.parquet')
@delayed
def load_chunk(path):
return ParquetFile(path).to_pandas()
df = dd.from_delayed([load_chunk(f) for f in files])
df.compute()
When writing to parquet, consider using brotli compression. I’m getting a 70% size reduction of 8GB file parquet file by using brotli compression. Brotli makes for a smaller file and faster read/writes than gzip, snappy, pickle. Although pickle can do tuples whereas parquet does not.
df.to_parquet('df.parquet.brotli',compression='brotli')
df = pd.read_parquet('df.parquet.brotli')
Considering the .parquet
file named data.parquet
parquet_file = '../data.parquet'
open( parquet_file, 'w+' )
Convert to Parquet
Assuming one has a dataframe parquet_df
that one wants to save to the parquet file above, one can use pandas.to_parquet
(this function requires either the fastparquet or pyarrow library) as follows
parquet_df.to_parquet(parquet_file)
Read from Parquet
In order to read the parquet file into a dataframe new_parquet_df
, one can use pandas.read_parquet()
as follows
new_parquet_df = pd.read_parquet(parquet_file)
Parquet
Step 1: Data to play with
df = pd.DataFrame({
'student': ['personA007', 'personB', 'x', 'personD', 'personE'],
'marks': [20,10,22,21,22],
})
Step 2: Save as Parquet
df.to_parquet('sample.parquet')
Step 3: Read from Parquet
df = pd.read_parquet('sample.parquet')
you can use python to get parquet data
1.install package
pin install pandas pyarrow
2.read file
def read_parquet(file):
result = []
data = pd.read_parquet(file)
for index in data.index:
res = data.loc[index].values[0:-1]
result.append(res)
print(len(result))
file = "./data.parquet"
read_parquet(file)
How to read a modestly sized Parquet data-set into an in-memory Pandas DataFrame without setting up a cluster computing infrastructure such as Hadoop or Spark? This is only a moderate amount of data that I would like to read in-memory with a simple Python script on a laptop. The data does not reside on HDFS. It is either on the local file system or possibly in S3. I do not want to spin up and configure other services like Hadoop, Hive or Spark.
I thought Blaze/Odo would have made this possible: the Odo documentation mentions Parquet, but the examples seem all to be going through an external Hive runtime.
Update: since the time I answered this there has been a lot of work on this look at Apache Arrow for a better read and write of parquet. Also: http://wesmckinney.com/blog/python-parquet-multithreading/
There is a python parquet reader that works relatively well: https://github.com/jcrobak/parquet-python
It will create python objects and then you will have to move them to a Pandas DataFrame so the process will be slower than pd.read_csv
for example.
pandas 0.21 introduces new functions for Parquet:
import pandas as pd
pd.read_parquet('example_pa.parquet', engine='pyarrow')
or
import pandas as pd
pd.read_parquet('example_fp.parquet', engine='fastparquet')
The above link explains:
These engines are very similar and should read/write nearly identical parquet format files. These libraries differ by having different underlying dependencies (fastparquet by using numba, while pyarrow uses a c-library).
Aside from pandas, Apache pyarrow also provides way to transform parquet to dataframe
The code is simple, just type:
import pyarrow.parquet as pq
df = pq.read_table(source=your_file_path).to_pandas()
For more information, see the document from Apache pyarrow Reading and Writing Single Files
Parquet files are always large. so read it using dask.
import dask.dataframe as dd
from dask import delayed
from fastparquet import ParquetFile
import glob
files = glob.glob('data/*.parquet')
@delayed
def load_chunk(path):
return ParquetFile(path).to_pandas()
df = dd.from_delayed([load_chunk(f) for f in files])
df.compute()
When writing to parquet, consider using brotli compression. I’m getting a 70% size reduction of 8GB file parquet file by using brotli compression. Brotli makes for a smaller file and faster read/writes than gzip, snappy, pickle. Although pickle can do tuples whereas parquet does not.
df.to_parquet('df.parquet.brotli',compression='brotli')
df = pd.read_parquet('df.parquet.brotli')
Considering the .parquet
file named data.parquet
parquet_file = '../data.parquet'
open( parquet_file, 'w+' )
Convert to Parquet
Assuming one has a dataframe parquet_df
that one wants to save to the parquet file above, one can use pandas.to_parquet
(this function requires either the fastparquet or pyarrow library) as follows
parquet_df.to_parquet(parquet_file)
Read from Parquet
In order to read the parquet file into a dataframe new_parquet_df
, one can use pandas.read_parquet()
as follows
new_parquet_df = pd.read_parquet(parquet_file)
Parquet
Step 1: Data to play with
df = pd.DataFrame({
'student': ['personA007', 'personB', 'x', 'personD', 'personE'],
'marks': [20,10,22,21,22],
})
Step 2: Save as Parquet
df.to_parquet('sample.parquet')
Step 3: Read from Parquet
df = pd.read_parquet('sample.parquet')
you can use python to get parquet data
1.install package
pin install pandas pyarrow
2.read file
def read_parquet(file):
result = []
data = pd.read_parquet(file)
for index in data.index:
res = data.loc[index].values[0:-1]
result.append(res)
print(len(result))
file = "./data.parquet"
read_parquet(file)