What are the differences between feather and parquet?

Question:

Both are columnar (disk-)storage formats for use in data analysis systems.
Both are integrated within Apache Arrow (pyarrow package for python) and are
designed to correspond with Arrow as a columnar in-memory analytics layer.

How do both formats differ?

Should you always prefer feather when working with pandas when possible?

What are the use cases where feather is more suitable than parquet and the
other way round?


Appendix

I found some hints here https://github.com/wesm/feather/issues/188,
but given the young age of this project, it’s possibly a bit out of date.

Not a serious speed test because I’m just dumping and loading a whole
Dataframe but to give you some impression if you never
heard of the formats before:

 # IPython    
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.feather as feather
import pyarrow.parquet as pq
import fastparquet as fp


df = pd.DataFrame({'one': [-1, np.nan, 2.5],
                   'two': ['foo', 'bar', 'baz'],
                   'three': [True, False, True]})

print("pandas df to disk ####################################################")
print('example_feather:')
%timeit feather.write_feather(df, 'example_feather')
# 2.62 ms ± 35.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
print('example_parquet:')
%timeit pq.write_table(pa.Table.from_pandas(df), 'example.parquet')
# 3.19 ms ± 51 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
print()

print("for comparison:")
print('example_pickle:')
%timeit df.to_pickle('example_pickle')
# 2.75 ms ± 18.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
print('example_fp_parquet:')
%timeit fp.write('example_fp_parquet', df)
# 7.06 ms ± 205 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
print('example_hdf:')
%timeit df.to_hdf('example_hdf', 'key_to_store', mode='w', table=True)
# 24.6 ms ± 4.45 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
print()

print("pandas df from disk ##################################################")
print('example_feather:')
%timeit feather.read_feather('example_feather')
# 969 µs ± 1.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
print('example_parquet:')
%timeit pq.read_table('example.parquet').to_pandas()
# 1.9 ms ± 5.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

print("for comparison:")
print('example_pickle:')
%timeit pd.read_pickle('example_pickle')
# 1.07 ms ± 6.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
print('example_fp_parquet:')
%timeit fp.ParquetFile('example_fp_parquet').to_pandas()
# 4.53 ms ± 260 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
print('example_hdf:')
%timeit pd.read_hdf('example_hdf')
# 10 ms ± 43.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# pandas version: 0.22.0
# fastparquet version: 0.1.3
# numpy version: 1.13.3
# pandas version: 0.22.0
# pyarrow version: 0.8.0
# sys.version: 3.6.3
# example Dataframe taken from https://arrow.apache.org/docs/python/parquet.html
Asked By: Darkonaut

||

Answers:

  • Parquet format is designed for long-term storage, where Arrow is more intended for short term or ephemeral storage (Arrow may be more suitable for long-term storage after the 1.0.0 release happens, since the binary format will be stable then)

  • Parquet is more expensive to write than Feather as it features more layers of encoding and compression. Feather is unmodified raw columnar Arrow memory. We will probably add simple compression to Feather in the future.

  • Due to dictionary encoding, RLE encoding, and data page compression, Parquet files will often be much smaller than Feather files

  • Parquet is a standard storage format for analytics that’s supported by many different systems: Spark, Hive, Impala, various AWS services, in future by BigQuery, etc. So if you are doing analytics, Parquet is a good option as a reference storage format for query by multiple systems

The benchmarks you showed are going to be very noisy since the data you read and wrote is very small. You should try compressing at least 100MB or upwards 1GB of data to get some more informative benchmarks, see e.g. http://wesmckinney.com/blog/python-parquet-multithreading/

Hope this helps

Answered By: Wes McKinney

I would also include in the comparison between parquet and feather different compression methods to check for importing/exporting speeds and how much storage it uses.

I advocate for 2 options for the average user who wants a better csv alternative:

  • parquet with "gzip" compression (for storage): It is slitly faster to export than just .csv (if the csv needs to be zipped, then parquet is much faster). Importing is about 2x times faster than csv. The compression is around 22% from the original file size, which is about the same as zipped csv files.
  • feather with "zstd" compression (for I/O speed): compared to csv, feather exporting has 20x faster exporting and about 6x times faster importing. The storage is around 32% from the original file size, which is 10% worse than parquet "gzip" and csv zipped but still decent.

Both are better options that just normal csv files in all categories (I/O speed and storage).

I analysed the following formats:

  1. csv
  2. csv using "zip" compression
  3. feather using "zstd" compression
  4. feather using "lz4" compression
  5. parquet using "snappy" compression
  6. parquet using "gzip" compression
  7. parquet using "gzip" brotli

import zipfile
import pandas as pd
folder_path = (r"...\intraday")
zip_path = zipfile.ZipFile(folder_path + "\AAPL.zip")    
test_data = pd.read_csv(zip_path.open('AAPL.csv'))


# EXPORT, STORAGE AND IMPORT TESTS
# ------------------------------------------
# - FORMAT .csv 

# export
%%timeit
test_data.to_csv(folder_path + "\AAPL.csv", index=False)
# 12.8 s ± 399 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# storage
# AAPL.csv exported using python.
# 169.034 KB

# import
%%timeit
test_data = pd.read_csv(folder_path + "\AAPL.csv")
# 1.56 s ± 14.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# ------------------------------------------
# - FORMAT zipped .csv 

# export
%%timeit
test_data.to_csv(folder_path + "\AAPL.csv")
# 12.8 s ± 399 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# OBSERVATION: this does not include the time I spent manually zipping the .csv

# storage
# AAPL.csv zipped with .zip "normal" compression using 7-zip software.
# 36.782 KB

# import
zip_path = zipfile.ZipFile(folder_path + "AAPL.zip")
%%timeit
test_data = pd.read_csv(zip_path.open('AAPL.csv'))
# 2.31 s ± 43.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# ------------------------------------------
# - FORMAT .feather using "zstd" compression.

# export
%%timeit
test_data.to_feather(folder_path + "\AAPL.feather", compression='zstd')
# 460 ms ± 13.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# storage
# AAPL.feather exported with python using zstd
# 54.924 KB

# import
%%timeit
test_data = pd.read_feather(folder_path + "\AAPL.feather")
# 310 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# ------------------------------------------
# - FORMAT .feather using "lz4" compression.
# Only works installing with pip, not with conda. Bad sign.

# export
%%timeit
test_data.to_feather(folder_path + "\AAPL.feather", compression='lz4')
# 392 ms ± 14.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# storage
# AAPL.feather exported with python using "lz4"
# 79.668 KB    

# import
%%timeit
test_data = pd.read_feather(folder_path + "\AAPL.feather")
# 255 ms ± 4.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# ------------------------------------------
# - FORMAT .parquet using compression "snappy"

# export
%%timeit
test_data.to_parquet(folder_path + "\AAPL.parquet", compression='snappy')
# 2.82 s ± 47.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# storage
# AAPL.parquet exported with python using "snappy"
# 62.383 KB

# import
%%timeit
test_data = pd.read_parquet(folder_path + "\AAPL.parquet")
# 701 ms ± 19.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# ------------------------------------------
# - FORMAT .parquet using compression "gzip"

# export
%%timeit
test_data.to_parquet(folder_path + "\AAPL.parquet", compression='gzip')
# 10.8 s ± 77.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# storage
# AAPL.parquet exported with python using "gzip"
# 37.595 KB

# import
%%timeit
test_data = pd.read_parquet(folder_path + "\AAPL.parquet")
# 1.18 s ± 80.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# ------------------------------------------
# - FORMAT .parquet using compression "brotli"

# export
%%timeit
test_data.to_parquet(folder_path + "\AAPL.parquet", compression='brotli')
# around 5min each loop. I did not run %%timeit on this one.

# storage
# AAPL.parquet exported with python using "brotli"
# 29.425 KB    

# import
%%timeit
test_data = pd.read_parquet(folder_path + "\AAPL.parquet")
# 1.04 s ± 72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Observations:

  • Feather seems better for light weight data, as it writes and loads faster. Parquet has better storage ratios.
  • Feather library support and maintenance made me initially concerned, however the file format has good integration with pandas and I could install the dependencies using conda for the "zstd" compression method.
  • Best storage by far is parquet with "brotli" compression, however it takes to long to export. It has a good import speed once the exporting is done, but still is 2.5x slower importing than feather.
Answered By: Artur Dutra