Using pyarrow how do you append to parquet file?
Question:
How do you append/update to a parquet
file with pyarrow
?
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
table2 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})
table3 = pd.DataFrame({'six': [-1, np.nan, 2.5], 'nine': ['foo', 'bar', 'baz'], 'ten': [True, False, True]})
pq.write_table(table2, './dataNew/pqTest2.parquet')
#append pqTest2 here?
There is nothing I found in the docs about appending parquet files. And, Can you use pyarrow
with multiprocessing to insert/update the data.
Answers:
Generally speaking, Parquet datasets consist of multiple files, so you append by writing an additional file into the same directory where the data belongs to. It would be useful to have the ability to concatenate multiple files easily. I opened https://issues.apache.org/jira/browse/PARQUET-1154 to make this possible to do easily in C++ (and therefore Python)
I ran into the same issue and I think I was able to solve it using the following:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
chunksize=10000 # this is the number of lines
pqwriter = None
for i, df in enumerate(pd.read_csv('sample.csv', chunksize=chunksize)):
table = pa.Table.from_pandas(df)
# for the first chunk of records
if i == 0:
# create a parquet write object giving it an output file
pqwriter = pq.ParquetWriter('sample.parquet', table.schema)
pqwriter.write_table(table)
# close the parquet writer
if pqwriter:
pqwriter.close()
In your case the column name is not consistent, I made the column name consistent for three sample dataframes and the following code worked for me.
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
def append_to_parquet_table(dataframe, filepath=None, writer=None):
"""Method writes/append dataframes in parquet format.
This method is used to write pandas DataFrame as pyarrow Table in parquet format. If the methods is invoked
with writer, it appends dataframe to the already written pyarrow table.
:param dataframe: pd.DataFrame to be written in parquet format.
:param filepath: target file location for parquet file.
:param writer: ParquetWriter object to write pyarrow tables in parquet format.
:return: ParquetWriter object. This can be passed in the subsequenct method calls to append DataFrame
in the pyarrow Table
"""
table = pa.Table.from_pandas(dataframe)
if writer is None:
writer = pq.ParquetWriter(filepath, table.schema)
writer.write_table(table=table)
return writer
if __name__ == '__main__':
table1 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})
table2 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})
table3 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})
writer = None
filepath = '/tmp/verify_pyarrow_append.parquet'
table_list = [table1, table2, table3]
for table in table_list:
writer = append_to_parquet_table(table, filepath, writer)
if writer:
writer.close()
df = pd.read_parquet(filepath)
print(df)
Output:
one three two
0 -1.0 True foo
1 NaN False bar
2 2.5 True baz
0 -1.0 True foo
1 NaN False bar
2 2.5 True baz
0 -1.0 True foo
1 NaN False bar
2 2.5 True baz
Demo of appending a Pandas dataframe to an existing .parquet file.
Note: Other answers cannot append to existing .parquet files. This can; see discussion at end.
Tested on Python v3.9 on Windows and Linux.
Install PyArrow using pip:
pip install pyarrow==6.0.1
conda install -c conda-forge pyarrow=6.0.1 -y
Demo code:
# Q. Demo?
# A. Demo of appending to an existing .parquet file by memory mapping the original file, appending the new dataframe, then writing the new file out.
import os
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
filepath = "parquet_append.parquet"
Method 1 of 2
Simple way: Using pandas, read the orignal .parquet file in, append, write entire file back out.
# Create parquet file.
df = pd.DataFrame({"x": [1.,2.,np.nan], "y": ["a","b","c"]}) # Create dataframe ...
df.to_parquet(filepath) # ... write to file.
# Append to original parquet file.
df = pd.read_parquet(filepath) # Read original ...
df2 = pd.DataFrame({"x": [3.,4.,np.nan], "y": ["d","e","f"]}) # ... create new dataframe to append ...
df3 = pd.concat([df, df2]) # ... concatenate together ...
df3.to_parquet(filepath) # ... overwrite original file.
# Demo that new data frame has been appended to old.
df_copy = pd.read_parquet(filepath)
print(df_copy)
# x y
# 0 1.0 a
# 1 2.0 b
# 2 NaN c
# 0 3.0 d
# 1 4.0 e
# 2 NaN f
Method 2 of 2
More complex but faster: using native PyArrow calls, memory map the original file, append the new dataframe, write new file out.
# Write initial file using PyArrow.
df = pd.DataFrame({"x": [1.,2.,np.nan], "y": ["a","b","c"]}) # Create dataframe ...
table = pa.Table.from_pandas(df)
pq.write_table(table, where=filepath)
def parquet_append(filepath:Path or str, df: pd.DataFrame) -> None:
"""
Append to dataframe to existing .parquet file. Reads original .parquet file in, appends new dataframe, writes new .parquet file out.
:param filepath: Filepath for parquet file.
:param df: Pandas dataframe to append. Must be same schema as original.
"""
table_original_file = pq.read_table(source=filepath, pre_buffer=False, use_threads=True, memory_map=True) # Use memory map for speed.
table_to_append = pa.Table.from_pandas(df)
table_to_append = table_to_append.cast(table_original_file.schema) # Attempt to cast new schema to existing, e.g. datetime64[ns] to datetime64[us] (may throw otherwise).
handle = pq.ParquetWriter(filepath, table_original_file.schema) # Overwrite old file with empty. WARNING: PRODUCTION LEVEL CODE SHOULD BE MORE ATOMIC: WRITE TO A TEMPORARY FILE, DELETE THE OLD, RENAME. THEN FAILURES WILL NOT LOSE DATA.
handle.write_table(table_original_file)
handle.write_table(table_to_append)
handle.close() # Writes binary footer. Until this occurs, .parquet file is not usable.
# Append to original parquet file.
df = pd.DataFrame({"x": [3.,4.,np.nan], "y": ["d","e","f"]}) # ... create new dataframe to append ...
parquet_append(filepath, df)
# Demo that new data frame has been appended to old.
df_copy = pd.read_parquet(filepath)
print(df_copy)
# x y
# 0 1.0 a
# 1 2.0 b
# 2 NaN c
# 0 3.0 d
# 1 4.0 e
# 2 NaN f
Discussion
The answers from @Ibraheem Ibraheem and @yardstick17 cannot be used to append to existing .parquet files:
- Limitation 1: After
.close()
is called, the files cannot be appended to. Once the footer is written, everything is set in stone;
- Limitation 2: The .parquet file cannot be read by any other program until
.close()
is called (it will throw an exception as the binary footer is missing).
Combined, these limitations mean that they cannot be used to append to an existing .parquet file, they can only be used to write a .parquet file in chunks. The technique above removes these limitations, at the expense of being less efficient as the entire file has to be rewritten to append to the end. After extensive research, I believe that it is not possible to append to an existing .parquet file with the existing PyArrow libraries (as of v6.0.1).
It would be possible to modify this to merge multiple .parquet files in a folder into a single .parquet file.
It would be possible to perform an efficient upsert: pq.read_table() has filters on column and row, so if the rows in the original table were filtered out on load, the rows in the new table would effectively replace the old. This would be more useful for timeseries data.
The accepted answer works as long as you have the pyarrow parquet writer open. Once the writer is closed we cannot append row groups to a parquet file. pyarrow doesn’t have any implementation to append to an already existing parquet file.
Its possible to append row groups to an already existing parquet file using fastparquet.
Here is SO answer on the same.
from fast parquet docs
append: bool (False) or ‘overwrite’ If False, construct data-set from
scratch; if True, add new row-group(s) to existing data-set. In the
latter case, the data-set must exist, and the schema must match the
input data.
from fastparquet import write
write('output.parquet', df, append=True)
How do you append/update to a parquet
file with pyarrow
?
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
table2 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})
table3 = pd.DataFrame({'six': [-1, np.nan, 2.5], 'nine': ['foo', 'bar', 'baz'], 'ten': [True, False, True]})
pq.write_table(table2, './dataNew/pqTest2.parquet')
#append pqTest2 here?
There is nothing I found in the docs about appending parquet files. And, Can you use pyarrow
with multiprocessing to insert/update the data.
Generally speaking, Parquet datasets consist of multiple files, so you append by writing an additional file into the same directory where the data belongs to. It would be useful to have the ability to concatenate multiple files easily. I opened https://issues.apache.org/jira/browse/PARQUET-1154 to make this possible to do easily in C++ (and therefore Python)
I ran into the same issue and I think I was able to solve it using the following:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
chunksize=10000 # this is the number of lines
pqwriter = None
for i, df in enumerate(pd.read_csv('sample.csv', chunksize=chunksize)):
table = pa.Table.from_pandas(df)
# for the first chunk of records
if i == 0:
# create a parquet write object giving it an output file
pqwriter = pq.ParquetWriter('sample.parquet', table.schema)
pqwriter.write_table(table)
# close the parquet writer
if pqwriter:
pqwriter.close()
In your case the column name is not consistent, I made the column name consistent for three sample dataframes and the following code worked for me.
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
def append_to_parquet_table(dataframe, filepath=None, writer=None):
"""Method writes/append dataframes in parquet format.
This method is used to write pandas DataFrame as pyarrow Table in parquet format. If the methods is invoked
with writer, it appends dataframe to the already written pyarrow table.
:param dataframe: pd.DataFrame to be written in parquet format.
:param filepath: target file location for parquet file.
:param writer: ParquetWriter object to write pyarrow tables in parquet format.
:return: ParquetWriter object. This can be passed in the subsequenct method calls to append DataFrame
in the pyarrow Table
"""
table = pa.Table.from_pandas(dataframe)
if writer is None:
writer = pq.ParquetWriter(filepath, table.schema)
writer.write_table(table=table)
return writer
if __name__ == '__main__':
table1 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})
table2 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})
table3 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})
writer = None
filepath = '/tmp/verify_pyarrow_append.parquet'
table_list = [table1, table2, table3]
for table in table_list:
writer = append_to_parquet_table(table, filepath, writer)
if writer:
writer.close()
df = pd.read_parquet(filepath)
print(df)
Output:
one three two
0 -1.0 True foo
1 NaN False bar
2 2.5 True baz
0 -1.0 True foo
1 NaN False bar
2 2.5 True baz
0 -1.0 True foo
1 NaN False bar
2 2.5 True baz
Demo of appending a Pandas dataframe to an existing .parquet file.
Note: Other answers cannot append to existing .parquet files. This can; see discussion at end.
Tested on Python v3.9 on Windows and Linux.
Install PyArrow using pip:
pip install pyarrow==6.0.1
conda install -c conda-forge pyarrow=6.0.1 -y
Demo code:
# Q. Demo?
# A. Demo of appending to an existing .parquet file by memory mapping the original file, appending the new dataframe, then writing the new file out.
import os
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
filepath = "parquet_append.parquet"
Method 1 of 2
Simple way: Using pandas, read the orignal .parquet file in, append, write entire file back out.
# Create parquet file.
df = pd.DataFrame({"x": [1.,2.,np.nan], "y": ["a","b","c"]}) # Create dataframe ...
df.to_parquet(filepath) # ... write to file.
# Append to original parquet file.
df = pd.read_parquet(filepath) # Read original ...
df2 = pd.DataFrame({"x": [3.,4.,np.nan], "y": ["d","e","f"]}) # ... create new dataframe to append ...
df3 = pd.concat([df, df2]) # ... concatenate together ...
df3.to_parquet(filepath) # ... overwrite original file.
# Demo that new data frame has been appended to old.
df_copy = pd.read_parquet(filepath)
print(df_copy)
# x y
# 0 1.0 a
# 1 2.0 b
# 2 NaN c
# 0 3.0 d
# 1 4.0 e
# 2 NaN f
Method 2 of 2
More complex but faster: using native PyArrow calls, memory map the original file, append the new dataframe, write new file out.
# Write initial file using PyArrow.
df = pd.DataFrame({"x": [1.,2.,np.nan], "y": ["a","b","c"]}) # Create dataframe ...
table = pa.Table.from_pandas(df)
pq.write_table(table, where=filepath)
def parquet_append(filepath:Path or str, df: pd.DataFrame) -> None:
"""
Append to dataframe to existing .parquet file. Reads original .parquet file in, appends new dataframe, writes new .parquet file out.
:param filepath: Filepath for parquet file.
:param df: Pandas dataframe to append. Must be same schema as original.
"""
table_original_file = pq.read_table(source=filepath, pre_buffer=False, use_threads=True, memory_map=True) # Use memory map for speed.
table_to_append = pa.Table.from_pandas(df)
table_to_append = table_to_append.cast(table_original_file.schema) # Attempt to cast new schema to existing, e.g. datetime64[ns] to datetime64[us] (may throw otherwise).
handle = pq.ParquetWriter(filepath, table_original_file.schema) # Overwrite old file with empty. WARNING: PRODUCTION LEVEL CODE SHOULD BE MORE ATOMIC: WRITE TO A TEMPORARY FILE, DELETE THE OLD, RENAME. THEN FAILURES WILL NOT LOSE DATA.
handle.write_table(table_original_file)
handle.write_table(table_to_append)
handle.close() # Writes binary footer. Until this occurs, .parquet file is not usable.
# Append to original parquet file.
df = pd.DataFrame({"x": [3.,4.,np.nan], "y": ["d","e","f"]}) # ... create new dataframe to append ...
parquet_append(filepath, df)
# Demo that new data frame has been appended to old.
df_copy = pd.read_parquet(filepath)
print(df_copy)
# x y
# 0 1.0 a
# 1 2.0 b
# 2 NaN c
# 0 3.0 d
# 1 4.0 e
# 2 NaN f
Discussion
The answers from @Ibraheem Ibraheem and @yardstick17 cannot be used to append to existing .parquet files:
- Limitation 1: After
.close()
is called, the files cannot be appended to. Once the footer is written, everything is set in stone; - Limitation 2: The .parquet file cannot be read by any other program until
.close()
is called (it will throw an exception as the binary footer is missing).
Combined, these limitations mean that they cannot be used to append to an existing .parquet file, they can only be used to write a .parquet file in chunks. The technique above removes these limitations, at the expense of being less efficient as the entire file has to be rewritten to append to the end. After extensive research, I believe that it is not possible to append to an existing .parquet file with the existing PyArrow libraries (as of v6.0.1).
It would be possible to modify this to merge multiple .parquet files in a folder into a single .parquet file.
It would be possible to perform an efficient upsert: pq.read_table() has filters on column and row, so if the rows in the original table were filtered out on load, the rows in the new table would effectively replace the old. This would be more useful for timeseries data.
The accepted answer works as long as you have the pyarrow parquet writer open. Once the writer is closed we cannot append row groups to a parquet file. pyarrow doesn’t have any implementation to append to an already existing parquet file.
Its possible to append row groups to an already existing parquet file using fastparquet.
Here is SO answer on the same.
from fast parquet docs
append: bool (False) or ‘overwrite’ If False, construct data-set from
scratch; if True, add new row-group(s) to existing data-set. In the
latter case, the data-set must exist, and the schema must match the
input data.
from fastparquet import write
write('output.parquet', df, append=True)