Writing a pickle file to an s3 bucket in AWS
Question:
I’m trying to write a pandas dataframe as a pickle file into an s3 bucket in AWS. I know that I can write dataframe new_df
as a csv to an s3 bucket as follows:
bucket='mybucket'
key='path'
csv_buffer = StringIO()
s3_resource = boto3.resource('s3')
new_df.to_csv(csv_buffer, index=False)
s3_resource.Object(bucket,path).put(Body=csv_buffer.getvalue())
I’ve tried using the same code as above with to_pickle()
but with no success.
Answers:
I’ve found the solution, need to call BytesIO into the buffer for pickle files instead of StringIO (which are for CSV files).
import io
import boto3
pickle_buffer = io.BytesIO()
s3_resource = boto3.resource('s3')
new_df.to_pickle(pickle_buffer)
s3_resource.Object(bucket, key).put(Body=pickle_buffer.getvalue())
Further to you answer, you don’t need to convert to csv.
pickle.dumps method returns a byte obj. see here: https://docs.python.org/3/library/pickle.html
import boto3
import pickle
bucket='your_bucket_name'
key='your_pickle_filename.pkl'
pickle_byte_obj = pickle.dumps([var1, var2, ..., varn])
s3_resource = boto3.resource('s3')
s3_resource.Object(bucket,key).put(Body=pickle_byte_obj)
this worked for me with pandas 0.23.4 and boto3 1.7.80 :
bucket='your_bucket_name'
key='your_pickle_filename.pkl'
new_df.to_pickle(key)
s3_resource.Object(bucket, key).put(Body=open(key, 'rb'))
This solution (using s3fs) worked perfectly and elegantly for my team:
import s3fs
from pickle import dump
fs = s3fs.S3FileSystem(anon=False)
bucket = 'bucket1'
key = 'your_pickle_filename.pkl'
dump(data, fs.open(f's3://{bucket}/{key}', 'wb'))
This adds some clarification to a previous answer:
import pandas as pd
import boto3
# make df
df = pd.DataFrame({'col1:': [1,2,3]})
# bucket name
str_bucket = 'bucket_name'
# filename
str_key_file = 'df.pkl'
# bucket path
str_key_bucket = dir_1/dir2/{str_key_file}'
# write df to local pkl file
df.to_pickle(str_key_file)
# put object into s3
boto3.resource('s3').Object(str_bucket, str_key_bucket).put(Body=open(str_key_file, 'rb'))
From the just-released book ‘Time Series Analysis with Python’ by Tarek Atwan, I learned this method:
import pandas as pd
df = pd.DataFrame(...)
df.to_pickle('s3://mybucket/pklfile.bz2',
storage_options={
'key': AWS_ACCESS_KEY,
'secret': AWS_SECRET_KEY
}
)
which I believe is more pythonic.
I’ve found the best solution – just upgrade pandas and also install s3fs:
pip install s3fs==2022.8.2
pip install install pandas==1.1.5
bucket,key='mybucket','path'
df.to_pickle(f"{bucket}{key}.pkl.gz", compression='gzip')
I’m trying to write a pandas dataframe as a pickle file into an s3 bucket in AWS. I know that I can write dataframe new_df
as a csv to an s3 bucket as follows:
bucket='mybucket'
key='path'
csv_buffer = StringIO()
s3_resource = boto3.resource('s3')
new_df.to_csv(csv_buffer, index=False)
s3_resource.Object(bucket,path).put(Body=csv_buffer.getvalue())
I’ve tried using the same code as above with to_pickle()
but with no success.
I’ve found the solution, need to call BytesIO into the buffer for pickle files instead of StringIO (which are for CSV files).
import io
import boto3
pickle_buffer = io.BytesIO()
s3_resource = boto3.resource('s3')
new_df.to_pickle(pickle_buffer)
s3_resource.Object(bucket, key).put(Body=pickle_buffer.getvalue())
Further to you answer, you don’t need to convert to csv.
pickle.dumps method returns a byte obj. see here: https://docs.python.org/3/library/pickle.html
import boto3
import pickle
bucket='your_bucket_name'
key='your_pickle_filename.pkl'
pickle_byte_obj = pickle.dumps([var1, var2, ..., varn])
s3_resource = boto3.resource('s3')
s3_resource.Object(bucket,key).put(Body=pickle_byte_obj)
this worked for me with pandas 0.23.4 and boto3 1.7.80 :
bucket='your_bucket_name'
key='your_pickle_filename.pkl'
new_df.to_pickle(key)
s3_resource.Object(bucket, key).put(Body=open(key, 'rb'))
This solution (using s3fs) worked perfectly and elegantly for my team:
import s3fs
from pickle import dump
fs = s3fs.S3FileSystem(anon=False)
bucket = 'bucket1'
key = 'your_pickle_filename.pkl'
dump(data, fs.open(f's3://{bucket}/{key}', 'wb'))
This adds some clarification to a previous answer:
import pandas as pd
import boto3
# make df
df = pd.DataFrame({'col1:': [1,2,3]})
# bucket name
str_bucket = 'bucket_name'
# filename
str_key_file = 'df.pkl'
# bucket path
str_key_bucket = dir_1/dir2/{str_key_file}'
# write df to local pkl file
df.to_pickle(str_key_file)
# put object into s3
boto3.resource('s3').Object(str_bucket, str_key_bucket).put(Body=open(str_key_file, 'rb'))
From the just-released book ‘Time Series Analysis with Python’ by Tarek Atwan, I learned this method:
import pandas as pd
df = pd.DataFrame(...)
df.to_pickle('s3://mybucket/pklfile.bz2',
storage_options={
'key': AWS_ACCESS_KEY,
'secret': AWS_SECRET_KEY
}
)
which I believe is more pythonic.
I’ve found the best solution – just upgrade pandas and also install s3fs:
pip install s3fs==2022.8.2
pip install install pandas==1.1.5
bucket,key='mybucket','path'
df.to_pickle(f"{bucket}{key}.pkl.gz", compression='gzip')