Write pandas dataframe into AWS athena database

Question:

I have run a query using pyathena, and have created a pandas dataframe. Is there a way to write the pandas dataframe to AWS athena database directly?
Like data.to_sql for MYSQL database.

Sharing a example of dataframe code below for reference need to write into AWS athena database:

data=pd.DataFrame({'id':[1,2,3,4,5,6],'name':['a','b','c','d','e','f'],'score':[11,22,33,44,55,66]})
Asked By: PritamJ

||

Answers:

Storage for AWS Athena is S3. And it reads data from S3 files only. It was not possible earlier to write the data directly to Athena database like any other database.

It was missing support support for insert into ....

As workaround, users could have done following steps to make it work.

1. You need to write the pandas output to a file, 
2. Save the file to S3 location, from where the AWS Athena is reading.

I hope it gives you some pointers.

Update on 05/01/2020.

On Sep 19, 2019, AWS has announced support for insert to Athena, has made one of statement in above answer incorrect, though above solution that I have provide will still work, but with AWS announcement has added another possible solution going forward.

As AWS Documentation suggests, this feature will allow you send insert statements and Athena will write data back to new file in source table S3 location. So essentially, AWS has resolved your headache of writing data to back S3 files.

Just a note, Athena will write inserted data into separate files.
Here goes the documentation.

Answered By: Red Boy

Another modern (as for February 2020) way to achieve this goal is to use aws-data-wrangler library. It’s authomating many routine (and sometimes annoying) tasks in data processing.

Combining the case from the question the code would look like below:

import pandas as pd
import awswrangler as wr

data=pd.DataFrame({'id':[1,2,3,4,5,6],'name':['a','b','c','d','e','f'],'score':[11,22,33,44,55,66]})

# Typical Pandas, Numpy or Pyarrow transformation HERE!

wr.pandas.to_parquet(  # Storing the data and metadata to Data Lake
    dataframe=data,
    database="database",
    path="s3://your-s3-bucket/path/to/new/table",
    partition_cols=["name"],
)

This is amazingly helpful, because aws-data-wrangler knows to parse table name from the path (but you can provide table name in the parameter) and define proper types in Glue catalog according to the dataframe.

It also helpful for querying the data with Athena directly to pandas dataframe:

df = wr.pandas.read_table(database="dataase", table="table")

All the process will be fast and convenient.

Answered By: Robert Navado

One option is to use:

pandas_df.to_parquet(file, engine="pyarrow) 

To save it first to a temporal file in parquet format. For this you need to install pyarrow dependency. Once this file is saved locally, you can push it to S3 using the aws sdk for python.

A new table in Athena can now be created by executig the following queries:

    CREATE EXTERNAL TABLE IF NOT EXISTS 'your_new_table'
        (col1 type1, col2 type2)
    PARTITIONED BY (col_partitions_if_neccesary)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
    LOCATION 's3 location of your parquet file'
    tblproperties ("parquet.compression"="snappy");

Another option is to use pyathena for that. Taking the example from their official documentation:

import pandas as pd
from urllib.parse import quote_plus
from sqlalchemy import create_engine

conn_str = "awsathena+rest://:@athena.{region_name}.amazonaws.com:443/"
           "{schema_name}?s3_staging_dir={s3_staging_dir}&s3_dir={s3_dir}&compression=snappy"

engine = create_engine(conn_str.format(
    region_name="us-west-2",
    schema_name="YOUR_SCHEMA",
    s3_staging_dir=quote_plus("s3://YOUR_S3_BUCKET/path/to/"),
    s3_dir=quote_plus("s3://YOUR_S3_BUCKET/path/to/")))

df = pd.DataFrame({"a": [1, 2, 3, 4, 5]})
df.to_sql("YOUR_TABLE", engine, schema="YOUR_SCHEMA", index=False, if_exists="replace", method="multi")

In this case, the dependency sqlalchemy is needed.

Answered By: Cocomico

The top answer at the time of writing uses an older version of the API, which no longer works.

The documentation now outlines this round trip.

import awswrangler as wr
import pandas as pd

df = pd.DataFrame({"id": [1, 2], "value": ["foo", "boo"]})

# Storing data on Data Lake
wr.s3.to_parquet(
    df=df,
    path="s3://bucket/dataset/",
    dataset=True,
    database="my_db",
    table="my_table"
)

# Retrieving the data directly from Amazon S3
df = wr.s3.read_parquet("s3://bucket/dataset/", dataset=True)

# Retrieving the data from Amazon Athena
df = wr.athena.read_sql_query("SELECT * FROM my_table", database="my_db")
Answered By: mcskinner