Write a Pandas DataFrame to Google Cloud Storage or BigQuery

Question

Hello and thanks for your time and consideration.
I am developing a Jupyter Notebook in the Google Cloud Platform / Datalab.
I have created a Pandas DataFrame and would like to write this DataFrame to both Google Cloud Storage(GCS) and/or BigQuery. I have a bucket in GCS and have, via the following code, created the following objects:

import gcp
import gcp.storage as storage
project = gcp.Context.default().project_id    
bucket_name = 'steve-temp'           
bucket_path  = bucket_name   
bucket = storage.Bucket(bucket_path)
bucket.exists()

I have tried various approaches based on Google Datalab documentation but continue to fail.
Thanks

Asked By: EcoWarrior

||

Source

Answer 1

I think you need to load it into a plain bytes variable and use a %%storage write –variable $sample_bucketpath(see the doc) in a separate cell… I’m still figuring it out… But That is roughly the inverse of what I needed to do to read a CSV file in, I don’t know if it makes a difference on write but I had to use BytesIO to read the buffer created by the %% storage read command… Hope it helps, let me know!

Answered By: dartdog

Answer 2

Try the following working example:

from datalab.context import Context
import google.datalab.storage as storage
import google.datalab.bigquery as bq
import pandas as pd

# Dataframe to write
simple_dataframe = pd.DataFrame(data=[{1,2,3},{4,5,6}],columns=['a','b','c'])

sample_bucket_name = Context.default().project_id + '-datalab-example'
sample_bucket_path = 'gs://' + sample_bucket_name
sample_bucket_object = sample_bucket_path + '/Hello.txt'
bigquery_dataset_name = 'TestDataSet'
bigquery_table_name = 'TestTable'

# Define storage bucket
sample_bucket = storage.Bucket(sample_bucket_name)

# Create storage bucket if it does not exist
if not sample_bucket.exists():
    sample_bucket.create()

# Define BigQuery dataset and table
dataset = bq.Dataset(bigquery_dataset_name)
table = bq.Table(bigquery_dataset_name + '.' + bigquery_table_name)

# Create BigQuery dataset
if not dataset.exists():
    dataset.create()

# Create or overwrite the existing table if it exists
table_schema = bq.Schema.from_data(simple_dataframe)
table.create(schema = table_schema, overwrite = True)

# Write the DataFrame to GCS (Google Cloud Storage)
%storage write --variable simple_dataframe --object $sample_bucket_object

# Write the DataFrame to a BigQuery table
table.insert(simple_dataframe)

I used this example, and the _table.py file from the datalab github site as a reference. You can find other datalab source code files at this link.

Answered By: Anthonios Partheniou

Answer 3

Using the Google Cloud Datalab documentation

import datalab.storage as gcs
gcs.Bucket('bucket-name').item('to/data.csv').write_to(simple_dataframe.to_csv(),'text/csv')

Answered By: Jan Krynauw

Answer 4

Writing a Pandas DataFrame to BigQuery

Update on @Anthonios Partheniou’s answer.
The code is a bit different now – as of Nov. 29 2017

To define a BigQuery dataset

Pass a tuple containing project_id and dataset_id to bq.Dataset.

# define a BigQuery dataset    
bigquery_dataset_name = ('project_id', 'dataset_id')
dataset = bq.Dataset(name = bigquery_dataset_name)

To define a BigQuery table

Pass a tuple containing project_id, dataset_id and the table name to bq.Table.

# define a BigQuery table    
bigquery_table_name = ('project_id', 'dataset_id', 'table_name')
table = bq.Table(bigquery_table_name)

Create the dataset/ table and write to table in BQ

# Create BigQuery dataset
if not dataset.exists():
    dataset.create()

# Create or overwrite the existing table if it exists
table_schema = bq.Schema.from_data(dataFrame_name)
table.create(schema = table_schema, overwrite = True)

# Write the DataFrame to a BigQuery table
table.insert(dataFrame_name)

Answered By: Ekaba Bisong

Answer 5

I have a little bit simpler solution for the task using Dask. You can convert your DataFrame to Dask DataFrame, which can be written to csv on Cloud Storage

import dask.dataframe as dd
import pandas
df # your Pandas DataFrame
ddf = dd.from_pandas(df,npartitions=1, sort=True)
dd.to_csv('gs://YOUR_BUCKET/ddf-*.csv', index=False, sep=',', header=False,  
                               storage_options={'token': gcs.session.credentials})

Answered By: Porada Kev

Answer 6

Since 2017, Pandas has a Dataframe to BigQuery function pandas.DataFrame.to_gbq

The documentation has an example:

import pandas_gbq as gbq gbq.to_gbq(df, 'my_dataset.my_table', projectid, if_exists='fail')

Parameter if_exists can be set to ‘fail’, ‘replace’ or ‘append’

Uploading to Google Cloud Storage without writing a temporary file and only using the standard GCS module

from google.cloud import storage
import os
import pandas as pd

# Only need this if you're running this code locally.
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = r'/your_GCP_creds/credentials.json'

df = pd.DataFrame(data=[{1,2,3},{4,5,6}],columns=['a','b','c'])

client = storage.Client()
bucket = client.get_bucket('my-bucket-name')
    
bucket.blob('upload_test/test.csv').upload_from_string(df.to_csv(), 'text/csv')

Answered By: Theo

Answer 8

I spent a lot of time to find the easiest way to solve this:

import pandas as pd

df = pd.DataFrame(...)

df.to_csv('gs://bucket/path')

Answered By: Vova Pytsyuk

Answer 9

To Google storage:

def write_df_to_gs(df, gs_key):
    df.to_csv(gs_key)

To BigQuery:

def upload_df_to_bq(df, project, bq_table):
    df.to_gbq(bq_table, project_id=project, if_exists='replace')

Answered By: Franco Piccolo

Answer 10

To save a parquet file in GCS with authentication due Service Account:

df.to_parquet("gs://<bucket-name>/file.parquet",
               storage_options={"token": <path-to-gcs-service-account-file>}

Answered By: b0lle