Create a BigQuery table from pandas dataframe, WITHOUT specifying schema explicitly
Question:
I have a pandas dataframe and want to create a BigQuery table from it. I understand that there are many posts asking about this question, but all the answers I can find so far require explicitly specifying the schema of every column. For example:
from google.cloud import bigquery as bq
client = bq.Client()
dataset_ref = client.dataset('my_dataset', project = 'my_project')
table_ref = dataset_ref.table('my_table')
job_config = bq.LoadJobConfig(
schema=[
bq.SchemaField("a", bq.enums.SqlTypeNames.STRING),
bq.SchemaField("b", bq.enums.SqlTypeNames.INT64),
bq.SchemaField("c", bq.enums.SqlTypeNames.FLOAT64),
]
)
client.load_table_from_dataframe(my_df, table_ref, job_config=job_config).result()
However, sometimes I have a dataframe of many columns (for example, 100 columns), it’s really non-trival to specify all the columns. Is there a way to do it efficiently?
Btw, I found this post with similar question: Efficiently write a Pandas dataframe to Google BigQuery
But seems like bq.Schema.from_dataframe
does not exist:
AttributeError: module 'google.cloud.bigquery' has no attribute 'Schema'
Answers:
Here’s a code snippet to load a DataFrame to BQ:
import pandas as pd
from google.cloud import bigquery
# Example data
df = pd.DataFrame({'a': [1,2,4], 'b': ['123', '456', '000']})
# Load client
client = bigquery.Client(project='your-project-id')
# Define table name, in format dataset.table_name
table = 'your-dataset.your-table'
# Load data to BQ
job = client.load_table_from_dataframe(df, table)
If you want to specify only a subset of the schema and still import all the columns, you can switch the last row with
# Define a job config object, with a subset of the schema
job_config = bigquery.LoadJobConfig(schema=[bigquery.SchemaField('b', 'STRING')])
# Load data to BQ
job = client.load_table_from_dataframe(df, table, job_config=job_config)
Here is the working code:
from google.cloud import bigquery
import pandas as pd
bigqueryClient = bigquery.Client()
tableRef = bigqueryClient.dataset("dataset-name").table("table-name")
dataFrame = pd.read_csv("file-name")
bigqueryJob = bigqueryClient.load_table_from_dataframe(dataFrame, tableRef)
bigqueryJob.result()
Now it’s as easy as installing pandas-gbq==0.18.1
and then
df.to_gbq(
destination_table="my_project_id.my_dataset.my_table",
project_id="my_project_id",
credentials=service_account.Credentials.from_service_account_info(
my_service_account_info # there are several ways to authenticate
),
)
Docs:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_gbq.html
See the How to authenticate with Google BigQuery guide for authentication instructions.
I have a pandas dataframe and want to create a BigQuery table from it. I understand that there are many posts asking about this question, but all the answers I can find so far require explicitly specifying the schema of every column. For example:
from google.cloud import bigquery as bq
client = bq.Client()
dataset_ref = client.dataset('my_dataset', project = 'my_project')
table_ref = dataset_ref.table('my_table')
job_config = bq.LoadJobConfig(
schema=[
bq.SchemaField("a", bq.enums.SqlTypeNames.STRING),
bq.SchemaField("b", bq.enums.SqlTypeNames.INT64),
bq.SchemaField("c", bq.enums.SqlTypeNames.FLOAT64),
]
)
client.load_table_from_dataframe(my_df, table_ref, job_config=job_config).result()
However, sometimes I have a dataframe of many columns (for example, 100 columns), it’s really non-trival to specify all the columns. Is there a way to do it efficiently?
Btw, I found this post with similar question: Efficiently write a Pandas dataframe to Google BigQuery
But seems like bq.Schema.from_dataframe
does not exist:
AttributeError: module 'google.cloud.bigquery' has no attribute 'Schema'
Here’s a code snippet to load a DataFrame to BQ:
import pandas as pd
from google.cloud import bigquery
# Example data
df = pd.DataFrame({'a': [1,2,4], 'b': ['123', '456', '000']})
# Load client
client = bigquery.Client(project='your-project-id')
# Define table name, in format dataset.table_name
table = 'your-dataset.your-table'
# Load data to BQ
job = client.load_table_from_dataframe(df, table)
If you want to specify only a subset of the schema and still import all the columns, you can switch the last row with
# Define a job config object, with a subset of the schema
job_config = bigquery.LoadJobConfig(schema=[bigquery.SchemaField('b', 'STRING')])
# Load data to BQ
job = client.load_table_from_dataframe(df, table, job_config=job_config)
Here is the working code:
from google.cloud import bigquery
import pandas as pd
bigqueryClient = bigquery.Client()
tableRef = bigqueryClient.dataset("dataset-name").table("table-name")
dataFrame = pd.read_csv("file-name")
bigqueryJob = bigqueryClient.load_table_from_dataframe(dataFrame, tableRef)
bigqueryJob.result()
Now it’s as easy as installing pandas-gbq==0.18.1
and then
df.to_gbq(
destination_table="my_project_id.my_dataset.my_table",
project_id="my_project_id",
credentials=service_account.Credentials.from_service_account_info(
my_service_account_info # there are several ways to authenticate
),
)
Docs:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_gbq.html
See the How to authenticate with Google BigQuery guide for authentication instructions.