Pandas/Google BigQuery: Schema mismatch makes the upload fail
Question:
The schema in my google table looks like this:
price_datetime : DATETIME,
symbol : STRING,
bid_open : FLOAT,
bid_high : FLOAT,
bid_low : FLOAT,
bid_close : FLOAT,
ask_open : FLOAT,
ask_high : FLOAT,
ask_low : FLOAT,
ask_close : FLOAT
After I do a pandas.read_gbq
I get a dataframe
with column dtypes like this:
price_datetime object
symbol object
bid_open float64
bid_high float64
bid_low float64
bid_close float64
ask_open float64
ask_high float64
ask_low float64
ask_close float64
dtype: object
Now I want to use to_gbq
so I convert my local dataframe (which I just made) from these dtypes:
price_datetime datetime64[ns]
symbol object
bid_open float64
bid_high float64
bid_low float64
bid_close float64
ask_open float64
ask_high float64
ask_low float64
ask_close float64
dtype: object
to these dtypes:
price_datetime object
symbol object
bid_open float64
bid_high float64
bid_low float64
bid_close float64
ask_open float64
ask_high float64
ask_low float64
ask_close float64
dtype: object
by doing:
df['price_datetime'] = df['price_datetime'].astype(object)
Now I (think) I am read to use to_gbq
so I do:
import pandas
pandas.io.gbq.to_gbq(df, <table_name>, <project_name>, if_exists='append')
but I get the error:
---------------------------------------------------------------------------
InvalidSchema Traceback (most recent call last)
<ipython-input-15-d5a3f86ad382> in <module>()
1 a = time.time()
----> 2 pandas.io.gbq.to_gbq(df, <table_name>, <project_name>, if_exists='append')
3 b = time.time()
4
5 print(b-a)
C:UsersmeAppDataLocalContinuumAnaconda3libsite-packagespandasiogbq.py in to_gbq(dataframe, destination_table, project_id, chunksize, verbose, reauth, if_exists, private_key)
825 elif if_exists == 'append':
826 if not connector.verify_schema(dataset_id, table_id, table_schema):
--> 827 raise InvalidSchema("Please verify that the structure and "
828 "data types in the DataFrame match the "
829 "schema of the destination table.")
InvalidSchema: Please verify that the structure and data types in the DataFrame match the schema of the destination table.
Answers:
This is probably an issue related to pandas. If you check the code for to_gbq, you’ll see that it runs this code:
table_schema = _generate_bq_schema(dataframe)
Where _generate_bq_schema
is given by:
def _generate_bq_schema(df, default_type='STRING'):
""" Given a passed df, generate the associated Google BigQuery schema.
Parameters
----------
df : DataFrame
default_type : string
The default big query type in case the type of the column
does not exist in the schema.
"""
type_mapping = {
'i': 'INTEGER',
'b': 'BOOLEAN',
'f': 'FLOAT',
'O': 'STRING',
'S': 'STRING',
'U': 'STRING',
'M': 'TIMESTAMP'
}
fields = []
for column_name, dtype in df.dtypes.iteritems():
fields.append({'name': column_name,
'type': type_mapping.get(dtype.kind, default_type)})
return {'fields': fields}
As you can see, there’s no type mapping to DATETIME
. This inevitably gets mapped to type STRING
(since its dtype.kind
is “O”) and then conflict occurs.
The only work around for now that I’m aware of would be to change your table schema from DATETIME
to either TIMESTAMP
or STRING
.
It probably would be a good idea to start a new issue on pandas-bq repository asking to update this code to accept DATETIME
as well.
[EDIT]:
I’ve opened this issue in their repository.
I had to do two things that solved the issue for me. First, I deleted my table and reuploaded it with the columns as TIMESTAMP
types rather than DATETIME
types. This made sure that the schema matched when the pandas.DataFrame
with column type datetime64[ns]
was uploaded to using to_gbq
, which converts datetime64[ns]
to TIMESTAMP
type and not to DATETIME
type (for now).
The second thing I did was upgrade from pandas 0.19
to pandas 0.20
. These two things solved my problem of a schema mismatch.
I had this issue and determined that Pandas is sending the columns in alphabetical order by column_name, which, in my case, mismatched the schema of the BigQuery Table. Hence, a column was expecting a date value when it got an integer, etc. It thus, throws the “Invalid Schema” error. Check your column order.
Extract in the BigQuery Schema, generate the schema pandas_gbq would generate if you try loading your data into BigQuery and take a look at the differences.
import logging
from google.cloud import bigquery
from pandas_gbq import schema
# Load BQ schema
client = bigquery.Client()
table = client.get_table(f"{project_id}.{dataset_id}.{table_id}")
# Put it in Pandas
f_tuples = [(field.name, field.field_type) for field in table.schema]
bq_schema = pd.DataFrame.from_records(f_tuples, columns = ["name", "type"])
bq_schema = bq_schema.set_index("name")
# Generate the load schema
load_schema = pd.json_normalize(schema.generate_bq_schema(df)["fields"])
load_schema = load_schema.set_index("name")
# Use the following code to check whether the field is a field in BQ
# and whether it's the same field type
for i in load_schema.index:
try:
b_type = bq_schema.loc[i, "type"]
l_type = load_schema.loc[i, "type"]
except KeyError:
logging.warning(f'{i} not in BigQuery Schema')
try:
assert b_type == l_type
except AssertionError:
logging.warning(f'{i} with {l_type} does not match {b_type}')
This snippet will give the following log if the field is not in the BQ schema
2022-11-25 12:38:16,249 WARNING TEST not in BigQuery Schema (950218398.<cell line: 31>:36)
This snippet will give the following log if the field types do not match
2022-11-25 12:36:29,304 WARNING domain_userid with TEST does not match STRING (2236687881.<cell line: 30>:39)
As mentioned in the pandas-gbq documentation, you can supply the schema yourself – "If the data type inference does not suit your needs, supply a BigQuery schema as the table_schema parameter of to_gbq()."
The schema in my google table looks like this:
price_datetime : DATETIME,
symbol : STRING,
bid_open : FLOAT,
bid_high : FLOAT,
bid_low : FLOAT,
bid_close : FLOAT,
ask_open : FLOAT,
ask_high : FLOAT,
ask_low : FLOAT,
ask_close : FLOAT
After I do a pandas.read_gbq
I get a dataframe
with column dtypes like this:
price_datetime object
symbol object
bid_open float64
bid_high float64
bid_low float64
bid_close float64
ask_open float64
ask_high float64
ask_low float64
ask_close float64
dtype: object
Now I want to use to_gbq
so I convert my local dataframe (which I just made) from these dtypes:
price_datetime datetime64[ns]
symbol object
bid_open float64
bid_high float64
bid_low float64
bid_close float64
ask_open float64
ask_high float64
ask_low float64
ask_close float64
dtype: object
to these dtypes:
price_datetime object
symbol object
bid_open float64
bid_high float64
bid_low float64
bid_close float64
ask_open float64
ask_high float64
ask_low float64
ask_close float64
dtype: object
by doing:
df['price_datetime'] = df['price_datetime'].astype(object)
Now I (think) I am read to use to_gbq
so I do:
import pandas
pandas.io.gbq.to_gbq(df, <table_name>, <project_name>, if_exists='append')
but I get the error:
---------------------------------------------------------------------------
InvalidSchema Traceback (most recent call last)
<ipython-input-15-d5a3f86ad382> in <module>()
1 a = time.time()
----> 2 pandas.io.gbq.to_gbq(df, <table_name>, <project_name>, if_exists='append')
3 b = time.time()
4
5 print(b-a)
C:UsersmeAppDataLocalContinuumAnaconda3libsite-packagespandasiogbq.py in to_gbq(dataframe, destination_table, project_id, chunksize, verbose, reauth, if_exists, private_key)
825 elif if_exists == 'append':
826 if not connector.verify_schema(dataset_id, table_id, table_schema):
--> 827 raise InvalidSchema("Please verify that the structure and "
828 "data types in the DataFrame match the "
829 "schema of the destination table.")
InvalidSchema: Please verify that the structure and data types in the DataFrame match the schema of the destination table.
This is probably an issue related to pandas. If you check the code for to_gbq, you’ll see that it runs this code:
table_schema = _generate_bq_schema(dataframe)
Where _generate_bq_schema
is given by:
def _generate_bq_schema(df, default_type='STRING'):
""" Given a passed df, generate the associated Google BigQuery schema.
Parameters
----------
df : DataFrame
default_type : string
The default big query type in case the type of the column
does not exist in the schema.
"""
type_mapping = {
'i': 'INTEGER',
'b': 'BOOLEAN',
'f': 'FLOAT',
'O': 'STRING',
'S': 'STRING',
'U': 'STRING',
'M': 'TIMESTAMP'
}
fields = []
for column_name, dtype in df.dtypes.iteritems():
fields.append({'name': column_name,
'type': type_mapping.get(dtype.kind, default_type)})
return {'fields': fields}
As you can see, there’s no type mapping to DATETIME
. This inevitably gets mapped to type STRING
(since its dtype.kind
is “O”) and then conflict occurs.
The only work around for now that I’m aware of would be to change your table schema from DATETIME
to either TIMESTAMP
or STRING
.
It probably would be a good idea to start a new issue on pandas-bq repository asking to update this code to accept DATETIME
as well.
[EDIT]:
I’ve opened this issue in their repository.
I had to do two things that solved the issue for me. First, I deleted my table and reuploaded it with the columns as TIMESTAMP
types rather than DATETIME
types. This made sure that the schema matched when the pandas.DataFrame
with column type datetime64[ns]
was uploaded to using to_gbq
, which converts datetime64[ns]
to TIMESTAMP
type and not to DATETIME
type (for now).
The second thing I did was upgrade from pandas 0.19
to pandas 0.20
. These two things solved my problem of a schema mismatch.
I had this issue and determined that Pandas is sending the columns in alphabetical order by column_name, which, in my case, mismatched the schema of the BigQuery Table. Hence, a column was expecting a date value when it got an integer, etc. It thus, throws the “Invalid Schema” error. Check your column order.
Extract in the BigQuery Schema, generate the schema pandas_gbq would generate if you try loading your data into BigQuery and take a look at the differences.
import logging
from google.cloud import bigquery
from pandas_gbq import schema
# Load BQ schema
client = bigquery.Client()
table = client.get_table(f"{project_id}.{dataset_id}.{table_id}")
# Put it in Pandas
f_tuples = [(field.name, field.field_type) for field in table.schema]
bq_schema = pd.DataFrame.from_records(f_tuples, columns = ["name", "type"])
bq_schema = bq_schema.set_index("name")
# Generate the load schema
load_schema = pd.json_normalize(schema.generate_bq_schema(df)["fields"])
load_schema = load_schema.set_index("name")
# Use the following code to check whether the field is a field in BQ
# and whether it's the same field type
for i in load_schema.index:
try:
b_type = bq_schema.loc[i, "type"]
l_type = load_schema.loc[i, "type"]
except KeyError:
logging.warning(f'{i} not in BigQuery Schema')
try:
assert b_type == l_type
except AssertionError:
logging.warning(f'{i} with {l_type} does not match {b_type}')
This snippet will give the following log if the field is not in the BQ schema
2022-11-25 12:38:16,249 WARNING TEST not in BigQuery Schema (950218398.<cell line: 31>:36)
This snippet will give the following log if the field types do not match
2022-11-25 12:36:29,304 WARNING domain_userid with TEST does not match STRING (2236687881.<cell line: 30>:39)
As mentioned in the pandas-gbq documentation, you can supply the schema yourself – "If the data type inference does not suit your needs, supply a BigQuery schema as the table_schema parameter of to_gbq()."