Pandas/Google BigQuery: Schema mismatch makes the upload fail

Question:

The schema in my google table looks like this:

price_datetime : DATETIME,
symbol         : STRING,
bid_open       : FLOAT,
bid_high       : FLOAT,
bid_low        : FLOAT,
bid_close      : FLOAT,
ask_open       : FLOAT,
ask_high       : FLOAT,
ask_low        : FLOAT,
ask_close      : FLOAT

After I do a pandas.read_gbq I get a dataframe with column dtypes like this:

price_datetime     object
symbol             object
bid_open          float64
bid_high          float64
bid_low           float64
bid_close         float64
ask_open          float64
ask_high          float64
ask_low           float64
ask_close         float64
dtype: object

Now I want to use to_gbq so I convert my local dataframe (which I just made) from these dtypes:

price_datetime    datetime64[ns]
symbol                    object
bid_open                 float64
bid_high                 float64
bid_low                  float64
bid_close                float64
ask_open                 float64
ask_high                 float64
ask_low                  float64
ask_close                float64
dtype: object

to these dtypes:

price_datetime     object
symbol             object
bid_open          float64
bid_high          float64
bid_low           float64
bid_close         float64
ask_open          float64
ask_high          float64
ask_low           float64
ask_close         float64
dtype: object

by doing:

df['price_datetime'] = df['price_datetime'].astype(object)

Now I (think) I am read to use to_gbq so I do:

import pandas
pandas.io.gbq.to_gbq(df, <table_name>, <project_name>, if_exists='append')

but I get the error:

---------------------------------------------------------------------------
InvalidSchema                             Traceback (most recent call last)
<ipython-input-15-d5a3f86ad382> in <module>()
      1 a = time.time()
----> 2 pandas.io.gbq.to_gbq(df, <table_name>, <project_name>, if_exists='append')
      3 b = time.time()
      4 
      5 print(b-a)

C:UsersmeAppDataLocalContinuumAnaconda3libsite-packagespandasiogbq.py in to_gbq(dataframe, destination_table, project_id, chunksize, verbose, reauth, if_exists, private_key)
    825         elif if_exists == 'append':
    826             if not connector.verify_schema(dataset_id, table_id, table_schema):
--> 827                 raise InvalidSchema("Please verify that the structure and "
    828                                     "data types in the DataFrame match the "
    829                                     "schema of the destination table.")

InvalidSchema: Please verify that the structure and data types in the DataFrame match the schema of the destination table.
Asked By: user1367204

||

Answers:

This is probably an issue related to pandas. If you check the code for to_gbq, you’ll see that it runs this code:

table_schema = _generate_bq_schema(dataframe)

Where _generate_bq_schema is given by:

def _generate_bq_schema(df, default_type='STRING'):
    """ Given a passed df, generate the associated Google BigQuery schema.
    Parameters
    ----------
    df : DataFrame
    default_type : string
        The default big query type in case the type of the column
        does not exist in the schema.
    """

    type_mapping = {
        'i': 'INTEGER',
        'b': 'BOOLEAN',
        'f': 'FLOAT',
        'O': 'STRING',
        'S': 'STRING',
        'U': 'STRING',
        'M': 'TIMESTAMP'
    }

    fields = []
    for column_name, dtype in df.dtypes.iteritems():
        fields.append({'name': column_name,
                       'type': type_mapping.get(dtype.kind, default_type)})

    return {'fields': fields}

As you can see, there’s no type mapping to DATETIME. This inevitably gets mapped to type STRING (since its dtype.kind is “O”) and then conflict occurs.

The only work around for now that I’m aware of would be to change your table schema from DATETIME to either TIMESTAMP or STRING.

It probably would be a good idea to start a new issue on pandas-bq repository asking to update this code to accept DATETIME as well.

[EDIT]:

I’ve opened this issue in their repository.

Answered By: Willian Fuks

I had to do two things that solved the issue for me. First, I deleted my table and reuploaded it with the columns as TIMESTAMP types rather than DATETIME types. This made sure that the schema matched when the pandas.DataFrame with column type datetime64[ns] was uploaded to using to_gbq, which converts datetime64[ns] to TIMESTAMP type and not to DATETIME type (for now).

The second thing I did was upgrade from pandas 0.19 to pandas 0.20. These two things solved my problem of a schema mismatch.

Answered By: user1367204

I had this issue and determined that Pandas is sending the columns in alphabetical order by column_name, which, in my case, mismatched the schema of the BigQuery Table. Hence, a column was expecting a date value when it got an integer, etc. It thus, throws the “Invalid Schema” error. Check your column order.

Answered By: rcfmonarch

Extract in the BigQuery Schema, generate the schema pandas_gbq would generate if you try loading your data into BigQuery and take a look at the differences.

import logging

from google.cloud import bigquery
from pandas_gbq import schema

# Load BQ schema
client =  bigquery.Client()
table = client.get_table(f"{project_id}.{dataset_id}.{table_id}")

# Put it in Pandas
f_tuples = [(field.name, field.field_type) for field in table.schema]
bq_schema = pd.DataFrame.from_records(f_tuples, columns = ["name", "type"])
bq_schema = bq_schema.set_index("name")

# Generate the load schema
load_schema = pd.json_normalize(schema.generate_bq_schema(df)["fields"])
load_schema = load_schema.set_index("name")

# Use the following code to check whether the field is a field in BQ 
# and whether it's the same field type
for i in load_schema.index:
    try:
        b_type = bq_schema.loc[i, "type"]
        l_type = load_schema.loc[i, "type"]
    except KeyError:
        logging.warning(f'{i} not in BigQuery Schema')
    try:
        assert b_type == l_type
    except AssertionError:
        logging.warning(f'{i} with {l_type} does not match {b_type}')

This snippet will give the following log if the field is not in the BQ schema

2022-11-25 12:38:16,249 WARNING TEST not in BigQuery Schema (950218398.<cell line: 31>:36)

This snippet will give the following log if the field types do not match

2022-11-25 12:36:29,304 WARNING domain_userid with TEST does not match STRING (2236687881.<cell line: 30>:39)
Answered By: Sam

As mentioned in the pandas-gbq documentation, you can supply the schema yourself – "If the data type inference does not suit your needs, supply a BigQuery schema as the table_schema parameter of to_gbq()."

https://pandas-gbq.readthedocs.io/en/latest/writing.html

Answered By: Stav Hacohen
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.