Pandas to_gbq() TypeError "Expected bytes, got a 'int' object

Question:

I am using the pandas_gbq module to try and append a dataframe to a table in Google BigQuery.

I keep getting this error:

ArrowTypeError: Expected bytes, got a ‘int’ object.

I can confirm the data types of the dataframe match the schema of the BQ table.

I found this post regarding Parquet files not being able to have mixed datatypes: Pandas to parquet file

In the error message I’m receiving, I see there is a reference to a Parquet file, so I’m assuming the df.to_gbq() call is creating a Parquet file and I have a mixed data type column, which is causing the error. The error message doesn’t specify.

I think that my challenge is that I can’t see to find which column has the mixed datatype – I’ve tried casting them all as strings and then specifying the table schema parameter, but that hasn’t worked either.

This is the full error traceback:

In [76]: df.to_gbq('Pricecrawler.Daily_Crawl_Data', project_id=project_id, if_exists='append')
ArrowTypeError                            Traceback (most recent call last)
<ipython-input-76-74cec633c5d0> in <module>
----> 1 df.to_gbq('Pricecrawler.Daily_Crawl_Data', project_id=project_id, if_exists='append')

~Anaconda3libsite-packagespandascoreframe.py in to_gbq(self, destination_table, 
project_id, chunksize, reauth, if_exists, auth_local_webserver, table_schema, location, 
progress_bar, credentials)
   1708         from pandas.io import gbq
   1709
-> 1710         gbq.to_gbq(
   1711             self,
   1712             destination_table,

~Anaconda3libsite-packagespandasiogbq.py in to_gbq(dataframe, destination_table, project_id, chunksize, reauth, if_exists, auth_local_webserver, table_schema, location, progress_bar, credentials)
    209 ) -> None:
    210     pandas_gbq = _try_import()
--> 211     pandas_gbq.to_gbq(
    212         dataframe,
    213         destination_table,

~Anaconda3libsite-packagespandas_gbqgbq.py in to_gbq(dataframe, destination_table, project_id, chunksize, reauth, if_exists, auth_local_webserver, table_schema, location, progress_bar, credentials, api_method, verbose, private_key)
   1191         return
   1192
-> 1193     connector.load_data(
   1194         dataframe,
   1195         destination_table_ref,

~Anaconda3libsite-packagespandas_gbqgbq.py in load_data(self, dataframe, destination_table_ref, chunksize, schema, progress_bar, api_method, billing_project)
    584
    585         try:
--> 586             chunks = load.load_chunks(
    587                 self.client,
    588                 dataframe,

~Anaconda3libsite-packagespandas_gbqload.py in load_chunks(client, dataframe, destination_table_ref, chunksize, schema, location, api_method, billing_project)
    235 ):
    236     if api_method == "load_parquet":
--> 237         load_parquet(
    238             client,
    239             dataframe,

~Anaconda3libsite-packagespandas_gbqload.py in load_parquet(client, dataframe, destination_table_ref, location, schema, billing_project)
    127
    128     try:
--> 129         client.load_table_from_dataframe(
    130             dataframe,
    131             destination_table_ref,

~Anaconda3libsite-packagesgooglecloudbigqueryclient.py in load_table_from_dataframe(self, dataframe, destination, num_retries, job_id, job_id_prefix, location, project, job_config, parquet_compression, timeout)
   2669                         parquet_compression = parquet_compression.upper()
   2670
-> 2671                     _pandas_helpers.dataframe_to_parquet(
   2672                         dataframe,
   2673                         job_config.schema,

~Anaconda3libsite-packagesgooglecloudbigquery_pandas_helpers.py in dataframe_to_parquet(dataframe, bq_schema, filepath, parquet_compression, parquet_use_compliant_nested_type)
    584
    585     bq_schema = schema._to_schema_fields(bq_schema)
--> 586     arrow_table = dataframe_to_arrow(dataframe, bq_schema)
    587     pyarrow.parquet.write_table(
    588         arrow_table, filepath, compression=parquet_compression, **kwargs,

~Anaconda3libsite-packagesgooglecloudbigquery_pandas_helpers.py in dataframe_to_arrow(dataframe, bq_schema)
    527         arrow_names.append(bq_field.name)
    528         arrow_arrays.append(
--> 529             bq_to_arrow_array(get_column_or_index(dataframe, bq_field.name), bq_field)
    530         )
    531         arrow_fields.append(bq_to_arrow_field(bq_field, arrow_arrays[-1].type))

~Anaconda3libsite-packagesgooglecloudbigquery_pandas_helpers.py in bq_to_arrow_array(series, bq_field)
    288     if field_type_upper in schema._STRUCT_TYPES:
    289         return pyarrow.StructArray.from_pandas(series, type=arrow_type)
--> 290     return pyarrow.Array.from_pandas(series, type=arrow_type)
    291
    292

~Anaconda3libsite-packagespyarrowarray.pxi in pyarrow.lib.Array.from_pandas()

~Anaconda3libsite-packagespyarrowarray.pxi in pyarrow.lib.array()

~Anaconda3libsite-packagespyarrowarray.pxi in pyarrow.lib._ndarray_to_array()

~Anaconda3libsite-packagespyarrowerror.pxi in pyarrow.lib.check_status()

ArrowTypeError: Expected bytes, got a 'int' object
Asked By: markd227

||

Answers:

Not really an answer but a kludgy workaround. I’m having this exact same problem with dataframes which contain columns of the INT64 type. I’ve found that doing the following works:

from io import StringIO
# temporarily store the dataframe as a csv in a string variable
temp_csv_string = df.to_csv(sep=";", index=False)
temp_csv_string_IO = StringIO(temp_csv_string)
# create new dataframe from string variable
new_df = pd.read_csv(temp_csv_string_IO, sep=";")
# this new df can be uploaded to BQ with no issues
new_df.to_gbq(table_id, project_id, if_exists="append")

I have no idea why this works. Both dataframes seem to be identical if you look at df.info() and new_df.info(). I decided to try this after saving the offending dataframe as a csv and uploading it to biquery in that format, which worked.

Note that this specifically happens with INT64 type columns. I’m uploading dataframes generated in the same way that don’t contain INT64 values whithout any issues.

Answered By: Óscar

Had this same issue – solved it simply with

df = df.astype(str)

and doing to_gbq on that instead.

Answered By: John F

I have a similar issue when loading API data to BigQuery and I believe thing this is more efficient to get rid of the Int64_field_0.

blankIndex=[''] * len(df)
df.index=blankIndex
df

Answered By: Muyukani Kizito

Provide the expected schema of your table in the form of a dictionary like so:
schema= [{‘name’: ‘row’, ‘type’: ‘INTEGER’},{‘name’: ‘city’, ‘type’: ‘STRING’},{‘name’: ‘value’, ‘type’: ‘INTEGER’}]
replace with your values
and then inside the function use table_schema=schema

this function allows for a schema to be passed. wrong!
this function requires a schema to be passed, but if you don’t, a little snippet will quickly autogenerate one for you.
When the contents down the road don’t match the autogenerated schema then it fails like this because the default type is a String and in the first column there was nothing that indicated something else was needed than the default.

workaround alert -> if the row at the top has all the the fields populated chances are it will infer correctly what the schema is and you avoid the error without passing a schema.

This makes the function’s behavior to look random and it is what throws people off sometimes.

The first solution works because it makes everything a String and String being the default well….it works
however you end up with a table that you now have to type if you want all your fields to be proper.

Hope it helps. if so Upvote!

Answered By: Rick Il Grande
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.