Pandas to_gbq() TypeError "Expected bytes, got a 'int' object
Question:
I am using the pandas_gbq
module to try and append a dataframe to a table in Google BigQuery.
I keep getting this error:
ArrowTypeError: Expected bytes, got a ‘int’ object.
I can confirm the data types of the dataframe match the schema of the BQ table.
I found this post regarding Parquet files not being able to have mixed datatypes: Pandas to parquet file
In the error message I’m receiving, I see there is a reference to a Parquet file, so I’m assuming the df.to_gbq()
call is creating a Parquet file and I have a mixed data type column, which is causing the error. The error message doesn’t specify.
I think that my challenge is that I can’t see to find which column has the mixed datatype – I’ve tried casting them all as strings and then specifying the table schema parameter, but that hasn’t worked either.
This is the full error traceback:
In [76]: df.to_gbq('Pricecrawler.Daily_Crawl_Data', project_id=project_id, if_exists='append')
ArrowTypeError Traceback (most recent call last)
<ipython-input-76-74cec633c5d0> in <module>
----> 1 df.to_gbq('Pricecrawler.Daily_Crawl_Data', project_id=project_id, if_exists='append')
~Anaconda3libsite-packagespandascoreframe.py in to_gbq(self, destination_table,
project_id, chunksize, reauth, if_exists, auth_local_webserver, table_schema, location,
progress_bar, credentials)
1708 from pandas.io import gbq
1709
-> 1710 gbq.to_gbq(
1711 self,
1712 destination_table,
~Anaconda3libsite-packagespandasiogbq.py in to_gbq(dataframe, destination_table, project_id, chunksize, reauth, if_exists, auth_local_webserver, table_schema, location, progress_bar, credentials)
209 ) -> None:
210 pandas_gbq = _try_import()
--> 211 pandas_gbq.to_gbq(
212 dataframe,
213 destination_table,
~Anaconda3libsite-packagespandas_gbqgbq.py in to_gbq(dataframe, destination_table, project_id, chunksize, reauth, if_exists, auth_local_webserver, table_schema, location, progress_bar, credentials, api_method, verbose, private_key)
1191 return
1192
-> 1193 connector.load_data(
1194 dataframe,
1195 destination_table_ref,
~Anaconda3libsite-packagespandas_gbqgbq.py in load_data(self, dataframe, destination_table_ref, chunksize, schema, progress_bar, api_method, billing_project)
584
585 try:
--> 586 chunks = load.load_chunks(
587 self.client,
588 dataframe,
~Anaconda3libsite-packagespandas_gbqload.py in load_chunks(client, dataframe, destination_table_ref, chunksize, schema, location, api_method, billing_project)
235 ):
236 if api_method == "load_parquet":
--> 237 load_parquet(
238 client,
239 dataframe,
~Anaconda3libsite-packagespandas_gbqload.py in load_parquet(client, dataframe, destination_table_ref, location, schema, billing_project)
127
128 try:
--> 129 client.load_table_from_dataframe(
130 dataframe,
131 destination_table_ref,
~Anaconda3libsite-packagesgooglecloudbigqueryclient.py in load_table_from_dataframe(self, dataframe, destination, num_retries, job_id, job_id_prefix, location, project, job_config, parquet_compression, timeout)
2669 parquet_compression = parquet_compression.upper()
2670
-> 2671 _pandas_helpers.dataframe_to_parquet(
2672 dataframe,
2673 job_config.schema,
~Anaconda3libsite-packagesgooglecloudbigquery_pandas_helpers.py in dataframe_to_parquet(dataframe, bq_schema, filepath, parquet_compression, parquet_use_compliant_nested_type)
584
585 bq_schema = schema._to_schema_fields(bq_schema)
--> 586 arrow_table = dataframe_to_arrow(dataframe, bq_schema)
587 pyarrow.parquet.write_table(
588 arrow_table, filepath, compression=parquet_compression, **kwargs,
~Anaconda3libsite-packagesgooglecloudbigquery_pandas_helpers.py in dataframe_to_arrow(dataframe, bq_schema)
527 arrow_names.append(bq_field.name)
528 arrow_arrays.append(
--> 529 bq_to_arrow_array(get_column_or_index(dataframe, bq_field.name), bq_field)
530 )
531 arrow_fields.append(bq_to_arrow_field(bq_field, arrow_arrays[-1].type))
~Anaconda3libsite-packagesgooglecloudbigquery_pandas_helpers.py in bq_to_arrow_array(series, bq_field)
288 if field_type_upper in schema._STRUCT_TYPES:
289 return pyarrow.StructArray.from_pandas(series, type=arrow_type)
--> 290 return pyarrow.Array.from_pandas(series, type=arrow_type)
291
292
~Anaconda3libsite-packagespyarrowarray.pxi in pyarrow.lib.Array.from_pandas()
~Anaconda3libsite-packagespyarrowarray.pxi in pyarrow.lib.array()
~Anaconda3libsite-packagespyarrowarray.pxi in pyarrow.lib._ndarray_to_array()
~Anaconda3libsite-packagespyarrowerror.pxi in pyarrow.lib.check_status()
ArrowTypeError: Expected bytes, got a 'int' object
Answers:
Not really an answer but a kludgy workaround. I’m having this exact same problem with dataframes which contain columns of the INT64 type. I’ve found that doing the following works:
from io import StringIO
# temporarily store the dataframe as a csv in a string variable
temp_csv_string = df.to_csv(sep=";", index=False)
temp_csv_string_IO = StringIO(temp_csv_string)
# create new dataframe from string variable
new_df = pd.read_csv(temp_csv_string_IO, sep=";")
# this new df can be uploaded to BQ with no issues
new_df.to_gbq(table_id, project_id, if_exists="append")
I have no idea why this works. Both dataframes seem to be identical if you look at df.info()
and new_df.info()
. I decided to try this after saving the offending dataframe as a csv and uploading it to biquery in that format, which worked.
Note that this specifically happens with INT64 type columns. I’m uploading dataframes generated in the same way that don’t contain INT64 values whithout any issues.
Had this same issue – solved it simply with
df = df.astype(str)
and doing to_gbq
on that instead.
I have a similar issue when loading API data to BigQuery and I believe thing this is more efficient to get rid of the Int64_field_0.
blankIndex=[''] * len(df)
df.index=blankIndex
df
Provide the expected schema of your table in the form of a dictionary like so:
schema= [{‘name’: ‘row’, ‘type’: ‘INTEGER’},{‘name’: ‘city’, ‘type’: ‘STRING’},{‘name’: ‘value’, ‘type’: ‘INTEGER’}]
replace with your values
and then inside the function use table_schema=schema
this function allows for a schema to be passed. wrong!
this function requires a schema to be passed, but if you don’t, a little snippet will quickly autogenerate one for you.
When the contents down the road don’t match the autogenerated schema then it fails like this because the default type is a String and in the first column there was nothing that indicated something else was needed than the default.
workaround alert -> if the row at the top has all the the fields populated chances are it will infer correctly what the schema is and you avoid the error without passing a schema.
This makes the function’s behavior to look random and it is what throws people off sometimes.
The first solution works because it makes everything a String and String being the default well….it works
however you end up with a table that you now have to type if you want all your fields to be proper.
Hope it helps. if so Upvote!
I am using the pandas_gbq
module to try and append a dataframe to a table in Google BigQuery.
I keep getting this error:
ArrowTypeError: Expected bytes, got a ‘int’ object.
I can confirm the data types of the dataframe match the schema of the BQ table.
I found this post regarding Parquet files not being able to have mixed datatypes: Pandas to parquet file
In the error message I’m receiving, I see there is a reference to a Parquet file, so I’m assuming the df.to_gbq()
call is creating a Parquet file and I have a mixed data type column, which is causing the error. The error message doesn’t specify.
I think that my challenge is that I can’t see to find which column has the mixed datatype – I’ve tried casting them all as strings and then specifying the table schema parameter, but that hasn’t worked either.
This is the full error traceback:
In [76]: df.to_gbq('Pricecrawler.Daily_Crawl_Data', project_id=project_id, if_exists='append')
ArrowTypeError Traceback (most recent call last)
<ipython-input-76-74cec633c5d0> in <module>
----> 1 df.to_gbq('Pricecrawler.Daily_Crawl_Data', project_id=project_id, if_exists='append')
~Anaconda3libsite-packagespandascoreframe.py in to_gbq(self, destination_table,
project_id, chunksize, reauth, if_exists, auth_local_webserver, table_schema, location,
progress_bar, credentials)
1708 from pandas.io import gbq
1709
-> 1710 gbq.to_gbq(
1711 self,
1712 destination_table,
~Anaconda3libsite-packagespandasiogbq.py in to_gbq(dataframe, destination_table, project_id, chunksize, reauth, if_exists, auth_local_webserver, table_schema, location, progress_bar, credentials)
209 ) -> None:
210 pandas_gbq = _try_import()
--> 211 pandas_gbq.to_gbq(
212 dataframe,
213 destination_table,
~Anaconda3libsite-packagespandas_gbqgbq.py in to_gbq(dataframe, destination_table, project_id, chunksize, reauth, if_exists, auth_local_webserver, table_schema, location, progress_bar, credentials, api_method, verbose, private_key)
1191 return
1192
-> 1193 connector.load_data(
1194 dataframe,
1195 destination_table_ref,
~Anaconda3libsite-packagespandas_gbqgbq.py in load_data(self, dataframe, destination_table_ref, chunksize, schema, progress_bar, api_method, billing_project)
584
585 try:
--> 586 chunks = load.load_chunks(
587 self.client,
588 dataframe,
~Anaconda3libsite-packagespandas_gbqload.py in load_chunks(client, dataframe, destination_table_ref, chunksize, schema, location, api_method, billing_project)
235 ):
236 if api_method == "load_parquet":
--> 237 load_parquet(
238 client,
239 dataframe,
~Anaconda3libsite-packagespandas_gbqload.py in load_parquet(client, dataframe, destination_table_ref, location, schema, billing_project)
127
128 try:
--> 129 client.load_table_from_dataframe(
130 dataframe,
131 destination_table_ref,
~Anaconda3libsite-packagesgooglecloudbigqueryclient.py in load_table_from_dataframe(self, dataframe, destination, num_retries, job_id, job_id_prefix, location, project, job_config, parquet_compression, timeout)
2669 parquet_compression = parquet_compression.upper()
2670
-> 2671 _pandas_helpers.dataframe_to_parquet(
2672 dataframe,
2673 job_config.schema,
~Anaconda3libsite-packagesgooglecloudbigquery_pandas_helpers.py in dataframe_to_parquet(dataframe, bq_schema, filepath, parquet_compression, parquet_use_compliant_nested_type)
584
585 bq_schema = schema._to_schema_fields(bq_schema)
--> 586 arrow_table = dataframe_to_arrow(dataframe, bq_schema)
587 pyarrow.parquet.write_table(
588 arrow_table, filepath, compression=parquet_compression, **kwargs,
~Anaconda3libsite-packagesgooglecloudbigquery_pandas_helpers.py in dataframe_to_arrow(dataframe, bq_schema)
527 arrow_names.append(bq_field.name)
528 arrow_arrays.append(
--> 529 bq_to_arrow_array(get_column_or_index(dataframe, bq_field.name), bq_field)
530 )
531 arrow_fields.append(bq_to_arrow_field(bq_field, arrow_arrays[-1].type))
~Anaconda3libsite-packagesgooglecloudbigquery_pandas_helpers.py in bq_to_arrow_array(series, bq_field)
288 if field_type_upper in schema._STRUCT_TYPES:
289 return pyarrow.StructArray.from_pandas(series, type=arrow_type)
--> 290 return pyarrow.Array.from_pandas(series, type=arrow_type)
291
292
~Anaconda3libsite-packagespyarrowarray.pxi in pyarrow.lib.Array.from_pandas()
~Anaconda3libsite-packagespyarrowarray.pxi in pyarrow.lib.array()
~Anaconda3libsite-packagespyarrowarray.pxi in pyarrow.lib._ndarray_to_array()
~Anaconda3libsite-packagespyarrowerror.pxi in pyarrow.lib.check_status()
ArrowTypeError: Expected bytes, got a 'int' object
Not really an answer but a kludgy workaround. I’m having this exact same problem with dataframes which contain columns of the INT64 type. I’ve found that doing the following works:
from io import StringIO
# temporarily store the dataframe as a csv in a string variable
temp_csv_string = df.to_csv(sep=";", index=False)
temp_csv_string_IO = StringIO(temp_csv_string)
# create new dataframe from string variable
new_df = pd.read_csv(temp_csv_string_IO, sep=";")
# this new df can be uploaded to BQ with no issues
new_df.to_gbq(table_id, project_id, if_exists="append")
I have no idea why this works. Both dataframes seem to be identical if you look at df.info()
and new_df.info()
. I decided to try this after saving the offending dataframe as a csv and uploading it to biquery in that format, which worked.
Note that this specifically happens with INT64 type columns. I’m uploading dataframes generated in the same way that don’t contain INT64 values whithout any issues.
Had this same issue – solved it simply with
df = df.astype(str)
and doing to_gbq
on that instead.
I have a similar issue when loading API data to BigQuery and I believe thing this is more efficient to get rid of the Int64_field_0.
blankIndex=[''] * len(df)
df.index=blankIndex
df
Provide the expected schema of your table in the form of a dictionary like so:
schema= [{‘name’: ‘row’, ‘type’: ‘INTEGER’},{‘name’: ‘city’, ‘type’: ‘STRING’},{‘name’: ‘value’, ‘type’: ‘INTEGER’}]
replace with your values
and then inside the function use table_schema=schema
this function allows for a schema to be passed. wrong!
this function requires a schema to be passed, but if you don’t, a little snippet will quickly autogenerate one for you.
When the contents down the road don’t match the autogenerated schema then it fails like this because the default type is a String and in the first column there was nothing that indicated something else was needed than the default.
workaround alert -> if the row at the top has all the the fields populated chances are it will infer correctly what the schema is and you avoid the error without passing a schema.
This makes the function’s behavior to look random and it is what throws people off sometimes.
The first solution works because it makes everything a String and String being the default well….it works
however you end up with a table that you now have to type if you want all your fields to be proper.
Hope it helps. if so Upvote!