BigQuery TypeError: to_pandas() got an unexpected keyword argument 'timestamp_as_object'
Question:
Environment details
- OS type and version: 1.5.29-debian10
- Python version: 3.7
google-cloud-bigquery
version: 2.8.0
I’m provisioning a dataproc cluster which gets the data from BigQuery into a pandas dataframe.
As my data is growing I was looking to boost the performance and heard about using the BigQuery storage client.
I had the same problem in the past and this was solved by setting the google-cloud-bigquery to version 1.26.1.
If I use that version I get the following message.
/opt/conda/default/lib/python3.7/site-packages/google/cloud/bigquery/client.py:407: UserWarning: Cannot create BigQuery Storage client, the dependency google-cloud-bigquery-storage is not installed.
"Cannot create BigQuery Storage client, the dependency "
The code snippet executes but at a way slower rate. If I do not specify the pip version, I encounter this error.
Steps to reproduce
- Cluster creation on dataproc
gcloud dataproc clusters create testing-cluster --region=europe-west1 --zone=europe-west1-b --master-machine-type n1-standard-16 --single-node --image-version 1.5-debian10 --initialization-actions gs://dataproc-initialization-actions/python/pip-install.sh --metadata 'PIP_PACKAGES=elasticsearch google-cloud-bigquery google-cloud-bigquery-storage pandas pandas_gbq'
- Execute the Following script on the cluster
bqclient = bigquery.Client(project=project)
job_config = bigquery.QueryJobConfig(
query_parameters=[
bigquery.ScalarQueryParameter("query_start", "STRING", str('2021-02-09 00:00:00')),
bigquery.ScalarQueryParameter("query_end", "STRING", str('2021-02-09 23:59:59.99')),
]
)
df = bqclient.query(query, job_config=job_config).to_dataframe(create_bqstorage_client=True)
2021-02-11 10:10:14,069 - preprocessing logger initialized
2021-02-11 10:10:14,069 - arguments = [file, arg1, arg2, arg3, arg4, project_id, arg5, arg6]
Traceback (most recent call last):
File "/tmp/782503bcc80246258560a07d2179891f/immo_preprocessing-pageviews_kyero.py", line 104, in <module>
df = bqclient.query(base_query, job_config=job_config).to_dataframe(create_bqstorage_client=True)
File "/opt/conda/default/lib/python3.7/site-packages/google/cloud/bigquery/job/query.py", line 1333, in to_dataframe
date_as_object=date_as_object,
File "/opt/conda/default/lib/python3.7/site-packages/google/cloud/bigquery/table.py", line 1793, in to_dataframe
df = record_batch.to_pandas(date_as_object=date_as_object, **extra_kwargs)
File "pyarrow/array.pxi", line 414, in pyarrow.lib._PandasConvertible.to_pandas
TypeError: to_pandas() got an unexpected keyword argument 'timestamp_as_object'
Using the pandas-gbq version gives exaclty the same error
query_config = {
'query': {
'parameterMode': 'NAMED',
'queryParameters': [
{
'name': 'query_start',
'parameterType': {'type': 'STRING'},
'parameterValue': {'value': str('2021-02-09 00:00:00')}
},
{
'name': 'query_end',
'parameterType': {'type': 'STRING'},
'parameterValue': {'value': str('2021-02-09 23:59:59.99')}
},
]
}
}
df = pd.read_gbq(base_query,
configuration=query_config,
progress_bar_type='tqdm',
use_bqstorage_api=True)
2021-02-11 09:21:19,532 - preprocessing logger initialized
2021-02-11 09:21:19,532 - arguments = [file, arg1, arg2, arg3, arg4, project_id, arg5, arg6]
started
Downloading: 100%|██████████| 3107858/3107858 [00:14<00:00, 207656.33rows/s]
Traceback (most recent call last):
File "/tmp/1830d5bcf198440e9e030c8e42a1b870/immo_preprocessing-pageviews.py", line 98, in <module>
use_bqstorage_api=True)
File "/opt/conda/default/lib/python3.7/site-packages/pandas/io/gbq.py", line 193, in read_gbq
**kwargs,
File "/opt/conda/default/lib/python3.7/site-packages/pandas_gbq/gbq.py", line 977, in read_gbq
dtypes=dtypes,
File "/opt/conda/default/lib/python3.7/site-packages/pandas_gbq/gbq.py", line 536, in run_query
user_dtypes=dtypes,
File "/opt/conda/default/lib/python3.7/site-packages/pandas_gbq/gbq.py", line 590, in _download_results
**to_dataframe_kwargs
File "/opt/conda/default/lib/python3.7/site-packages/google/cloud/bigquery/table.py", line 1793, in to_dataframe
df = record_batch.to_pandas(date_as_object=date_as_object, **extra_kwargs)
File "pyarrow/array.pxi", line 414, in pyarrow.lib._PandasConvertible.to_pandas
TypeError: to_pandas() got an unexpected keyword argument 'timestamp_as_object'
Answers:
Dataproc installs by default pyarrow 0.15.0 while the bigquery-storage-api needs a more recent version. Manually setting pyarrow to 3.0.0 at install solved the issue.
That being said, PySpark has a compability setting for Pyarrow >= 0.15.0
https://spark.apache.org/docs/3.0.0-preview/sql-pyspark-pandas-with-arrow.html#apache-arrow-in-spark
I’ve taken a look at the release notes of dataproc and this env variable is set as default since May 2020.
@Sam answered this, but I thought I’d just mention the actionable commands:
In a Jupyter notebook:
!pip install pyarrow==3.0.0
In your virtualenv
pip install pyarrow==3.0.0
Environment details
- OS type and version: 1.5.29-debian10
- Python version: 3.7
google-cloud-bigquery
version: 2.8.0
I’m provisioning a dataproc cluster which gets the data from BigQuery into a pandas dataframe.
As my data is growing I was looking to boost the performance and heard about using the BigQuery storage client.
I had the same problem in the past and this was solved by setting the google-cloud-bigquery to version 1.26.1.
If I use that version I get the following message.
/opt/conda/default/lib/python3.7/site-packages/google/cloud/bigquery/client.py:407: UserWarning: Cannot create BigQuery Storage client, the dependency google-cloud-bigquery-storage is not installed.
"Cannot create BigQuery Storage client, the dependency "
The code snippet executes but at a way slower rate. If I do not specify the pip version, I encounter this error.
Steps to reproduce
- Cluster creation on dataproc
gcloud dataproc clusters create testing-cluster --region=europe-west1 --zone=europe-west1-b --master-machine-type n1-standard-16 --single-node --image-version 1.5-debian10 --initialization-actions gs://dataproc-initialization-actions/python/pip-install.sh --metadata 'PIP_PACKAGES=elasticsearch google-cloud-bigquery google-cloud-bigquery-storage pandas pandas_gbq'
- Execute the Following script on the cluster
bqclient = bigquery.Client(project=project)
job_config = bigquery.QueryJobConfig(
query_parameters=[
bigquery.ScalarQueryParameter("query_start", "STRING", str('2021-02-09 00:00:00')),
bigquery.ScalarQueryParameter("query_end", "STRING", str('2021-02-09 23:59:59.99')),
]
)
df = bqclient.query(query, job_config=job_config).to_dataframe(create_bqstorage_client=True)
2021-02-11 10:10:14,069 - preprocessing logger initialized
2021-02-11 10:10:14,069 - arguments = [file, arg1, arg2, arg3, arg4, project_id, arg5, arg6]
Traceback (most recent call last):
File "/tmp/782503bcc80246258560a07d2179891f/immo_preprocessing-pageviews_kyero.py", line 104, in <module>
df = bqclient.query(base_query, job_config=job_config).to_dataframe(create_bqstorage_client=True)
File "/opt/conda/default/lib/python3.7/site-packages/google/cloud/bigquery/job/query.py", line 1333, in to_dataframe
date_as_object=date_as_object,
File "/opt/conda/default/lib/python3.7/site-packages/google/cloud/bigquery/table.py", line 1793, in to_dataframe
df = record_batch.to_pandas(date_as_object=date_as_object, **extra_kwargs)
File "pyarrow/array.pxi", line 414, in pyarrow.lib._PandasConvertible.to_pandas
TypeError: to_pandas() got an unexpected keyword argument 'timestamp_as_object'
Using the pandas-gbq version gives exaclty the same error
query_config = {
'query': {
'parameterMode': 'NAMED',
'queryParameters': [
{
'name': 'query_start',
'parameterType': {'type': 'STRING'},
'parameterValue': {'value': str('2021-02-09 00:00:00')}
},
{
'name': 'query_end',
'parameterType': {'type': 'STRING'},
'parameterValue': {'value': str('2021-02-09 23:59:59.99')}
},
]
}
}
df = pd.read_gbq(base_query,
configuration=query_config,
progress_bar_type='tqdm',
use_bqstorage_api=True)
2021-02-11 09:21:19,532 - preprocessing logger initialized
2021-02-11 09:21:19,532 - arguments = [file, arg1, arg2, arg3, arg4, project_id, arg5, arg6]
started
Downloading: 100%|██████████| 3107858/3107858 [00:14<00:00, 207656.33rows/s]
Traceback (most recent call last):
File "/tmp/1830d5bcf198440e9e030c8e42a1b870/immo_preprocessing-pageviews.py", line 98, in <module>
use_bqstorage_api=True)
File "/opt/conda/default/lib/python3.7/site-packages/pandas/io/gbq.py", line 193, in read_gbq
**kwargs,
File "/opt/conda/default/lib/python3.7/site-packages/pandas_gbq/gbq.py", line 977, in read_gbq
dtypes=dtypes,
File "/opt/conda/default/lib/python3.7/site-packages/pandas_gbq/gbq.py", line 536, in run_query
user_dtypes=dtypes,
File "/opt/conda/default/lib/python3.7/site-packages/pandas_gbq/gbq.py", line 590, in _download_results
**to_dataframe_kwargs
File "/opt/conda/default/lib/python3.7/site-packages/google/cloud/bigquery/table.py", line 1793, in to_dataframe
df = record_batch.to_pandas(date_as_object=date_as_object, **extra_kwargs)
File "pyarrow/array.pxi", line 414, in pyarrow.lib._PandasConvertible.to_pandas
TypeError: to_pandas() got an unexpected keyword argument 'timestamp_as_object'
Dataproc installs by default pyarrow 0.15.0 while the bigquery-storage-api needs a more recent version. Manually setting pyarrow to 3.0.0 at install solved the issue.
That being said, PySpark has a compability setting for Pyarrow >= 0.15.0
https://spark.apache.org/docs/3.0.0-preview/sql-pyspark-pandas-with-arrow.html#apache-arrow-in-spark
I’ve taken a look at the release notes of dataproc and this env variable is set as default since May 2020.
@Sam answered this, but I thought I’d just mention the actionable commands:
In a Jupyter notebook:
!pip install pyarrow==3.0.0
In your virtualenv
pip install pyarrow==3.0.0