How to convert pandas data types into BQ schema
Question:
I am trying to construct a BigQuery schema as per the pandas data types.
The schema should be in json format.
I initally started with below code and not able to construct a base dictionary.
my code:
import pandas as pd
df = pd.DataFrame({'A': [1, 2],
'B': [1., 2.],
'C': ['a', 'b'],
'D': [True, False]})
dict1=df.dtypes.apply(lambda x: x.name).to_dict()
new_dict={}
for k,v in dict1.items():
new_dict["name"]=k.lower()
if v == 'bool':
new_dict["dtype"]="BOOL"
elif v == 'object':
new_dict["dtype"]="STRING"
elif v=='int64':
new_dict["dtype"]="INTEGER"
new_dict["mode"]="NULLABLE"
with above loop I am am getting last record in the new_dict.
Expected output is:
[
{
"name": "col1",
"mode": "NULLABLE",
"type": "STRING"
},
{
"name": "col2",
"mode": "NULLABLE",
"type": "INTEGER"
}
]
Please suggest.
Answers:
here is the code snippet to achieve my goal.
json_list = []
for col_name,datatype in dict1.items():
new_dict={"name": col_name, "mode": "NULLABLE", "dtype": datatype}
new_dict["name"]=col_name.lower()
if datatype == 'bool':
new_dict["dtype"]="BOOL"
elif datatype == 'object':
new_dict["dtype"]="STRING"
elif datatype =='int64':
new_dict["dtype"]="INTEGER"
elif datatype =='float64':
new_dict["dtype"]="FLOAT"
new_dict["mode"]="NULLABLE"
json_list.append(new_dict)
The pandas_gbq
library supports this.
import pandas as pd
import pandas_gbq
import pprint
df = pd.DataFrame({'A': [1, 2],
'B': [1., 2.],
'C': ['a', 'b'],
'D': [True, False]})
schema = pandas_gbq.schema.generate_bq_schema(df, default_type="STRING")['fields']
pprint.pprint(schema)
Gives the output:
[{'name': 'A', 'type': 'INTEGER'},
{'name': 'B', 'type': 'FLOAT'},
{'name': 'C', 'type': 'STRING'},
{'name': 'D', 'type': 'BOOLEAN'}]
You can just add the mode
manually
I am trying to construct a BigQuery schema as per the pandas data types.
The schema should be in json format.
I initally started with below code and not able to construct a base dictionary.
my code:
import pandas as pd
df = pd.DataFrame({'A': [1, 2],
'B': [1., 2.],
'C': ['a', 'b'],
'D': [True, False]})
dict1=df.dtypes.apply(lambda x: x.name).to_dict()
new_dict={}
for k,v in dict1.items():
new_dict["name"]=k.lower()
if v == 'bool':
new_dict["dtype"]="BOOL"
elif v == 'object':
new_dict["dtype"]="STRING"
elif v=='int64':
new_dict["dtype"]="INTEGER"
new_dict["mode"]="NULLABLE"
with above loop I am am getting last record in the new_dict.
Expected output is:
[
{
"name": "col1",
"mode": "NULLABLE",
"type": "STRING"
},
{
"name": "col2",
"mode": "NULLABLE",
"type": "INTEGER"
}
]
Please suggest.
here is the code snippet to achieve my goal.
json_list = []
for col_name,datatype in dict1.items():
new_dict={"name": col_name, "mode": "NULLABLE", "dtype": datatype}
new_dict["name"]=col_name.lower()
if datatype == 'bool':
new_dict["dtype"]="BOOL"
elif datatype == 'object':
new_dict["dtype"]="STRING"
elif datatype =='int64':
new_dict["dtype"]="INTEGER"
elif datatype =='float64':
new_dict["dtype"]="FLOAT"
new_dict["mode"]="NULLABLE"
json_list.append(new_dict)
The pandas_gbq
library supports this.
import pandas as pd
import pandas_gbq
import pprint
df = pd.DataFrame({'A': [1, 2],
'B': [1., 2.],
'C': ['a', 'b'],
'D': [True, False]})
schema = pandas_gbq.schema.generate_bq_schema(df, default_type="STRING")['fields']
pprint.pprint(schema)
Gives the output:
[{'name': 'A', 'type': 'INTEGER'},
{'name': 'B', 'type': 'FLOAT'},
{'name': 'C', 'type': 'STRING'},
{'name': 'D', 'type': 'BOOLEAN'}]
You can just add the mode
manually