Flatten nested JSON string to different columns in Google BigQuery
Question:
I have column in one of the BigQuery table which looks like this.
{"name": "name1", "last_delivered": {"push_id": "push_id1", "time": "time1"}, "session_id": "session_id1", "source": "SDK", "properties": {"UserId": "u1"}}
Is there any was to get the output like this in GBQ ?? (basically flatten the entire column into different columns)
name last_delivered.push_id last_delivered.time session_id source properties.UserId
name1 push_id1 time1 session_id1 SDK uid1
Let’s say
a = {“name”: “name1”, “last_delivered”: {“push_id”: “push_id1”,
“time”: “time1”}, “session_id”: “session_id1”, “source”: “SDK”,
“properties”: {“UserId”: “u1”}}
I have tried to get desired output in Pandas Python using json_normalize(a) , but every time I try get the following error
Does anyone has any idea how can I get my desired output. Am I missing something ??
Any help would be greatly appreciated!!
Answers:
Below example is for BigQuery Standard SQL
#standardSQL
WITH `project.dataset.table` AS (
SELECT '{"name": "name1", "last_delivered": {"push_id": "push_id1", "time": "time1"}, "session_id": "session_id1", "source": "SDK", "properties": {"UserId": "u1"}}' col
)
SELECT
JSON_EXTRACT_SCALAR(col, '$.name') name,
STRUCT(
JSON_EXTRACT_SCALAR(col, '$.last_delivered.push_id') AS push_id,
JSON_EXTRACT_SCALAR(col, '$.last_delivered.time') AS time
) last_delivered,
JSON_EXTRACT_SCALAR(col, '$.session_id') session_id,
JSON_EXTRACT_SCALAR(col, '$.source') source,
STRUCT(
JSON_EXTRACT_SCALAR(col, '$.properties.UserId') AS UserId
) properties
FROM `project.dataset.table`
and produces result as expected/asked
Row name last_delivered.push_id last_delivered.time session_id source properties.UserId
1 name1 push_id1 time1 session_id1 SDK u1
My guess as to why it’s not working is that your json data is actually a string:
from pandas.io.json import json_normalize
a = '''{"name": "name1", "last_delivered": {"push_id": "push_id1", "time": "time1"}, "session_id": "session_id1", "source": "SDK", "properties": {"UserId": "u1"}}'''
df = json_normalize(a)
Output:
AttributeError: 'str' object has no attribute 'values'
Versus:
from pandas.io.json import json_normalize
a = {"name": "name1", "last_delivered": {"push_id": "push_id1", "time": "time1"}, "session_id": "session_id1", "source": "SDK", "properties": {"UserId": "u1"}}
df = json_normalize(a)
Output:
print(df.to_string())
last_delivered.push_id last_delivered.time name properties.UserId session_id source
0 push_id1 time1 name1 u1 session_id1 SDK
If this is the case, you can use json.loads()
right before normalize:
import json
from pandas.io.json import json_normalize
a = '''{"name": "name1", "last_delivered": {"push_id": "push_id1", "time": "time1"}, "session_id": "session_id1", "source": "SDK", "properties": {"UserId": "u1"}}'''
data = json.loads(a)
df = json_normalize(data)
This might be useful if you want do it in bigquery itself
https://medium.com/@vigneshmailappan/flattening-json-in-bigquery-f68e3a78a970
I have column in one of the BigQuery table which looks like this.
{"name": "name1", "last_delivered": {"push_id": "push_id1", "time": "time1"}, "session_id": "session_id1", "source": "SDK", "properties": {"UserId": "u1"}}
Is there any was to get the output like this in GBQ ?? (basically flatten the entire column into different columns)
name last_delivered.push_id last_delivered.time session_id source properties.UserId
name1 push_id1 time1 session_id1 SDK uid1
Let’s say
a = {“name”: “name1”, “last_delivered”: {“push_id”: “push_id1”,
“time”: “time1”}, “session_id”: “session_id1”, “source”: “SDK”,
“properties”: {“UserId”: “u1”}}
I have tried to get desired output in Pandas Python using json_normalize(a) , but every time I try get the following error
Does anyone has any idea how can I get my desired output. Am I missing something ??
Any help would be greatly appreciated!!
Below example is for BigQuery Standard SQL
#standardSQL
WITH `project.dataset.table` AS (
SELECT '{"name": "name1", "last_delivered": {"push_id": "push_id1", "time": "time1"}, "session_id": "session_id1", "source": "SDK", "properties": {"UserId": "u1"}}' col
)
SELECT
JSON_EXTRACT_SCALAR(col, '$.name') name,
STRUCT(
JSON_EXTRACT_SCALAR(col, '$.last_delivered.push_id') AS push_id,
JSON_EXTRACT_SCALAR(col, '$.last_delivered.time') AS time
) last_delivered,
JSON_EXTRACT_SCALAR(col, '$.session_id') session_id,
JSON_EXTRACT_SCALAR(col, '$.source') source,
STRUCT(
JSON_EXTRACT_SCALAR(col, '$.properties.UserId') AS UserId
) properties
FROM `project.dataset.table`
and produces result as expected/asked
Row name last_delivered.push_id last_delivered.time session_id source properties.UserId
1 name1 push_id1 time1 session_id1 SDK u1
My guess as to why it’s not working is that your json data is actually a string:
from pandas.io.json import json_normalize
a = '''{"name": "name1", "last_delivered": {"push_id": "push_id1", "time": "time1"}, "session_id": "session_id1", "source": "SDK", "properties": {"UserId": "u1"}}'''
df = json_normalize(a)
Output:
AttributeError: 'str' object has no attribute 'values'
Versus:
from pandas.io.json import json_normalize
a = {"name": "name1", "last_delivered": {"push_id": "push_id1", "time": "time1"}, "session_id": "session_id1", "source": "SDK", "properties": {"UserId": "u1"}}
df = json_normalize(a)
Output:
print(df.to_string())
last_delivered.push_id last_delivered.time name properties.UserId session_id source
0 push_id1 time1 name1 u1 session_id1 SDK
If this is the case, you can use json.loads()
right before normalize:
import json
from pandas.io.json import json_normalize
a = '''{"name": "name1", "last_delivered": {"push_id": "push_id1", "time": "time1"}, "session_id": "session_id1", "source": "SDK", "properties": {"UserId": "u1"}}'''
data = json.loads(a)
df = json_normalize(data)
This might be useful if you want do it in bigquery itself
https://medium.com/@vigneshmailappan/flattening-json-in-bigquery-f68e3a78a970