Flatten nested JSON string to different columns in Google BigQuery

Question:

I have column in one of the BigQuery table which looks like this.

{"name": "name1", "last_delivered": {"push_id": "push_id1", "time": "time1"}, "session_id": "session_id1", "source": "SDK", "properties": {"UserId": "u1"}}

Is there any was to get the output like this in GBQ ?? (basically flatten the entire column into different columns)

name    last_delivered.push_id   last_delivered.time   session_id   source   properties.UserId

name1       push_id1                     time1         session_id1   SDK          uid1

Let’s say

a = {“name”: “name1”, “last_delivered”: {“push_id”: “push_id1”,
“time”: “time1”}, “session_id”: “session_id1”, “source”: “SDK”,
“properties”: {“UserId”: “u1”}}

I have tried to get desired output in Pandas Python using json_normalize(a) , but every time I try get the following error

enter image description here

Does anyone has any idea how can I get my desired output. Am I missing something ??

Any help would be greatly appreciated!!

Asked By: Munagala

||

Answers:

Below example is for BigQuery Standard SQL

#standardSQL
WITH `project.dataset.table` AS (
  SELECT '{"name": "name1", "last_delivered": {"push_id": "push_id1", "time": "time1"}, "session_id": "session_id1", "source": "SDK", "properties": {"UserId": "u1"}}' col
)
SELECT 
  JSON_EXTRACT_SCALAR(col, '$.name') name,
  STRUCT(
    JSON_EXTRACT_SCALAR(col, '$.last_delivered.push_id') AS push_id,
    JSON_EXTRACT_SCALAR(col, '$.last_delivered.time') AS time
  ) last_delivered,
  JSON_EXTRACT_SCALAR(col, '$.session_id') session_id,
  JSON_EXTRACT_SCALAR(col, '$.source') source,
  STRUCT(
    JSON_EXTRACT_SCALAR(col, '$.properties.UserId') AS UserId
  ) properties
FROM `project.dataset.table`   

and produces result as expected/asked

Row name    last_delivered.push_id  last_delivered.time session_id  source  properties.UserId    
1   name1   push_id1                time1               session_id1 SDK     u1     
Answered By: Mikhail Berlyant

My guess as to why it’s not working is that your json data is actually a string:

from pandas.io.json import json_normalize 

a = '''{"name": "name1", "last_delivered": {"push_id": "push_id1", "time": "time1"}, "session_id": "session_id1", "source": "SDK", "properties": {"UserId": "u1"}}'''  

df = json_normalize(a)

Output:

AttributeError: 'str' object has no attribute 'values'    

Versus:

from pandas.io.json import json_normalize 

a = {"name": "name1", "last_delivered": {"push_id": "push_id1", "time": "time1"}, "session_id": "session_id1", "source": "SDK", "properties": {"UserId": "u1"}}  

df = json_normalize(a)

Output:

 print(df.to_string())
  last_delivered.push_id last_delivered.time   name properties.UserId   session_id source
0               push_id1               time1  name1                u1  session_id1    SDK

If this is the case, you can use json.loads() right before normalize:

import json
from pandas.io.json import json_normalize

a = '''{"name": "name1", "last_delivered": {"push_id": "push_id1", "time": "time1"}, "session_id": "session_id1", "source": "SDK", "properties": {"UserId": "u1"}}'''  

data = json.loads(a)
df = json_normalize(data)
Answered By: chitown88

This might be useful if you want do it in bigquery itself
https://medium.com/@vigneshmailappan/flattening-json-in-bigquery-f68e3a78a970

Answered By: vignesh