Pandas transform json column into multiple columns

Question:

I need to transform following data frame with json values in column into dataframe columnar structure so that it will be taking less space and easy to compute.

Sample DataFrame:

obs_id date obs
I2213 2021-12-31 23:20:02.761008 "[{‘type’: ‘air’, ‘results’: {‘bat’: {‘F1’: 0.1, ‘F2’: 0.2}}, {‘type’: ‘water’, ‘results’: {‘neo’: {‘F1’: 0.3}}]"
I2213 2022-01-01 23:20:02.761008 "[{‘type’: ‘earth’, ‘results’: {‘cat’: {‘F1’: 0.4}}]"
I2213 2022-01-02 23:20:02.761008 "[{‘type’: ‘air’, ‘results’: {‘bat’: {‘F1’: 0.2, ‘F2’: 0.1}}]"

Required Transformation format:

obs_id date obs.air.bat.F1 obs.air.bat.F2 obs.water.neo.F1 obs.earth.cat.F1

not sure if multi-level columns will suit better here.

I tried to create a separate dataframe from obs column like:

df1 = pd.DataFrame(df['obs'].values.tolist())

but since it contains list instead of dictionary, it doesn’t work.
Is it possible to achieve the require format?

Asked By: ThomasBecker

||

Answers:

Let’s start from a correction to your source data.
As your sample contains unbalanced parentheses, probably its actual content
should be:

   obs_id  date        obs
0  I2213   2021-12-31  [{'type': 'air', 'results': {'bat': {'F1': 0.1, 'F2': 0.2}}}, {'type': 'water', 'results': {'neo': {'F1': 0.3}}}]
1  I2213   2022-01-01  [{'type': 'earth', 'results': {'cat': {'F1': 0.4}}}]
2  I2213   2022-01-02  [{'type': 'air', 'results': {'bat': {'F1': 0.2, 'F2': 0.1}}}]

To keep printouts of reasonable breadth, I dropped the time part from your date column.

Start the coding part from necessary imports:

import pandas as pd
import json

Then define a row processing function as:

def procRow(row):
    wrk1 = pd.json_normalize(json.loads(row.obs.replace("'", '"')))
    wrk2 = wrk1.set_index('type').stack().reset_index()
    return pd.Series(wrk2[0].values, index='obs.' + wrk2.type
        + wrk2.level_1.str.slice(7))

And concatenate first 2 columns of df with the result of application of this
function to each row:

result = pd.concat([df.iloc[:, 0:2], df.apply(procRow, axis=1)], axis=1)

The result is:

  obs_id       date  obs.air.bat.F1  obs.air.bat.F2  obs.earth.cat.F1  obs.water.neo.F1
0  I2213 2021-12-31             0.1             0.2               NaN               0.3
1  I2213 2022-01-01             NaN             NaN               0.4               NaN
2  I2213 2022-01-02             0.2             0.1               NaN               NaN
Answered By: Valdi_Bo
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.