How to json_normalize a column with NaNs

Question:

  • This question is specific to columns of data in a pandas.DataFrame
  • This question depends on if the values in the columns are str, dict, or list type.
  • This question addresses dealing with the NaN values, when df.dropna().reset_index(drop=True) isn’t a valid option.

Case 1

  • With a column of str type, the values in the column must be converted to dict type, with ast.literal_eval, before using .json_normalize.
import numpy as np
import pandas as pd
from ast import literal_eval

df = pd.DataFrame({'col_str': ['{"a": "46", "b": "3", "c": "12"}', '{"b": "2", "c": "7"}', '{"c": "11"}', np.NaN]})

                            col_str
0  {"a": "46", "b": "3", "c": "12"}
1              {"b": "2", "c": "7"}
2                       {"c": "11"}
3                               NaN

type(df.iloc[0, 0])
[out]: str

df.col_str.apply(literal_eval)

Error:

df.col_str.apply(literal_eval) results in ValueError: malformed node or string: nan

Case 2

  • With a column of dict type, use pandas.json_normalize to convert keys to column headers and values to rows
df = pd.DataFrame({'col_dict': [{"a": "46", "b": "3", "c": "12"}, {"b": "2", "c": "7"}, {"c": "11"}, np.NaN]})

                           col_dict
0  {'a': '46', 'b': '3', 'c': '12'}
1              {'b': '2', 'c': '7'}
2                       {'c': '11'}
3                               NaN

type(df.iloc[0, 0])
[out]: dict

pd.json_normalize(df.col_dict)

Error:

pd.json_normalize(df.col_dict) results in AttributeError: 'float' object has no attribute 'items'

Case 3

  • In a column of str type, with the dict inside a list.
  • To normalize the column
    • apply literal_eval, because explode doesn’t work on str type
    • explode the column to separate the dicts to separate rows
    • normalize the column
df = pd.DataFrame({'col_str': ['[{"a": "46", "b": "3", "c": "12"}, {"b": "2", "c": "7"}]', '[{"b": "2", "c": "7"}, {"c": "11"}]', np.nan]})

                                                    col_str
0  [{"a": "46", "b": "3", "c": "12"}, {"b": "2", "c": "7"}]
1                       [{"b": "2", "c": "7"}, {"c": "11"}]
2                                                       NaN

type(df.iloc[0, 0])
[out]: str
    
df.col_str.apply(literal_eval)

Error:

df.col_str.apply(literal_eval) results in ValueError: malformed node or string: nan
Asked By: Trenton McKinney

||

Answers:

  • There is always the option to:
    • df = df.dropna().reset_index(drop=True)
    • That’s fine for the dummy data here, or when dealing with a dataframe where the other columns don’t matter.
    • Not a great option for dataframes with additional columns that are required.
  • Tested in python 3.10, pandas 1.4.3

Case 1

  • Since the column contains str types, fillna with '{}' (a str)
import numpy as np
import pandas as pd
from ast import literal_eval

df = pd.DataFrame({'col_str': ['{"a": "46", "b": "3", "c": "12"}', '{"b": "2", "c": "7"}', '{"c": "11"}', np.NaN]})

                            col_str
0  {"a": "46", "b": "3", "c": "12"}
1              {"b": "2", "c": "7"}
2                       {"c": "11"}
3                               NaN

type(df.iloc[0, 0])
[out]: str

# fillna
df.col_str = df.col_str.fillna('{}')

# convert the column to dicts
df.col_str = df.col_str.apply(literal_eval)

# use json_normalize
df = df.join(pd.json_normalize(df.pop('col_str')))

# display(df)
     a    b    c
0   46    3   12
1  NaN    2    7
2  NaN  NaN   11
3  NaN  NaN  NaN

Case 2

As of at least pandas 1.3.4, pd.json_normalize(df.col_dict) works without issue, at least for this simple example.


  • Since the column contains dict types, fillna with {} (not a str)
  • This needs to be filled using a dict-comprehension, since fillna({}) does not work
df = pd.DataFrame({'col_dict': [{"a": "46", "b": "3", "c": "12"}, {"b": "2", "c": "7"}, {"c": "11"}, np.NaN]})

                           col_dict
0  {'a': '46', 'b': '3', 'c': '12'}
1              {'b': '2', 'c': '7'}
2                       {'c': '11'}
3                               NaN

type(df.iloc[0, 0])
[out]: dict
    
# fillna
df.col_dict = df.col_dict.fillna({i: {} for i in df.index})

# use json_normalize
df = df.join(pd.json_normalize(df.pop('col_dict')))

# display(df)
     a    b    c
0   46    3   12
1  NaN    2    7
2  NaN  NaN   11
3  NaN  NaN  NaN

Case 3

  1. Fill the NaNs with '[]' (a str)
  2. Now literal_eval will work
  3. .explode can be used on the column to separate the dict values to rows
  4. Now the NaNs need to be filled with {} (not a str)
  5. Then the column can be normalized
  • For the case when the column is lists of dicts, that aren’t str type, skip to .explode.
df = pd.DataFrame({'col_str': ['[{"a": "46", "b": "3", "c": "12"}, {"b": "2", "c": "7"}]', '[{"b": "2", "c": "7"}, {"c": "11"}]', np.nan]})

                                                    col_str
0  [{"a": "46", "b": "3", "c": "12"}, {"b": "2", "c": "7"}]
1                       [{"b": "2", "c": "7"}, {"c": "11"}]
2                                                       NaN

type(df.iloc[0, 0])
[out]: str
    
# fillna
df.col_str = df.col_str.fillna('[]')

# literal_eval
df.col_str = df.col_str.apply(literal_eval)

# explode
df = df.explode('col_str', ignore_index=True)

# fillna again
df.col_str = df.col_str.fillna({i: {} for i in df.index})

# use json_normalize
df = df.join(pd.json_normalize(df.pop('col_str')))

# display(df)
     a    b    c
0   46    3   12
1  NaN    2    7
2  NaN    2    7
3  NaN  NaN   11
4  NaN  NaN  NaN
Answered By: Trenton McKinney