Any depth nested dict to pandas dataframe

Question:

I’ve been fighting to go from a nested dictionary of depth D to a pandas DataFrame.

I’ve tried with recursive function, like the following one, but my problem is that when I’m iterating over a KEY, I don’t know what was the pervious key.

I’ve also tried with json.normalize, pandas from dict but I always end up with dots in the columns…

Example code:

def iterate_dict(d, i = 2, cols = []):
    
    for k, v in d.items():
        # missing here how to check for the previous key
        # so that I can create an structure to create the dataframe.
        if type(v) is dict:   
            print('this is k: ', k)  
            if i % 2  == 0:
                cols.append(k)
            i+=1
            iterate_dict(v, i, cols)            
        else:
            print('this is k2: ' , k, ': ', v)

  
iterate_dict(test2)

This is an example of how my dictionary looks like:

# example 2 
test = {
    'column-gender': {
        'male': {
            'column-country' : {
                'FRENCH': {
                    'column-class': [0,1]
                },
                ('SPAIN','ITALY') : {
                    'column-married' : {
                        'YES': {
                            'column-class' : [0,1]
                        },
                        'NO' : {
                            'column-class' : 2
                        }
                    }
                }
            }
        },
        'female': {
            'column-country' : {
                ('FRENCH', 'SPAIN') : {
                    'column-class' : [[1,2],'#']
                },
                'REST-OF-VALUES': {
                    'column-married' : '*'
                }
            }
        }
    }
}

And this is how I want the dataframe to look like:

enter image description here

Any suggestion is welcome 🙂

Asked By: Tonino Fernandez

||

Answers:

I’m not sure how that data going to be consistent but for just understanding we can do something like the below, remember this is just a little demo on the approach of how we can handle it, you can spend more time to polish it up accordingly:

I added comments on each step for better understanding.

import pandas as pd


def nested_dict_to_df(data, columns=None):

    if columns are None:
        columns = []

    # if the data is a dictionary, then we need to iterate over the keys
    if isinstance(data, dict):

        for key, value in data.items():
            columns.append(key)
            yield from nested_dict_to_df(value, columns)  # recursive call
            columns.pop()  # remove the last element
    else:
        yield columns + [data]


df = pd.DataFrame(nested_dict_to_df(data))

# Drop column [0, 2, 4, 6] from the dataframe that are not needed for the final output
df = df.drop(df.columns[[0, 2, 4, 6]], axis=1)

header = ["GENDER", "COUNTRY", "CLASS", "MARRIED"]  # Desired header
df.columns = header

print(df)

Output:

   GENDER          COUNTRY        CLASS MARRIED
0    male           FRENCH       [0, 1]    None
1    male   (SPAIN, ITALY)          YES  [0, 1]
2    male   (SPAIN, ITALY)           NO       2
3  female  (FRENCH, SPAIN)  [[1, 2], #]    None
4  female   REST-OF-VALUES            *    None
Answered By: Faisal Nazik

If the column-keys are consistently prefixed with column-, you can create a recursive function:

def data_to_df(data):
    rec_out = []
    def dict_to_rec(d, curr_row={}):
        for k, v in d.items():
            if 'column-' in k: # definition of a column
                if isinstance(v, dict):
                    for val, nested_dict in v.items():
                        dict_to_rec(nested_dict, dict(curr_row, **{k[7:]: val}))
                else:
                    rec_out.append(dict(curr_row, **{k[7:]: v}))
    dict_to_rec(data)
    return pd.DataFrame(rec_out)

print(data_to_df(test))

Edit: removing unnecessary variable and argument

Output:

   gender          country        class married
0    male           FRENCH       [0, 1]     NaN
1    male   (SPAIN, ITALY)          YES  [0, 1]
2    male   (SPAIN, ITALY)           NO       2
3  female  (FRENCH, SPAIN)  [[1, 2], #]     NaN
4  female   REST-OF-VALUES            *     NaN
Answered By: Tranbi