Speeding up dataframe generating code from dict

Question

Here is a reduced working example of a real dict I am working with. The actual dict when dumped to a JSON file is quite large (about 10 MB). I am trying to parse through the dict and convert it to a dataframe using a specific format. The objective is to then dump this dataframe into excel using to_excel method.

import pandas as pd

data = {'kvk_1':
            {'link_1':
                {'header_1':
                    {'body_1':'value_1', 
                    'body_2':'value_2',
                    'body_3':'value_3'},
                 'header_2':
                    {'body_4':'value_1',
                    'body_4':'value_3',
                    'body_5':'value_2'}
                    },
            'link_2':
                {'header_4':
                    {'body_7':'value_8',
                    'body_8':'value_9'},
                 'header_2':
                    {'body_4':'value_6',
                    'body_4':'value_35',
                    'body_5':'value_25',
                    'body_6':'value_25'},
                 'header_3':
                    {}}},
            
        'kvk_2':
            {'link_1':
                {'header_1':
                    {'body_1':'value_1', 
                    'body_2':'value_2',
                    'body_3':'value_3'},
                 'header_2':
                    {'body_4':'value_1',
                    'body_4':'value_3',
                    'body_5':'value_2'},
                'header_9':
                    {'body_10':'value_2'}
                    },
            'link_2':
                {'header_1':
                    {'body_2':'value_8',
                    'body_3':'value_9'},
                 'header_2':
                    {'body_6':'value_6',
                    'body_6':'value_35',
                    'body_5':'value_25',
                    'body_6':'value_25'},
                 'header_3':
                    {'body_9':'value_800'}},
             'link_3': {}},
        'kvk_3':
            {'link_1':
                {'header_10':{}}}}
                
#Write data

df = pd.DataFrame(columns = ['kvk', 'link'])
row = -1
for kvk, link_dict in data.items():    
    for link, header_dict in link_dict.items():
        row = row+1
        df.loc[row, 'kvk'] = kvk
        df.loc[row, 'link'] = link
        for header, body_dict in header_dict.items():

            for body, value in body_dict.items():
                df.loc[row, body] = value

Which outputs the following pandas dataframe:

     kvk    link   body_1   body_2   body_3    body_4    body_5   body_7  
0  kvk_1  link_1  value_1  value_2  value_3   value_3   value_2      NaN   
1  kvk_1  link_2      NaN      NaN      NaN  value_35  value_25  value_8   
2  kvk_2  link_1  value_1  value_2  value_3   value_3   value_2      NaN   
3  kvk_2  link_2      NaN  value_8  value_9       NaN  value_25      NaN   
4  kvk_2  link_3      NaN      NaN      NaN       NaN       NaN      NaN   
5  kvk_3  link_1      NaN      NaN      NaN       NaN       NaN      NaN   

    body_8    body_6  body_10     body_9  
0      NaN       NaN      NaN        NaN  
1  value_9  value_25      NaN        NaN  
2      NaN       NaN  value_2        NaN  
3      NaN  value_25      NaN  value_800  
4      NaN       NaN      NaN        NaN  
5      NaN       NaN      NaN        NaN

This is very slow for the real case. I think the bottleneck is the last line df.loc[row, body] = value where pandas needs to locate a cell in an ever growing dataframe based on dict keys and an incrementing row number. If the column which the key points to exists then a new row is added and the value is inserted into that row. If the column does not exist then a new one is created and the value is inserted.

I really like this set up since it allows me to locate columns by name which is ideal for how the dict is set up. However as I already mentioned it grinds to a halt when the data frame exceeds about 10000 rows. How can I tweak this to speed it up?

Asked By: user32882

||

Source

Answer 1

Use loops for change structure of data for list of dictionaries first:

out = []
for k, v in data.items():
    for k1, v1 in v.items():
        d = {}
        for k2, v3 in v1.items():
            d.update(v3)
        out.append({**d, **{'kvk':k, 'link':k1}})
#print (out)

df = pd.DataFrame(out)
cols = ['kvk','link']
#if want cols for first columns in df and sorting body columns by numbers after _
c = cols + sorted(df.columns.difference(cols), key=lambda x: int(x.split('_')[1]))

#if need only change order by ['kvk','link']
#c = cols + df.columns.difference(cols).tolist()

df = df[c]
print (df)
     kvk    link   body_1   body_2   body_3    body_4    body_5    body_6  
0  kvk_1  link_1  value_1  value_2  value_3   value_3   value_2       NaN   
1  kvk_1  link_2      NaN      NaN      NaN  value_35  value_25  value_25   
2  kvk_2  link_1  value_1  value_2  value_3   value_3   value_2       NaN   
3  kvk_2  link_2      NaN  value_8  value_9       NaN  value_25  value_25   
4  kvk_2  link_3      NaN      NaN      NaN       NaN       NaN       NaN   
5  kvk_3  link_1      NaN      NaN      NaN       NaN       NaN       NaN   

    body_7   body_8     body_9  body_10  
0      NaN      NaN        NaN      NaN  
1  value_8  value_9        NaN      NaN  
2      NaN      NaN        NaN  value_2  
3      NaN      NaN  value_800      NaN  
4      NaN      NaN        NaN      NaN  
5      NaN      NaN        NaN      NaN

Answered By: jezrael

Speeding up dataframe generating code from dict

Question:

Answers: