pandas combine nested dataframes into one single dataframe

Question:

I have a dataframe, where in one column (we’ll call it info) all the cells/rows contain another dataframe inside. I want to loop through all the rows in this column and literally stack the nested dataframes on top of each other, because they all have the same columns

How would I go about this?

Answers:

This is the solution that I came up with, although it’s not the fastest which is why I am still leaving the question unanswered

df1 = pd.DataFrame()
for frame in df['Info'].tolist():
    df1 = pd.concat([df1, frame], axis=0).reset_index(drop=True)

Our dataframe has three columns (col1, col2 and info).

In info, each row has a nested df as value.

import pandas as pd

nested_d1 = {'coln1': [11, 12], 'coln2': [13, 14]}
nested_df1 = pd.DataFrame(data=nested_d1)
nested_d2 = {'coln1': [15, 16], 'coln2': [17, 18]}
nested_df2 = pd.DataFrame(data=nested_d2)

d = {'col1': [1, 2], 'col2': [3, 4], 'info': [nested_df1, nested_df2]}
df = pd.DataFrame(data=d)

We could combine all nested dfs rows appending them to a list (as nested dfs schema is constant) and concatenating them later.

nested_dfs = []

for index, row in df.iterrows():
    nested_dfs.append(row['info'])

result = pd.concat(nested_dfs, sort=False).reset_index(drop=True) 

print(result)

This would be the result:

   coln1  coln2
0     11     13
1     12     14
2     15     17
3     16     18
Answered By: nandoquintana

You could try as follows:

import pandas as pd

length=5

# some dfs
nested_dfs = [pd.DataFrame({'a': [*range(length)],
                            'b': [*range(length)]}) for x in range(length)]

print(nested_dfs[0])

   a  b
0  0  0
1  1  1
2  2  2
3  3  3
4  4  4

# df with nested_dfs in info
df = pd.DataFrame({'info_col': nested_dfs})

# code to be implemented
lst_dfs = df['info_col'].values.tolist()
df_final = pd.concat(lst_dfs,axis=0, ignore_index=True)

df_final.tail()

    a  b
20  0  0
21  1  1
22  2  2
23  3  3
24  4  4

This method should be a bit faster than the solution offered by nandoquintana, which also works.


Incidentally, it is ill advised to name a df column info. This is because df.info is actually a function. E.g., normally df['col_name'].values.tolist() can also be written as df.col_name.values.tolist(). However, if you try this with df.info.values.tolist(), you will run into an error:

AttributeError: 'function' object has no attribute 'values'

You also run the risk of overwriting the function if you start assigning values to your column on top of doing something which you probably don’t want to do. E.g.:

print(type(df.info))
<class 'method'>

df.info=1

# column is unaffected, you just create an int variable
print(type(df.info))
<class 'int'>

# but:
df['info']=1

# your column now has all 1's
print(type(df['info']))
<class 'pandas.core.series.Series'>
Answered By: ouroboros1
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.