Creating a new pandas dataframe compiling information from many existing pandas dataframes

Question:

I have a list containing 12 Pandas dataframes. Each dataframe has 94 columns. How can I create a loop which creates 94 new dataframes? The first new dataframe (lets call it df1 for simplicity) should have dfs[1]['column 1'] in its first column, then dfs[2]['column 1'] in its second column, then dfs[3]['column 1'] in its third column. The second new dataframe (df2) should have dfs[1]['column 2'] in its first column, then dfs[2]['column 2'] in its second column etc..

This means I can then plot each column of df1 onto a single histogram, then each column of df2 onto a new single histogram etc, such that I would be able to plot 94 histograms at the end (I can do this myself by simply appending each new dataframe into a list, then doing a nested for-loop through the dataframes and column headers)

Right now, my code produces the right number of histogram plots (94) but instead of producing new pandas dataframes, it puts all the data in a list. This means that on each plot, I can’t differentiate between the different histograms on the same plot as they are all the same colour.

for i in range(94):
    data = [] # store all i'th column data across all dfs
    for df in dfs:
        data.extend(list(df.iloc[:,i])) # i'th column
    
    plt.hist(data, bins=50, histtype = 'step', label = dfs[0].iloc[:,i].name)
    plt.title(dfs[0].iloc[:,i].name) # get name of column from 1st df
    plt.xlabel(dfs[0].iloc[:,i].name) # get name of column from 1st df
    plt.ylabel('Frequency Density')
    plt.show()

dfs is the list containing all of the original 12 dataframes.

Asked By: probablysid

||

Answers:

You might think that to save memory it would be a good idea to avoid building all 94 new dataframes simultaneously. However, Pandas does not have to make copies of the underlying data (unless you want it to) so the new dataframes can simply contain views into the data from original 12. Here’s how to do that:

new_dfs = {}
columns = range(len(dfs))  # specify column names
for col in dfs[0].columns:
    new_df = pd.concat([df[col] for df in dfs], axis=1, 
                       keys=columns, copy=False)
    new_dfs[col] = new_df

Update

I added the keyword argument copy=False to pd.concat because by default it does make a copy of the data.

Here’s the code to plot the new data

for name, df in new_dfs.items():
    for col in df:
        plt.hist(df[col], bins=50, histtype='step', label=col)
    plt.title(name)  # name of column
    plt.xlabel(name)
    plt.ylabel('Frequency Density')
    plt.legend()
    plt.show()
Answered By: Bill

I came up with a better answer than my previous one. You can concatenate all the dataframes with a multi-index and use the multiindex to get a view of all the columns with the same name.

df_index = range(len(dfs))
df_combined = pd.concat(dfs, axis=1, keys= df_index, copy=False)
columns = df_combined.columns.levels[1].tolist()

idx = pd.IndexSlice
for col in columns:
    df = df_combined.loc[:, idx[:, col]]  # select all columns matching col
    df.columns = df.columns.droplevel(1)
    for name in df:
        plt.hist(df[name], bins=50, histtype='step', label=name)
    plt.title(col)  # name of column
    plt.xlabel(col)
    plt.ylabel('Frequency Density')
    plt.legend()
    plt.show()
Answered By: Bill