Best Practice for Adding Lots of Columns to Pandas DataFrame

Question:

I am trying to add many columns to a pandas dataframe as follows:

def create_sum_rounds(df, col_name_base):
    '''
    Create a summed column in df from base columns. For example,
    df['sum_foo'] = df['foo_1'] + df['foo_2'] + df['foo_3'] + 
                    df['foo_4'] + df['foo_5'] +
    '''
    out_name = 'sum_' + col_name_base
    df[out_name] = 0.0
    for i in range(1, 6):
        col_name = col_name_base + str(i)
        if col_name in df:
            df[out_name] += df[col_name]
        else:
            logger.error('Col %s not in df' % col_name)

for col in sum_cols_list:
    create_sum_rounds(df, col)

Where sum_cols_list is a list of ~200 base column names (e.g. "foo"), and df is a pandas dataframe which includes the base columns extended with 1 through 5 (e.g. "foo_1", "foo_2", ..., "foo_5").

I’m getting a performance warning when I run this snippet:

PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`

I believe this is because creating a new column is actually calling an insert operation behind the scenes. What’s the right way to use pd.concat in this case?

Asked By: oliffur

||

Answers:

Simplify 🙂

def create_sum_rounds(df, col_name_base):
    '''
    Create a summed column in df from base columns. For example,
    df['sum_foo'] = df['foo_1'] + df['foo_2'] + df['foo_3'] + 
                    df['foo_4'] + df['foo_5'] +
    '''
    out_name = 'sum_' + col_name_base
    df[out_name] = df.loc[:,[x for x in df.columns if x.startswith(col_name_base)]].sum(axis=1)
Answered By: Quazer

Would this get you the results you are expecting?

df = pd.DataFrame({
    'Foo_1' : [1, 2, 3, 4, 5],
    'Foo_2' : [10, 20, 30, 40, 50],
    'Something' : ['A', 'B', 'C', 'D', 'E']
})

df['Foo_Sum'] = df.filter(like = 'Foo_').sum(axis = 1)
Answered By: ArchAngelPwn

You can use your same approach, but instead of operating directly on the DataFrame, you’ll need to store each output as its own pd.Series. Then when all of the computations are done, use pd.concat to glue everything back to your original DataFrame.

(untested, but should work)

import pandas as pd

def create_sum_rounds(df, col_name_base):
    '''
    Create a summed column in df from base columns. For example,
    df['sum_foo'] = df['foo_1'] + df['foo_2'] + df['foo_3'] + 
                    df['foo_4'] + df['foo_5'] +
    '''
    out = pd.Series(0, name='sum_' + col_name_base, index=df.index)
    for i in range(1, 6):
        col_name = col_name_base + str(i)
        if col_name in df:
            out += df[col_name]
        else:
            logger.error('Col %s not in df' % col_name)
    return out 

col_sums = []
for col in sum_cols_list:
    col_sums.append(create_sum_rounds(df, col))
new_df = pd.concat([df, *col_sums], axis=1)

Additionally, you can simplify your existing code (if you’re willing to forego your logging)

import pandas as pd

def create_sum_rounds(df, col_name_base):
    '''
    Create a summed column in df from base columns. For example,
    df['sum_foo'] = df['foo_1'] + df['foo_2'] + df['foo_3'] + 
                    df['foo_4'] + df['foo_5'] + ...
    '''
    return df.filter(regex=f'{col_name_base}_d+').sum(axis=1)

col_sums = []
for col in sum_cols_list:
    col_sums.append(create_sum_rounds(df, col))
new_df = pd.concat([df, *col_sums], axis=1)

Answered By: Cameron Riddell
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.