Apply the same block of formatting code to multiple dataframes at once

Question:

My raw data is in multiple datafiles that have the same format. After importing the various (10) csv files using pd.read_csv(filename.csv) I have a series of dataframes df1, df2, df3 etc etc

I want to perform all of the below code to each of the dataframes.

I therefore created a function to do it:

def my_func(df):
    df = df.rename(columns=lambda x: x.strip())
    df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
    df.date = pd.to_datetime(df.date)
    df = df.join(df['long_margin'].str.split(' ', 1, expand=True).rename(columns={0:'A', 1:'B'}))
    df = df.drop(columns=['long_margin'])
    df = df.drop(columns=['cash_interest'])
    mapping = {df.columns[6]: 'daily_turnover', df.columns[7]: 'cash_interest', df.columns[8]: 'long_margin', df.columns[9]: 'short_margin'}
    df = df.rename(columns=mapping)
    
    return(df)

and then tried to call the function as follows:

list_of_datasets = [df1, df2, df3]

for dataframe in list_of_datasets:

    dataframe = my_func(dataframe)

If I manually ran this code changing df to df1, df2 etc it works, but it doesn’t seem to work in my function (or the way I am calling it).

What am I missing?

Asked By: James S

||

Answers:

As I understand, in

for dataframe in list_of_datasets:
    dataframe = my_func(dataframe)

dataframe is a pointer to an object in the list. It is not the DataFrame itself. When for x in something: is executed, Python creates a new variable x, which points to an element of the list, and (this new pointer) is usually discarded by you when the loop ends (the pointer (the new variable created by the loop) is not deleted though).

If inside the function you just modify this object "by reference", it’s ok. The changes will propagate to the object in the list.

But as soon as the function starts to create a new object named "df" instead of the previous object (not modifying the previous, but creating a new one with a new ID) and then returning this new object to dataframe in the for loop, the assignment of this new object to dataframe will basically mean that dataframe will start to point to the new object instead of the element of the list. And the element in the list won’t be affected or rather will be affected to the point when the function created a new DataFrame instead of the previous.

In order to see when exactly it happens, I would suggest that you add print(id(df)) after (and before) each line of code in the function and in the loop. When the id changes, you deal with the new object (not with the element of the list).

Answered By: Alex

Alex is correct.

To make this work you could use list comprehension:

list_of_datasets = [my_func(df) for df in list_of_datasets]

or create a new list for the outputs

formatted_dfs = []
for dataframe in list_of_datasets:
    formatted_dfs.append(my_func(dataframe))

Answered By: s_pike
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.