Pandas rename columns according to my dict

Question:

I have many dataframes in pickle format, they have the same information, but columns name are not identical in all of them.
For example:

>     file name:    Columns_['Weryfikacja UK - 2022-04-14.xlsx_20uk-woocommerce-last-5yrs-expo.pkl'].pkl
>     columns_nam:  Index(['domain', 'phones', 'phones_data_source', 'company_name', 'SourceFile'], dtype='object')
> 
>     file name:    Columns_['US _ Canada - rechurn - 01.2022-04.2022.xlsx_Other.pkl'].pkl
>     columns_nam:  Index(['Phone', 'Domain', 'SourceFile'], dtype='object')
> 
>     file name:    Columns_['2022-08 - US _ Canada.xlsx_29.08-02.09 WixStore USCA.pkl'].pkl
>     columns_nam:  Index(['Phone', 'Alternative phone 1', 'Alternative phone2', 'Alternative phone3', 'Alternative phone4', 'SourceFile'], dtype='object')

I have a dict like this to rename columns names in all files:

my_dict = {
"Domain": ['Domain','domain', 'WWW', 'www'],
"Phone": ['Phone','phone_number', 'phones', 'Tel'],
"AlternativePhone1": ['Alternative phone1','Alternative phone 2', 'phones2'],
"AlternativePhone2": ['Alternative phone2','Alternative phone 3', 'phones3'],
"AlternativePhone3": ['Alternative phone3','Alternative phone 4', 'phones4'],
"AlternativePhone4": ['Alternative phone4','Alternative phone 5', 'phones5'],
"SourceFile": ['SourceFile']
}

I need a help with code, how should I do it?

for file in glob.glob("*.pkl"):
    df = pd.read_pickle(file)

On output I would like to have something like this

>     file name:    Columns_['2022-08 - US _ Canada.xlsx_29.08-02.09 WixStore USCA.pkl'].pkl
>     columns_nam:  Index(['Phone', 'AlternativePhone1', 'AlternativePhone2', 'AlternativePhone3', 'AlternativePhone4', 'SourceFile'], dtype='object')
Asked By: Bartosz

||

Answers:

The easiest way

I don’t think there is a standard way to rename columns based on a list of possible names built-in to pandas, but its easy enough with a few for loops:

my_dict = {
    "Domain": ['Domain','domain', 'WWW', 'www'],
    "Phone": ['Phone','phone_number', 'phones', 'Tel'],
}

df = pd.DataFrame({
    "WWW": [1,2,3,4,5], 
    "phones": [1,2,3,4,5]
})

for (goal, names) in my_dict.items():
    for name in names:
        if name in df.columns:
            df.rename(columns={name:goal}, inplace=True)

A cleaner way

The pandas.rename function does not actually care if the dataframe contains all the columns you provide new names for in the columns dictonary. This means that if we convert my_dict from newname -> List[oldname] to the oldname -> newname fromat expected by columns we can rename the columns without any for loops.

columns_dict = dict(sum([[(old, new) for old in olds] for (new, olds) in my_dict.items()], []))

# columns_dict = {
#     'Domain': 'Domain', 'domain': 'Domain', 'WWW': 'Domain', 'www': 'Domain', 
#     'Phone': 'Phone', 'phone_number': 'Phone', 'phones': 'Phone', 'Tel': 'Phone'
# }

df.rename(columns=columns_dict, inplace=True)
Answered By: Matt
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.