Pandas rename columns according to my dict
Question:
I have many dataframes in pickle format, they have the same information, but columns name are not identical in all of them.
For example:
> file name: Columns_['Weryfikacja UK - 2022-04-14.xlsx_20uk-woocommerce-last-5yrs-expo.pkl'].pkl
> columns_nam: Index(['domain', 'phones', 'phones_data_source', 'company_name', 'SourceFile'], dtype='object')
>
> file name: Columns_['US _ Canada - rechurn - 01.2022-04.2022.xlsx_Other.pkl'].pkl
> columns_nam: Index(['Phone', 'Domain', 'SourceFile'], dtype='object')
>
> file name: Columns_['2022-08 - US _ Canada.xlsx_29.08-02.09 WixStore USCA.pkl'].pkl
> columns_nam: Index(['Phone', 'Alternative phone 1', 'Alternative phone2', 'Alternative phone3', 'Alternative phone4', 'SourceFile'], dtype='object')
I have a dict like this to rename columns names in all files:
my_dict = {
"Domain": ['Domain','domain', 'WWW', 'www'],
"Phone": ['Phone','phone_number', 'phones', 'Tel'],
"AlternativePhone1": ['Alternative phone1','Alternative phone 2', 'phones2'],
"AlternativePhone2": ['Alternative phone2','Alternative phone 3', 'phones3'],
"AlternativePhone3": ['Alternative phone3','Alternative phone 4', 'phones4'],
"AlternativePhone4": ['Alternative phone4','Alternative phone 5', 'phones5'],
"SourceFile": ['SourceFile']
}
I need a help with code, how should I do it?
for file in glob.glob("*.pkl"):
df = pd.read_pickle(file)
On output I would like to have something like this
> file name: Columns_['2022-08 - US _ Canada.xlsx_29.08-02.09 WixStore USCA.pkl'].pkl
> columns_nam: Index(['Phone', 'AlternativePhone1', 'AlternativePhone2', 'AlternativePhone3', 'AlternativePhone4', 'SourceFile'], dtype='object')
Answers:
The easiest way
I don’t think there is a standard way to rename columns based on a list of possible names built-in to pandas, but its easy enough with a few for loops:
my_dict = {
"Domain": ['Domain','domain', 'WWW', 'www'],
"Phone": ['Phone','phone_number', 'phones', 'Tel'],
}
df = pd.DataFrame({
"WWW": [1,2,3,4,5],
"phones": [1,2,3,4,5]
})
for (goal, names) in my_dict.items():
for name in names:
if name in df.columns:
df.rename(columns={name:goal}, inplace=True)
A cleaner way
The pandas.rename
function does not actually care if the dataframe contains all the columns you provide new names for in the columns
dictonary. This means that if we convert my_dict
from newname -> List[oldname]
to the oldname -> newname
fromat expected by columns
we can rename the columns without any for loops.
columns_dict = dict(sum([[(old, new) for old in olds] for (new, olds) in my_dict.items()], []))
# columns_dict = {
# 'Domain': 'Domain', 'domain': 'Domain', 'WWW': 'Domain', 'www': 'Domain',
# 'Phone': 'Phone', 'phone_number': 'Phone', 'phones': 'Phone', 'Tel': 'Phone'
# }
df.rename(columns=columns_dict, inplace=True)
I have many dataframes in pickle format, they have the same information, but columns name are not identical in all of them.
For example:
> file name: Columns_['Weryfikacja UK - 2022-04-14.xlsx_20uk-woocommerce-last-5yrs-expo.pkl'].pkl
> columns_nam: Index(['domain', 'phones', 'phones_data_source', 'company_name', 'SourceFile'], dtype='object')
>
> file name: Columns_['US _ Canada - rechurn - 01.2022-04.2022.xlsx_Other.pkl'].pkl
> columns_nam: Index(['Phone', 'Domain', 'SourceFile'], dtype='object')
>
> file name: Columns_['2022-08 - US _ Canada.xlsx_29.08-02.09 WixStore USCA.pkl'].pkl
> columns_nam: Index(['Phone', 'Alternative phone 1', 'Alternative phone2', 'Alternative phone3', 'Alternative phone4', 'SourceFile'], dtype='object')
I have a dict like this to rename columns names in all files:
my_dict = {
"Domain": ['Domain','domain', 'WWW', 'www'],
"Phone": ['Phone','phone_number', 'phones', 'Tel'],
"AlternativePhone1": ['Alternative phone1','Alternative phone 2', 'phones2'],
"AlternativePhone2": ['Alternative phone2','Alternative phone 3', 'phones3'],
"AlternativePhone3": ['Alternative phone3','Alternative phone 4', 'phones4'],
"AlternativePhone4": ['Alternative phone4','Alternative phone 5', 'phones5'],
"SourceFile": ['SourceFile']
}
I need a help with code, how should I do it?
for file in glob.glob("*.pkl"):
df = pd.read_pickle(file)
On output I would like to have something like this
> file name: Columns_['2022-08 - US _ Canada.xlsx_29.08-02.09 WixStore USCA.pkl'].pkl
> columns_nam: Index(['Phone', 'AlternativePhone1', 'AlternativePhone2', 'AlternativePhone3', 'AlternativePhone4', 'SourceFile'], dtype='object')
The easiest way
I don’t think there is a standard way to rename columns based on a list of possible names built-in to pandas, but its easy enough with a few for loops:
my_dict = {
"Domain": ['Domain','domain', 'WWW', 'www'],
"Phone": ['Phone','phone_number', 'phones', 'Tel'],
}
df = pd.DataFrame({
"WWW": [1,2,3,4,5],
"phones": [1,2,3,4,5]
})
for (goal, names) in my_dict.items():
for name in names:
if name in df.columns:
df.rename(columns={name:goal}, inplace=True)
A cleaner way
The pandas.rename
function does not actually care if the dataframe contains all the columns you provide new names for in the columns
dictonary. This means that if we convert my_dict
from newname -> List[oldname]
to the oldname -> newname
fromat expected by columns
we can rename the columns without any for loops.
columns_dict = dict(sum([[(old, new) for old in olds] for (new, olds) in my_dict.items()], []))
# columns_dict = {
# 'Domain': 'Domain', 'domain': 'Domain', 'WWW': 'Domain', 'www': 'Domain',
# 'Phone': 'Phone', 'phone_number': 'Phone', 'phones': 'Phone', 'Tel': 'Phone'
# }
df.rename(columns=columns_dict, inplace=True)