What is the most efficient way to normalize values in a single row in pandas?
Question:
I have two types of columns in a pandas dataframe, let’s say A and B.
How to normalize the values in each row individually using the mean for each type of column efficiently?
I can first calculate mean for each column type and then divide each column with it’s respective column type mean but it’s taking too much time(more than 30 mins). I have over 300 columns and 500K rows.
df = pd.DataFrame({'A1': [1,2,3],
'A2': [4,5,6],
'A3': [7,8,9],
'B1': [11,12,13],
'B2': [14,15,16],
'B3': [17,18,19]
})
df['A_mean'] = df.apply(lambda x: x.filter(regex='A').mean(), axis=1)
df['A1'] = df['A1']/df['A_mean']
I am expecting the following result.
Answers:
I just copy pasted the same question in chatGPT and with a minor modification it gave a good answer.
import pandas as pd
# create a sample dataframe
df = pd.DataFrame({'A1': [1,2,3],
'A2': [4,5,6],
'A3': [7,8,9],
'B1': [11,12,13],
'B2': [14,15,16],
'B3': [17,18,19]
})
# calculate the mean for each type of column A and B
A_mean = df.filter(regex='A').mean(axis=1)
B_mean = df.filter(regex='B').mean(axis=1)
# normalize the values in each row of column A
df[df.filter(regex='A').columns] = df.filter(regex='A').div(A_mean, axis=0)
# normalize the values in each row of column B
df[df.filter(regex='B').columns] = df.filter(regex='B').div(B_mean, axis=0)
Here’s a way to do what your question asks (note that I have used startswith
instead of filter
, but this can be tweaked for generality if needed):
prefixes = ['A','B']
colsByPrefix = [[col for col in df.columns if col.startswith(pref)] for pref in prefixes]
df = pd.concat([df[cols] / df[cols].mean(axis=1).to_frame().to_numpy() for cols in colsByPrefix], axis=1)
Output:
A1 A2 A3 B1 B2 B3
0 0.25 1.0 1.75 0.785714 1.0 1.214286
1 0.40 1.0 1.60 0.800000 1.0 1.200000
2 0.50 1.0 1.50 0.812500 1.0 1.187500
Run a groupby and unpack the dataframe within the assign function:
df.assign(**df.groupby(df.columns.str[0], axis = 1).mean().add_suffix("_mean"))
A1 A2 A3 B1 B2 B3 A_mean B_mean
0 1 4 7 11 14 17 4.0 14.0
1 2 5 8 12 15 18 5.0 15.0
2 3 6 9 13 16 19 6.0 16.0
I have two types of columns in a pandas dataframe, let’s say A and B.
How to normalize the values in each row individually using the mean for each type of column efficiently?
I can first calculate mean for each column type and then divide each column with it’s respective column type mean but it’s taking too much time(more than 30 mins). I have over 300 columns and 500K rows.
df = pd.DataFrame({'A1': [1,2,3],
'A2': [4,5,6],
'A3': [7,8,9],
'B1': [11,12,13],
'B2': [14,15,16],
'B3': [17,18,19]
})
df['A_mean'] = df.apply(lambda x: x.filter(regex='A').mean(), axis=1)
df['A1'] = df['A1']/df['A_mean']
I am expecting the following result.
I just copy pasted the same question in chatGPT and with a minor modification it gave a good answer.
import pandas as pd
# create a sample dataframe
df = pd.DataFrame({'A1': [1,2,3],
'A2': [4,5,6],
'A3': [7,8,9],
'B1': [11,12,13],
'B2': [14,15,16],
'B3': [17,18,19]
})
# calculate the mean for each type of column A and B
A_mean = df.filter(regex='A').mean(axis=1)
B_mean = df.filter(regex='B').mean(axis=1)
# normalize the values in each row of column A
df[df.filter(regex='A').columns] = df.filter(regex='A').div(A_mean, axis=0)
# normalize the values in each row of column B
df[df.filter(regex='B').columns] = df.filter(regex='B').div(B_mean, axis=0)
Here’s a way to do what your question asks (note that I have used startswith
instead of filter
, but this can be tweaked for generality if needed):
prefixes = ['A','B']
colsByPrefix = [[col for col in df.columns if col.startswith(pref)] for pref in prefixes]
df = pd.concat([df[cols] / df[cols].mean(axis=1).to_frame().to_numpy() for cols in colsByPrefix], axis=1)
Output:
A1 A2 A3 B1 B2 B3
0 0.25 1.0 1.75 0.785714 1.0 1.214286
1 0.40 1.0 1.60 0.800000 1.0 1.200000
2 0.50 1.0 1.50 0.812500 1.0 1.187500
Run a groupby and unpack the dataframe within the assign function:
df.assign(**df.groupby(df.columns.str[0], axis = 1).mean().add_suffix("_mean"))
A1 A2 A3 B1 B2 B3 A_mean B_mean
0 1 4 7 11 14 17 4.0 14.0
1 2 5 8 12 15 18 5.0 15.0
2 3 6 9 13 16 19 6.0 16.0