MinMaxScaler for a number of columns in a pandas DataFrame
Question:
I want to apply MinmaxScaler on a number of pandas DataFrame ‘together’. Meaning that I want the scaler to perform on all data in those columns, not separately on each column.
My DataFrame has 20 columns. I want to apply the scaler on 12 of the columns at the same time. I have already read this. But it does not solve my problem since it acts on each column separately.
Answers:
you can extract the "min" and "max" statistics from those columns and perform the scaling yourself:
# columns of interest
cols = [...]
# get the minimum and maximum values in that region
vals = df[cols].to_numpy()
min_val = vals.min()
max_val = vals.max()
# scale the region using them
df[cols] = df[cols].sub(min_val).div(max_val - min_val)
(sub
is method way of doing "-" and div
is for "/".)
Above, df
is your training dataframe; to scale the testing dataframe, you replace df
with that in the last line, e.g.,
test_df[cols] = test_df[cols].sub(min_val).div(max_val - min_val)
instead of extracting min/max of it separately which would leak information from the test set.
IIUC, you want the sklearn
scaler to fit and transform multiple columns with the same criteria (in this case min and max definitions). Here is one way you can do this –
- You can save the initial shape of the columns and then transform the numpy array of those columns into a 1D array from a 2D array.
- Next you can fit your scaler and transform this 1D array
- Finally you can use the old shape to reshape the array back into the n columns you need and save them
The advantage of this approach is that this works with any of the sklearn scalers you need to use, MinMaxScaler
, StandardScaler
etc.
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
dfTest = pd.DataFrame({'A':[14.00,90.20,90.95,96.27,91.21],
'B':[103.02,107.26,110.35,114.23,114.68],
'C':['big','small','big','small','small']})
cols = ['A','B']
old_shape = dfTest[cols].shape #(5,2)
dfTest[cols] = scaler.fit_transform(dfTest[cols].to_numpy().reshape(-1,1)).reshape(old_shape)
print(dfTest)
A B C
0 0.000000 0.884188 big
1 0.756853 0.926301 small
2 0.764303 0.956992 big
3 0.817143 0.995530 small
4 0.766885 1.000000 small
I want to apply MinmaxScaler on a number of pandas DataFrame ‘together’. Meaning that I want the scaler to perform on all data in those columns, not separately on each column.
My DataFrame has 20 columns. I want to apply the scaler on 12 of the columns at the same time. I have already read this. But it does not solve my problem since it acts on each column separately.
you can extract the "min" and "max" statistics from those columns and perform the scaling yourself:
# columns of interest
cols = [...]
# get the minimum and maximum values in that region
vals = df[cols].to_numpy()
min_val = vals.min()
max_val = vals.max()
# scale the region using them
df[cols] = df[cols].sub(min_val).div(max_val - min_val)
(sub
is method way of doing "-" and div
is for "/".)
Above, df
is your training dataframe; to scale the testing dataframe, you replace df
with that in the last line, e.g.,
test_df[cols] = test_df[cols].sub(min_val).div(max_val - min_val)
instead of extracting min/max of it separately which would leak information from the test set.
IIUC, you want the sklearn
scaler to fit and transform multiple columns with the same criteria (in this case min and max definitions). Here is one way you can do this –
- You can save the initial shape of the columns and then transform the numpy array of those columns into a 1D array from a 2D array.
- Next you can fit your scaler and transform this 1D array
- Finally you can use the old shape to reshape the array back into the n columns you need and save them
The advantage of this approach is that this works with any of the sklearn scalers you need to use, MinMaxScaler
, StandardScaler
etc.
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
dfTest = pd.DataFrame({'A':[14.00,90.20,90.95,96.27,91.21],
'B':[103.02,107.26,110.35,114.23,114.68],
'C':['big','small','big','small','small']})
cols = ['A','B']
old_shape = dfTest[cols].shape #(5,2)
dfTest[cols] = scaler.fit_transform(dfTest[cols].to_numpy().reshape(-1,1)).reshape(old_shape)
print(dfTest)
A B C
0 0.000000 0.884188 big
1 0.756853 0.926301 small
2 0.764303 0.956992 big
3 0.817143 0.995530 small
4 0.766885 1.000000 small