SettingWithCopyWarning won't go away regardless of the approach
Question:
Let me start by saying that I understand what the warning is, why it’s there and I’ve read a ton of questions which have been answered. Using today’s pandas
(1.2.3) and scikit-learn
(0.24.1) this warning simply won’t go away:
I have a dataframe loaded from a pickle, nothing too complex:
print(df)
Date Sales Labels
0 2013-01-01 0 5024.00000
1 2013-01-02 5024 5215.00000
2 2013-01-03 5215 5552.00000
3 2013-01-04 5552 5230.00000
4 2013-01-05 5230 0.00000
.. ... ... ...
747 2015-01-18 0 5018.00000
748 2015-01-19 5018 4339.00000
749 2015-01-20 4339 4786.00000
750 2015-01-21 4786 4606.00000
751 2015-01-22 4606 4944.00000
I’m using the accepted answer of how to min-max the columns Sales
and Labels
because I want to preserve order and keep the Dates
:
scaler = MinMaxScaler()
df[['Sales', 'Labels']] = scaler.fit_transform(df[['Sales', 'Labels']])
This gives me the following warning (as you can guess):
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I’ve tried:
df.loc[:, 'Sales'] = scaler.fit_transform(df[['Sales']])
And I still get the warning (even though now it won’t tell me which line it is coming from!).
Which makes me wonder if scikit-learn
is internally calling it in the old-fashioned way, and that’s where the warning is now coming from.
I’ve also tried using a .copy()
which I understand is only masking the issue, but the warning is still present.
Is there another way to apply MinMaxScaler
without the warning?
Answers:
Most likely df
is a subset of another dataframe, for example:
rawdata = pd.DataFrame({'Date':range(5),
'Sales':np.random.uniform(1000,2000,5),
'Labels':np.random.uniform(1000,2000,5),
'Var':np.random.uniform(0,1,5)})
And you subset df
from this, but bear in mind this is a slice of the original dataframe rawdata
. Hence if we try to scale, it throws a warning:
df = rawdata[['Date','Sales','Labels']]
df[['Sales', 'Labels']] = scaler.fit_transform(df[['Sales', 'Labels']])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
If you scale and transform the original dataframe, it works:
rawdata[['Sales','Labels']] = scaler.fit_transform(rawdata[['Sales', 'Labels']])
You have to think about whether you need the original data frame, you can do this, just that it cost more memory:
df = rawdata[['Date','Sales','Labels']].copy()
df[['Sales', 'Labels']] = scaler.fit_transform(df[['Sales', 'Labels']])
One reason for this warning coming up is unordered indices. For example, if you split a dataset into subsets (i.e., train and test splits common in ML), then the indices are not in order.
A quick fix is to reset the indices of the subset df if they won’t matter in a downstream task.
df = df.reset_index()
Let me start by saying that I understand what the warning is, why it’s there and I’ve read a ton of questions which have been answered. Using today’s pandas
(1.2.3) and scikit-learn
(0.24.1) this warning simply won’t go away:
I have a dataframe loaded from a pickle, nothing too complex:
print(df)
Date Sales Labels
0 2013-01-01 0 5024.00000
1 2013-01-02 5024 5215.00000
2 2013-01-03 5215 5552.00000
3 2013-01-04 5552 5230.00000
4 2013-01-05 5230 0.00000
.. ... ... ...
747 2015-01-18 0 5018.00000
748 2015-01-19 5018 4339.00000
749 2015-01-20 4339 4786.00000
750 2015-01-21 4786 4606.00000
751 2015-01-22 4606 4944.00000
I’m using the accepted answer of how to min-max the columns Sales
and Labels
because I want to preserve order and keep the Dates
:
scaler = MinMaxScaler()
df[['Sales', 'Labels']] = scaler.fit_transform(df[['Sales', 'Labels']])
This gives me the following warning (as you can guess):
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I’ve tried:
df.loc[:, 'Sales'] = scaler.fit_transform(df[['Sales']])
And I still get the warning (even though now it won’t tell me which line it is coming from!).
Which makes me wonder if scikit-learn
is internally calling it in the old-fashioned way, and that’s where the warning is now coming from.
I’ve also tried using a .copy()
which I understand is only masking the issue, but the warning is still present.
Is there another way to apply MinMaxScaler
without the warning?
Most likely df
is a subset of another dataframe, for example:
rawdata = pd.DataFrame({'Date':range(5),
'Sales':np.random.uniform(1000,2000,5),
'Labels':np.random.uniform(1000,2000,5),
'Var':np.random.uniform(0,1,5)})
And you subset df
from this, but bear in mind this is a slice of the original dataframe rawdata
. Hence if we try to scale, it throws a warning:
df = rawdata[['Date','Sales','Labels']]
df[['Sales', 'Labels']] = scaler.fit_transform(df[['Sales', 'Labels']])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
If you scale and transform the original dataframe, it works:
rawdata[['Sales','Labels']] = scaler.fit_transform(rawdata[['Sales', 'Labels']])
You have to think about whether you need the original data frame, you can do this, just that it cost more memory:
df = rawdata[['Date','Sales','Labels']].copy()
df[['Sales', 'Labels']] = scaler.fit_transform(df[['Sales', 'Labels']])
One reason for this warning coming up is unordered indices. For example, if you split a dataset into subsets (i.e., train and test splits common in ML), then the indices are not in order.
A quick fix is to reset the indices of the subset df if they won’t matter in a downstream task.
df = df.reset_index()