How to correct this code to not raise a SettingWithCopyWarning?

Question:

I’m following along with this: https://www.kdnuggets.com/2021/01/cleaner-data-analysis-pandas-pipes.html

About halfway down the author creates a function to remove outliers:

def to_category(df):
    cols = df.select_dtypes(include='object').columns
    for col in cols:
        ratio = len(df[col].value_counts()) / len(df)
        if ratio < 0.05:
            df[col] = df[col].astype('category')
    return df

This raised a warning from Python:

Warning (from warnings module):
  File "D:/I7_Education/pandas_pipe_function1/pipes3.py", line 51
    df[col] = df[col].astype('category')
SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

I’m not sure I understand what the problem is (though I’m working my way through it, and some posts online to try to understand). I’m still trying to make sense of the documentation explanation.

I’m aware that I can suppress the warnings from Python (The code runs fine if you suppress the warnings). I would like to know how to change the code in the article so it doesn’t raise a warning in the first place.

I tried contacting the author, but haven’t heard back.

What I want is no suppression to be necessary. But I don’t understand what the problem is well enough to figure out how to change the code to not trip a SettingWithCopyWarning in the first place.

I was not expecting the warning. The documentation, as well as a few posts online, say to change df using loc, but I’m not changing values, or elements, in the dataframe, I’m changing the dtype of columns from object to category; astype('catagory') is how to do that, and I would assume that looping through columns to do it should be fine. A friend told me to create a copy of the df that’s passed to the function, and then manipulate that, then return the copy, which I also don’t fully understand, but it doesn’t solve the problem – it still raises the same warning.

The dataframe I’m passing to the function is a copy. The article is only manipulating the dataset (directmarketing.csv); it reads the csv into a pandas dataframe and manipulates it directly. I had instead created two dataframes: the first is dataset = pd.read_csv(".directmarketing.csv") and the second is marketing = dataset.copy() and I’m only manipulating the marketing dataframe. That way I can go back and check against the dataset dataframe and make sure things have changed the way they’re supposed to, etc.

But when I call the function, I’m calling to_category(marketing) – I haven’t touched the dataset dataframe at all.

There is a thread on stackoverflow – Returning a copy versus a view warning when using Python pandas dataframe – that talks about this, but it’s saying to make a copy to avoid the warning, and so I’m very confused.

Is there a way to correct the code in the article so it does not trip this warning?

I’m using Python 3.10, and Idle – I’m not using an IDE with this.

Asked By: TransitoryGouda

||

Answers:

One idea is rewrite solution by DataFrame.astype with columns names in final list convert to dictionary by dict.fromkeys:

def to_category(df):
    final = []
    cols = df.select_dtypes(include='object').columns
    for col in cols:
        ratio = len(df[col].value_counts()) / len(df)
        if ratio < 0.05:
            final.append(col)
    return df.astype(dict.fromkeys(final, 'category'))
Answered By: jezrael
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.