How to manipulate Pandas Series without changing the given Original?
Question:
Context
I have a method that takes a Pandas Series of categorial Data and returns it as an indexed version. However, I think my implementation is also modifying the given Series, not just returning a modified new Series. I also get the following Errors:
A value is trying to be set on a copy of a slice from a DataFrame.
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
series[series == value] = index
SettingWithCopyWarning: modifications to a property of a datetimelike object are not supported and are discarded. Change values on the original.
cacher_needs_updating = self._check_is_chained_assignment_possible()
Code
def categorials(series: pandas.Series) -> pandas.Series:
unique = series.unique()
for index, value in enumerate(unique):
series[series == value] = index
return series.astype(pandas.Int64Dtype())
Question
- How can I achieve my goal: This method should return the modified series without manipulating the original given series?
Answers:
You need to .copy()
the incoming argument. Normally, that warning wouldn’t have appeared; we’re at liberty to write to Series/DataFrames after all. However, in the code you didn’t share, it seems the argument you’re passing here was obtained as a subset of another Series/Frame (or maybe even itself). FYI, if you’re planning to do modifications on a subset, better chain .copy()
at the end of initialization.
Anyway, back to the question, series = series.copy()
as the first line in the function should resolve the issue. However, your method is actually doing factorization, so
pd.Series(pd.factorize(series)[0], index=series.index)
is equivalent to what your function does, where since pd.factorize
returns a 2-tuple of (codes, uniques), we take the 0th one. Also it gives a NumPy array back, so we Series-ify it with the incoming index.
Context
I have a method that takes a Pandas Series of categorial Data and returns it as an indexed version. However, I think my implementation is also modifying the given Series, not just returning a modified new Series. I also get the following Errors:
A value is trying to be set on a copy of a slice from a DataFrame.
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
series[series == value] = index
SettingWithCopyWarning: modifications to a property of a datetimelike object are not supported and are discarded. Change values on the original.
cacher_needs_updating = self._check_is_chained_assignment_possible()
Code
def categorials(series: pandas.Series) -> pandas.Series:
unique = series.unique()
for index, value in enumerate(unique):
series[series == value] = index
return series.astype(pandas.Int64Dtype())
Question
- How can I achieve my goal: This method should return the modified series without manipulating the original given series?
You need to .copy()
the incoming argument. Normally, that warning wouldn’t have appeared; we’re at liberty to write to Series/DataFrames after all. However, in the code you didn’t share, it seems the argument you’re passing here was obtained as a subset of another Series/Frame (or maybe even itself). FYI, if you’re planning to do modifications on a subset, better chain .copy()
at the end of initialization.
Anyway, back to the question, series = series.copy()
as the first line in the function should resolve the issue. However, your method is actually doing factorization, so
pd.Series(pd.factorize(series)[0], index=series.index)
is equivalent to what your function does, where since pd.factorize
returns a 2-tuple of (codes, uniques), we take the 0th one. Also it gives a NumPy array back, so we Series-ify it with the incoming index.