Pandas updating a subset of rows multiple times leads to an unexpected result

Question:

I have one dataframe which requires multiple updates to one column using different subsets of rows per update. Each update corresponds to a set of rows which have a certain value for column A, where the B column should be given the values of the B column from another dataframe. A simple example is presented below, which works for the first update, but subsequent updates no longer change any of the values.

a = pd.DataFrame({'A': [1, 1, 2],
              'B': [np.nan, np.nan, np.nan]})
b = pd.DataFrame({'A': [1, 1],
              'B': ["test1", "test2"]})
c = pd.DataFrame({'A': [2],
              'B': ["test3"]})

a.loc[a['A']==1, 'B'] = b['B']
a.loc[a['A']==2, 'B'] = c['B']

display(a)

The expected result would be {‘A’: [1, 1, 2], ‘B’: [‘test1’, ‘test2’, ‘test3’]}

The actual result is {‘A’: [1, 1, 2], ‘B’: [‘test1’, ‘test2’, NaN]}

Related post that I have tried: Efficient way to update column value for subset of rows on Pandas DataFrame?

Performing some further testing, I found some unexpected behavior demonstrated by the following code

a = pd.DataFrame({'A': [1, 1, 2, 2],
                  'B': [np.nan, np.nan, np.nan, np.nan]})
b = pd.DataFrame({'A': [1, 1],
                  'B': ["test1", "test2"]})
c = pd.DataFrame({'A': [1, 1],
                  'B': ["test3", "test4"]})

a.loc[a['A']==1, 'B'] = b['B']
a.loc[a['A']==2, 'B'] = c['B']

display(a)

a.loc[a['A']==2, 'B'] = ["test3", "test4"]

display(a)

which gives {‘A’: [1, 1, 2, 2], ‘B’: [‘test1’, ‘test2’, NaN, NaN]} for the first display, and {‘A’: [1, 1, 2, 2], ‘B’: [‘test1’, ‘test2’, ‘test3’, ‘test4’]} for the second. The expected result was to have the same output twice.

Asked By: René Steeman

||

Answers:

The issue is the indexes don’t align when you try and assign. That’s why your final example, using a list does work as expected. You can use .to_list() or .values to make it work:

a = pd.DataFrame({'A': [1, 1, 2],
              'B': [np.nan, np.nan, np.nan]})
b = pd.DataFrame({'A': [1, 1],
              'B': ["test1", "test2"]})
c = pd.DataFrame({'A': [2],
              'B': ["test3"]})

a.loc[a['A']==1, 'B'] = b['B'].values
a.loc[a['A']==2, 'B'] = c['B'].values

Or alternatively set the index on dataframes b and c, but I’m not sure that will help with what you’re trying to do.

Answered By: s_pike
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.