Pandas updating a subset of rows multiple times leads to an unexpected result
Question:
I have one dataframe which requires multiple updates to one column using different subsets of rows per update. Each update corresponds to a set of rows which have a certain value for column A, where the B column should be given the values of the B column from another dataframe. A simple example is presented below, which works for the first update, but subsequent updates no longer change any of the values.
a = pd.DataFrame({'A': [1, 1, 2],
'B': [np.nan, np.nan, np.nan]})
b = pd.DataFrame({'A': [1, 1],
'B': ["test1", "test2"]})
c = pd.DataFrame({'A': [2],
'B': ["test3"]})
a.loc[a['A']==1, 'B'] = b['B']
a.loc[a['A']==2, 'B'] = c['B']
display(a)
The expected result would be {‘A’: [1, 1, 2], ‘B’: [‘test1’, ‘test2’, ‘test3’]}
The actual result is {‘A’: [1, 1, 2], ‘B’: [‘test1’, ‘test2’, NaN]}
Related post that I have tried: Efficient way to update column value for subset of rows on Pandas DataFrame?
Performing some further testing, I found some unexpected behavior demonstrated by the following code
a = pd.DataFrame({'A': [1, 1, 2, 2],
'B': [np.nan, np.nan, np.nan, np.nan]})
b = pd.DataFrame({'A': [1, 1],
'B': ["test1", "test2"]})
c = pd.DataFrame({'A': [1, 1],
'B': ["test3", "test4"]})
a.loc[a['A']==1, 'B'] = b['B']
a.loc[a['A']==2, 'B'] = c['B']
display(a)
a.loc[a['A']==2, 'B'] = ["test3", "test4"]
display(a)
which gives {‘A’: [1, 1, 2, 2], ‘B’: [‘test1’, ‘test2’, NaN, NaN]} for the first display, and {‘A’: [1, 1, 2, 2], ‘B’: [‘test1’, ‘test2’, ‘test3’, ‘test4’]} for the second. The expected result was to have the same output twice.
Answers:
The issue is the indexes don’t align when you try and assign. That’s why your final example, using a list does work as expected. You can use .to_list()
or .values
to make it work:
a = pd.DataFrame({'A': [1, 1, 2],
'B': [np.nan, np.nan, np.nan]})
b = pd.DataFrame({'A': [1, 1],
'B': ["test1", "test2"]})
c = pd.DataFrame({'A': [2],
'B': ["test3"]})
a.loc[a['A']==1, 'B'] = b['B'].values
a.loc[a['A']==2, 'B'] = c['B'].values
Or alternatively set the index on dataframes b and c, but I’m not sure that will help with what you’re trying to do.
I have one dataframe which requires multiple updates to one column using different subsets of rows per update. Each update corresponds to a set of rows which have a certain value for column A, where the B column should be given the values of the B column from another dataframe. A simple example is presented below, which works for the first update, but subsequent updates no longer change any of the values.
a = pd.DataFrame({'A': [1, 1, 2],
'B': [np.nan, np.nan, np.nan]})
b = pd.DataFrame({'A': [1, 1],
'B': ["test1", "test2"]})
c = pd.DataFrame({'A': [2],
'B': ["test3"]})
a.loc[a['A']==1, 'B'] = b['B']
a.loc[a['A']==2, 'B'] = c['B']
display(a)
The expected result would be {‘A’: [1, 1, 2], ‘B’: [‘test1’, ‘test2’, ‘test3’]}
The actual result is {‘A’: [1, 1, 2], ‘B’: [‘test1’, ‘test2’, NaN]}
Related post that I have tried: Efficient way to update column value for subset of rows on Pandas DataFrame?
Performing some further testing, I found some unexpected behavior demonstrated by the following code
a = pd.DataFrame({'A': [1, 1, 2, 2],
'B': [np.nan, np.nan, np.nan, np.nan]})
b = pd.DataFrame({'A': [1, 1],
'B': ["test1", "test2"]})
c = pd.DataFrame({'A': [1, 1],
'B': ["test3", "test4"]})
a.loc[a['A']==1, 'B'] = b['B']
a.loc[a['A']==2, 'B'] = c['B']
display(a)
a.loc[a['A']==2, 'B'] = ["test3", "test4"]
display(a)
which gives {‘A’: [1, 1, 2, 2], ‘B’: [‘test1’, ‘test2’, NaN, NaN]} for the first display, and {‘A’: [1, 1, 2, 2], ‘B’: [‘test1’, ‘test2’, ‘test3’, ‘test4’]} for the second. The expected result was to have the same output twice.
The issue is the indexes don’t align when you try and assign. That’s why your final example, using a list does work as expected. You can use .to_list()
or .values
to make it work:
a = pd.DataFrame({'A': [1, 1, 2],
'B': [np.nan, np.nan, np.nan]})
b = pd.DataFrame({'A': [1, 1],
'B': ["test1", "test2"]})
c = pd.DataFrame({'A': [2],
'B': ["test3"]})
a.loc[a['A']==1, 'B'] = b['B'].values
a.loc[a['A']==2, 'B'] = c['B'].values
Or alternatively set the index on dataframes b and c, but I’m not sure that will help with what you’re trying to do.