Pandas update multiple columns at once
Question:
I’m trying to update a couple fields at once – I have two data sources and I’m trying to reconcile them. I know I could do some ugly merging and then delete columns, but was expecting this code below to work:
df = pd.DataFrame([['A','B','C',np.nan,np.nan,np.nan],
['D','E','F',np.nan,np.nan,np.nan],[np.nan,np.nan,np.nan,'a','b','d'],
[np.nan,np.nan,np.nan,'d','e','f']], columns = ['Col1','Col2','Col3','col1_v2','col2_v2','col3_v2'])
print df
Col1 Col2 Col3 col1_v2 col2_v2 col3_v2
0 A B C NaN NaN NaN
1 D E F NaN NaN NaN
2 NaN NaN NaN a b d
3 NaN NaN NaN d e f
#update
df.loc[df['Col1'].isnull(),['Col1','Col2', 'Col3']] = df[['col1_v2','col2_v2','col3_v2']]
print df
Col1 Col2 Col3 col1_v2 col2_v2 col3_v2
0 A B C NaN NaN NaN
1 D E F NaN NaN NaN
2 NaN NaN NaN a b d
3 NaN NaN NaN d e f
My desired output would be:
Col1 Col2 Col3 col1_v2 col2_v2 col3_v2
0 A B C NaN NaN NaN
1 D E F NaN NaN NaN
2 a b c a b d
3 d e f d e f
I’m betting it has to do with updating/setting on a slice, but I always use .loc to update values, just not on multiple columns at once.
I feel like there’s an easy way to do this that I’m just missing, any thoughts/suggestions would be welcome!
Edit to reflect solution below
Thanks for the comment on the indexes. However, I have a question about this as it relates to series. If I wanted to update an individual series in a similar manner, I could do something like this:
df.loc[df['Col1'].isnull(),['Col1']] = df['col1_v2']
print df
Col1 Col2 Col3 col1_v2 col2_v2 col3_v2
0 A B C NaN NaN NaN
1 D E F NaN NaN NaN
2 a NaN NaN a b d
3 d NaN NaN d e f
Note that I didn’t account for the indexes here, I filtered to a 2×1 series and set that equal to a 4×1 series, yet it handled it correctly. Thoughts? I’m trying to understand the functionality a bit better of something I’ve used for a while, but I guess don’t have a full grasp of the underlying mechanism/rule
Answers:
you want to replace
print df.loc[df['Col1'].isnull(),['Col1','Col2', 'Col3']]
Col1 Col2 Col3
2 NaN NaN NaN
3 NaN NaN NaN
With:
replace_with_this = df.loc[df['Col1'].isnull(),['col1_v2','col2_v2', 'col3_v2']]
print replace_with_this
col1_v2 col2_v2 col3_v2
2 a b d
3 d e f
Seems reasonable. However, when you do the assignment, you need to account for index alignment, which includes columns.
So, this should work:
df.loc[df['Col1'].isnull(),['Col1','Col2', 'Col3']] = replace_with_this.values
print df
Col1 Col2 Col3 col1_v2 col2_v2 col3_v2
0 A B C NaN NaN NaN
1 D E F NaN NaN NaN
2 a b d a b d
3 d e f d e f
I accounted for columns by using .values
at the end. This stripped the column information from the replace_with_this
dataframe and just used the values in the appropriate positions.
In the “take the hill” spirit, I offer the below solution which yields the requested result.
I realize this is not exactly what you are after as I am not slicing the df (in the reasonable – but non functional – way in which you propose).
#Does not work when indexing on np.nan, so I fill with some arbitrary value.
df = df.fillna('AAA')
#mask to determine which rows to update
mask = df['Col1'] == 'AAA'
#dict with key value pairs for columns to be updated
mp = {'Col1':'col1_v2','Col2':'col2_v2','Col3':'col3_v2'}
#update
for k in mp:
df.loc[mask,k] = df[mp.get(k)]
#swap back np.nans for the arbitrary values
df = df.replace('AAA',np.nan)
Output:
Col1 Col2 Col3 col1_v2 col2_v2 col3_v2
A B C NaN NaN NaN
D E F NaN NaN NaN
a b d a b d
d e f d e f
The error I get if I do not replace nans is below. I’m going to research exactly where that error stems from.
ValueError: array is not broadcastable to correct shape
I’m trying to update a couple fields at once – I have two data sources and I’m trying to reconcile them. I know I could do some ugly merging and then delete columns, but was expecting this code below to work:
df = pd.DataFrame([['A','B','C',np.nan,np.nan,np.nan],
['D','E','F',np.nan,np.nan,np.nan],[np.nan,np.nan,np.nan,'a','b','d'],
[np.nan,np.nan,np.nan,'d','e','f']], columns = ['Col1','Col2','Col3','col1_v2','col2_v2','col3_v2'])
print df
Col1 Col2 Col3 col1_v2 col2_v2 col3_v2
0 A B C NaN NaN NaN
1 D E F NaN NaN NaN
2 NaN NaN NaN a b d
3 NaN NaN NaN d e f
#update
df.loc[df['Col1'].isnull(),['Col1','Col2', 'Col3']] = df[['col1_v2','col2_v2','col3_v2']]
print df
Col1 Col2 Col3 col1_v2 col2_v2 col3_v2
0 A B C NaN NaN NaN
1 D E F NaN NaN NaN
2 NaN NaN NaN a b d
3 NaN NaN NaN d e f
My desired output would be:
Col1 Col2 Col3 col1_v2 col2_v2 col3_v2
0 A B C NaN NaN NaN
1 D E F NaN NaN NaN
2 a b c a b d
3 d e f d e f
I’m betting it has to do with updating/setting on a slice, but I always use .loc to update values, just not on multiple columns at once.
I feel like there’s an easy way to do this that I’m just missing, any thoughts/suggestions would be welcome!
Edit to reflect solution below
Thanks for the comment on the indexes. However, I have a question about this as it relates to series. If I wanted to update an individual series in a similar manner, I could do something like this:
df.loc[df['Col1'].isnull(),['Col1']] = df['col1_v2']
print df
Col1 Col2 Col3 col1_v2 col2_v2 col3_v2
0 A B C NaN NaN NaN
1 D E F NaN NaN NaN
2 a NaN NaN a b d
3 d NaN NaN d e f
Note that I didn’t account for the indexes here, I filtered to a 2×1 series and set that equal to a 4×1 series, yet it handled it correctly. Thoughts? I’m trying to understand the functionality a bit better of something I’ve used for a while, but I guess don’t have a full grasp of the underlying mechanism/rule
you want to replace
print df.loc[df['Col1'].isnull(),['Col1','Col2', 'Col3']]
Col1 Col2 Col3
2 NaN NaN NaN
3 NaN NaN NaN
With:
replace_with_this = df.loc[df['Col1'].isnull(),['col1_v2','col2_v2', 'col3_v2']]
print replace_with_this
col1_v2 col2_v2 col3_v2
2 a b d
3 d e f
Seems reasonable. However, when you do the assignment, you need to account for index alignment, which includes columns.
So, this should work:
df.loc[df['Col1'].isnull(),['Col1','Col2', 'Col3']] = replace_with_this.values
print df
Col1 Col2 Col3 col1_v2 col2_v2 col3_v2
0 A B C NaN NaN NaN
1 D E F NaN NaN NaN
2 a b d a b d
3 d e f d e f
I accounted for columns by using .values
at the end. This stripped the column information from the replace_with_this
dataframe and just used the values in the appropriate positions.
In the “take the hill” spirit, I offer the below solution which yields the requested result.
I realize this is not exactly what you are after as I am not slicing the df (in the reasonable – but non functional – way in which you propose).
#Does not work when indexing on np.nan, so I fill with some arbitrary value.
df = df.fillna('AAA')
#mask to determine which rows to update
mask = df['Col1'] == 'AAA'
#dict with key value pairs for columns to be updated
mp = {'Col1':'col1_v2','Col2':'col2_v2','Col3':'col3_v2'}
#update
for k in mp:
df.loc[mask,k] = df[mp.get(k)]
#swap back np.nans for the arbitrary values
df = df.replace('AAA',np.nan)
Output:
Col1 Col2 Col3 col1_v2 col2_v2 col3_v2
A B C NaN NaN NaN
D E F NaN NaN NaN
a b d a b d
d e f d e f
The error I get if I do not replace nans is below. I’m going to research exactly where that error stems from.
ValueError: array is not broadcastable to correct shape