Augment DataFrame index
Question:
I want to write a series ('b'
) of a dataframe from one dataframe (df2) to another one (df1). Both DataFrames use the same index column, but the range of df2
‘s index goes a bit further and it’s missing some of the indices of df1
.
This is the current behaviour:
>>> import pandas as pd
>>> pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
a b
0 1 4
1 2 5
2 3 6
>>>
>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
>>> df1 = df.set_index(['a'])
>>> df1
b
a
1 4
2 5
3 6
>>> dg = pd.DataFrame({'a': [3, 4, 5], 'b': [7, 8, 9]})
>>> dg
a b
0 3 7
1 4 8
2 5 9
>>> df2 = dg.set_index('a')
>>> df2
b
a
3 7
4 8
5 9
>>> df1['b'] = df2['b']
>>> df1
b
a
1 NaN
2 NaN
3 7.0
When I call df1['b'] = df2['b']
those values of the indices not in df2
are becoming nan
and the indices of df2
that aren’t in df1
are not getting carried over into df1
.
Is there any way to change this behaviour so that the resulting DataFrame is the below?
>>> df1
b
a
1 1
2 2
3 7
4 8
5 9
Answers:
One option you can go with is reindex()
df2 and then fill missing values with df1:
df2 = df2.reindex(df1.index.union(df2.index))
df2['b'] = df2['b'].fillna(df1['b'])
df2
# b
#a
#1 4.0
#2 5.0
#3 7.0
#4 8.0
#5 9.0
This is a use case for combine_first
. It will prioritize the calling dataframe and fill in any missing values with the second. It will also concatenate rows from the second data frame that don’t have labels in the first.
df2.combine_first(df1)
I want to write a series ('b'
) of a dataframe from one dataframe (df2) to another one (df1). Both DataFrames use the same index column, but the range of df2
‘s index goes a bit further and it’s missing some of the indices of df1
.
This is the current behaviour:
>>> import pandas as pd
>>> pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
a b
0 1 4
1 2 5
2 3 6
>>>
>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
>>> df1 = df.set_index(['a'])
>>> df1
b
a
1 4
2 5
3 6
>>> dg = pd.DataFrame({'a': [3, 4, 5], 'b': [7, 8, 9]})
>>> dg
a b
0 3 7
1 4 8
2 5 9
>>> df2 = dg.set_index('a')
>>> df2
b
a
3 7
4 8
5 9
>>> df1['b'] = df2['b']
>>> df1
b
a
1 NaN
2 NaN
3 7.0
When I call df1['b'] = df2['b']
those values of the indices not in df2
are becoming nan
and the indices of df2
that aren’t in df1
are not getting carried over into df1
.
Is there any way to change this behaviour so that the resulting DataFrame is the below?
>>> df1
b
a
1 1
2 2
3 7
4 8
5 9
One option you can go with is reindex()
df2 and then fill missing values with df1:
df2 = df2.reindex(df1.index.union(df2.index))
df2['b'] = df2['b'].fillna(df1['b'])
df2
# b
#a
#1 4.0
#2 5.0
#3 7.0
#4 8.0
#5 9.0
This is a use case for combine_first
. It will prioritize the calling dataframe and fill in any missing values with the second. It will also concatenate rows from the second data frame that don’t have labels in the first.
df2.combine_first(df1)