Efficiently joining two dataframes based on multiple levels of a multiindex
Question:
I frequently have a dataframe with a large multiindex, and a secondary DataFrame with a MultiIndex that is a subset of the larger one. The secondary dataframe is usually some kind of lookup table. I often want to add the columns from the lookup table to the larger dataframe. The primary DataFrame is often very large, so I want to do this efficiently.
Here is an imaginary example, where I construct two dataframes df1
and df2
import pandas as pd
import numpy as np
arrays = [['sun', 'sun', 'sun', 'moon', 'moon', 'moon', 'moon', 'moon'],
['summer', 'winter', 'winter', 'summer', 'summer', 'summer', 'winter', 'winter'],
['one', 'one', 'two', 'one', 'two', 'three', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['Body', 'Season','Item'])
df1 = pd.DataFrame(np.random.randn(8,2), index=index,columns=['A','B'])
index2= pd.MultiIndex.from_tuples([('sun','summer'),('sun','winter'),('moon','summer'),('moon','winter')],
names=['Body','Season'])
df2 = pd.DataFrame(['Good','Bad','Ugly','Confused'],index=index2,columns = ['Mood'])
Giving the dataframes:
df1
A B
Body Season Item
sun summer one -0.409372 0.638502
winter one 1.448772 -1.460596
two -0.495634 -0.839063
moon summer one 1.296035 -1.439349
two -1.002667 0.508394
three -1.247748 -0.645782
winter one -1.848857 -0.858759
two 0.559172 2.202957
df2
Mood
Body Season
sun summer Good
winter Bad
moon summer Ugly
winter Confused
Now, suppose I want to add the columns from df2 to df1? This line is the only way I could find to do the job:
df1 = df1.reset_index().join(df2,on=['Body','Season']).set_index(df1.index.names)
resulting in:
A B Mood
Body Season Item
sun summer one -0.121588 0.272774 Good
winter one 0.233562 -2.005623 Bad
two -1.034642 0.315065 Bad
moon summer one 0.184548 0.820873 Ugly
two 0.838290 0.495047 Ugly
three 0.450813 -2.040089 Ugly
winter one -1.149993 -0.498148 Confused
two 2.406824 -2.031849 Confused
[8 rows x 3 columns]
It works, but there are two problems with this method. First, the line is ugly. Needing to reset the index, then recreate the multiindex, makes this simple operation seem needlessly complicated. Second, if I understand correctly, every time I run reset_index() and set_index(), a copy of the dataframe is created. I am often working with very large dataframes, and this seems very inefficient.
Is there a better way to do this?
Answers:
This is not implemented internally ATM, but your soln is the recommended one, see here as well the issue
You can simply wrap this in a function if you want to make it look nicer. reset_index/set_index
do copy (though you can pass an inplace=True
argument if you want); it IS truly inplace as these are just changing the index attribute.
You could patch in a nice function like:
def merge_multi(self, df, on):
return self.reset_index().join(df,on=on).set_index(self.index.names)
DataFrame.merge_multi = merge_multi
df1.merge_multi(df2,on=['Body','Season'])
However, merging by definition creates new data, so not sure how much this will actually save you.
A better method is to build up smaller frames, then do a larger merge. You also might want to do something like this
join
now allows merging of MultiIndex DataFrames with partially matching indices.
Following your example:
df1 = df1.join(df2, on=['Body','Season'])
make sure the on
columns are specified in exactly the order that match the Index of the other DataFrame as the on
argument matches the order you specify the labels of the calling DataFrame with the Index as it is in the other
DataFrame you join to.
or just join
without using on
and by default it will use the common index levels between the two DataFrames:
df1 = df1.join(df2)
Resulting df1
:
A B Mood
Body Season Item
sun summer one -0.483779 0.981052 Good
winter one -0.309939 0.803862 Bad
two -0.413732 0.025331 Bad
moon summer one -0.926068 -1.316808 Ugly
two 0.221627 -0.226154 Ugly
three 1.064856 0.402827 Ugly
winter one 0.526461 -0.932231 Confused
two -0.296415 -0.812374 Confused
I frequently have a dataframe with a large multiindex, and a secondary DataFrame with a MultiIndex that is a subset of the larger one. The secondary dataframe is usually some kind of lookup table. I often want to add the columns from the lookup table to the larger dataframe. The primary DataFrame is often very large, so I want to do this efficiently.
Here is an imaginary example, where I construct two dataframes df1
and df2
import pandas as pd
import numpy as np
arrays = [['sun', 'sun', 'sun', 'moon', 'moon', 'moon', 'moon', 'moon'],
['summer', 'winter', 'winter', 'summer', 'summer', 'summer', 'winter', 'winter'],
['one', 'one', 'two', 'one', 'two', 'three', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['Body', 'Season','Item'])
df1 = pd.DataFrame(np.random.randn(8,2), index=index,columns=['A','B'])
index2= pd.MultiIndex.from_tuples([('sun','summer'),('sun','winter'),('moon','summer'),('moon','winter')],
names=['Body','Season'])
df2 = pd.DataFrame(['Good','Bad','Ugly','Confused'],index=index2,columns = ['Mood'])
Giving the dataframes:
df1
A B
Body Season Item
sun summer one -0.409372 0.638502
winter one 1.448772 -1.460596
two -0.495634 -0.839063
moon summer one 1.296035 -1.439349
two -1.002667 0.508394
three -1.247748 -0.645782
winter one -1.848857 -0.858759
two 0.559172 2.202957
df2
Mood
Body Season
sun summer Good
winter Bad
moon summer Ugly
winter Confused
Now, suppose I want to add the columns from df2 to df1? This line is the only way I could find to do the job:
df1 = df1.reset_index().join(df2,on=['Body','Season']).set_index(df1.index.names)
resulting in:
A B Mood
Body Season Item
sun summer one -0.121588 0.272774 Good
winter one 0.233562 -2.005623 Bad
two -1.034642 0.315065 Bad
moon summer one 0.184548 0.820873 Ugly
two 0.838290 0.495047 Ugly
three 0.450813 -2.040089 Ugly
winter one -1.149993 -0.498148 Confused
two 2.406824 -2.031849 Confused
[8 rows x 3 columns]
It works, but there are two problems with this method. First, the line is ugly. Needing to reset the index, then recreate the multiindex, makes this simple operation seem needlessly complicated. Second, if I understand correctly, every time I run reset_index() and set_index(), a copy of the dataframe is created. I am often working with very large dataframes, and this seems very inefficient.
Is there a better way to do this?
This is not implemented internally ATM, but your soln is the recommended one, see here as well the issue
You can simply wrap this in a function if you want to make it look nicer. reset_index/set_index
do copy (though you can pass an inplace=True
argument if you want); it IS truly inplace as these are just changing the index attribute.
You could patch in a nice function like:
def merge_multi(self, df, on):
return self.reset_index().join(df,on=on).set_index(self.index.names)
DataFrame.merge_multi = merge_multi
df1.merge_multi(df2,on=['Body','Season'])
However, merging by definition creates new data, so not sure how much this will actually save you.
A better method is to build up smaller frames, then do a larger merge. You also might want to do something like this
join
now allows merging of MultiIndex DataFrames with partially matching indices.
Following your example:
df1 = df1.join(df2, on=['Body','Season'])
make sure the on
columns are specified in exactly the order that match the Index of the other DataFrame as the on
argument matches the order you specify the labels of the calling DataFrame with the Index as it is in the other
DataFrame you join to.
or just join
without using on
and by default it will use the common index levels between the two DataFrames:
df1 = df1.join(df2)
Resulting df1
:
A B Mood
Body Season Item
sun summer one -0.483779 0.981052 Good
winter one -0.309939 0.803862 Bad
two -0.413732 0.025331 Bad
moon summer one -0.926068 -1.316808 Ugly
two 0.221627 -0.226154 Ugly
three 1.064856 0.402827 Ugly
winter one 0.526461 -0.932231 Confused
two -0.296415 -0.812374 Confused