Efficiently joining two dataframes based on multiple levels of a multiindex

Question:

I frequently have a dataframe with a large multiindex, and a secondary DataFrame with a MultiIndex that is a subset of the larger one. The secondary dataframe is usually some kind of lookup table. I often want to add the columns from the lookup table to the larger dataframe. The primary DataFrame is often very large, so I want to do this efficiently.

Here is an imaginary example, where I construct two dataframes df1 and df2

import pandas as pd
import numpy as np

arrays = [['sun', 'sun', 'sun', 'moon', 'moon', 'moon', 'moon', 'moon'],
          ['summer', 'winter', 'winter', 'summer', 'summer', 'summer', 'winter', 'winter'],
          ['one', 'one', 'two', 'one', 'two', 'three', 'one', 'two']]

tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['Body', 'Season','Item'])
df1 = pd.DataFrame(np.random.randn(8,2), index=index,columns=['A','B'])

index2= pd.MultiIndex.from_tuples([('sun','summer'),('sun','winter'),('moon','summer'),('moon','winter')],
                                  names=['Body','Season'])

df2 = pd.DataFrame(['Good','Bad','Ugly','Confused'],index=index2,columns = ['Mood'])

Giving the dataframes:

df1

                    A         B
Body Season Item                     
sun  summer one   -0.409372  0.638502
     winter one    1.448772 -1.460596
            two   -0.495634 -0.839063
moon summer one    1.296035 -1.439349
            two   -1.002667  0.508394
            three -1.247748 -0.645782
     winter one   -1.848857 -0.858759
            two    0.559172  2.202957

df2

                 Mood
Body Season          
sun  summer      Good
     winter       Bad
moon summer      Ugly
     winter  Confused

Now, suppose I want to add the columns from df2 to df1? This line is the only way I could find to do the job:

df1 = df1.reset_index().join(df2,on=['Body','Season']).set_index(df1.index.names)

resulting in:

           A         B      Mood
Body Season Item
sun  summer one   -0.121588  0.272774      Good
     winter one    0.233562 -2.005623       Bad
            two   -1.034642  0.315065       Bad
moon summer one    0.184548  0.820873      Ugly
            two    0.838290  0.495047      Ugly
            three  0.450813 -2.040089      Ugly
     winter one   -1.149993 -0.498148  Confused
            two    2.406824 -2.031849  Confused

[8 rows x 3 columns]

It works, but there are two problems with this method. First, the line is ugly. Needing to reset the index, then recreate the multiindex, makes this simple operation seem needlessly complicated. Second, if I understand correctly, every time I run reset_index() and set_index(), a copy of the dataframe is created. I am often working with very large dataframes, and this seems very inefficient.

Is there a better way to do this?

Asked By: Caleb

||

Answers:

This is not implemented internally ATM, but your soln is the recommended one, see here as well the issue

You can simply wrap this in a function if you want to make it look nicer. reset_index/set_index do copy (though you can pass an inplace=True argument if you want); it IS truly inplace as these are just changing the index attribute.

You could patch in a nice function like:

def merge_multi(self, df, on):
    return self.reset_index().join(df,on=on).set_index(self.index.names)
DataFrame.merge_multi = merge_multi

df1.merge_multi(df2,on=['Body','Season'])

However, merging by definition creates new data, so not sure how much this will actually save you.

A better method is to build up smaller frames, then do a larger merge. You also might want to do something like this

Answered By: Jeff

join now allows merging of MultiIndex DataFrames with partially matching indices.

Following your example:

df1 = df1.join(df2, on=['Body','Season'])

make sure the on columns are specified in exactly the order that match the Index of the other DataFrame as the on argument matches the order you specify the labels of the calling DataFrame with the Index as it is in the other DataFrame you join to.

or just join without using on and by default it will use the common index levels between the two DataFrames:

df1 = df1.join(df2)

Resulting df1:

                          A         B      Mood
Body Season Item                               
sun  summer one   -0.483779  0.981052      Good
     winter one   -0.309939  0.803862       Bad
            two   -0.413732  0.025331       Bad
moon summer one   -0.926068 -1.316808      Ugly
            two    0.221627 -0.226154      Ugly
            three  1.064856  0.402827      Ugly
     winter one    0.526461 -0.932231  Confused
            two   -0.296415 -0.812374  Confused
Answered By: CAPSLOCK
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.