Benefits of panda's multiindex?

Question:

So I learned that I can use DataFrame.groupby without having a MultiIndex to do subsampling/cross-sections.

On the other hand, when I have a MultiIndex on a DataFrame, I still need to use DataFrame.groupby to do sub-sampling/cross-sections.

So what is a MultiIndex good for apart from the quite helpful and pretty display of the hierarchies when printing?

Asked By: K.-Michael Aye

||

Answers:

Hierarchical indexing (also referred to as “multi-level” indexing) was introduced in the pandas 0.4 release.

This opens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data. In essence, it enables you to effectively store and manipulate arbitrarily high dimension data in a 2-dimensional tabular structure (DataFrame), for example.

Imagine constructing a dataframe using MultiIndex like this:-

import pandas as pd
import numpy as np

np.arrays = [['one','one','one','two','two','two'],[1,2,3,1,2,3]]

df = pd.DataFrame(np.random.randn(6,2),index=pd.MultiIndex.from_tuples(list(zip(*np.arrays))),columns=['A','B'])

df  # This is the dataframe we have generated

          A         B
one 1 -0.732470 -0.313871
    2 -0.031109 -2.068794
    3  1.520652  0.471764
two 1 -0.101713 -1.204458
    2  0.958008 -0.455419
    3 -0.191702 -0.915983

This df is simply a data structure of two dimensions

df.ndim

2

But we can imagine it, looking at the output, as a 3 dimensional data structure.

  • one with 1 with data -0.732470 -0.313871.
  • one with 2 with data -0.031109 -2.068794.
  • one with 3 with data 1.520652 0.471764.

A.k.a.: “effectively store and manipulate arbitrarily high dimension data in a 2-dimensional tabular structure”

This is not just a “pretty display”. It has the benefit of easy retrieval of data since we now have a hierarchal index.

For example.

In [44]: df.ix["one"]
Out[44]: 
          A         B
1 -0.732470 -0.313871
2 -0.031109 -2.068794
3  1.520652  0.471764

will give us a new data frame only for the group of data belonging to “one”.

And we can narrow down our data selection further by doing this:-

In [45]: df.ix["one"].ix[1]
Out[45]: 
A   -0.732470
B   -0.313871
Name: 1

And of course, if we want a specific value, here’s an example:-

In [46]: df.ix["one"].ix[1]["A"]
Out[46]: -0.73247029752040727

So if we have even more indexes (besides the 2 indexes shown in the example above), we can essentially drill down and select the data set we are really interested in without a need for groupby.

We can even grab a cross-section (either rows or columns) from our dataframe…

By rows:-

In [47]: df.xs('one')
Out[47]: 
          A         B
1 -0.732470 -0.313871
2 -0.031109 -2.068794
3  1.520652  0.471764

By columns:-

In [48]: df.xs('B', axis=1)
Out[48]: 
one  1   -0.313871
     2   -2.068794
     3    0.471764
two  1   -1.204458
     2   -0.455419
     3   -0.915983
Name: B
Answered By: Calvin Cheng

Great post by @Calvin Cheng, but thought I’d take a stab at this as well.

When to use a MultiIndex:

  1. When a single column’s value isn’t enough to uniquely identify a row.
  2. When data is logically hierarchical – meaning that it has multiple dimensions or “levels.”

Why (your core question) – at least these are the biggest benefits IMO:

  1. Easy manipulation via stack() and unstack()
  2. Easy math when there are multiple column levels
  3. Syntactic sugar for slicing/filtering

Example:

                                                       Dollars  Units
Date       Store   Category Subcategory UPC EAN
2018-07-10 Store 1 Alcohol  Liqour      80480280024    154.77      7
           Store 2 Alcohol  Liqour      80480280024     82.08      4
           Store 3 Alcohol  Liqour      80480280024    259.38      9
           Store 1 Alcohol  Liquor      80432400630    477.68     14
                                        674545000001   139.68      4
           Store 2 Alcohol  Liquor      80432400630    203.88      6
                                        674545000001   377.13     13
           Store 3 Alcohol  Liquor      80432400630    239.19      7
                                        674545000001   432.32     14
           Store 1 Beer     Ales        94922755711     65.17      7
                                        702770082018   174.44     14
                                        736920111112    50.70      5
           Store 2 Beer     Ales        94922755711    129.60     12
                                        702770082018   107.40     10
                                        736920111112    59.65      5
           Store 3 Beer     Ales        94922755711    154.00     14
                                        702770082018   137.40     10
                                        736920111112   107.88     12
           Store 1 Beer     Lagers      702770081011   156.24     12
           Store 2 Beer     Lagers      702770081011   137.06     11
           Store 3 Beer     Lagers      702770081011   119.52      8    

1) If we want to easily compare sales across stores, we can use df.unstack('Store') to line everything up side-by-side:

                                             Dollars                   Units
Store                                        Store 1 Store 2 Store 3 Store 1 Store 2 Store 3
Date       Category Subcategory UPC EAN
2018-07-10 Alcohol  Liqour      80480280024   154.77   82.08  259.38       7       4       9
                    Liquor      80432400630   477.68  203.88  239.19      14       6       7
                                674545000001  139.68  377.13  432.32       4      13      14
           Beer     Ales        94922755711    65.17  129.60  154.00       7      12      14
                                702770082018  174.44  107.40  137.40      14      10      10
                                736920111112   50.70   59.65  107.88       5       5      12
                    Lagers      702770081011  156.24  137.06  119.52      12      11       8

2) We can also easily do math on multiple columns. For example, df['Dollars'] / df['Units'] will then divide each store’s dollars by its units, for every store without multiple operations:

Store                                         Store 1  Store 2  Store 3
Date       Category Subcategory UPC EAN
2018-07-10 Alcohol  Liqour      80480280024     22.11    20.52    28.82
                    Liquor      80432400630     34.12    33.98    34.17
                                674545000001    34.92    29.01    30.88
           Beer     Ales        94922755711      9.31    10.80    11.00
                                702770082018    12.46    10.74    13.74
                                736920111112    10.14    11.93     8.99
                    Lagers      702770081011    13.02    12.46    14.94

3) If we then want to filter to just specific rows, instead of using the

df[(df[col1] == val1) and (df[col2] == val2) and (df[col3] == val3)]

format, we can instead .xs or .query (yes these work for regular dfs, but it’s not very useful). The syntax would instead be:

df.xs((val1, val2, val3), level=(col1, col2, col3))

More examples can be found in this tutorial notebook I put together.

Answered By: ZaxR

The alternative to using a multiindex is to store your data using multiple columns of a dataframe. One would expect multiindex to provide a performance boost over naive column storage, but as of Pandas v 1.1.4, that appears not to be the case.

Timinigs

import numpy as np
import pandas as pd

np.random.seed(2020)
inv = pd.DataFrame({
    'store_id': np.random.choice(10000, size=10**7),
    'product_id': np.random.choice(1000, size=10**7),
    'stock': np.random.choice(100, size=10**7),
})
# Create a DataFrame with a multiindex
inv_multi = inv.groupby(['store_id', 'product_id'])[['stock']].agg('sum')
print(inv_multi)
                     stock
store_id product_id       
0        2              48
         4              18
         5              58
         7             149
         8             158
...                    ...
9999     992           132
         995           121
         996           105
         998            99
         999            16

[6321869 rows x 1 columns]
# Create a DataFrame without a multiindex
inv_cols = inv_multi.reset_index()
print(inv_cols)
         store_id  product_id  stock
0               0           2     48
1               0           4     18
2               0           5     58
3               0           7    149
4               0           8    158
...           ...         ...    ...
6321864      9999         992    132
6321865      9999         995    121
6321866      9999         996    105
6321867      9999         998     99
6321868      9999         999     16

[6321869 rows x 3 columns]
%%timeit
inv_multi.xs(key=100, level='store_id')
10 loops, best of 3: 20.2 ms per loop

%%timeit
inv_cols.loc[inv_cols.store_id == 100]
The slowest run took 8.79 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 11.5 ms per loop

%%timeit
inv_multi.xs(key=100, level='product_id')
100 loops, best of 3: 9.08 ms per loop

%%timeit
inv_cols.loc[inv_cols.product_id == 100]
100 loops, best of 3: 12.2 ms per loop

%%timeit
inv_multi.xs(key=(100, 100), level=('store_id', 'product_id'))
10 loops, best of 3: 29.8 ms per loop

%%timeit
inv_cols.loc[(inv_cols.store_id == 100) & (inv_cols.product_id == 100)]
10 loops, best of 3: 28.8 ms per loop

Conclusion

The benefits from using a MultiIndex are about syntactic sugar, self-documenting data, and small conveniences from functions like unstack() as mentioned in @ZaxR’s answer; Performance is not a benefit, which seems like a real missed opportunity.

Answered By: Ben

Based on the comment on this
answer
it seems the
experiment was flawed. Here is my attempt at a correct experiment.

Timings

import pandas as pd
import numpy as np
from timeit import timeit


random_data = np.random.randn(16, 4)

multiindex_lists = [["A", "B", "C", "D"], [1, 2, 3, 4]]
multiindex = pd.MultiIndex.from_product(multiindex_lists)

dfm = pd.DataFrame(random_data, multiindex)
df = dfm.reset_index()
print("dfm:n", dfm, "n")
print("dfn", df, "n")

dfm_selection = dfm.loc[("B", 4), 3]
print("dfm_selection:", dfm_selection, type(dfm_selection))

df_selection = df[(df["level_0"] == "B") & (df["level_1"] == 4)][3].iat[0]
print("df_selection: ", df_selection, type(df_selection), "n")

print("dfm_selection timeit:",
      timeit(lambda: dfm.loc[("B", 4), 3], number=int(1e6)))
print("df_selection timeit: ",
      timeit(
          lambda: df[(df["level_0"] == "B") & (df["level_1"] == 4)][3].iat[0],
          number=int(1e6)))
dfm:
             0         1         2         3
A 1 -1.055128 -0.845019 -2.853027  0.521738
  2  0.397804  0.385045 -0.121294 -0.696215
  3 -0.551836 -0.666953 -0.956578  1.929732
  4 -0.154780  1.778150  0.183104 -0.013989
B 1 -0.315476  0.564419  0.492496 -1.052432
  2 -0.695300  0.085265  0.701724 -0.974168
  3 -0.879915 -0.206499  1.597701  1.294885
  4  0.653261  0.279641 -0.800613  1.050241
C 1  1.004199 -1.377520 -0.672913  1.491793
  2 -0.453452  0.367264 -0.002362  0.411193
  3  2.271958  0.240864 -0.923934 -0.572957
  4  0.737893 -0.523488  0.485497 -2.371977
D 1  1.133661 -0.584973 -0.713320 -0.656315
  2 -1.173231 -0.490667  0.634677  1.711015
  3 -0.050371 -0.175644  0.124797  0.703672
  4  1.349595  0.122202 -1.498178  0.013391

df
    level_0  level_1         0         1         2         3
0        A        1 -1.055128 -0.845019 -2.853027  0.521738
1        A        2  0.397804  0.385045 -0.121294 -0.696215
2        A        3 -0.551836 -0.666953 -0.956578  1.929732
3        A        4 -0.154780  1.778150  0.183104 -0.013989
4        B        1 -0.315476  0.564419  0.492496 -1.052432
5        B        2 -0.695300  0.085265  0.701724 -0.974168
6        B        3 -0.879915 -0.206499  1.597701  1.294885
7        B        4  0.653261  0.279641 -0.800613  1.050241
8        C        1  1.004199 -1.377520 -0.672913  1.491793
9        C        2 -0.453452  0.367264 -0.002362  0.411193
10       C        3  2.271958  0.240864 -0.923934 -0.572957
11       C        4  0.737893 -0.523488  0.485497 -2.371977
12       D        1  1.133661 -0.584973 -0.713320 -0.656315
13       D        2 -1.173231 -0.490667  0.634677  1.711015
14       D        3 -0.050371 -0.175644  0.124797  0.703672
15       D        4  1.349595  0.122202 -1.498178  0.013391 
dfm_selection: 1.0502406808918188 <class 'numpy.float64'>
df_selection:  1.0502406808918188 <class 'numpy.float64'> 

dfm_selection timeit: 63.92458086000079
df_selection timeit:  450.4555013199997

Conclusion

MultiIndex single-value retrieval is over 7 times faster than conventional
dataframe single-value retrieval.

The syntax for MultiIndex retrieval is much cleaner.

Answered By: young_souvlaki
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.