How to slice multiindex dataframe with list of labels on one level

Question:

MultiIndex dataframes are very powerful but personally I think there is no enough (clear) documentations on it, specially for different type of slicing…
Here is my question:

How to slice a multi-indexed dataframe just on one level with a list of labels?
Please help me if you have a solution (without reseting indexes and converting the dataframe to single level index! Which is obvious and not efficient)

For example, we have following dataframe:

import pandas as pd
import numpy as np

df = pd.DataFrame(index=range(10))
df['id'] = pd.Series(range(10,20))
df['name'] = [f'name_{id}' for id in range(10,20)]
df['price'] = np.random.rand(df.index.size)
df['date'] = pd.date_range('20200310', '20200319')
df = df.set_index(['id', 'date'])
df

enter image description here

Slicing on one label is working just fine:

df.xs('2020-03-10', level='date', drop_level=False)

enter image description here

But how can we slice on a list of labels on that level?

df.xs(('2020-03-10', '2020-03-11', '2020-03-12'), level='date', drop_level=False)

This leads to an exception:

enter image description here

However Python doc says that “key” parameter could be a tuple as well:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.xs.html

enter image description here

Asked By: Sacha

||

Answers:

For filter by multiple values use Index.get_level_values with Index.isin and boolean indexing:

a = df[df.index.get_level_values('date').isin(('2020-03-10', '2020-03-11', '2020-03-12'))]
print (a)
                  name     price
id date                         
10 2020-03-10  name_10  0.557772
11 2020-03-11  name_11  0.122315
12 2020-03-12  name_12  0.775976

However Python doc says that “key” parameter could be a tuple as well:

Tuple is possible use, but working differently – you can select by both labels like:

b = df.xs((10, '2020-03-10'), drop_level=False)
print (b)
name      name_10
price    0.348808
Name: (10, 2020-03-10 00:00:00), dtype: object

c = df.xs((10, '2020-03-10'), level=('id','date'), drop_level=False)
print (c)
                  name     price
id date                         
10 2020-03-10  name_10  0.239876

Like @yatu mentioned, another solution with IndexSlice is with : for all first levels and last : for all columns:

df = df.loc[pd.IndexSlice[:, ['2020-03-10', '2020-03-11', '2020-03-12']], :]
print (df)
                  name     price
id date                         
10 2020-03-10  name_10  0.557488
11 2020-03-11  name_11  0.592082
12 2020-03-12  name_12  0.547747
Answered By: jezrael

The use of tuples when accessing multiindex is meant to address the different levels/hierarchy. Tuples are meant for this use, not as a form of passing multiple items within the same hierarchy/level. For multiple selections within the same level you need to use some other functions such as the one Jezrael.

dates = ['2020-03-10', '2020-03-11', '2020-03-12']
filtered_df = df[df.index.get_level_values('date').isin(dates)]
Answered By: Celius Stingher

This is a slight variation from the answer provided by @jezrael.

You can use loc() combined with slice(None) like this:

dates = ['2020-03-10', '2020-03-11', '2020-03-12']

df.loc[(slice(None), dates), :]


id  date        name    price
10  2020-03-10  name_10 0.36806
11  2020-03-11  name_11 0.20436
12  2020-03-12  name_12 0.00443

The first argument in .loc is a tuple that selects rows in the MultiIndex. slice(None) gets all the values from the first level id.
The list dates filters keys in the second level date. The second argument : selects all columns.

In the Pandas Documentation – MultiIndex – Advanced Indexing you can find:

It is important to note that tuples and lists are not treated identically in pandas when it comes to indexing. Whereas a tuple is interpreted as one multi-level key, a list is used to specify several keys. Or in other words, tuples go horizontally (traversing levels), lists go vertically (scanning levels).

Answered By: Martin Clausse
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.