How to slice multiindex dataframe with list of labels on one level
Question:
MultiIndex dataframes are very powerful but personally I think there is no enough (clear) documentations on it, specially for different type of slicing…
Here is my question:
How to slice a multi-indexed dataframe just on one level with a list of labels?
Please help me if you have a solution (without reseting indexes and converting the dataframe to single level index! Which is obvious and not efficient)
For example, we have following dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame(index=range(10))
df['id'] = pd.Series(range(10,20))
df['name'] = [f'name_{id}' for id in range(10,20)]
df['price'] = np.random.rand(df.index.size)
df['date'] = pd.date_range('20200310', '20200319')
df = df.set_index(['id', 'date'])
df
Slicing on one label is working just fine:
df.xs('2020-03-10', level='date', drop_level=False)
But how can we slice on a list of labels on that level?
df.xs(('2020-03-10', '2020-03-11', '2020-03-12'), level='date', drop_level=False)
This leads to an exception:
However Python doc says that “key” parameter could be a tuple as well:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.xs.html
Answers:
For filter by multiple values use Index.get_level_values
with Index.isin
and boolean indexing
:
a = df[df.index.get_level_values('date').isin(('2020-03-10', '2020-03-11', '2020-03-12'))]
print (a)
name price
id date
10 2020-03-10 name_10 0.557772
11 2020-03-11 name_11 0.122315
12 2020-03-12 name_12 0.775976
However Python doc says that “key” parameter could be a tuple as well:
Tuple is possible use, but working differently – you can select by both labels like:
b = df.xs((10, '2020-03-10'), drop_level=False)
print (b)
name name_10
price 0.348808
Name: (10, 2020-03-10 00:00:00), dtype: object
c = df.xs((10, '2020-03-10'), level=('id','date'), drop_level=False)
print (c)
name price
id date
10 2020-03-10 name_10 0.239876
Like @yatu mentioned, another solution with IndexSlice
is with :
for all first levels and last :
for all columns:
df = df.loc[pd.IndexSlice[:, ['2020-03-10', '2020-03-11', '2020-03-12']], :]
print (df)
name price
id date
10 2020-03-10 name_10 0.557488
11 2020-03-11 name_11 0.592082
12 2020-03-12 name_12 0.547747
The use of tuples when accessing multiindex is meant to address the different levels/hierarchy. Tuples are meant for this use, not as a form of passing multiple items within the same hierarchy/level. For multiple selections within the same level you need to use some other functions such as the one Jezrael.
dates = ['2020-03-10', '2020-03-11', '2020-03-12']
filtered_df = df[df.index.get_level_values('date').isin(dates)]
This is a slight variation from the answer provided by @jezrael.
You can use loc()
combined with slice(None)
like this:
dates = ['2020-03-10', '2020-03-11', '2020-03-12']
df.loc[(slice(None), dates), :]
id date name price
10 2020-03-10 name_10 0.36806
11 2020-03-11 name_11 0.20436
12 2020-03-12 name_12 0.00443
The first argument in .loc
is a tuple that selects rows in the MultiIndex. slice(None)
gets all the values from the first level id
.
The list dates
filters keys in the second level date
. The second argument :
selects all columns.
In the Pandas Documentation – MultiIndex – Advanced Indexing you can find:
It is important to note that tuples and lists are not treated identically in pandas when it comes to indexing. Whereas a tuple is interpreted as one multi-level key, a list is used to specify several keys. Or in other words, tuples go horizontally (traversing levels), lists go vertically (scanning levels).
MultiIndex dataframes are very powerful but personally I think there is no enough (clear) documentations on it, specially for different type of slicing…
Here is my question:
How to slice a multi-indexed dataframe just on one level with a list of labels?
Please help me if you have a solution (without reseting indexes and converting the dataframe to single level index! Which is obvious and not efficient)
For example, we have following dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame(index=range(10))
df['id'] = pd.Series(range(10,20))
df['name'] = [f'name_{id}' for id in range(10,20)]
df['price'] = np.random.rand(df.index.size)
df['date'] = pd.date_range('20200310', '20200319')
df = df.set_index(['id', 'date'])
df
Slicing on one label is working just fine:
df.xs('2020-03-10', level='date', drop_level=False)
But how can we slice on a list of labels on that level?
df.xs(('2020-03-10', '2020-03-11', '2020-03-12'), level='date', drop_level=False)
This leads to an exception:
However Python doc says that “key” parameter could be a tuple as well:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.xs.html
For filter by multiple values use Index.get_level_values
with Index.isin
and boolean indexing
:
a = df[df.index.get_level_values('date').isin(('2020-03-10', '2020-03-11', '2020-03-12'))]
print (a)
name price
id date
10 2020-03-10 name_10 0.557772
11 2020-03-11 name_11 0.122315
12 2020-03-12 name_12 0.775976
However Python doc says that “key” parameter could be a tuple as well:
Tuple is possible use, but working differently – you can select by both labels like:
b = df.xs((10, '2020-03-10'), drop_level=False)
print (b)
name name_10
price 0.348808
Name: (10, 2020-03-10 00:00:00), dtype: object
c = df.xs((10, '2020-03-10'), level=('id','date'), drop_level=False)
print (c)
name price
id date
10 2020-03-10 name_10 0.239876
Like @yatu mentioned, another solution with IndexSlice
is with :
for all first levels and last :
for all columns:
df = df.loc[pd.IndexSlice[:, ['2020-03-10', '2020-03-11', '2020-03-12']], :]
print (df)
name price
id date
10 2020-03-10 name_10 0.557488
11 2020-03-11 name_11 0.592082
12 2020-03-12 name_12 0.547747
The use of tuples when accessing multiindex is meant to address the different levels/hierarchy. Tuples are meant for this use, not as a form of passing multiple items within the same hierarchy/level. For multiple selections within the same level you need to use some other functions such as the one Jezrael.
dates = ['2020-03-10', '2020-03-11', '2020-03-12']
filtered_df = df[df.index.get_level_values('date').isin(dates)]
This is a slight variation from the answer provided by @jezrael.
You can use loc()
combined with slice(None)
like this:
dates = ['2020-03-10', '2020-03-11', '2020-03-12']
df.loc[(slice(None), dates), :]
id date name price
10 2020-03-10 name_10 0.36806
11 2020-03-11 name_11 0.20436
12 2020-03-12 name_12 0.00443
The first argument in .loc
is a tuple that selects rows in the MultiIndex. slice(None)
gets all the values from the first level id
.
The list dates
filters keys in the second level date
. The second argument :
selects all columns.
In the Pandas Documentation – MultiIndex – Advanced Indexing you can find:
It is important to note that tuples and lists are not treated identically in pandas when it comes to indexing. Whereas a tuple is interpreted as one multi-level key, a list is used to specify several keys. Or in other words, tuples go horizontally (traversing levels), lists go vertically (scanning levels).