What causes "indexing past lexsort depth" warning in Pandas?
Question:
I’m indexing a large multi-index Pandas df using df.loc[(key1, key2)]
. Sometimes I get a series back (as expected), but other times I get a dataframe. I’m trying to isolate the cases which cause the latter, but so far all I can see is that it’s correlated with getting a PerformanceWarning: indexing past lexsort depth may impact performance
warning.
I’d like to reproduce it to post here, but I can’t generate another case that gives me the same warning. Here’s my attempt:
def random_dates(start, end, n=10):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
np.random.seed(0)
df = pd.DataFrame(np.random.random(3255000).reshape(465000,7)) # same shape as my data
df['date'] = random_dates(pd.to_datetime('1990-01-01'), pd.to_datetime('2018-01-01'), 465000)
df = df.set_index([0, 'date'])
df = df.sort_values(by=[3]) # unsort indices, just in case
df.index.lexsort_depth
> 0
df.index.is_monotonic
> False
df.loc[(0.9987185534991936, pd.to_datetime('2012-04-16 07:04:34'))]
# no warning
So my question is: what causes this warning? How do I artificially induce it?
Answers:
According to pandas advanced indexing (Sorting a Multiindex)
On higher dimensional objects, you can sort any of the other axes by level if they have a MultiIndex
And also:
Indexing will work even if the data are not sorted, but will be rather inefficient (and show a PerformanceWarning). It will also return a copy of the data rather than a view:
According to them, you may need to ensure that indices are sorted properly.
TL;DR: your index is unsorted and this severely impacts performance.
Sort your DataFrame’s index using df.sort_index()
to address the warning and improve performance.
I’ve actually written about this in detail in my writeup: Select rows in pandas MultiIndex DataFrame (under "Question 3").
To reproduce,
mux = pd.MultiIndex.from_arrays([
list('aaaabbbbbccddddd'),
list('tuvwtuvwtuvwtuvw')
], names=['one', 'two'])
df = pd.DataFrame({'col': np.arange(len(mux))}, mux)
col
one two
a t 0
u 1
v 2
w 3
b t 4
u 5
v 6
w 7
t 8
c u 9
v 10
d w 11
t 12
u 13
v 14
w 15
You’ll notice that the second level is not properly sorted.
Now, try to index a specific cross section:
df.loc[pd.IndexSlice[('c', 'u')]]
PerformanceWarning: indexing past lexsort depth may impact performance.
# encoding: utf-8
col
one two
c u 9
You’ll see the same behaviour with xs
:
df.xs(('c', 'u'), axis=0)
PerformanceWarning: indexing past lexsort depth may impact performance.
self.interact()
col
one two
c u 9
The docs, backed by this timing test I once did seem to suggest that handling un-sorted indexes imposes a slowdown—Indexing is O(N) time when it could/should be O(1).
If you sort the index before slicing, you’ll notice the difference:
df2 = df.sort_index()
df2.loc[pd.IndexSlice[('c', 'u')]]
col
one two
c u 9
%timeit df.loc[pd.IndexSlice[('c', 'u')]]
%timeit df2.loc[pd.IndexSlice[('c', 'u')]]
802 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
648 µs ± 20.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Finally, if you want to know whether the index is sorted or not, check with MultiIndex.is_lexsorted
.
df.index.is_lexsorted()
# False
df2.index.is_lexsorted()
# True
As for your question on how to induce this behaviour, simply permuting the indices should suffice. This works if your index is unique:
df2 = df.loc[pd.MultiIndex.from_tuples(np.random.permutation(df2.index))]
If your index is not unique, add a cumcount
ed level first,
df.set_index(
df.groupby(level=list(range(len(df.index.levels)))).cumcount(), append=True)
df2 = df.loc[pd.MultiIndex.from_tuples(np.random.permutation(df2.index))]
df2 = df2.reset_index(level=-1, drop=True)
Series vs. dataframe output:
I also had the same problem that sometimes the output of df.loc[(index1, index2)]
was a series and sometimes a dataframe. I found that this was caused by duplicated indices. If the dataframe had some duplicated indices, the output of df.loc[(index1, index2)]
is a dataframe otherwise a series.
On my case, PerformanceWarning: indexing past lexsort depth may impact performance is for duplicate index on df.
Case:
Trying to read a excel file with pandas in a loop for each sheetname
In the sheet with duplicate index gives: PerformanceWarning: indexing past lexsort depth may impact performance
I’m indexing a large multi-index Pandas df using df.loc[(key1, key2)]
. Sometimes I get a series back (as expected), but other times I get a dataframe. I’m trying to isolate the cases which cause the latter, but so far all I can see is that it’s correlated with getting a PerformanceWarning: indexing past lexsort depth may impact performance
warning.
I’d like to reproduce it to post here, but I can’t generate another case that gives me the same warning. Here’s my attempt:
def random_dates(start, end, n=10):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
np.random.seed(0)
df = pd.DataFrame(np.random.random(3255000).reshape(465000,7)) # same shape as my data
df['date'] = random_dates(pd.to_datetime('1990-01-01'), pd.to_datetime('2018-01-01'), 465000)
df = df.set_index([0, 'date'])
df = df.sort_values(by=[3]) # unsort indices, just in case
df.index.lexsort_depth
> 0
df.index.is_monotonic
> False
df.loc[(0.9987185534991936, pd.to_datetime('2012-04-16 07:04:34'))]
# no warning
So my question is: what causes this warning? How do I artificially induce it?
According to pandas advanced indexing (Sorting a Multiindex)
On higher dimensional objects, you can sort any of the other axes by level if they have a MultiIndex
And also:
Indexing will work even if the data are not sorted, but will be rather inefficient (and show a PerformanceWarning). It will also return a copy of the data rather than a view:
According to them, you may need to ensure that indices are sorted properly.
TL;DR: your index is unsorted and this severely impacts performance.
Sort your DataFrame’s index using df.sort_index()
to address the warning and improve performance.
I’ve actually written about this in detail in my writeup: Select rows in pandas MultiIndex DataFrame (under "Question 3").
To reproduce,
mux = pd.MultiIndex.from_arrays([
list('aaaabbbbbccddddd'),
list('tuvwtuvwtuvwtuvw')
], names=['one', 'two'])
df = pd.DataFrame({'col': np.arange(len(mux))}, mux)
col
one two
a t 0
u 1
v 2
w 3
b t 4
u 5
v 6
w 7
t 8
c u 9
v 10
d w 11
t 12
u 13
v 14
w 15
You’ll notice that the second level is not properly sorted.
Now, try to index a specific cross section:
df.loc[pd.IndexSlice[('c', 'u')]]
PerformanceWarning: indexing past lexsort depth may impact performance.
# encoding: utf-8
col
one two
c u 9
You’ll see the same behaviour with xs
:
df.xs(('c', 'u'), axis=0)
PerformanceWarning: indexing past lexsort depth may impact performance.
self.interact()
col
one two
c u 9
The docs, backed by this timing test I once did seem to suggest that handling un-sorted indexes imposes a slowdown—Indexing is O(N) time when it could/should be O(1).
If you sort the index before slicing, you’ll notice the difference:
df2 = df.sort_index()
df2.loc[pd.IndexSlice[('c', 'u')]]
col
one two
c u 9
%timeit df.loc[pd.IndexSlice[('c', 'u')]]
%timeit df2.loc[pd.IndexSlice[('c', 'u')]]
802 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
648 µs ± 20.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Finally, if you want to know whether the index is sorted or not, check with MultiIndex.is_lexsorted
.
df.index.is_lexsorted()
# False
df2.index.is_lexsorted()
# True
As for your question on how to induce this behaviour, simply permuting the indices should suffice. This works if your index is unique:
df2 = df.loc[pd.MultiIndex.from_tuples(np.random.permutation(df2.index))]
If your index is not unique, add a cumcount
ed level first,
df.set_index(
df.groupby(level=list(range(len(df.index.levels)))).cumcount(), append=True)
df2 = df.loc[pd.MultiIndex.from_tuples(np.random.permutation(df2.index))]
df2 = df2.reset_index(level=-1, drop=True)
Series vs. dataframe output:
I also had the same problem that sometimes the output of df.loc[(index1, index2)]
was a series and sometimes a dataframe. I found that this was caused by duplicated indices. If the dataframe had some duplicated indices, the output of df.loc[(index1, index2)]
is a dataframe otherwise a series.
On my case, PerformanceWarning: indexing past lexsort depth may impact performance is for duplicate index on df.
Case:
Trying to read a excel file with pandas in a loop for each sheetname
In the sheet with duplicate index gives: PerformanceWarning: indexing past lexsort depth may impact performance