How to get the index of ith item in pandas.Series or pandas.DataFrame
Question:
I’m trying to get the index of 6th item in a Series
I have.
This is how the head looks like:
United States 1.536434e+13
China 6.348609e+12
Japan 5.542208e+12
Germany 3.493025e+12
France 2.681725e+12
To get the 6th index value (6th Country after being sorted), I usually use s.head(6)
and get the 6th index from there.
s.head(6)
gives me:
United States 1.536434e+13
China 6.348609e+12
Japan 5.542208e+12
Germany 3.493025e+12
France 2.681725e+12
United Kingdom 2.487907e+12
and from this Series, I get United Kingdom
as the 6th index.
So, is there any better way for getting the index other than this? And also, for a dataframe, is there any function to get the 6th index on the basis of a respective column after sorting?
If it’s a dataframe, I usually sort, create a new column named index
, and use reset_index
, and then use iloc
attribute to get the 6th (since it will be using a range in the index after reset).
Is there any better way to do this with pd.Series
and pd.DataFrame
?
Answers:
You could get it straight from the index
s.index[5]
Or
s.index.values[5]
It all depends on what you consider better
. I can tell you that a numpy
approach will probably be faster.
For example. numpy.argsort
returns an array where the first element in the array is the position in the array being sorted that should be first. The second element in argsort’s return array is the position of the element in the array being sorted that should be second. So on and so forth.
So you could do this to get the index value of the 6th item after being sorted.
s.index.values[s.values.argsort()[5]]
Or more transparently
s.sort_values().index[5]
Or more creatively
s.nsmallest(6).idxmax()
If you are trying to get the index of the ith item, then as piRSquared mentioned, s.index[i-1]
suffices.
If you want to get the index of the ith largest value as in the OP, then instead of sorting the whole column / Series, a faster way is a combination of nlargest
and idxmin
:
i = 6
s.nlargest(i).idxmin()
or use argpartition
and index. It is particularly fast because it only guarantees the ith element is in its final sorted position (which is the only thing we care about here), so it’s much faster than a full sorting of the elements (a timeit
test shows that it’s about 15 times faster than a full sort and 3 times faster than nlargest.idxmin
).
s.values.argpartition(len(s)-i)[-i]
To get the index of the ith smallest value,
s.nsmallest(i).idxmax() # suggested by piRSquared
# or
s.values.argpartition(i)[i-1]
A working example to get the index of the 6th largest value in a Series.
s = pd.Series(range(1_000_000)).sample(frac=1).reset_index(drop=True)
x = s.sort_values(ascending=False).index[5]
y = s.values.argsort()[-6]
z = s.nlargest(6).idxmin()
w = s.values.argpartition(len(s)-6)[-6]
x == y == z == w # True
I’m trying to get the index of 6th item in a Series
I have.
This is how the head looks like:
United States 1.536434e+13
China 6.348609e+12
Japan 5.542208e+12
Germany 3.493025e+12
France 2.681725e+12
To get the 6th index value (6th Country after being sorted), I usually use s.head(6)
and get the 6th index from there.
s.head(6)
gives me:
United States 1.536434e+13
China 6.348609e+12
Japan 5.542208e+12
Germany 3.493025e+12
France 2.681725e+12
United Kingdom 2.487907e+12
and from this Series, I get United Kingdom
as the 6th index.
So, is there any better way for getting the index other than this? And also, for a dataframe, is there any function to get the 6th index on the basis of a respective column after sorting?
If it’s a dataframe, I usually sort, create a new column named index
, and use reset_index
, and then use iloc
attribute to get the 6th (since it will be using a range in the index after reset).
Is there any better way to do this with pd.Series
and pd.DataFrame
?
You could get it straight from the index
s.index[5]
Or
s.index.values[5]
It all depends on what you consider better
. I can tell you that a numpy
approach will probably be faster.
For example. numpy.argsort
returns an array where the first element in the array is the position in the array being sorted that should be first. The second element in argsort’s return array is the position of the element in the array being sorted that should be second. So on and so forth.
So you could do this to get the index value of the 6th item after being sorted.
s.index.values[s.values.argsort()[5]]
Or more transparently
s.sort_values().index[5]
Or more creatively
s.nsmallest(6).idxmax()
If you are trying to get the index of the ith item, then as piRSquared mentioned, s.index[i-1]
suffices.
If you want to get the index of the ith largest value as in the OP, then instead of sorting the whole column / Series, a faster way is a combination of nlargest
and idxmin
:
i = 6
s.nlargest(i).idxmin()
or use argpartition
and index. It is particularly fast because it only guarantees the ith element is in its final sorted position (which is the only thing we care about here), so it’s much faster than a full sorting of the elements (a timeit
test shows that it’s about 15 times faster than a full sort and 3 times faster than nlargest.idxmin
).
s.values.argpartition(len(s)-i)[-i]
To get the index of the ith smallest value,
s.nsmallest(i).idxmax() # suggested by piRSquared
# or
s.values.argpartition(i)[i-1]
A working example to get the index of the 6th largest value in a Series.
s = pd.Series(range(1_000_000)).sample(frac=1).reset_index(drop=True)
x = s.sort_values(ascending=False).index[5]
y = s.values.argsort()[-6]
z = s.nlargest(6).idxmin()
w = s.values.argpartition(len(s)-6)[-6]
x == y == z == w # True