What is as_index in groupby in pandas?
Question:
What exactly is the function of as_index
in groupby
in Pandas?
Answers:
print()
is your friend when you don’t understand a thing. It clears out doubts many times.
Take a look:
import pandas as pd
df = pd.DataFrame(data={'books':['bk1','bk1','bk1','bk2','bk2','bk3'], 'price': [12,12,12,15,15,17]})
print(df)
print(df.groupby('books', as_index=True).sum())
print(df.groupby('books', as_index=False).sum())
Output:
books price
0 bk1 12
1 bk1 12
2 bk1 12
3 bk2 15
4 bk2 15
5 bk3 17
price
books
bk1 36
bk2 30
bk3 17
books price
0 bk1 36
1 bk2 30
2 bk3 17
When as_index=True
the key(s) you use in groupby()
will become an index in the new dataframe.
The benefits you get when you set the column as index are:
-
Speed. When you filter values based on the index column eg. df.loc['bk1']
, it would be faster because of hashing of index column. It doesn’t have to traverse the entire books
column to find 'bk1'
. It will just calculate the hash value of 'bk1'
and find it in 1 go.
-
Ease. When as_index=True
you can use this syntax df.loc['bk1']
which is shorter and faster as opposed to df.loc[df.books=='bk1']
which is longer and slower.
When using the group by function, as_index can be set to true or false depending on if you want the column by which you grouped to be the index of the output.
import pandas as pd
table_r = pd.DataFrame({
'colors': ['orange', 'red', 'orange', 'red'],
'price': [1000, 2000, 3000, 4000],
'quantity': [500, 3000, 3000, 4000],
})
new_group = table_r.groupby('colors',as_index=True).count().sort('price', ascending=False)
print new_group
output:
price quantity
colors
orange 2 2
red 2 2
Now with as_index=False
colors price quantity
0 orange 2 2
1 red 2 2
Note how colors is no longer an index when we change as_index=False
One limitation of setting as_index = True
is that it means you can’t use that column in a df.pivot()
method. You have to reset it to False
before calling the pivot:
df_test = df[['drive-wheels', 'body-style', 'price']]
df_group = df_test.groupby(['drive-wheels', 'body-style'], as_index=False).mean() # must be False for the pivot to work
df_pivot = df_group.pivot(index='drive-wheels', columns='body-style')
What exactly is the function of as_index
in groupby
in Pandas?
print()
is your friend when you don’t understand a thing. It clears out doubts many times.
Take a look:
import pandas as pd
df = pd.DataFrame(data={'books':['bk1','bk1','bk1','bk2','bk2','bk3'], 'price': [12,12,12,15,15,17]})
print(df)
print(df.groupby('books', as_index=True).sum())
print(df.groupby('books', as_index=False).sum())
Output:
books price
0 bk1 12
1 bk1 12
2 bk1 12
3 bk2 15
4 bk2 15
5 bk3 17
price
books
bk1 36
bk2 30
bk3 17
books price
0 bk1 36
1 bk2 30
2 bk3 17
When as_index=True
the key(s) you use in groupby()
will become an index in the new dataframe.
The benefits you get when you set the column as index are:
-
Speed. When you filter values based on the index column eg.
df.loc['bk1']
, it would be faster because of hashing of index column. It doesn’t have to traverse the entirebooks
column to find'bk1'
. It will just calculate the hash value of'bk1'
and find it in 1 go. -
Ease. When
as_index=True
you can use this syntaxdf.loc['bk1']
which is shorter and faster as opposed todf.loc[df.books=='bk1']
which is longer and slower.
When using the group by function, as_index can be set to true or false depending on if you want the column by which you grouped to be the index of the output.
import pandas as pd
table_r = pd.DataFrame({
'colors': ['orange', 'red', 'orange', 'red'],
'price': [1000, 2000, 3000, 4000],
'quantity': [500, 3000, 3000, 4000],
})
new_group = table_r.groupby('colors',as_index=True).count().sort('price', ascending=False)
print new_group
output:
price quantity
colors
orange 2 2
red 2 2
Now with as_index=False
colors price quantity
0 orange 2 2
1 red 2 2
Note how colors is no longer an index when we change as_index=False
One limitation of setting as_index = True
is that it means you can’t use that column in a df.pivot()
method. You have to reset it to False
before calling the pivot:
df_test = df[['drive-wheels', 'body-style', 'price']]
df_group = df_test.groupby(['drive-wheels', 'body-style'], as_index=False).mean() # must be False for the pivot to work
df_pivot = df_group.pivot(index='drive-wheels', columns='body-style')