What is the use of reset_index() in pandas?
Question:
While reading this article, I came across this statement.
order_total = df.groupby('order')["ext price"].sum().rename("Order_Total").reset_index()
Other than reset_index()
method call, everything else is clear to me.
My question is what will happen if I don’t call reset_index()
considering the given below sequence?
order_total = df.groupby('order')["ext price"].sum().rename("Order_Total").reset_index()
df_1 = df.merge(order_total)
df_1["Percent_of_Order"] = df_1["ext price"] / df_1["Order_Total"]
I tried to understand about this method from https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html, but couldn’t understand what does it mean to reset the index of a dataframe.
Answers:
A simplified explanation is that;
reset_index()
takes the current index, and places it in column ‘index’. Then it recreates a new ‘linear’ index for the data-set.
df=pd.DataFrame([20,30,40,50],index=[2,3,4,5])
0
2 20
3 30
4 40
5 50
df.reset_index()
index 0
0 2 20
1 3 30
2 4 40
3 5 50
I think better here is use GroupBy.transform
for new Series
with same size like original DataFrame filled by aggregate values, so merge
is not necessary:
df_1 = pd.DataFrame({
'A':list('abcdef'),
'ext price':[5,3,6,9,2,4],
'order':list('aaabbb')
})
order_total1 = df_1.groupby('order')["ext price"].transform('sum')
df_1["Percent_of_Order"] = df_1["ext price"] / order_total1
print (df_1)
A ext price order Percent_of_Order
0 a 5 a 0.357143
1 b 3 a 0.214286
2 c 6 a 0.428571
3 d 9 b 0.600000
4 e 2 b 0.133333
5 f 4 b 0.266667
My question is what will happen if I don’t call reset_index() considering the sequence?
Here is Series
before reset_index()
, so after reset_index
is converting Series
to 2 columns DataFrame, first column is called by index name and second column by Series
name.
order_total = df_1.groupby('order')["ext price"].sum().rename("Order_Total")
print (order_total)
order
a 14
b 15
Name: Order_Total, dtype: int64
print (type(order_total))
<class 'pandas.core.series.Series'>
print (order_total.name)
Order_Total
print (order_total.index.name)
order
print (order_total.reset_index())
order Order_Total
0 a 14
1 b 15
Reason why is necessry in your code to 2 columns DataFrame is no parameter in merge
. It means it use parameter on
by intersection of common columns names between both DataFrames, here order
column.
Reset Index will create index starting from 0 and remove if there is any column set as index.
import pandas as pd
df = pd.DataFrame(
{
"ID": [1, 2, 3, 4, 5],
"name": [
"Hello Kitty",
"Hello Puppy",
"It is an Helloexample",
"for stackoverflow",
"Hello World",
],
}
)
newdf = df.set_index('ID')
print(newdf.reset_index())
Output Before reset_index()
:
name
ID
1 Hello Kitty
2 Hello Puppy
3 It is an Helloexample
4 for stackoverflow
5 Hello World
Output after reset_index()
:
ID name
0 1 Hello Kitty
1 2 Hello Puppy
2 3 It is an Helloexample
3 4 for stackoverflow
4 5 Hello World
To answer your question:
My question is what will happen if I don’t call reset_index() considering the sequence?
You will have a multi-index formed by the keys you have applied group-by statement on.
for eg- ‘order’ in your case.
Specific to the article, difference in indices of two dataframes may cause wrong merges (done after the group-by statement).
Hence, a reset-index is needed to perform the correct merge.
While reading this article, I came across this statement.
order_total = df.groupby('order')["ext price"].sum().rename("Order_Total").reset_index()
Other than reset_index()
method call, everything else is clear to me.
My question is what will happen if I don’t call reset_index()
considering the given below sequence?
order_total = df.groupby('order')["ext price"].sum().rename("Order_Total").reset_index()
df_1 = df.merge(order_total)
df_1["Percent_of_Order"] = df_1["ext price"] / df_1["Order_Total"]
I tried to understand about this method from https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html, but couldn’t understand what does it mean to reset the index of a dataframe.
A simplified explanation is that;
reset_index()
takes the current index, and places it in column ‘index’. Then it recreates a new ‘linear’ index for the data-set.
df=pd.DataFrame([20,30,40,50],index=[2,3,4,5])
0
2 20
3 30
4 40
5 50
df.reset_index()
index 0
0 2 20
1 3 30
2 4 40
3 5 50
I think better here is use GroupBy.transform
for new Series
with same size like original DataFrame filled by aggregate values, so merge
is not necessary:
df_1 = pd.DataFrame({
'A':list('abcdef'),
'ext price':[5,3,6,9,2,4],
'order':list('aaabbb')
})
order_total1 = df_1.groupby('order')["ext price"].transform('sum')
df_1["Percent_of_Order"] = df_1["ext price"] / order_total1
print (df_1)
A ext price order Percent_of_Order
0 a 5 a 0.357143
1 b 3 a 0.214286
2 c 6 a 0.428571
3 d 9 b 0.600000
4 e 2 b 0.133333
5 f 4 b 0.266667
My question is what will happen if I don’t call reset_index() considering the sequence?
Here is Series
before reset_index()
, so after reset_index
is converting Series
to 2 columns DataFrame, first column is called by index name and second column by Series
name.
order_total = df_1.groupby('order')["ext price"].sum().rename("Order_Total")
print (order_total)
order
a 14
b 15
Name: Order_Total, dtype: int64
print (type(order_total))
<class 'pandas.core.series.Series'>
print (order_total.name)
Order_Total
print (order_total.index.name)
order
print (order_total.reset_index())
order Order_Total
0 a 14
1 b 15
Reason why is necessry in your code to 2 columns DataFrame is no parameter in merge
. It means it use parameter on
by intersection of common columns names between both DataFrames, here order
column.
Reset Index will create index starting from 0 and remove if there is any column set as index.
import pandas as pd
df = pd.DataFrame(
{
"ID": [1, 2, 3, 4, 5],
"name": [
"Hello Kitty",
"Hello Puppy",
"It is an Helloexample",
"for stackoverflow",
"Hello World",
],
}
)
newdf = df.set_index('ID')
print(newdf.reset_index())
Output Before reset_index()
:
name
ID
1 Hello Kitty
2 Hello Puppy
3 It is an Helloexample
4 for stackoverflow
5 Hello World
Output after reset_index()
:
ID name
0 1 Hello Kitty
1 2 Hello Puppy
2 3 It is an Helloexample
3 4 for stackoverflow
4 5 Hello World
To answer your question:
My question is what will happen if I don’t call reset_index() considering the sequence?
You will have a multi-index formed by the keys you have applied group-by statement on.
for eg- ‘order’ in your case.
Specific to the article, difference in indices of two dataframes may cause wrong merges (done after the group-by statement).
Hence, a reset-index is needed to perform the correct merge.