Pandas recalculate index after a concatenation
Question:
I have a problem where I produce a pandas dataframe by concatenating along the row axis (stacking vertically).
Each of the constituent dataframes has an autogenerated index (ascending numbers).
After concatenation, my index is screwed up: it counts up to n (where n is the shape[0] of the corresponding dataframe), and restarts at zero at the next dataframe.
I am trying to “re-calculate the index, given the current order”, or “re-index” (or so I thought). Turns out that isn’t exactly what DataFrame.reindex
seems to be doing.
Here is what I tried to do:
train_df = pd.concat(train_class_df_list)
train_df = train_df.reindex(index=[i for i in range(train_df.shape[0])])
It failed with “cannot reindex from a duplicate axis.” I don’t want to change the order of my data… just need to delete the old index and set up a new one, with the order of rows preserved.
Answers:
This should work:
train_df.reset_index(inplace=True, drop=True)
Set drop to True
to avoid an additional column in your dataframe.
After vertical concatenation, if you get an index of [0, n) followed by [0, m), all you need to do is call reset_index
:
train_df.reset_index(drop=True)
(you can do this in place using inplace=True
).
import pandas as pd
>>> pd.concat([
pd.DataFrame({'a': [1, 2]}),
pd.DataFrame({'a': [1, 2]})]).reset_index(drop=True)
a
0 1
1 2
2 1
3 2
If your index is autogenerated and you don’t want to keep it, you can use the ignore_index
option.
`
train_df = pd.concat(train_class_df_list, ignore_index=True)
This will autogenerate a new index for you, and my guess is that this is exactly what you are after.
I have a problem where I produce a pandas dataframe by concatenating along the row axis (stacking vertically).
Each of the constituent dataframes has an autogenerated index (ascending numbers).
After concatenation, my index is screwed up: it counts up to n (where n is the shape[0] of the corresponding dataframe), and restarts at zero at the next dataframe.
I am trying to “re-calculate the index, given the current order”, or “re-index” (or so I thought). Turns out that isn’t exactly what DataFrame.reindex
seems to be doing.
Here is what I tried to do:
train_df = pd.concat(train_class_df_list)
train_df = train_df.reindex(index=[i for i in range(train_df.shape[0])])
It failed with “cannot reindex from a duplicate axis.” I don’t want to change the order of my data… just need to delete the old index and set up a new one, with the order of rows preserved.
This should work:
train_df.reset_index(inplace=True, drop=True)
Set drop to True
to avoid an additional column in your dataframe.
After vertical concatenation, if you get an index of [0, n) followed by [0, m), all you need to do is call reset_index
:
train_df.reset_index(drop=True)
(you can do this in place using inplace=True
).
import pandas as pd
>>> pd.concat([
pd.DataFrame({'a': [1, 2]}),
pd.DataFrame({'a': [1, 2]})]).reset_index(drop=True)
a
0 1
1 2
2 1
3 2
If your index is autogenerated and you don’t want to keep it, you can use the ignore_index
option.
`
train_df = pd.concat(train_class_df_list, ignore_index=True)
This will autogenerate a new index for you, and my guess is that this is exactly what you are after.