get index per ID – from example data using text input to pandas dataframe? easily reproducible
Question:
Text can be used as input to pandas dataframes to make easily reproducible examples for testing solutions.1
import pandas as pd
from io import StringIO
txt= """
ID,datetime,value
AB-CL-34,07/10/2022 10:00:00,5
AB-CL-34,07/10/2022 11:15:10,7
AB-CL-34,09/10/2022 15:30:30,13
BX-RT-55,06/10/2022 11:30:22,0
BX-RT-55,10/10/2022 22:44:11,1
BX-RT-55,10/10/2022 23:30:22,6
"""
df = pd.read_csv(StringIO(txt), parse_dates=[1], dayfirst=True)
How do I index df
such that there is an integer index according to each ID, keeping the ID variable?
imagined output
ID datetime value
0 AB-CL-34 07/10/2022 10:00:00 5
0 AB-CL-34 07/10/2022 11:15:10 7
0 AB-CL-34 09/10/2022 15:30:30 13
1 BX-RT-55 06/10/2022 11:30:22 0
1 BX-RT-55 10/10/2022 22:44:11 1
1 BX-RT-55 10/10/2022 23:30:22 6
edit ID values are now chr/str, not int, as they should have been (apologies to Marcus_CH.
Answers:
You could try:
df = pd.read_csv(StringIO(txt),
parse_dates=[1],
dayfirst=True)
.assign(id_index= lambda df: df
.groupby('ID', sort=False).ngroup())
.set_index("id_index")
.rename_axis(index=None)
Output
+----+----------+---------------------+---------+
| | ID | datetime | value |
|----+----------+---------------------+---------|
| 0 | AB-CL-34 | 2022-10-07 10:00:00 | 5 |
| 0 | AB-CL-34 | 2022-10-07 11:15:10 | 7 |
| 0 | AB-CL-34 | 2022-10-09 15:30:30 | 13 |
| 1 | BX-RT-55 | 2022-10-06 11:30:22 | 0 |
| 1 | BX-RT-55 | 2022-10-10 22:44:11 | 1 |
| 1 | BX-RT-55 | 2022-10-10 23:30:22 | 6 |
+----+----------+---------------------+---------+
References:
- indexing, please have a look at the official reference. This is a way better explanation as I could do.
index=None
to remove the index name (try it without this statement)
- lambda is an anonymous function. You could also do something like
df["id_col"] = df["ID"] - 1
. lambda
makes it easier and faster.
- the brackets are so that you can do line breaks without
for linebreak (so it’s just for styling)
Text can be used as input to pandas dataframes to make easily reproducible examples for testing solutions.1
import pandas as pd
from io import StringIO
txt= """
ID,datetime,value
AB-CL-34,07/10/2022 10:00:00,5
AB-CL-34,07/10/2022 11:15:10,7
AB-CL-34,09/10/2022 15:30:30,13
BX-RT-55,06/10/2022 11:30:22,0
BX-RT-55,10/10/2022 22:44:11,1
BX-RT-55,10/10/2022 23:30:22,6
"""
df = pd.read_csv(StringIO(txt), parse_dates=[1], dayfirst=True)
How do I index df
such that there is an integer index according to each ID, keeping the ID variable?
imagined output
ID datetime value
0 AB-CL-34 07/10/2022 10:00:00 5
0 AB-CL-34 07/10/2022 11:15:10 7
0 AB-CL-34 09/10/2022 15:30:30 13
1 BX-RT-55 06/10/2022 11:30:22 0
1 BX-RT-55 10/10/2022 22:44:11 1
1 BX-RT-55 10/10/2022 23:30:22 6
edit ID values are now chr/str, not int, as they should have been (apologies to Marcus_CH.
You could try:
df = pd.read_csv(StringIO(txt),
parse_dates=[1],
dayfirst=True)
.assign(id_index= lambda df: df
.groupby('ID', sort=False).ngroup())
.set_index("id_index")
.rename_axis(index=None)
Output
+----+----------+---------------------+---------+
| | ID | datetime | value |
|----+----------+---------------------+---------|
| 0 | AB-CL-34 | 2022-10-07 10:00:00 | 5 |
| 0 | AB-CL-34 | 2022-10-07 11:15:10 | 7 |
| 0 | AB-CL-34 | 2022-10-09 15:30:30 | 13 |
| 1 | BX-RT-55 | 2022-10-06 11:30:22 | 0 |
| 1 | BX-RT-55 | 2022-10-10 22:44:11 | 1 |
| 1 | BX-RT-55 | 2022-10-10 23:30:22 | 6 |
+----+----------+---------------------+---------+
References:
- indexing, please have a look at the official reference. This is a way better explanation as I could do.
index=None
to remove the index name (try it without this statement)- lambda is an anonymous function. You could also do something like
df["id_col"] = df["ID"] - 1
.lambda
makes it easier and faster. - the brackets are so that you can do line breaks without
for linebreak (so it’s just for styling)