get index per ID – from example data using text input to pandas dataframe? easily reproducible

Question:

Text can be used as input to pandas dataframes to make easily reproducible examples for testing solutions.1

import pandas as pd
from io import StringIO

txt= """
ID,datetime,value
AB-CL-34,07/10/2022 10:00:00,5 
AB-CL-34,07/10/2022 11:15:10,7 
AB-CL-34,09/10/2022 15:30:30,13 
BX-RT-55,06/10/2022 11:30:22,0 
BX-RT-55,10/10/2022 22:44:11,1 
BX-RT-55,10/10/2022 23:30:22,6 
"""

df = pd.read_csv(StringIO(txt), parse_dates=[1], dayfirst=True)

How do I index df such that there is an integer index according to each ID, keeping the ID variable?

imagined output

        ID            datetime value
0 AB-CL-34 07/10/2022 10:00:00     5 
0 AB-CL-34 07/10/2022 11:15:10     7 
0 AB-CL-34 09/10/2022 15:30:30    13 
1 BX-RT-55 06/10/2022 11:30:22     0 
1 BX-RT-55 10/10/2022 22:44:11     1 
1 BX-RT-55 10/10/2022 23:30:22     6 

edit ID values are now chr/str, not int, as they should have been (apologies to Marcus_CH.

Asked By: Johan_A

||

Answers:

You could try:

df = pd.read_csv(StringIO(txt),
             parse_dates=[1],
             dayfirst=True)
             .assign(id_index= lambda df: df
             .groupby('ID', sort=False).ngroup())
             .set_index("id_index")
             .rename_axis(index=None)

Output

+----+----------+---------------------+---------+
|    |       ID |            datetime |   value |
|----+----------+---------------------+---------|
|  0 | AB-CL-34 | 2022-10-07 10:00:00 |       5 |
|  0 | AB-CL-34 | 2022-10-07 11:15:10 |       7 |
|  0 | AB-CL-34 | 2022-10-09 15:30:30 |      13 |
|  1 | BX-RT-55 | 2022-10-06 11:30:22 |       0 |
|  1 | BX-RT-55 | 2022-10-10 22:44:11 |       1 |
|  1 | BX-RT-55 | 2022-10-10 23:30:22 |       6 |
+----+----------+---------------------+---------+

References:

  1. indexing, please have a look at the official reference. This is a way better explanation as I could do.
  2. index=None to remove the index name (try it without this statement)
  3. lambda is an anonymous function. You could also do something like df["id_col"] = df["ID"] - 1. lambda makes it easier and faster.
  4. the brackets are so that you can do line breaks without for linebreak (so it’s just for styling)
Answered By: Marco_CH