How can I assign a tag to the smallest in one group, second smallest in another group and third smallest in the third group to a pandas dataframe?

Question:

I have the below data frame,

ID Group Date_Time_1 Date_Time_2 Difference New_Column
123 A 14-10-2021 15:19 14-10-2021 15:32 13 First
123 A 14-10-2021 15:19 14-10-2021 15:36 17 null
123 A 14-10-2021 15:19 14-10-2021 15:37 18 null
123 A 14-10-2021 15:19 14-10-2021 16:29 70 null
123 A 14-10-2021 15:19 14-10-2021 17:04 105 null
123 B 14-10-2021 15:21 14-10-2021 15:32 11 null
123 B 14-10-2021 15:21 14-10-2021 15:36 15 Second
123 B 14-10-2021 15:21 14-10-2021 15:37 16 null
123 B 14-10-2021 15:21 14-10-2021 16:29 68 null
123 B 14-10-2021 15:21 14-10-2021 17:04 103 null
123 C 14-10-2021 15:22 14-10-2021 15:32 10 null
123 C 14-10-2021 15:22 14-10-2021 15:36 14 null
123 C 14-10-2021 15:22 14-10-2021 15:37 15 Third
123 C 14-10-2021 15:23 14-10-2021 16:29 67 Third_A
123 C 14-10-2021 15:48 14-10-2021 17:04 102 Third_B
789 A 14-10-2021 15:19 14-10-2021 15:32 13 First
789 A 14-10-2021 15:19 14-10-2021 15:36 17 null
789 B 14-10-2021 15:21 14-10-2021 15:32 11 null
789 B 14-10-2021 15:21 14-10-2021 15:36 15 Second
789 C 14-10-2021 15:22 14-10-2021 15:32 10 null

I am trying to create a new column which will assign "First" to the smallest "Date_Time_2" in group "A" and it will assign "second" to the second smallest "Date_Time_2" in group B.
Similarly, it will assign "third" to the third smallest "Date_Time_2" in group C.

I want it to assign "Third_A", "Third_B" and so on once the loop reaches the last "Group" of the "ID". So, once it reaches the last group of "ID" it will assign "Third or 3" (As there are only three unique groups in the dataset) to the third lowest "Date_Time_2" which is not used in the previous groups and if it will find another "Date_Time_2" for a new "Date_Time_1" it will assign "Third_A", "Third_B" and so on

I have tried the below code but it is not working,

`df.drop('New_Column', axis = 1, inplace = True)
df['New_Column'] = pd.Series()
for i, v in df['Difference'].items():
    a = 0
    b = 1
    diff = df[df['Group'] == df['Group'].unique()[a]]['Difference'].nsmallest(b).min()
    if diff == v:
        df.loc[i, 'New_Column'] = "Yes"
        b = b + 1
    a = a + 1`

Any help here would be great!

Asked By: Kajal Singh

||

Answers:

First, make sure you read csv value currectly. Means date time value should be interpreted correctly, e.g.

date_parse = lambda x : pd.to_datetime(x, format="%d-%m-%Y %H:%M")
df = pd.read_csv('filename.csv', parse_dates=['Date_Time_1','Date_Time_2'], date_parser= date_parse)

If you already have dataframe, you can use following code to parse datetime object insde dataframe,

df['Date_Time_2'] = pd.to_datetime(df['Date_Time_2'], format="%d-%m-%Y %H:%M")
df['Date_Time_2'] = pd.to_datetime(df['Date_Time_2'], format="%d-%m-%Y %H:%M")

Now just iterate over different groups, and filter out the date_time_2 column in sorted list, finally take out the appropriate index, e.g. for group ‘A’ take ‘0’ index, for group ‘B’ take out ‘1’ index …,
Select the dataframe appropriately and update the value in new column

df['New_Column'] = 'NA'
for index, group in enumerate(df['Group'].unique()):
    unqiue_time = df[df['Group'] == group]['Date_Time_2'].unique()[index]
    df.loc[(df['Group'] == group) & (df['Date_Time_2'] == unqiue_time), 'New_Column'] = index
print(df)

Note: appending number is lot easier then, word like ‘first’, ‘second’, if you want, create a new list, and assign the value from index, like below

df['New_Column'] = 'NA'
number_as_string = ['first', 'second', 'third']
for index, group in enumerate(df['Group'].unique()):
    unqiue_time = df[df['Group'] == group]['Date_Time_2'].unique()[index]
    df.loc[(df['Group'] == group) & (df['Date_Time_2'] == unqiue_time), 'New_Column'] = number_as_string[index]
print(df)
Answered By: Lokesh Kurre

You could try the following:

from string import ascii_uppercase as letters

df["Date_Time_2"] = pd.to_datetime(df["Date_Time_2"])
for n, (_, gdf) in enumerate(df.sort_values("Date_Time_2").groupby("Group")):
    nths = gdf.groupby("Date_Time_2", as_index=False).ngroup()
    df.loc[gdf[nths == n].index, "New"] = str(n + 1)
for i, c in zip(gdf[nths > n].index, letters):
    df.at[i, "New"] = f"{n + 1}_{c}"
  • First make sure column Date_Time_2 contains datetimes.
  • Then group df by Group after sorting along Date_Time_2.
  • Then in each group identify the indices belonging to nth Date_Time_2 sub-group (starting from 0) and set n + 1 on the resp. New column rows.
  • Then take the last group and add the lettered values to the New column.

Maybe you have to replace the last part with

for k, c in zip(range(n + 1, nths.max() + 1), letters):
    df.loc[gdf[nths == k].index, "New"] = f"{n + 1}_{c}"

if the lettered values should be grouped too.

Result for the sample in the question:

     ID Group       Date_Time_1         Date_Time_2  Difference New_Column  New
0   123     A  14-10-2021 15:19 2021-10-14 15:32:00          13      First    1
1   123     A  14-10-2021 15:19 2021-10-14 15:36:00          17        NaN  NaN
2   123     A  14-10-2021 15:19 2021-10-14 15:37:00          18        NaN  NaN
3   123     A  14-10-2021 15:19 2021-10-14 16:29:00          70        NaN  NaN
4   123     A  14-10-2021 15:19 2021-10-14 17:04:00         105        NaN  NaN
5   123     B  14-10-2021 15:21 2021-10-14 15:32:00          11        NaN  NaN
6   123     B  14-10-2021 15:21 2021-10-14 15:36:00          15     Second    2
7   123     B  14-10-2021 15:21 2021-10-14 15:37:00          16        NaN  NaN
8   123     B  14-10-2021 15:21 2021-10-14 16:29:00          68        NaN  NaN
9   123     B  14-10-2021 15:21 2021-10-14 17:04:00         103        NaN  NaN
10  123     C  14-10-2021 15:22 2021-10-14 15:32:00          10        NaN  NaN
11  123     C  14-10-2021 15:22 2021-10-14 15:36:00          14        NaN  NaN
12  123     C  14-10-2021 15:22 2021-10-14 15:37:00          15      Third    3
13  123     C  14-10-2021 15:23 2021-10-14 16:29:00          67    Third_A  3_A
14  123     C  14-10-2021 15:48 2021-10-14 17:04:00         102    Third_B  3_B
15  789     A  14-10-2021 15:19 2021-10-14 15:32:00          13      First    1
16  789     A  14-10-2021 15:19 2021-10-14 15:36:00          17        NaN  NaN
17  789     B  14-10-2021 15:21 2021-10-14 15:32:00          11        NaN  NaN
18  789     B  14-10-2021 15:21 2021-10-14 15:36:00          15     Second    2
19  789     C  14-10-2021 15:22 2021-10-14 15:32:00          10        NaN  NaN

If the whole process has to be done for each ID group then you could try

...
for _, df_id in df.sort_values("Date_Time_2").groupby("ID"):
    for n, (_, gdf) in enumerate(df_id.groupby("Group")):
        nths = gdf.groupby("Date_Time_2", as_index=False).ngroup()
        df.loc[gdf[nths == n].index, "New"] = str(n + 1)
    for i, c in zip(gdf[nths > n].index, letters):
        df.at[i, "New"] = f"{n + 1}_{c}"

instead.

Answered By: Timus

It looks like you’re trying to do a "dense" ranking per group?

This could probably be simplified – but something like:

group = df.groupby(["ID", "Group"])

df1 = df.assign(
   group_id = group.ngroup(),
   rank = group["Date_Time_2"].rank(method="dense"),
)

# Get ranks per group
# A 1, A 2, A 3, B 4, B 5 -> A 1, A 2, A 3, B 1, B 2
df1 = df1.assign(group_id = df1.groupby("ID")["group_id"].rank(method="dense"))
df1 = df1.assign(last_group_id = df1.groupby("ID")["group_id"].transform("max"))

# Keep only 1st for 1st group - 2nd for 2nd group
# "OR" > Nth for last group
df1.loc[ 
   (df1["group_id"] == df1["rank"]) | 
   ((df1["group_id"] == df1["last_group_id"]) & (df1["rank"] > df1["group_id"])),
   "New_Column" 
] = df1["rank"]
     ID Group         Date_Time_1         Date_Time_2  Difference  group_id  rank  last_group_id  New_Column
0   123     A 2021-10-14 15:19:00 2021-10-14 15:32:00          13       1.0   1.0            3.0         1.0
1   123     A 2021-10-14 15:19:00 2021-10-14 15:36:00          17       1.0   2.0            3.0         NaN
2   123     A 2021-10-14 15:19:00 2021-10-14 15:37:00          18       1.0   3.0            3.0         NaN
3   123     A 2021-10-14 15:19:00 2021-10-14 16:29:00          70       1.0   4.0            3.0         NaN
4   123     A 2021-10-14 15:19:00 2021-10-14 17:04:00         105       1.0   5.0            3.0         NaN
5   123     B 2021-10-14 15:21:00 2021-10-14 15:32:00          11       2.0   1.0            3.0         NaN
6   123     B 2021-10-14 15:21:00 2021-10-14 15:36:00          15       2.0   2.0            3.0         2.0
7   123     B 2021-10-14 15:21:00 2021-10-14 15:37:00          16       2.0   3.0            3.0         NaN
8   123     B 2021-10-14 15:21:00 2021-10-14 16:29:00          68       2.0   4.0            3.0         NaN
9   123     B 2021-10-14 15:21:00 2021-10-14 17:04:00         103       2.0   5.0            3.0         NaN
10  123     C 2021-10-14 15:22:00 2021-10-14 15:32:00          10       3.0   1.0            3.0         NaN
11  123     C 2021-10-14 15:22:00 2021-10-14 15:36:00          14       3.0   2.0            3.0         NaN
12  123     C 2021-10-14 15:22:00 2021-10-14 15:37:00          15       3.0   3.0            3.0         3.0
13  123     C 2021-10-14 15:23:00 2021-10-14 16:29:00          67       3.0   4.0            3.0         4.0
14  123     C 2021-10-14 15:48:00 2021-10-14 17:04:00         102       3.0   5.0            3.0         5.0
15  789     A 2021-10-14 15:19:00 2021-10-14 15:32:00          13       1.0   1.0            3.0         1.0
16  789     A 2021-10-14 15:19:00 2021-10-14 15:36:00          17       1.0   2.0            3.0         NaN
17  789     B 2021-10-14 15:21:00 2021-10-14 15:32:00          11       2.0   1.0            3.0         NaN
18  789     B 2021-10-14 15:21:00 2021-10-14 15:36:00          15       2.0   2.0            3.0         2.0
19  789     C 2021-10-14 15:22:00 2021-10-14 15:32:00          10       3.0   1.0            3.0         NaN

4.0 = Third_A and 5.0 = Third_B, …

Is this what you’re trying to achieve?

Answered By: jqurious