How can I assign a tag to the smallest in one group, second smallest in another group and third smallest in the third group to a pandas dataframe?
Question:
I have the below data frame,
ID
Group
Date_Time_1
Date_Time_2
Difference
New_Column
123
A
14-10-2021 15:19
14-10-2021 15:32
13
First
123
A
14-10-2021 15:19
14-10-2021 15:36
17
null
123
A
14-10-2021 15:19
14-10-2021 15:37
18
null
123
A
14-10-2021 15:19
14-10-2021 16:29
70
null
123
A
14-10-2021 15:19
14-10-2021 17:04
105
null
123
B
14-10-2021 15:21
14-10-2021 15:32
11
null
123
B
14-10-2021 15:21
14-10-2021 15:36
15
Second
123
B
14-10-2021 15:21
14-10-2021 15:37
16
null
123
B
14-10-2021 15:21
14-10-2021 16:29
68
null
123
B
14-10-2021 15:21
14-10-2021 17:04
103
null
123
C
14-10-2021 15:22
14-10-2021 15:32
10
null
123
C
14-10-2021 15:22
14-10-2021 15:36
14
null
123
C
14-10-2021 15:22
14-10-2021 15:37
15
Third
123
C
14-10-2021 15:23
14-10-2021 16:29
67
Third_A
123
C
14-10-2021 15:48
14-10-2021 17:04
102
Third_B
789
A
14-10-2021 15:19
14-10-2021 15:32
13
First
789
A
14-10-2021 15:19
14-10-2021 15:36
17
null
789
B
14-10-2021 15:21
14-10-2021 15:32
11
null
789
B
14-10-2021 15:21
14-10-2021 15:36
15
Second
789
C
14-10-2021 15:22
14-10-2021 15:32
10
null
I am trying to create a new column which will assign "First" to the smallest "Date_Time_2" in group "A" and it will assign "second" to the second smallest "Date_Time_2" in group B.
Similarly, it will assign "third" to the third smallest "Date_Time_2" in group C.
I want it to assign "Third_A", "Third_B" and so on once the loop reaches the last "Group" of the "ID". So, once it reaches the last group of "ID" it will assign "Third or 3" (As there are only three unique groups in the dataset) to the third lowest "Date_Time_2" which is not used in the previous groups and if it will find another "Date_Time_2" for a new "Date_Time_1" it will assign "Third_A", "Third_B" and so on
I have tried the below code but it is not working,
`df.drop('New_Column', axis = 1, inplace = True)
df['New_Column'] = pd.Series()
for i, v in df['Difference'].items():
a = 0
b = 1
diff = df[df['Group'] == df['Group'].unique()[a]]['Difference'].nsmallest(b).min()
if diff == v:
df.loc[i, 'New_Column'] = "Yes"
b = b + 1
a = a + 1`
Any help here would be great!
Answers:
First, make sure you read csv value currectly. Means date time value should be interpreted correctly, e.g.
date_parse = lambda x : pd.to_datetime(x, format="%d-%m-%Y %H:%M")
df = pd.read_csv('filename.csv', parse_dates=['Date_Time_1','Date_Time_2'], date_parser= date_parse)
If you already have dataframe, you can use following code to parse datetime object insde dataframe,
df['Date_Time_2'] = pd.to_datetime(df['Date_Time_2'], format="%d-%m-%Y %H:%M")
df['Date_Time_2'] = pd.to_datetime(df['Date_Time_2'], format="%d-%m-%Y %H:%M")
Now just iterate over different groups, and filter out the date_time_2 column in sorted list, finally take out the appropriate index, e.g. for group ‘A’ take ‘0’ index, for group ‘B’ take out ‘1’ index …,
Select the dataframe appropriately and update the value in new column
df['New_Column'] = 'NA'
for index, group in enumerate(df['Group'].unique()):
unqiue_time = df[df['Group'] == group]['Date_Time_2'].unique()[index]
df.loc[(df['Group'] == group) & (df['Date_Time_2'] == unqiue_time), 'New_Column'] = index
print(df)
Note: appending number is lot easier then, word like ‘first’, ‘second’, if you want, create a new list, and assign the value from index, like below
df['New_Column'] = 'NA'
number_as_string = ['first', 'second', 'third']
for index, group in enumerate(df['Group'].unique()):
unqiue_time = df[df['Group'] == group]['Date_Time_2'].unique()[index]
df.loc[(df['Group'] == group) & (df['Date_Time_2'] == unqiue_time), 'New_Column'] = number_as_string[index]
print(df)
You could try the following:
from string import ascii_uppercase as letters
df["Date_Time_2"] = pd.to_datetime(df["Date_Time_2"])
for n, (_, gdf) in enumerate(df.sort_values("Date_Time_2").groupby("Group")):
nths = gdf.groupby("Date_Time_2", as_index=False).ngroup()
df.loc[gdf[nths == n].index, "New"] = str(n + 1)
for i, c in zip(gdf[nths > n].index, letters):
df.at[i, "New"] = f"{n + 1}_{c}"
- First make sure column
Date_Time_2
contains datetimes.
- Then group
df
by Group
after sorting along Date_Time_2
.
- Then in each group identify the indices belonging to
n
th Date_Time_2
sub-group (starting from 0) and set n + 1
on the resp. New
column rows.
- Then take the last group and add the lettered values to the
New
column.
Maybe you have to replace the last part with
for k, c in zip(range(n + 1, nths.max() + 1), letters):
df.loc[gdf[nths == k].index, "New"] = f"{n + 1}_{c}"
if the lettered values should be grouped too.
Result for the sample in the question:
ID Group Date_Time_1 Date_Time_2 Difference New_Column New
0 123 A 14-10-2021 15:19 2021-10-14 15:32:00 13 First 1
1 123 A 14-10-2021 15:19 2021-10-14 15:36:00 17 NaN NaN
2 123 A 14-10-2021 15:19 2021-10-14 15:37:00 18 NaN NaN
3 123 A 14-10-2021 15:19 2021-10-14 16:29:00 70 NaN NaN
4 123 A 14-10-2021 15:19 2021-10-14 17:04:00 105 NaN NaN
5 123 B 14-10-2021 15:21 2021-10-14 15:32:00 11 NaN NaN
6 123 B 14-10-2021 15:21 2021-10-14 15:36:00 15 Second 2
7 123 B 14-10-2021 15:21 2021-10-14 15:37:00 16 NaN NaN
8 123 B 14-10-2021 15:21 2021-10-14 16:29:00 68 NaN NaN
9 123 B 14-10-2021 15:21 2021-10-14 17:04:00 103 NaN NaN
10 123 C 14-10-2021 15:22 2021-10-14 15:32:00 10 NaN NaN
11 123 C 14-10-2021 15:22 2021-10-14 15:36:00 14 NaN NaN
12 123 C 14-10-2021 15:22 2021-10-14 15:37:00 15 Third 3
13 123 C 14-10-2021 15:23 2021-10-14 16:29:00 67 Third_A 3_A
14 123 C 14-10-2021 15:48 2021-10-14 17:04:00 102 Third_B 3_B
15 789 A 14-10-2021 15:19 2021-10-14 15:32:00 13 First 1
16 789 A 14-10-2021 15:19 2021-10-14 15:36:00 17 NaN NaN
17 789 B 14-10-2021 15:21 2021-10-14 15:32:00 11 NaN NaN
18 789 B 14-10-2021 15:21 2021-10-14 15:36:00 15 Second 2
19 789 C 14-10-2021 15:22 2021-10-14 15:32:00 10 NaN NaN
If the whole process has to be done for each ID
group then you could try
...
for _, df_id in df.sort_values("Date_Time_2").groupby("ID"):
for n, (_, gdf) in enumerate(df_id.groupby("Group")):
nths = gdf.groupby("Date_Time_2", as_index=False).ngroup()
df.loc[gdf[nths == n].index, "New"] = str(n + 1)
for i, c in zip(gdf[nths > n].index, letters):
df.at[i, "New"] = f"{n + 1}_{c}"
instead.
It looks like you’re trying to do a "dense" ranking per group?
This could probably be simplified – but something like:
group = df.groupby(["ID", "Group"])
df1 = df.assign(
group_id = group.ngroup(),
rank = group["Date_Time_2"].rank(method="dense"),
)
# Get ranks per group
# A 1, A 2, A 3, B 4, B 5 -> A 1, A 2, A 3, B 1, B 2
df1 = df1.assign(group_id = df1.groupby("ID")["group_id"].rank(method="dense"))
df1 = df1.assign(last_group_id = df1.groupby("ID")["group_id"].transform("max"))
# Keep only 1st for 1st group - 2nd for 2nd group
# "OR" > Nth for last group
df1.loc[
(df1["group_id"] == df1["rank"]) |
((df1["group_id"] == df1["last_group_id"]) & (df1["rank"] > df1["group_id"])),
"New_Column"
] = df1["rank"]
ID Group Date_Time_1 Date_Time_2 Difference group_id rank last_group_id New_Column
0 123 A 2021-10-14 15:19:00 2021-10-14 15:32:00 13 1.0 1.0 3.0 1.0
1 123 A 2021-10-14 15:19:00 2021-10-14 15:36:00 17 1.0 2.0 3.0 NaN
2 123 A 2021-10-14 15:19:00 2021-10-14 15:37:00 18 1.0 3.0 3.0 NaN
3 123 A 2021-10-14 15:19:00 2021-10-14 16:29:00 70 1.0 4.0 3.0 NaN
4 123 A 2021-10-14 15:19:00 2021-10-14 17:04:00 105 1.0 5.0 3.0 NaN
5 123 B 2021-10-14 15:21:00 2021-10-14 15:32:00 11 2.0 1.0 3.0 NaN
6 123 B 2021-10-14 15:21:00 2021-10-14 15:36:00 15 2.0 2.0 3.0 2.0
7 123 B 2021-10-14 15:21:00 2021-10-14 15:37:00 16 2.0 3.0 3.0 NaN
8 123 B 2021-10-14 15:21:00 2021-10-14 16:29:00 68 2.0 4.0 3.0 NaN
9 123 B 2021-10-14 15:21:00 2021-10-14 17:04:00 103 2.0 5.0 3.0 NaN
10 123 C 2021-10-14 15:22:00 2021-10-14 15:32:00 10 3.0 1.0 3.0 NaN
11 123 C 2021-10-14 15:22:00 2021-10-14 15:36:00 14 3.0 2.0 3.0 NaN
12 123 C 2021-10-14 15:22:00 2021-10-14 15:37:00 15 3.0 3.0 3.0 3.0
13 123 C 2021-10-14 15:23:00 2021-10-14 16:29:00 67 3.0 4.0 3.0 4.0
14 123 C 2021-10-14 15:48:00 2021-10-14 17:04:00 102 3.0 5.0 3.0 5.0
15 789 A 2021-10-14 15:19:00 2021-10-14 15:32:00 13 1.0 1.0 3.0 1.0
16 789 A 2021-10-14 15:19:00 2021-10-14 15:36:00 17 1.0 2.0 3.0 NaN
17 789 B 2021-10-14 15:21:00 2021-10-14 15:32:00 11 2.0 1.0 3.0 NaN
18 789 B 2021-10-14 15:21:00 2021-10-14 15:36:00 15 2.0 2.0 3.0 2.0
19 789 C 2021-10-14 15:22:00 2021-10-14 15:32:00 10 3.0 1.0 3.0 NaN
4.0 = Third_A
and 5.0 = Third_B
, …
Is this what you’re trying to achieve?
I have the below data frame,
ID | Group | Date_Time_1 | Date_Time_2 | Difference | New_Column |
---|---|---|---|---|---|
123 | A | 14-10-2021 15:19 | 14-10-2021 15:32 | 13 | First |
123 | A | 14-10-2021 15:19 | 14-10-2021 15:36 | 17 | null |
123 | A | 14-10-2021 15:19 | 14-10-2021 15:37 | 18 | null |
123 | A | 14-10-2021 15:19 | 14-10-2021 16:29 | 70 | null |
123 | A | 14-10-2021 15:19 | 14-10-2021 17:04 | 105 | null |
123 | B | 14-10-2021 15:21 | 14-10-2021 15:32 | 11 | null |
123 | B | 14-10-2021 15:21 | 14-10-2021 15:36 | 15 | Second |
123 | B | 14-10-2021 15:21 | 14-10-2021 15:37 | 16 | null |
123 | B | 14-10-2021 15:21 | 14-10-2021 16:29 | 68 | null |
123 | B | 14-10-2021 15:21 | 14-10-2021 17:04 | 103 | null |
123 | C | 14-10-2021 15:22 | 14-10-2021 15:32 | 10 | null |
123 | C | 14-10-2021 15:22 | 14-10-2021 15:36 | 14 | null |
123 | C | 14-10-2021 15:22 | 14-10-2021 15:37 | 15 | Third |
123 | C | 14-10-2021 15:23 | 14-10-2021 16:29 | 67 | Third_A |
123 | C | 14-10-2021 15:48 | 14-10-2021 17:04 | 102 | Third_B |
789 | A | 14-10-2021 15:19 | 14-10-2021 15:32 | 13 | First |
789 | A | 14-10-2021 15:19 | 14-10-2021 15:36 | 17 | null |
789 | B | 14-10-2021 15:21 | 14-10-2021 15:32 | 11 | null |
789 | B | 14-10-2021 15:21 | 14-10-2021 15:36 | 15 | Second |
789 | C | 14-10-2021 15:22 | 14-10-2021 15:32 | 10 | null |
I am trying to create a new column which will assign "First" to the smallest "Date_Time_2" in group "A" and it will assign "second" to the second smallest "Date_Time_2" in group B.
Similarly, it will assign "third" to the third smallest "Date_Time_2" in group C.
I want it to assign "Third_A", "Third_B" and so on once the loop reaches the last "Group" of the "ID". So, once it reaches the last group of "ID" it will assign "Third or 3" (As there are only three unique groups in the dataset) to the third lowest "Date_Time_2" which is not used in the previous groups and if it will find another "Date_Time_2" for a new "Date_Time_1" it will assign "Third_A", "Third_B" and so on
I have tried the below code but it is not working,
`df.drop('New_Column', axis = 1, inplace = True)
df['New_Column'] = pd.Series()
for i, v in df['Difference'].items():
a = 0
b = 1
diff = df[df['Group'] == df['Group'].unique()[a]]['Difference'].nsmallest(b).min()
if diff == v:
df.loc[i, 'New_Column'] = "Yes"
b = b + 1
a = a + 1`
Any help here would be great!
First, make sure you read csv value currectly. Means date time value should be interpreted correctly, e.g.
date_parse = lambda x : pd.to_datetime(x, format="%d-%m-%Y %H:%M")
df = pd.read_csv('filename.csv', parse_dates=['Date_Time_1','Date_Time_2'], date_parser= date_parse)
If you already have dataframe, you can use following code to parse datetime object insde dataframe,
df['Date_Time_2'] = pd.to_datetime(df['Date_Time_2'], format="%d-%m-%Y %H:%M")
df['Date_Time_2'] = pd.to_datetime(df['Date_Time_2'], format="%d-%m-%Y %H:%M")
Now just iterate over different groups, and filter out the date_time_2 column in sorted list, finally take out the appropriate index, e.g. for group ‘A’ take ‘0’ index, for group ‘B’ take out ‘1’ index …,
Select the dataframe appropriately and update the value in new column
df['New_Column'] = 'NA'
for index, group in enumerate(df['Group'].unique()):
unqiue_time = df[df['Group'] == group]['Date_Time_2'].unique()[index]
df.loc[(df['Group'] == group) & (df['Date_Time_2'] == unqiue_time), 'New_Column'] = index
print(df)
Note: appending number is lot easier then, word like ‘first’, ‘second’, if you want, create a new list, and assign the value from index, like below
df['New_Column'] = 'NA'
number_as_string = ['first', 'second', 'third']
for index, group in enumerate(df['Group'].unique()):
unqiue_time = df[df['Group'] == group]['Date_Time_2'].unique()[index]
df.loc[(df['Group'] == group) & (df['Date_Time_2'] == unqiue_time), 'New_Column'] = number_as_string[index]
print(df)
You could try the following:
from string import ascii_uppercase as letters
df["Date_Time_2"] = pd.to_datetime(df["Date_Time_2"])
for n, (_, gdf) in enumerate(df.sort_values("Date_Time_2").groupby("Group")):
nths = gdf.groupby("Date_Time_2", as_index=False).ngroup()
df.loc[gdf[nths == n].index, "New"] = str(n + 1)
for i, c in zip(gdf[nths > n].index, letters):
df.at[i, "New"] = f"{n + 1}_{c}"
- First make sure column
Date_Time_2
contains datetimes. - Then group
df
byGroup
after sorting alongDate_Time_2
. - Then in each group identify the indices belonging to
n
thDate_Time_2
sub-group (starting from 0) and setn + 1
on the resp.New
column rows. - Then take the last group and add the lettered values to the
New
column.
Maybe you have to replace the last part with
for k, c in zip(range(n + 1, nths.max() + 1), letters):
df.loc[gdf[nths == k].index, "New"] = f"{n + 1}_{c}"
if the lettered values should be grouped too.
Result for the sample in the question:
ID Group Date_Time_1 Date_Time_2 Difference New_Column New
0 123 A 14-10-2021 15:19 2021-10-14 15:32:00 13 First 1
1 123 A 14-10-2021 15:19 2021-10-14 15:36:00 17 NaN NaN
2 123 A 14-10-2021 15:19 2021-10-14 15:37:00 18 NaN NaN
3 123 A 14-10-2021 15:19 2021-10-14 16:29:00 70 NaN NaN
4 123 A 14-10-2021 15:19 2021-10-14 17:04:00 105 NaN NaN
5 123 B 14-10-2021 15:21 2021-10-14 15:32:00 11 NaN NaN
6 123 B 14-10-2021 15:21 2021-10-14 15:36:00 15 Second 2
7 123 B 14-10-2021 15:21 2021-10-14 15:37:00 16 NaN NaN
8 123 B 14-10-2021 15:21 2021-10-14 16:29:00 68 NaN NaN
9 123 B 14-10-2021 15:21 2021-10-14 17:04:00 103 NaN NaN
10 123 C 14-10-2021 15:22 2021-10-14 15:32:00 10 NaN NaN
11 123 C 14-10-2021 15:22 2021-10-14 15:36:00 14 NaN NaN
12 123 C 14-10-2021 15:22 2021-10-14 15:37:00 15 Third 3
13 123 C 14-10-2021 15:23 2021-10-14 16:29:00 67 Third_A 3_A
14 123 C 14-10-2021 15:48 2021-10-14 17:04:00 102 Third_B 3_B
15 789 A 14-10-2021 15:19 2021-10-14 15:32:00 13 First 1
16 789 A 14-10-2021 15:19 2021-10-14 15:36:00 17 NaN NaN
17 789 B 14-10-2021 15:21 2021-10-14 15:32:00 11 NaN NaN
18 789 B 14-10-2021 15:21 2021-10-14 15:36:00 15 Second 2
19 789 C 14-10-2021 15:22 2021-10-14 15:32:00 10 NaN NaN
If the whole process has to be done for each ID
group then you could try
...
for _, df_id in df.sort_values("Date_Time_2").groupby("ID"):
for n, (_, gdf) in enumerate(df_id.groupby("Group")):
nths = gdf.groupby("Date_Time_2", as_index=False).ngroup()
df.loc[gdf[nths == n].index, "New"] = str(n + 1)
for i, c in zip(gdf[nths > n].index, letters):
df.at[i, "New"] = f"{n + 1}_{c}"
instead.
It looks like you’re trying to do a "dense" ranking per group?
This could probably be simplified – but something like:
group = df.groupby(["ID", "Group"])
df1 = df.assign(
group_id = group.ngroup(),
rank = group["Date_Time_2"].rank(method="dense"),
)
# Get ranks per group
# A 1, A 2, A 3, B 4, B 5 -> A 1, A 2, A 3, B 1, B 2
df1 = df1.assign(group_id = df1.groupby("ID")["group_id"].rank(method="dense"))
df1 = df1.assign(last_group_id = df1.groupby("ID")["group_id"].transform("max"))
# Keep only 1st for 1st group - 2nd for 2nd group
# "OR" > Nth for last group
df1.loc[
(df1["group_id"] == df1["rank"]) |
((df1["group_id"] == df1["last_group_id"]) & (df1["rank"] > df1["group_id"])),
"New_Column"
] = df1["rank"]
ID Group Date_Time_1 Date_Time_2 Difference group_id rank last_group_id New_Column
0 123 A 2021-10-14 15:19:00 2021-10-14 15:32:00 13 1.0 1.0 3.0 1.0
1 123 A 2021-10-14 15:19:00 2021-10-14 15:36:00 17 1.0 2.0 3.0 NaN
2 123 A 2021-10-14 15:19:00 2021-10-14 15:37:00 18 1.0 3.0 3.0 NaN
3 123 A 2021-10-14 15:19:00 2021-10-14 16:29:00 70 1.0 4.0 3.0 NaN
4 123 A 2021-10-14 15:19:00 2021-10-14 17:04:00 105 1.0 5.0 3.0 NaN
5 123 B 2021-10-14 15:21:00 2021-10-14 15:32:00 11 2.0 1.0 3.0 NaN
6 123 B 2021-10-14 15:21:00 2021-10-14 15:36:00 15 2.0 2.0 3.0 2.0
7 123 B 2021-10-14 15:21:00 2021-10-14 15:37:00 16 2.0 3.0 3.0 NaN
8 123 B 2021-10-14 15:21:00 2021-10-14 16:29:00 68 2.0 4.0 3.0 NaN
9 123 B 2021-10-14 15:21:00 2021-10-14 17:04:00 103 2.0 5.0 3.0 NaN
10 123 C 2021-10-14 15:22:00 2021-10-14 15:32:00 10 3.0 1.0 3.0 NaN
11 123 C 2021-10-14 15:22:00 2021-10-14 15:36:00 14 3.0 2.0 3.0 NaN
12 123 C 2021-10-14 15:22:00 2021-10-14 15:37:00 15 3.0 3.0 3.0 3.0
13 123 C 2021-10-14 15:23:00 2021-10-14 16:29:00 67 3.0 4.0 3.0 4.0
14 123 C 2021-10-14 15:48:00 2021-10-14 17:04:00 102 3.0 5.0 3.0 5.0
15 789 A 2021-10-14 15:19:00 2021-10-14 15:32:00 13 1.0 1.0 3.0 1.0
16 789 A 2021-10-14 15:19:00 2021-10-14 15:36:00 17 1.0 2.0 3.0 NaN
17 789 B 2021-10-14 15:21:00 2021-10-14 15:32:00 11 2.0 1.0 3.0 NaN
18 789 B 2021-10-14 15:21:00 2021-10-14 15:36:00 15 2.0 2.0 3.0 2.0
19 789 C 2021-10-14 15:22:00 2021-10-14 15:32:00 10 3.0 1.0 3.0 NaN
4.0 = Third_A
and 5.0 = Third_B
, …
Is this what you’re trying to achieve?