How to delete first date duplicate

Question:

I have this df:

             time
0 2022-07-28 13:35:00
1 2022-07-28 13:35:15
2 2022-07-28 13:35:15
3 2022-07-28 13:35:30
4 2022-07-28 13:35:30

I need to remove the first .time duplicate, so the final result will be like this:

             time
0 2022-07-28 13:35:00

2 2022-07-28 13:35:15

4 2022-07-28 13:35:30

sample code:

tbl2 = {"time" :["2022-07-28 13:35:00", "2022-07-28 13:35:15", "2022-07-28 13:35:15",
            "2022-07-28 13:35:30", "2022-07-28 13:35:30"]}



df2 = pd.DataFrame(tbl2)

df2.time = pd.to_datetime(df2.time)
Asked By: Pren Ven

||

Answers:

You can use drop_duplicates and keep the last value.

df = pd.DataFrame({"time" :["2022-07-28 13:35:00", "2022-07-28 13:35:15", "2022-07-28 13:35:15", "2022-07-28 13:35:30", "2022-07-28 13:35:30"]})
df.time = pd.to_datetime(df.time)
df.drop_duplicates(keep='last', inplace=True)
print(df)

Output:

                 time
0 2022-07-28 13:35:00
2 2022-07-28 13:35:15
4 2022-07-28 13:35:30
Answered By: René

The solution above works fine as long as you only have a max of 1 duplicate (i.e. each time occurs at most twice). However, if you have more duplicates, you can use the approach below:

# Make sure that time data is sorted
df2.sort_values(by="time", inplace=True)

# Add a rownumber per time group and a count of rows per group
df2["time_rownum"] = df2.groupby("time").cumcount() + 1
df2["time_count"] = df2.groupby("time")["time_rownum"].transform('max')

# Filter
mask = (df2["time_count"] == 1) | ((df2["time_rownum"] > 1) & ((df2["time_count"] > 1)))
df2 = df2.loc[mask, ["time"]]
Answered By: ags29
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.