Pandas dataframe duplicates – Python

Question:

I have a simplified dataframe like this:

 ID  RecordingType Date                 Value
 1   FEVR          2019-05-22 18:37:10  1.36
 1   FEVR          2019-05-22 18:41:12  1.35
 1   FEVR          2019-05-22 18:45:16  1.35

I am trying to run this code:

df = df.sort_values(by=['ID', 'RecordingType', 'Date'], ascending=True).reset_index().drop(columns=["index"])
df["ToRemove"] = False
dict_temp_df = df.to_dict("records")

outputcolnames = {'FEVR':'Value'}

for i in range(df.shape[0]-1):

curr_recording_type = df[i]["RecordingType"]
next_recording_type = df[i+1]["RecordingType"]
        
# Check if current and next row have the same ID and RecordingType and the difference in time between the current and next rows is less than 60 minutes
        
if dict_temp_df[i]["ID"] == dict_temp_df[i+1]["ID"] and curr_recording_type == next_recording_type and abs((dict_temp_df[i]["Date"] - dict_temp_df[i+1]["Date"]).total_seconds() / 60) < 60:
          
    # For similar rows the first row is marked for deletion and the second row's value is updated depending on the Recording Type
                
    df.at[i,"ToRemove"] = True
                
    if curr_recording_type == 'FEVR':

        df.at[i+1,outputcolnames[next_recording_type]] = max(dict_temp_df[i][outputcolnames[curr_recording_type]], dict_temp_df[i+1][outputcolnames[next_recording_type]])
                    
    else:
           
        df.at[i+1,outputcolnames[next_recording_type]] += dict_temp_df[i][outputcolnames[curr_recording_type]]

# Remove the columns to be deleted
df = df[df["ToRemove"] == False].reset_index().drop(columns=["index"])

The desired output should have the last consecutive row of these duplicates updated with the max value of these rows like this:

 ID  RecordingType Date                 Value
 1   FEVR          2019-05-22 18:45:16  1.36

My code keeps giving me this result and I don’t know how to fix it:

ID  RecordingType Date                 Value
1   FEVR          2019-05-22 18:45:16  1.35

Can you please help me? 🙁

P.S. I would prefer to keep the for loop, as there are multiple other if/else conditions for other recording types (I tried the groupby method but it messes up with the downstream code)

Asked By: mariant

||

Answers:

For this specific recording type, if your goal is to get the maximum value and associate it to the maximum date, you can do like this:

df.groupby("RecordiingType")[["Date", "Value"]].max()

You can optionally add a .reset_index(drop=False) if you want to get back the RecordingType as a column.

For the other rules you need to implement, you can check if it is possibile to translate them as "groupby-aggregate" pattern, and remember that you can also use the apply method with a custom defined function when grouping.

Answered By: mattiatantardini

updated

I have updated the code with your simplified data (thanks to pycharm that suggested using datetime in this code 😀 ):

import pandas as pd
from datetime import datetime, timedelta

data = {
    'ID': [1, 1, 1],
    'RecordingType': ['FEVR', 'FEVR', 'FEVR'],
    'Date': ['2019-05-22 18:37:10', '2019-05-22 18:41:12', '2019-05-22 18:45:16'],
    'Value': [1.36, 1.35, 1.35]
}

df = pd.DataFrame(data)

df = df.sort_values(by=['ID', 'RecordingType', 'Date'], ascending=True).reset_index().drop(columns=["index"])
df["ToRemove"] = False

outputcolnames = {'FEVR': 'Value'}

for i in range(df.shape[0] - 1):
    curr_recording_type = df.loc[i, "RecordingType"]
    next_recording_type = df.loc[i + 1, "RecordingType"]

    # Check if current and next row have the same ID and RecordingType and the difference in time between the current
    # and next rows is less than 60 minutes
    if df.loc[i, "ID"] == df.loc[i + 1, "ID"] and curr_recording_type == next_recording_type and abs((datetime.strptime(
            df.loc[i, "Date"], "%Y-%m-%d %H:%M:%S") - datetime.strptime(df.loc[i + 1, "Date"],
                                                                        "%Y-%m-%d %H:%M:%S")).total_seconds() / 60) < 60:

        # For similar rows the first row is marked for deletion and the second row's value is updated depending on
        # the Recording Type
        df.at[i, "ToRemove"] = True

        if curr_recording_type == 'FEVR':
            df.iloc[i + 1, df.columns.get_loc(outputcolnames[next_recording_type])] = max(
                df.loc[i, outputcolnames[curr_recording_type]], df.loc[i + 1, outputcolnames[next_recording_type]])
        else:
            df.iloc[i + 1, df.columns.get_loc(outputcolnames[next_recording_type])] += df.loc[
                i, outputcolnames[curr_recording_type]]

# Remove the columns to be deleted
df = df[df["ToRemove"] == False].reset_index().drop(columns=["index", "ToRemove"])

print(df)
#   ID RecordingType                 Date  Value
#0   1          FEVR  2019-05-22 18:45:16   1.36

hope that this will work for you

Answered By: gerpaick