Pandas dataframe duplicates – Python
Question:
I have a simplified dataframe like this:
ID RecordingType Date Value
1 FEVR 2019-05-22 18:37:10 1.36
1 FEVR 2019-05-22 18:41:12 1.35
1 FEVR 2019-05-22 18:45:16 1.35
I am trying to run this code:
df = df.sort_values(by=['ID', 'RecordingType', 'Date'], ascending=True).reset_index().drop(columns=["index"])
df["ToRemove"] = False
dict_temp_df = df.to_dict("records")
outputcolnames = {'FEVR':'Value'}
for i in range(df.shape[0]-1):
curr_recording_type = df[i]["RecordingType"]
next_recording_type = df[i+1]["RecordingType"]
# Check if current and next row have the same ID and RecordingType and the difference in time between the current and next rows is less than 60 minutes
if dict_temp_df[i]["ID"] == dict_temp_df[i+1]["ID"] and curr_recording_type == next_recording_type and abs((dict_temp_df[i]["Date"] - dict_temp_df[i+1]["Date"]).total_seconds() / 60) < 60:
# For similar rows the first row is marked for deletion and the second row's value is updated depending on the Recording Type
df.at[i,"ToRemove"] = True
if curr_recording_type == 'FEVR':
df.at[i+1,outputcolnames[next_recording_type]] = max(dict_temp_df[i][outputcolnames[curr_recording_type]], dict_temp_df[i+1][outputcolnames[next_recording_type]])
else:
df.at[i+1,outputcolnames[next_recording_type]] += dict_temp_df[i][outputcolnames[curr_recording_type]]
# Remove the columns to be deleted
df = df[df["ToRemove"] == False].reset_index().drop(columns=["index"])
The desired output should have the last consecutive row of these duplicates updated with the max value of these rows like this:
ID RecordingType Date Value
1 FEVR 2019-05-22 18:45:16 1.36
My code keeps giving me this result and I don’t know how to fix it:
ID RecordingType Date Value
1 FEVR 2019-05-22 18:45:16 1.35
Can you please help me? 🙁
P.S. I would prefer to keep the for loop, as there are multiple other if/else conditions for other recording types (I tried the groupby method but it messes up with the downstream code)
Answers:
For this specific recording type, if your goal is to get the maximum value and associate it to the maximum date, you can do like this:
df.groupby("RecordiingType")[["Date", "Value"]].max()
You can optionally add a .reset_index(drop=False)
if you want to get back the RecordingType
as a column.
For the other rules you need to implement, you can check if it is possibile to translate them as "groupby-aggregate" pattern, and remember that you can also use the apply
method with a custom defined function when grouping.
updated
I have updated the code with your simplified data (thanks to pycharm that suggested using datetime in this code 😀 ):
import pandas as pd
from datetime import datetime, timedelta
data = {
'ID': [1, 1, 1],
'RecordingType': ['FEVR', 'FEVR', 'FEVR'],
'Date': ['2019-05-22 18:37:10', '2019-05-22 18:41:12', '2019-05-22 18:45:16'],
'Value': [1.36, 1.35, 1.35]
}
df = pd.DataFrame(data)
df = df.sort_values(by=['ID', 'RecordingType', 'Date'], ascending=True).reset_index().drop(columns=["index"])
df["ToRemove"] = False
outputcolnames = {'FEVR': 'Value'}
for i in range(df.shape[0] - 1):
curr_recording_type = df.loc[i, "RecordingType"]
next_recording_type = df.loc[i + 1, "RecordingType"]
# Check if current and next row have the same ID and RecordingType and the difference in time between the current
# and next rows is less than 60 minutes
if df.loc[i, "ID"] == df.loc[i + 1, "ID"] and curr_recording_type == next_recording_type and abs((datetime.strptime(
df.loc[i, "Date"], "%Y-%m-%d %H:%M:%S") - datetime.strptime(df.loc[i + 1, "Date"],
"%Y-%m-%d %H:%M:%S")).total_seconds() / 60) < 60:
# For similar rows the first row is marked for deletion and the second row's value is updated depending on
# the Recording Type
df.at[i, "ToRemove"] = True
if curr_recording_type == 'FEVR':
df.iloc[i + 1, df.columns.get_loc(outputcolnames[next_recording_type])] = max(
df.loc[i, outputcolnames[curr_recording_type]], df.loc[i + 1, outputcolnames[next_recording_type]])
else:
df.iloc[i + 1, df.columns.get_loc(outputcolnames[next_recording_type])] += df.loc[
i, outputcolnames[curr_recording_type]]
# Remove the columns to be deleted
df = df[df["ToRemove"] == False].reset_index().drop(columns=["index", "ToRemove"])
print(df)
# ID RecordingType Date Value
#0 1 FEVR 2019-05-22 18:45:16 1.36
hope that this will work for you
I have a simplified dataframe like this:
ID RecordingType Date Value
1 FEVR 2019-05-22 18:37:10 1.36
1 FEVR 2019-05-22 18:41:12 1.35
1 FEVR 2019-05-22 18:45:16 1.35
I am trying to run this code:
df = df.sort_values(by=['ID', 'RecordingType', 'Date'], ascending=True).reset_index().drop(columns=["index"])
df["ToRemove"] = False
dict_temp_df = df.to_dict("records")
outputcolnames = {'FEVR':'Value'}
for i in range(df.shape[0]-1):
curr_recording_type = df[i]["RecordingType"]
next_recording_type = df[i+1]["RecordingType"]
# Check if current and next row have the same ID and RecordingType and the difference in time between the current and next rows is less than 60 minutes
if dict_temp_df[i]["ID"] == dict_temp_df[i+1]["ID"] and curr_recording_type == next_recording_type and abs((dict_temp_df[i]["Date"] - dict_temp_df[i+1]["Date"]).total_seconds() / 60) < 60:
# For similar rows the first row is marked for deletion and the second row's value is updated depending on the Recording Type
df.at[i,"ToRemove"] = True
if curr_recording_type == 'FEVR':
df.at[i+1,outputcolnames[next_recording_type]] = max(dict_temp_df[i][outputcolnames[curr_recording_type]], dict_temp_df[i+1][outputcolnames[next_recording_type]])
else:
df.at[i+1,outputcolnames[next_recording_type]] += dict_temp_df[i][outputcolnames[curr_recording_type]]
# Remove the columns to be deleted
df = df[df["ToRemove"] == False].reset_index().drop(columns=["index"])
The desired output should have the last consecutive row of these duplicates updated with the max value of these rows like this:
ID RecordingType Date Value
1 FEVR 2019-05-22 18:45:16 1.36
My code keeps giving me this result and I don’t know how to fix it:
ID RecordingType Date Value
1 FEVR 2019-05-22 18:45:16 1.35
Can you please help me? 🙁
P.S. I would prefer to keep the for loop, as there are multiple other if/else conditions for other recording types (I tried the groupby method but it messes up with the downstream code)
For this specific recording type, if your goal is to get the maximum value and associate it to the maximum date, you can do like this:
df.groupby("RecordiingType")[["Date", "Value"]].max()
You can optionally add a .reset_index(drop=False)
if you want to get back the RecordingType
as a column.
For the other rules you need to implement, you can check if it is possibile to translate them as "groupby-aggregate" pattern, and remember that you can also use the apply
method with a custom defined function when grouping.
updated
I have updated the code with your simplified data (thanks to pycharm that suggested using datetime in this code 😀 ):
import pandas as pd
from datetime import datetime, timedelta
data = {
'ID': [1, 1, 1],
'RecordingType': ['FEVR', 'FEVR', 'FEVR'],
'Date': ['2019-05-22 18:37:10', '2019-05-22 18:41:12', '2019-05-22 18:45:16'],
'Value': [1.36, 1.35, 1.35]
}
df = pd.DataFrame(data)
df = df.sort_values(by=['ID', 'RecordingType', 'Date'], ascending=True).reset_index().drop(columns=["index"])
df["ToRemove"] = False
outputcolnames = {'FEVR': 'Value'}
for i in range(df.shape[0] - 1):
curr_recording_type = df.loc[i, "RecordingType"]
next_recording_type = df.loc[i + 1, "RecordingType"]
# Check if current and next row have the same ID and RecordingType and the difference in time between the current
# and next rows is less than 60 minutes
if df.loc[i, "ID"] == df.loc[i + 1, "ID"] and curr_recording_type == next_recording_type and abs((datetime.strptime(
df.loc[i, "Date"], "%Y-%m-%d %H:%M:%S") - datetime.strptime(df.loc[i + 1, "Date"],
"%Y-%m-%d %H:%M:%S")).total_seconds() / 60) < 60:
# For similar rows the first row is marked for deletion and the second row's value is updated depending on
# the Recording Type
df.at[i, "ToRemove"] = True
if curr_recording_type == 'FEVR':
df.iloc[i + 1, df.columns.get_loc(outputcolnames[next_recording_type])] = max(
df.loc[i, outputcolnames[curr_recording_type]], df.loc[i + 1, outputcolnames[next_recording_type]])
else:
df.iloc[i + 1, df.columns.get_loc(outputcolnames[next_recording_type])] += df.loc[
i, outputcolnames[curr_recording_type]]
# Remove the columns to be deleted
df = df[df["ToRemove"] == False].reset_index().drop(columns=["index", "ToRemove"])
print(df)
# ID RecordingType Date Value
#0 1 FEVR 2019-05-22 18:45:16 1.36
hope that this will work for you