How to assign values of a column based on two conditions for current and previous row values?

Question:

Here’s the data frame, original has a million rows so solution needs to be efficient:

enter image description here

Code:

import pandas as pd

df_temp = pd.DataFrame({'Download Button Clicked Time': {0: '2021-10-24 12:39:27.189629',
  1: '2021-10-24 12:42:06.346536',
  2: '2021-10-24 12:42:06.369056',
  3: '2021-10-24 12:42:11.551610',
  4: '2021-10-24 12:44:38.475047',
  5: '2021-10-24 12:46:33.331920',
  6: '2021-10-24 12:46:33.346536',
  7: '2021-10-24 12:46:33.369056',
  8: '2021-10-24 12:46:33.421520',
  9: '2021-10-24 12:46:33.404641'},
 'Install Verified Time': {0: '2021-10-24 12:41:04.669589',
  1: '2021-10-24 12:43:14.032023',
  2: '2021-10-24 12:43:14.033913',
  3: '2021-10-24 12:44:08.667666',
  4: '2021-10-24 12:46:11.161883',
  5: '2021-10-24 12:46:34.976129',
  6: '2021-10-24 12:46:35.032023',
  7: '2021-10-24 12:46:35.033913',
  8: '2021-10-24 12:46:35.065320',
  9: '2021-10-24 12:46:35.125156'},
 'App ID': {0: 'a',
  1: 'b',
  2: 'c',
  3: 'd',
  4: 'e',
  5: 'f',
  6: 'g',
  7: 'h',
  8: 'i',
  9: 'j'},
 'Email ID': {0: 'mandeep',
  1: 'lucky',
  2: 'mandeep',
  3: 'chettan',
  4: 'kalia',
  5: 'pinky',
  6: 'pinky',
  7: 'pinky',
  8: 'pinky',
  9: 'pinky'},
 'install_time': {0: 97.47996,
  1: 68.29827800000001,
  2: 120.708813,
  3: 117.116056,
  4: 92.686836,
  5: 1.644209,
  6: 1.6854870000000002,
  7: 1.664857,
  8: 1.6438000000000001,
  9: 1.720515},
 'fraud': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0}})

df_temp

Output will have the first four columns exactly same but should only have the last FIVE ‘fraud’ rows as one but current output is this:

enter image description here

The code I’m using to detect fraud and get this output is this:

df_temp['Download Button Clicked Time'] = df_temp['Download Button Clicked Time'].astype('datetime64[ns]')
df_temp['Install Verified Time'] = df_temp['Install Verified Time'].astype('datetime64[ns]')

df_temp['install_time'] = df_temp['Install Verified Time'] - df_temp['Download Button Clicked Time']
df_temp['install_time'] = df_temp['install_time'].dt.total_seconds()

df_temp['diff'] = df_temp.install_time.diff().abs()

def fraud_time(row):
  fraud = 0
  if row['install_time'] < 0.5:
    fraud = 1
  elif row['diff'] < 0.1:
    fraud = 1
  return fraud

df_temp['fraud'] = df_temp.apply(fraud_time, axis=1)

df_temp

I’m using Install Verified Time, seems more sensible than Download Button Clicked Time. As you can clearly see, the third row should not be marked as one as second and third row emails were different. Also, last five, not four, rows should be also marked 1.

TL;DR

Detect fraud (maybe by) using pandas.DataFrame.diff only if last two email addresses were different.

Frauds will have a very small value of time difference (say 0.02 seconds), for the SAME email, not different ones. Two different users installing two different apps in under 2 milliseconds makes sense, same user doing so doesn’t makes sense.

Asked By: sonar

||

Answers:

The key is to group entries by email for successive installations:

# I assume you have done this already
df['Install Verified Time'] = pd.to_datetime(df['Install Verified Time'])

df['fraud'] = (df['install_time'] < 0.5) | (
    df.groupby('Email ID', as_index=False)['Install Verified Time'].diff()['Install Verified Time'] < pd.Timedelta(seconds=0.1)
)
Answered By: Marat
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.