Dropping all rows of dataframe where discontinuous data happens

Question

Consider the following part of a Pandas dataframe:

          0      1      2
12288  1000  45047  0.403
12289  1000  45048  0.334
12290  1000  45101  0.246
12291  1000  45102  0.096
12292  1000  45103  0.096
12293  1000  45104  0.024
12294  1000  45105  0.023
12295  1000  45106  0.023
12296  1000  45107  0.024
12297  1000  45108  0.024
12298  1000  45109  0.024
12299  1000  45110  0.055
12300  1000  45111  0.107
12301  1000  45112  0.024
12302  1000  45113  0.024
12303  1000  45114  0.024
12304  1000  45115  0.060
12305  1000  45116  1.095
12306  1000  45117  1.090
12307  1000  45118  0.418
12308  1000  45119  0.292
12309  1000  45120  0.446
12310  1000  45121  0.121
12311  1000  45122  0.121
12312  1000  45123  0.090
12313  1000  45124  0.031
12314  1000  45125  0.031
12315  1000  45126  0.031
12316  1000  45127  0.031
12317  1000  45128  0.036
12318  1000  45129  0.124
12319  1000  45130  0.069
12320  1000  45131  0.031
12321  1000  45132  0.031
12322  1000  45133  0.031
12323  1000  45134  0.031
12324  1000  45135  0.031
12325  1000  45136  0.059
12326  1000  45137  0.115
12327  1000  45138  0.595
12328  1000  45139  1.375
12329  1000  45140  0.780
12330  1000  45141  0.028
12331  1000  45142  0.029
12332  1000  45143  0.029
12333  1000  45144  0.029
12334  1000  45145  0.028
12335  1000  45146  0.085
12336  1000  45147  0.528
12337  1000  45148  0.107
12338  1000  45201  0.024
12339  1000  45204  0.024
12340  1000  45205  0.024
12341  1000  45206  0.024
12342  1000  45207  0.024
12343  1000  45208  0.024
12344  1000  45209  0.045
12345  1000  45210  0.033
12346  1000  45211  0.025
12347  1000  45212  0.024
12348  1000  45213  0.024
12349  1000  45214  0.024
12350  1000  45215  0.024
12351  1000  45216  0.108
12352  1000  45217  1.109
12353  1000  45218  2.025
12354  1000  45219  2.918
12355  1000  45220  4.130
12356  1000  45221  0.601
12357  1000  45222  0.330
12358  1000  45223  0.400
12359  1000  45224  0.200
12360  1000  45225  0.093
12361  1000  45226  0.023
12362  1000  45227  0.023
12363  1000  45228  0.023
12364  1000  45229  0.024
12365  1000  45230  0.024
12366  1000  45231  0.118
12367  1000  45232  0.064
12368  1000  45233  0.023
12369  1000  45234  0.023
12370  1000  45235  0.023
12371  1000  45236  0.022
12372  1000  45237  0.022
12373  1000  45238  0.022
12374  1000  45239  0.106
12375  1000  45240  0.074
12376  1000  45241  0.105
12377  1000  45242  1.231
12378  1000  45243  0.500
12379  1000  45244  0.382
12380  1000  45245  0.405
12381  1000  45246  0.469
12382  1000  45247  0.173
12383  1000  45248  0.035
12384  1000  45301  0.026
12385  1000  45302  0.027

In column 1, it’s a code that represents when some measurements (values in column 2) were taken. The first three digits of the values in column 1 represent a day, and the last two digits represent the HH:MM:SS. We start from day 450 (first two rows), and in the third row we are already in the day 451. From the index 12290 to 12337 you can see that we have 48 values (which represent 48 half-hourly measurements of a single day). So, last digit 01 means a measurement between 00:00:00 and 00:29:59, 02 means a measurement between 00:30:00 and 00:59:59, 03means a measurement between 01:00:00 and 01:29:59, and so on.

For example, a discontinuity happens in column 1 between index 12289 and index 12290, but this discontinuity happened between 450 and 451 in the the first three digits (a discontinuity between two days, since we moved from one day to another. The last three digits 048 in 45048 represent the measurements between 23:30:00 and 23:59:59 in that day 450), so those rows should not be dropped.

But now, if you look at the index 12338 and index 12339, there is a discontinuity happening in the same day 452, we are missing the measurements from 02 and 03 (we have measurements at 45201 and then the next at 45204. So, ALL rows from the 452 day should be dropped.

And again a discontinuity happens between index 12383 and index 12384, but since that happens between two different days (452 and 453), nothing should be dropped.

All the values in column 1 are int64.

Sorry if this is long and/or confusing, but any ideas in how can I solve this?

Asked By: Murilo

||

Source

Answer 1

A normal jump is either 1 or 53, use diff combined with isin to detect them and invert with ~ to find the true discontinuities:

df.loc[(~df['1'].diff().isin([np.nan, 1, 53]))]

Output:

          0      1      2
12339  1000  45204  0.024

Answered By: mozway

Answer 2

The first mask checks the continuous values. The second mask get every row from bad days:

# 1 -> 2 -> 3 -> ... or 47 -> 48 -> 1 -> ...
m1 = ~df[1].diff().fillna(1).isin([1, 53])
m2 = ~df[1].floordiv(100).isin(df.loc[m, 1].floordiv(100).tolist())
out = df[m2]

Output:

>>> out
          0      1      2
12288  1000  45047  0.403
12289  1000  45048  0.334
12290  1000  45101  0.246
12291  1000  45102  0.096
12292  1000  45103  0.096
12293  1000  45104  0.024
12294  1000  45105  0.023
12295  1000  45106  0.023
12296  1000  45107  0.024
12297  1000  45108  0.024
12298  1000  45109  0.024
12299  1000  45110  0.055
12300  1000  45111  0.107
12301  1000  45112  0.024
12302  1000  45113  0.024
12303  1000  45114  0.024
12304  1000  45115  0.060
12305  1000  45116  1.095
12306  1000  45117  1.090
12307  1000  45118  0.418
12308  1000  45119  0.292
12309  1000  45120  0.446
12310  1000  45121  0.121
12311  1000  45122  0.121
12312  1000  45123  0.090
12313  1000  45124  0.031
12314  1000  45125  0.031
12315  1000  45126  0.031
12316  1000  45127  0.031
12317  1000  45128  0.036
12318  1000  45129  0.124
12319  1000  45130  0.069
12320  1000  45131  0.031
12321  1000  45132  0.031
12322  1000  45133  0.031
12323  1000  45134  0.031
12324  1000  45135  0.031
12325  1000  45136  0.059
12326  1000  45137  0.115
12327  1000  45138  0.595
12328  1000  45139  1.375
12329  1000  45140  0.780
12330  1000  45141  0.028
12331  1000  45142  0.029
12332  1000  45143  0.029
12333  1000  45144  0.029
12334  1000  45145  0.028
12335  1000  45146  0.085
12336  1000  45147  0.528
12337  1000  45148  0.107
12384  1000  45301  0.026
12385  1000  45302  0.027

Answered By: Corralien

Answer 3

Split the timestamp column. Then group by day and filter by values less than 48.

df["day"] = df["1"]/100
df["day"] = df["day"].astype('int64')
df["half_hour"] = df["1"]%100
df["num_readings_per_day"] = df.groupby("day")["half_hour"].transform('count')
df = df[df.num_readings_per_day==48]

Answered By: Vikram Raghu

Dropping all rows of dataframe where discontinuous data happens

Question:

Answers: