Dropping all rows of dataframe where discontinuous data happens
Question:
Consider the following part of a Pandas dataframe:
0 1 2
12288 1000 45047 0.403
12289 1000 45048 0.334
12290 1000 45101 0.246
12291 1000 45102 0.096
12292 1000 45103 0.096
12293 1000 45104 0.024
12294 1000 45105 0.023
12295 1000 45106 0.023
12296 1000 45107 0.024
12297 1000 45108 0.024
12298 1000 45109 0.024
12299 1000 45110 0.055
12300 1000 45111 0.107
12301 1000 45112 0.024
12302 1000 45113 0.024
12303 1000 45114 0.024
12304 1000 45115 0.060
12305 1000 45116 1.095
12306 1000 45117 1.090
12307 1000 45118 0.418
12308 1000 45119 0.292
12309 1000 45120 0.446
12310 1000 45121 0.121
12311 1000 45122 0.121
12312 1000 45123 0.090
12313 1000 45124 0.031
12314 1000 45125 0.031
12315 1000 45126 0.031
12316 1000 45127 0.031
12317 1000 45128 0.036
12318 1000 45129 0.124
12319 1000 45130 0.069
12320 1000 45131 0.031
12321 1000 45132 0.031
12322 1000 45133 0.031
12323 1000 45134 0.031
12324 1000 45135 0.031
12325 1000 45136 0.059
12326 1000 45137 0.115
12327 1000 45138 0.595
12328 1000 45139 1.375
12329 1000 45140 0.780
12330 1000 45141 0.028
12331 1000 45142 0.029
12332 1000 45143 0.029
12333 1000 45144 0.029
12334 1000 45145 0.028
12335 1000 45146 0.085
12336 1000 45147 0.528
12337 1000 45148 0.107
12338 1000 45201 0.024
12339 1000 45204 0.024
12340 1000 45205 0.024
12341 1000 45206 0.024
12342 1000 45207 0.024
12343 1000 45208 0.024
12344 1000 45209 0.045
12345 1000 45210 0.033
12346 1000 45211 0.025
12347 1000 45212 0.024
12348 1000 45213 0.024
12349 1000 45214 0.024
12350 1000 45215 0.024
12351 1000 45216 0.108
12352 1000 45217 1.109
12353 1000 45218 2.025
12354 1000 45219 2.918
12355 1000 45220 4.130
12356 1000 45221 0.601
12357 1000 45222 0.330
12358 1000 45223 0.400
12359 1000 45224 0.200
12360 1000 45225 0.093
12361 1000 45226 0.023
12362 1000 45227 0.023
12363 1000 45228 0.023
12364 1000 45229 0.024
12365 1000 45230 0.024
12366 1000 45231 0.118
12367 1000 45232 0.064
12368 1000 45233 0.023
12369 1000 45234 0.023
12370 1000 45235 0.023
12371 1000 45236 0.022
12372 1000 45237 0.022
12373 1000 45238 0.022
12374 1000 45239 0.106
12375 1000 45240 0.074
12376 1000 45241 0.105
12377 1000 45242 1.231
12378 1000 45243 0.500
12379 1000 45244 0.382
12380 1000 45245 0.405
12381 1000 45246 0.469
12382 1000 45247 0.173
12383 1000 45248 0.035
12384 1000 45301 0.026
12385 1000 45302 0.027
In column 1
, it’s a code that represents when some measurements (values in column 2
) were taken. The first three digits of the values in column 1
represent a day, and the last two digits represent the HH:MM:SS. We start from day 450
(first two rows), and in the third row we are already in the day 451
. From the index 12290
to 12337
you can see that we have 48
values (which represent 48
half-hourly measurements of a single day). So, last digit 01
means a measurement between 00:00:00 and 00:29:59, 02
means a measurement between 00:30:00 and 00:59:59, 03
means a measurement between 01:00:00
and 01:29:59
, and so on.
For example, a discontinuity happens in column 1
between index 12289
and index 12290
, but this discontinuity happened between 450
and 451
in the the first three digits (a discontinuity between two days, since we moved from one day to another. The last three digits 048
in 45048
represent the measurements between 23:30:00
and 23:59:59
in that day 450
), so those rows should not be dropped.
But now, if you look at the index 12338
and index 12339
, there is a discontinuity happening in the same day 452
, we are missing the measurements from 02
and 03
(we have measurements at 45201
and then the next at 45204
. So, ALL rows from the 452
day should be dropped.
And again a discontinuity happens between index 12383
and index 12384
, but since that happens between two different days (452
and 453
), nothing should be dropped.
All the values in column 1
are int64
.
Sorry if this is long and/or confusing, but any ideas in how can I solve this?
Answers:
The first mask checks the continuous values. The second mask get every row from bad days:
# 1 -> 2 -> 3 -> ... or 47 -> 48 -> 1 -> ...
m1 = ~df[1].diff().fillna(1).isin([1, 53])
m2 = ~df[1].floordiv(100).isin(df.loc[m, 1].floordiv(100).tolist())
out = df[m2]
Output:
>>> out
0 1 2
12288 1000 45047 0.403
12289 1000 45048 0.334
12290 1000 45101 0.246
12291 1000 45102 0.096
12292 1000 45103 0.096
12293 1000 45104 0.024
12294 1000 45105 0.023
12295 1000 45106 0.023
12296 1000 45107 0.024
12297 1000 45108 0.024
12298 1000 45109 0.024
12299 1000 45110 0.055
12300 1000 45111 0.107
12301 1000 45112 0.024
12302 1000 45113 0.024
12303 1000 45114 0.024
12304 1000 45115 0.060
12305 1000 45116 1.095
12306 1000 45117 1.090
12307 1000 45118 0.418
12308 1000 45119 0.292
12309 1000 45120 0.446
12310 1000 45121 0.121
12311 1000 45122 0.121
12312 1000 45123 0.090
12313 1000 45124 0.031
12314 1000 45125 0.031
12315 1000 45126 0.031
12316 1000 45127 0.031
12317 1000 45128 0.036
12318 1000 45129 0.124
12319 1000 45130 0.069
12320 1000 45131 0.031
12321 1000 45132 0.031
12322 1000 45133 0.031
12323 1000 45134 0.031
12324 1000 45135 0.031
12325 1000 45136 0.059
12326 1000 45137 0.115
12327 1000 45138 0.595
12328 1000 45139 1.375
12329 1000 45140 0.780
12330 1000 45141 0.028
12331 1000 45142 0.029
12332 1000 45143 0.029
12333 1000 45144 0.029
12334 1000 45145 0.028
12335 1000 45146 0.085
12336 1000 45147 0.528
12337 1000 45148 0.107
12384 1000 45301 0.026
12385 1000 45302 0.027
Split the timestamp column. Then group by day and filter by values less than 48.
df["day"] = df["1"]/100
df["day"] = df["day"].astype('int64')
df["half_hour"] = df["1"]%100
df["num_readings_per_day"] = df.groupby("day")["half_hour"].transform('count')
df = df[df.num_readings_per_day==48]
Consider the following part of a Pandas dataframe:
0 1 2
12288 1000 45047 0.403
12289 1000 45048 0.334
12290 1000 45101 0.246
12291 1000 45102 0.096
12292 1000 45103 0.096
12293 1000 45104 0.024
12294 1000 45105 0.023
12295 1000 45106 0.023
12296 1000 45107 0.024
12297 1000 45108 0.024
12298 1000 45109 0.024
12299 1000 45110 0.055
12300 1000 45111 0.107
12301 1000 45112 0.024
12302 1000 45113 0.024
12303 1000 45114 0.024
12304 1000 45115 0.060
12305 1000 45116 1.095
12306 1000 45117 1.090
12307 1000 45118 0.418
12308 1000 45119 0.292
12309 1000 45120 0.446
12310 1000 45121 0.121
12311 1000 45122 0.121
12312 1000 45123 0.090
12313 1000 45124 0.031
12314 1000 45125 0.031
12315 1000 45126 0.031
12316 1000 45127 0.031
12317 1000 45128 0.036
12318 1000 45129 0.124
12319 1000 45130 0.069
12320 1000 45131 0.031
12321 1000 45132 0.031
12322 1000 45133 0.031
12323 1000 45134 0.031
12324 1000 45135 0.031
12325 1000 45136 0.059
12326 1000 45137 0.115
12327 1000 45138 0.595
12328 1000 45139 1.375
12329 1000 45140 0.780
12330 1000 45141 0.028
12331 1000 45142 0.029
12332 1000 45143 0.029
12333 1000 45144 0.029
12334 1000 45145 0.028
12335 1000 45146 0.085
12336 1000 45147 0.528
12337 1000 45148 0.107
12338 1000 45201 0.024
12339 1000 45204 0.024
12340 1000 45205 0.024
12341 1000 45206 0.024
12342 1000 45207 0.024
12343 1000 45208 0.024
12344 1000 45209 0.045
12345 1000 45210 0.033
12346 1000 45211 0.025
12347 1000 45212 0.024
12348 1000 45213 0.024
12349 1000 45214 0.024
12350 1000 45215 0.024
12351 1000 45216 0.108
12352 1000 45217 1.109
12353 1000 45218 2.025
12354 1000 45219 2.918
12355 1000 45220 4.130
12356 1000 45221 0.601
12357 1000 45222 0.330
12358 1000 45223 0.400
12359 1000 45224 0.200
12360 1000 45225 0.093
12361 1000 45226 0.023
12362 1000 45227 0.023
12363 1000 45228 0.023
12364 1000 45229 0.024
12365 1000 45230 0.024
12366 1000 45231 0.118
12367 1000 45232 0.064
12368 1000 45233 0.023
12369 1000 45234 0.023
12370 1000 45235 0.023
12371 1000 45236 0.022
12372 1000 45237 0.022
12373 1000 45238 0.022
12374 1000 45239 0.106
12375 1000 45240 0.074
12376 1000 45241 0.105
12377 1000 45242 1.231
12378 1000 45243 0.500
12379 1000 45244 0.382
12380 1000 45245 0.405
12381 1000 45246 0.469
12382 1000 45247 0.173
12383 1000 45248 0.035
12384 1000 45301 0.026
12385 1000 45302 0.027
In column 1
, it’s a code that represents when some measurements (values in column 2
) were taken. The first three digits of the values in column 1
represent a day, and the last two digits represent the HH:MM:SS. We start from day 450
(first two rows), and in the third row we are already in the day 451
. From the index 12290
to 12337
you can see that we have 48
values (which represent 48
half-hourly measurements of a single day). So, last digit 01
means a measurement between 00:00:00 and 00:29:59, 02
means a measurement between 00:30:00 and 00:59:59, 03
means a measurement between 01:00:00
and 01:29:59
, and so on.
For example, a discontinuity happens in column 1
between index 12289
and index 12290
, but this discontinuity happened between 450
and 451
in the the first three digits (a discontinuity between two days, since we moved from one day to another. The last three digits 048
in 45048
represent the measurements between 23:30:00
and 23:59:59
in that day 450
), so those rows should not be dropped.
But now, if you look at the index 12338
and index 12339
, there is a discontinuity happening in the same day 452
, we are missing the measurements from 02
and 03
(we have measurements at 45201
and then the next at 45204
. So, ALL rows from the 452
day should be dropped.
And again a discontinuity happens between index 12383
and index 12384
, but since that happens between two different days (452
and 453
), nothing should be dropped.
All the values in column 1
are int64
.
Sorry if this is long and/or confusing, but any ideas in how can I solve this?
The first mask checks the continuous values. The second mask get every row from bad days:
# 1 -> 2 -> 3 -> ... or 47 -> 48 -> 1 -> ...
m1 = ~df[1].diff().fillna(1).isin([1, 53])
m2 = ~df[1].floordiv(100).isin(df.loc[m, 1].floordiv(100).tolist())
out = df[m2]
Output:
>>> out
0 1 2
12288 1000 45047 0.403
12289 1000 45048 0.334
12290 1000 45101 0.246
12291 1000 45102 0.096
12292 1000 45103 0.096
12293 1000 45104 0.024
12294 1000 45105 0.023
12295 1000 45106 0.023
12296 1000 45107 0.024
12297 1000 45108 0.024
12298 1000 45109 0.024
12299 1000 45110 0.055
12300 1000 45111 0.107
12301 1000 45112 0.024
12302 1000 45113 0.024
12303 1000 45114 0.024
12304 1000 45115 0.060
12305 1000 45116 1.095
12306 1000 45117 1.090
12307 1000 45118 0.418
12308 1000 45119 0.292
12309 1000 45120 0.446
12310 1000 45121 0.121
12311 1000 45122 0.121
12312 1000 45123 0.090
12313 1000 45124 0.031
12314 1000 45125 0.031
12315 1000 45126 0.031
12316 1000 45127 0.031
12317 1000 45128 0.036
12318 1000 45129 0.124
12319 1000 45130 0.069
12320 1000 45131 0.031
12321 1000 45132 0.031
12322 1000 45133 0.031
12323 1000 45134 0.031
12324 1000 45135 0.031
12325 1000 45136 0.059
12326 1000 45137 0.115
12327 1000 45138 0.595
12328 1000 45139 1.375
12329 1000 45140 0.780
12330 1000 45141 0.028
12331 1000 45142 0.029
12332 1000 45143 0.029
12333 1000 45144 0.029
12334 1000 45145 0.028
12335 1000 45146 0.085
12336 1000 45147 0.528
12337 1000 45148 0.107
12384 1000 45301 0.026
12385 1000 45302 0.027
Split the timestamp column. Then group by day and filter by values less than 48.
df["day"] = df["1"]/100
df["day"] = df["day"].astype('int64')
df["half_hour"] = df["1"]%100
df["num_readings_per_day"] = df.groupby("day")["half_hour"].transform('count')
df = df[df.num_readings_per_day==48]