How to improve my performance in filling gaps in time series and data lists with Python

Question:

I’m having a time series data sets comprising of 10 Hz data over several years. For one year my data has around 3.1*10^8 rows of data (each row has a time stamp and 8 float values). My data has gaps which I need to identify and fill with ‘NaN’. My python code below is capable of doing so but the performance is by far too bad for my kind of problem. I cannot get though my data set in anything even close to a reasonable time.

Below an minimal working example.
I have for example series (time-seris-data) and data as lits with same lengths:

series      = [1.1, 2.1, 3.1, 7.1, 8.1, 9.1, 10.1, 14.1, 15.1, 16.1, 20.1]
data_a      = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
data_b      = [1.2, 1.2, 1.2, 2.2, 2.2, 2.2, 2.2, 3.2, 3.2, 3.2, 4.2]

I would like series to advance in intervals of 1, hence the gaps of series are 4.1, 5.1, 6.1, 11.1, 12.1, 13.1, 17.1, 18.1, 19.1. The data_a and data_b lists shall be filled with float(nan)’s.
so data_a for example should become:

[1.2, 1.2, 1.2, nan, nan, nan, 2.2, 2.2, 2.2, 2.2, nan, nan, nan, 3.2, 3.2, 3.2, nan, nan, nan, 4.2]

I archived this using:

d_max = 1.0    # Normal increment in series where no gaps shall be filled
shift = 0

for i in range(len(series)-1):
    diff = series[i+1] - series[i]
    if diff > d_max:
        num_fills = round(diff/d_max)-1    # Number of fills within one gap
        for it in range(num_fills):
            data_a.insert(i+1+it+shift, float(nan))
            data_b.insert(i+1+it+shift, float(nan))
        shift = int(shift + num_fills)     # Shift the index by the number of inserts from the previous gap filling

I searched for other solutions to this problems but only came across the use of the find() function yielding the indices of the gaps. Is the function find() faster than my solution? But then how would I insert NaN’s in data_a and data_b in a more efficient way?

Asked By: Betrieb

||

Answers:

First, realize that your innermost loop is not necessary:

for it in range(num_fills):
    data_a.insert(i+1+it+shift, float(nan))

is the same as

data_a[i+1+shift:i+1+shift] = [float(nan)] * int(num_fills)

That might make it slightly faster because there’s less allocation and less moving items going on.

Then, for large numerical problems, always use NumPy. It may take some effort to learn, but the performance is likely to go up orders of magnitude. Start with something like:

import numpy as np

series = np.array([1.1, 2.1, 3.1, 7.1, 8.1, 9.1, 10.1, 14.1, 15.1, 16.1, 20.1])
data_a = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
data_b = [1.2, 1.2, 1.2, 2.2, 2.2, 2.2, 2.2, 3.2, 3.2, 3.2, 4.2]

d_max = 1.0    # Normal increment in series where no gaps shall be filled
shift = 0

# the following two statements use NumPy's broadcasting
# to implicit run some loop at the C level
diff = series[1:] - series[:-1]
num_fills = np.round(diff / d_max) - 1
for i in np.where(diff > d_max)[0]:
    nf = num_fills[i]
    nans = [np.nan] * nf
    data_a[i+1+shift:i+1+shift] = nans
    data_b[i+1+shift:i+1+shift] = nans
    shift = int(shift + nf)
Answered By: Fred Foo

IIRC, inserts into python lists are expensive, with the size of the list.

I’d recommend not loading your huge data sets into memory, but to iterate through them with a generator function something like:

from itertools import izip

series      = [1.1, 2.1, 3.1, 7.1, 8.1, 9.1, 10.1, 14.1, 15.1, 16.1, 20.1]
data_a      = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
data_b      = [1.2, 1.2, 1.2, 2.2, 2.2, 2.2, 2.2, 3.2, 3.2, 3.2, 4.2]

def fillGaps(series,data_a,data_b,d_max=1.0):
  prev = None
  for s, a, b in izip(series,data_a,data_b):
    if prev is not None:
      diff = s - prev
      if s - prev > d_max:
        for x in xrange(int(round(diff/d_max))-1):
          yield (float('nan'),float('nan'))
    prev = s
    yield (a,b)

newA = []
newB = []
for a,b in fillGaps(series,data_a,data_b):
  newA.append(a)
  newB.append(b)

E.g. read the data into the izip and write it out instead of list appends.

Answered By: MattH