Pandas : Calculate difference between max and min over time period

Question:

I need to calculate the difference between max and min value over 1 second period, the data frame looks like this, epoch is in miliseconds.

Column A Epoch
10 1373981385937
11 1373981386140
13 1373981386312
8 1373981386968
7 1373981387187
7 1373981387421

I have to create a new column diff that is the difference between min and max of 'column A' in each 1-second intervals. Note that these intervals are all relative to the min value of 'Epoch' (the first value, 1373981385937, in the example above). First I get the first 1 second interval starting from 1373981385937 add 1 second, get the values in that range, calculate the max min difference and set diff to that value for the entire range, keeping the original index.

The desired result is:

Column A Epoch diff
10 1373981385937 3
11 1373981386140 3
13 1373981386312 3
8 1373981386968 1
7 1373981387187 1
7 1373981387421 1

Below I show how I currently do it:

current_index = 0
list_indexes = []
list_values = []
interval = 1000 # ms
while current_index < series.shape[0]:
    left = series.loc[(series["Epoch"] >= series["Epoch"].iloc[current_index]) & (series["Epoch"] < series["Epoch"].iloc[current_index] + interval)]
    value = left["Column A"].max() - left["Column A"].min()
    list_indexes.extend(list(left.index.values))
    list_values.extend(np.full(left.shape[0], value))
    current_index += left.shape[0]
result = pds.Series(data = list_values, index = list_indexes, name = label, dtype=np.float64)

I get the expected result, but the performance is poor.

Is there a way I can do it faster/better?

Edit:

Thank you for the support, but i cannot seem to integrate the solution in my code partly because i have to take into the accont two more columns

Column A Column B Column C Epoch diff
25 10 15 1373973055796 5
25 10 10 1373973055828 5
.. .. .. …………. .
25 12 18 1373973092296 2
25 12 16 1373973092328 2
.. .. .. …………. .
26 10 15 1373973055875 4
26 10 11 1373973055906 4
.. .. .. …………. .
26 12 13 1373973092359 3
26 12 10 1373973092406 3
.. .. .. …………. .
27 10 23 1373973055953 6
27 10 17 1373973056000 6
.. .. .. …………. .
27 12 17 1373973092921 7
27 12 10 1373973092953 7

The way I do it now is:

 for each unique value in colum A
  for each unique value in colum B
   gb = df.groupby((df["Epoch"] - df["Epoch"].min()) // 1000)["Column C"]
   kwargs = {label : gb.transform(max) - gb.transform(min)}
   newdf = df.assign(**kwargs)

Sorry for the long edit.
Do you thing there is a better way ?

Asked By: catalin_345323

||

Answers:

The code below processes 1 million rows in about 151 ms (on a generic Intel Xeon Platinum 8175M CPU).

Using your example:

gb = df.groupby((df['Epoch'] - df['Epoch'].min()) // 1000)['Column A']
newdf = df.assign(diff=gb.transform(max) - gb.transform(min))

>>> newdf
   Column A          Epoch  diff
0        10  1373981385937     3
1        11  1373981386140     3
2        13  1373981386312     3
3         8  1373981386968     1
4         7  1373981387187     1
5         7  1373981387421     1

Quick inspection: none of the below is necessary for the solution above, but is just to convince ourselves that the result is correct. We assign t the actual datetime, and delta_t as the difference in seconds from t.min():

t = pd.to_datetime(df['Epoch'], unit='ms')
tmp = df.assign(
    t=t,
    delta_t=(t - t.min()).dt.total_seconds(),
    groupno=gb.ngroup(),
)
>>> tmp
   Column A          Epoch                       t  delta_t  groupno
0        10  1373981385937 2013-07-16 13:29:45.937    0.000        0
1        11  1373981386140 2013-07-16 13:29:46.140    0.203        0
2        13  1373981386312 2013-07-16 13:29:46.312    0.375        0
3         8  1373981386968 2013-07-16 13:29:46.968    1.031        1
4         7  1373981387187 2013-07-16 13:29:47.187    1.250        1
5         7  1373981387421 2013-07-16 13:29:47.421    1.484        1

Speed

n = 1_000_000
t0 = 1373981385937
df = pd.DataFrame({
    'Column A': np.random.randint(0, 100, n),
    'Epoch': np.random.randint(t0, t0 + 300 * n, n),
})

def f(df):
    gb = df.groupby((df['Epoch'] - df['Epoch'].min()) // 1000)['Column A']
    return df.assign(diff=gb.transform(max) - gb.transform(min))

%timeit f(df)
# 151 ms ± 1.87 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Answered By: Pierre D
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.