Pandas : Calculate difference between max and min over time period

Question

I need to calculate the difference between max and min value over 1 second period, the data frame looks like this, epoch is in miliseconds.

Column A	Epoch
10	1373981385937
11	1373981386140
13	1373981386312
8	1373981386968
7	1373981387187
7	1373981387421

I have to create a new column diff that is the difference between min and max of 'column A' in each 1-second intervals. Note that these intervals are all relative to the min value of 'Epoch' (the first value, 1373981385937, in the example above). First I get the first 1 second interval starting from 1373981385937 add 1 second, get the values in that range, calculate the max min difference and set diff to that value for the entire range, keeping the original index.

The desired result is:

Column A	Epoch	diff
10	1373981385937	3
11	1373981386140	3
13	1373981386312	3
8	1373981386968	1
7	1373981387187	1
7	1373981387421	1

Below I show how I currently do it:

current_index = 0
list_indexes = []
list_values = []
interval = 1000 # ms
while current_index < series.shape[0]:
    left = series.loc[(series["Epoch"] >= series["Epoch"].iloc[current_index]) & (series["Epoch"] < series["Epoch"].iloc[current_index] + interval)]
    value = left["Column A"].max() - left["Column A"].min()
    list_indexes.extend(list(left.index.values))
    list_values.extend(np.full(left.shape[0], value))
    current_index += left.shape[0]
result = pds.Series(data = list_values, index = list_indexes, name = label, dtype=np.float64)

I get the expected result, but the performance is poor.

Is there a way I can do it faster/better?

Edit:

Thank you for the support, but i cannot seem to integrate the solution in my code partly because i have to take into the accont two more columns

Column A	Column B	Column C	Epoch	diff
25	10	15	1373973055796	5
25	10	10	1373973055828	5
..	..	..	………….	.
25	12	18	1373973092296	2
25	12	16	1373973092328	2
..	..	..	………….	.
26	10	15	1373973055875	4
26	10	11	1373973055906	4
..	..	..	………….	.
26	12	13	1373973092359	3
26	12	10	1373973092406	3
..	..	..	………….	.
27	10	23	1373973055953	6
27	10	17	1373973056000	6
..	..	..	………….	.
27	12	17	1373973092921	7
27	12	10	1373973092953	7

The way I do it now is:

 for each unique value in colum A
  for each unique value in colum B
   gb = df.groupby((df["Epoch"] - df["Epoch"].min()) // 1000)["Column C"]
   kwargs = {label : gb.transform(max) - gb.transform(min)}
   newdf = df.assign(**kwargs)

Sorry for the long edit.
Do you thing there is a better way ?

Asked By: catalin_345323

||

Source

Answer 1

The code below processes 1 million rows in about 151 ms (on a generic Intel Xeon Platinum 8175M CPU).

Using your example:

gb = df.groupby((df['Epoch'] - df['Epoch'].min()) // 1000)['Column A']
newdf = df.assign(diff=gb.transform(max) - gb.transform(min))

>>> newdf
   Column A          Epoch  diff
0        10  1373981385937     3
1        11  1373981386140     3
2        13  1373981386312     3
3         8  1373981386968     1
4         7  1373981387187     1
5         7  1373981387421     1

Quick inspection: none of the below is necessary for the solution above, but is just to convince ourselves that the result is correct. We assign t the actual datetime, and delta_t as the difference in seconds from t.min():

t = pd.to_datetime(df['Epoch'], unit='ms')
tmp = df.assign(
    t=t,
    delta_t=(t - t.min()).dt.total_seconds(),
    groupno=gb.ngroup(),
)
>>> tmp
   Column A          Epoch                       t  delta_t  groupno
0        10  1373981385937 2013-07-16 13:29:45.937    0.000        0
1        11  1373981386140 2013-07-16 13:29:46.140    0.203        0
2        13  1373981386312 2013-07-16 13:29:46.312    0.375        0
3         8  1373981386968 2013-07-16 13:29:46.968    1.031        1
4         7  1373981387187 2013-07-16 13:29:47.187    1.250        1
5         7  1373981387421 2013-07-16 13:29:47.421    1.484        1

Speed

n = 1_000_000
t0 = 1373981385937
df = pd.DataFrame({
    'Column A': np.random.randint(0, 100, n),
    'Epoch': np.random.randint(t0, t0 + 300 * n, n),
})

def f(df):
    gb = df.groupby((df['Epoch'] - df['Epoch'].min()) // 1000)['Column A']
    return df.assign(diff=gb.transform(max) - gb.transform(min))

%timeit f(df)
# 151 ms ± 1.87 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Answered By: Pierre D

Pandas : Calculate difference between max and min over time period

Question:

Answers:

Speed