Pandas : Calculate difference between max and min over time period
Question:
I need to calculate the difference between max and min value over 1 second period, the data frame looks like this, epoch is in miliseconds.
Column A
Epoch
10
1373981385937
11
1373981386140
13
1373981386312
8
1373981386968
7
1373981387187
7
1373981387421
I have to create a new column diff that is the difference between min and max of 'column A'
in each 1-second intervals. Note that these intervals are all relative to the min value of 'Epoch'
(the first value, 1373981385937, in the example above). First I get the first 1 second interval starting from 1373981385937 add 1 second, get the values in that range, calculate the max min difference and set diff
to that value for the entire range, keeping the original index.
The desired result is:
Column A
Epoch
diff
10
1373981385937
3
11
1373981386140
3
13
1373981386312
3
8
1373981386968
1
7
1373981387187
1
7
1373981387421
1
Below I show how I currently do it:
current_index = 0
list_indexes = []
list_values = []
interval = 1000 # ms
while current_index < series.shape[0]:
left = series.loc[(series["Epoch"] >= series["Epoch"].iloc[current_index]) & (series["Epoch"] < series["Epoch"].iloc[current_index] + interval)]
value = left["Column A"].max() - left["Column A"].min()
list_indexes.extend(list(left.index.values))
list_values.extend(np.full(left.shape[0], value))
current_index += left.shape[0]
result = pds.Series(data = list_values, index = list_indexes, name = label, dtype=np.float64)
I get the expected result, but the performance is poor.
Is there a way I can do it faster/better?
Edit:
Thank you for the support, but i cannot seem to integrate the solution in my code partly because i have to take into the accont two more columns
Column A
Column B
Column C
Epoch
diff
25
10
15
1373973055796
5
25
10
10
1373973055828
5
..
..
..
………….
.
25
12
18
1373973092296
2
25
12
16
1373973092328
2
..
..
..
………….
.
26
10
15
1373973055875
4
26
10
11
1373973055906
4
..
..
..
………….
.
26
12
13
1373973092359
3
26
12
10
1373973092406
3
..
..
..
………….
.
27
10
23
1373973055953
6
27
10
17
1373973056000
6
..
..
..
………….
.
27
12
17
1373973092921
7
27
12
10
1373973092953
7
The way I do it now is:
for each unique value in colum A
for each unique value in colum B
gb = df.groupby((df["Epoch"] - df["Epoch"].min()) // 1000)["Column C"]
kwargs = {label : gb.transform(max) - gb.transform(min)}
newdf = df.assign(**kwargs)
Sorry for the long edit.
Do you thing there is a better way ?
Answers:
The code below processes 1 million rows in about 151 ms (on a generic Intel Xeon Platinum 8175M CPU).
Using your example:
gb = df.groupby((df['Epoch'] - df['Epoch'].min()) // 1000)['Column A']
newdf = df.assign(diff=gb.transform(max) - gb.transform(min))
>>> newdf
Column A Epoch diff
0 10 1373981385937 3
1 11 1373981386140 3
2 13 1373981386312 3
3 8 1373981386968 1
4 7 1373981387187 1
5 7 1373981387421 1
Quick inspection: none of the below is necessary for the solution above, but is just to convince ourselves that the result is correct. We assign t
the actual datetime, and delta_t
as the difference in seconds from t.min()
:
t = pd.to_datetime(df['Epoch'], unit='ms')
tmp = df.assign(
t=t,
delta_t=(t - t.min()).dt.total_seconds(),
groupno=gb.ngroup(),
)
>>> tmp
Column A Epoch t delta_t groupno
0 10 1373981385937 2013-07-16 13:29:45.937 0.000 0
1 11 1373981386140 2013-07-16 13:29:46.140 0.203 0
2 13 1373981386312 2013-07-16 13:29:46.312 0.375 0
3 8 1373981386968 2013-07-16 13:29:46.968 1.031 1
4 7 1373981387187 2013-07-16 13:29:47.187 1.250 1
5 7 1373981387421 2013-07-16 13:29:47.421 1.484 1
Speed
n = 1_000_000
t0 = 1373981385937
df = pd.DataFrame({
'Column A': np.random.randint(0, 100, n),
'Epoch': np.random.randint(t0, t0 + 300 * n, n),
})
def f(df):
gb = df.groupby((df['Epoch'] - df['Epoch'].min()) // 1000)['Column A']
return df.assign(diff=gb.transform(max) - gb.transform(min))
%timeit f(df)
# 151 ms ± 1.87 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I need to calculate the difference between max and min value over 1 second period, the data frame looks like this, epoch is in miliseconds.
Column A | Epoch |
---|---|
10 | 1373981385937 |
11 | 1373981386140 |
13 | 1373981386312 |
8 | 1373981386968 |
7 | 1373981387187 |
7 | 1373981387421 |
I have to create a new column diff that is the difference between min and max of 'column A'
in each 1-second intervals. Note that these intervals are all relative to the min value of 'Epoch'
(the first value, 1373981385937, in the example above). First I get the first 1 second interval starting from 1373981385937 add 1 second, get the values in that range, calculate the max min difference and set diff
to that value for the entire range, keeping the original index.
The desired result is:
Column A | Epoch | diff |
---|---|---|
10 | 1373981385937 | 3 |
11 | 1373981386140 | 3 |
13 | 1373981386312 | 3 |
8 | 1373981386968 | 1 |
7 | 1373981387187 | 1 |
7 | 1373981387421 | 1 |
Below I show how I currently do it:
current_index = 0
list_indexes = []
list_values = []
interval = 1000 # ms
while current_index < series.shape[0]:
left = series.loc[(series["Epoch"] >= series["Epoch"].iloc[current_index]) & (series["Epoch"] < series["Epoch"].iloc[current_index] + interval)]
value = left["Column A"].max() - left["Column A"].min()
list_indexes.extend(list(left.index.values))
list_values.extend(np.full(left.shape[0], value))
current_index += left.shape[0]
result = pds.Series(data = list_values, index = list_indexes, name = label, dtype=np.float64)
I get the expected result, but the performance is poor.
Is there a way I can do it faster/better?
Edit:
Thank you for the support, but i cannot seem to integrate the solution in my code partly because i have to take into the accont two more columns
Column A | Column B | Column C | Epoch | diff |
---|---|---|---|---|
25 | 10 | 15 | 1373973055796 | 5 |
25 | 10 | 10 | 1373973055828 | 5 |
.. | .. | .. | …………. | . |
25 | 12 | 18 | 1373973092296 | 2 |
25 | 12 | 16 | 1373973092328 | 2 |
.. | .. | .. | …………. | . |
26 | 10 | 15 | 1373973055875 | 4 |
26 | 10 | 11 | 1373973055906 | 4 |
.. | .. | .. | …………. | . |
26 | 12 | 13 | 1373973092359 | 3 |
26 | 12 | 10 | 1373973092406 | 3 |
.. | .. | .. | …………. | . |
27 | 10 | 23 | 1373973055953 | 6 |
27 | 10 | 17 | 1373973056000 | 6 |
.. | .. | .. | …………. | . |
27 | 12 | 17 | 1373973092921 | 7 |
27 | 12 | 10 | 1373973092953 | 7 |
The way I do it now is:
for each unique value in colum A
for each unique value in colum B
gb = df.groupby((df["Epoch"] - df["Epoch"].min()) // 1000)["Column C"]
kwargs = {label : gb.transform(max) - gb.transform(min)}
newdf = df.assign(**kwargs)
Sorry for the long edit.
Do you thing there is a better way ?
The code below processes 1 million rows in about 151 ms (on a generic Intel Xeon Platinum 8175M CPU).
Using your example:
gb = df.groupby((df['Epoch'] - df['Epoch'].min()) // 1000)['Column A']
newdf = df.assign(diff=gb.transform(max) - gb.transform(min))
>>> newdf
Column A Epoch diff
0 10 1373981385937 3
1 11 1373981386140 3
2 13 1373981386312 3
3 8 1373981386968 1
4 7 1373981387187 1
5 7 1373981387421 1
Quick inspection: none of the below is necessary for the solution above, but is just to convince ourselves that the result is correct. We assign t
the actual datetime, and delta_t
as the difference in seconds from t.min()
:
t = pd.to_datetime(df['Epoch'], unit='ms')
tmp = df.assign(
t=t,
delta_t=(t - t.min()).dt.total_seconds(),
groupno=gb.ngroup(),
)
>>> tmp
Column A Epoch t delta_t groupno
0 10 1373981385937 2013-07-16 13:29:45.937 0.000 0
1 11 1373981386140 2013-07-16 13:29:46.140 0.203 0
2 13 1373981386312 2013-07-16 13:29:46.312 0.375 0
3 8 1373981386968 2013-07-16 13:29:46.968 1.031 1
4 7 1373981387187 2013-07-16 13:29:47.187 1.250 1
5 7 1373981387421 2013-07-16 13:29:47.421 1.484 1
Speed
n = 1_000_000
t0 = 1373981385937
df = pd.DataFrame({
'Column A': np.random.randint(0, 100, n),
'Epoch': np.random.randint(t0, t0 + 300 * n, n),
})
def f(df):
gb = df.groupby((df['Epoch'] - df['Epoch'].min()) // 1000)['Column A']
return df.assign(diff=gb.transform(max) - gb.transform(min))
%timeit f(df)
# 151 ms ± 1.87 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)