Is there a way to use the previous calculated row value with the sum of a different column in a Pandas Dataframe?

Question:

I have the following dataframe:

                     A          B
2021-05-19 07:00:00  Nan        Nan
2021-05-19 07:30:00  0.00       Nan
2021-05-19 08:00:00  0.00       Nan
2021-05-19 08:30:00  0.00       Nan
2021-05-19 09:00:00  19.91      Nan
2021-05-19 09:30:00  0.11       Nan
2021-05-19 10:00:00  0.00       Nan
2021-05-19 10:30:00  22.99      Nan
2021-05-19 11:00:00  0.00       Nan

Require:

                     A          B
2021-05-19 07:00:00  Nan        0.00
2021-05-19 07:30:00  0.00       0.00
2021-05-19 08:00:00  0.00       0.00
2021-05-19 08:30:00  0.00       0.00
2021-05-19 09:00:00  19.91      3.32
2021-05-19 09:30:00  0.11       2.78
2021-05-19 10:00:00  0.00       2.32
2021-05-19 10:30:00  22.99      5.76
2021-05-19 11:00:00  0.00       4.80

Calculation for column B:

B1 = A1
B2 = ((B1*5)+A2)/6
B3 = ((B2*5)+A3)/6
B4 = ((B3*5)+A4)/6
etc.

I have already attempted a calculation in Python with the shift function but this doesn’t work, would be great if someone can help me in the right direction.

Asked By: Tenzin

||

Answers:

If you want to express the function

B[i] = (A[i-1] * 5 + A[i])/6

You are on the right track with shift

B = ((A.shift(1)*5)+A)/6
B.iat[0] = A.iat[0]

However, if you want to express the recursive function

B[i] = (B[i-1] * 5 + A[i])/6

Then you cannot use vectorized pandas operations and can just calculate this using normal Python code, as another answer points out.

Answered By: Micah Smith

You can loop through the DataFrame and set column B as each value of B depends on its own previous value.

for i, date in enumerate(df.index):
    if i==0:
        df.at[date, "B"] = 0
    else:
        df.at[date, "B"] = (df["B"].iat[i-1]*5+df.at[date, "A"])/6
df
>>
                         A         B
2021-05-19 07:00:00   0.00  0.000000
2021-05-19 07:30:00   0.00  0.000000
2021-05-19 08:00:00   0.00  0.000000
2021-05-19 08:30:00   0.00  0.000000
2021-05-19 09:00:00  19.91  3.318333
2021-05-19 09:30:00   0.11  2.783611
2021-05-19 10:00:00   0.00  2.319676
2021-05-19 10:30:00  22.99  5.764730
2021-05-19 11:00:00   0.00  4.803942
Answered By: not_speshal

With some math, we can turn this recursive formula to a geometric-series like one:

df["B"] = (df.A
             .fillna(0)
             .expanding()
             .apply(lambda s: (1/6)*(s * ((5/6) ** np.arange(len(s))[::-1])).sum() + (5/6)**s.size*s.iloc[0]))

It amounts to

N := window.size

B_j = (5/6)^(N-1) A_1 + (1/6) sum_{j=2}^{N} (5/6)^(N-j) A_j

where window is expanding and corresponds to s in the code. In the code, however, we sum the A_1 together with others & thereby take 1/6 of it; so we add the remaining 5/6 of it, hence the (5/6)^N (rather than N-1) in front of it; output is equivalent. We also turn NaNs in A to 0 to prevent them from propagating.

to get

                         A         B
2021-05-19 07:00:00    NaN  0.000000
2021-05-19 07:30:00   0.00  0.000000
2021-05-19 08:00:00   0.00  0.000000
2021-05-19 08:30:00   0.00  0.000000
2021-05-19 09:00:00  19.91  3.318333
2021-05-19 09:30:00   0.11  2.783611
2021-05-19 10:00:00   0.00  2.319676
2021-05-19 10:30:00  22.99  5.764730
2021-05-19 11:00:00   0.00  4.803942
Answered By: Mustafa Aydın

We can define a function fast_sum to perform the required calculation then using the technique called just in time compilation, compile this function to machine code so that it can run more efficiently at C like speeds

import numba

@numba.jit(nopython=True)
def fast_sum(a):
    b = np.zeros_like(a)
    b[0] = a[0]
    for i in range(1, len(a)):
        b[i] = (b[i - 1] * 5 + a[i]) / 6 
    return b

df['B'] = fast_sum(df['A'].fillna(0).to_numpy())

                         A         B
2021-05-19 07:00:00   0.00  0.000000
2021-05-19 07:30:00   0.00  0.000000
2021-05-19 08:00:00   0.00  0.000000
2021-05-19 08:30:00   0.00  0.000000
2021-05-19 09:00:00  19.91  3.318333
2021-05-19 09:30:00   0.11  2.783611
2021-05-19 10:00:00   0.00  2.319676
2021-05-19 10:30:00  22.99  5.764730
2021-05-19 11:00:00   0.00  4.803942

Performance test on sample dataframe with 90000 rows

df = pd.concat([df] * 10000, ignore_index=True)

%%timeit
df['B'] = fast_sum(df['A'].fillna(0).to_numpy())
# 1.62 ms ± 93.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Answered By: Shubham Sharma

This is really weird that shift doesn’t work here, as far as I can tell, you are not recalculating the previous value, it’s there already, shift should be able to get that, have I missed something?

Answered By: Andrew Martin
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.