specific sumproduct list comprehension in pandas

Question:

suppose i have a dataframe

df = pd.DataFrame({"age" : [0, 5, 10, 15, 20], "income": [5, 13, 23, 18, 12]})

  age  income
0    0       5
1    5      13
2   10      23
3   15      18
4   20      12

i want to iterate through df["income"] and calculate the sumproduct as follows (example for age 15): 18+23*(15-10)+13*(15-5)+5*(15-0) = 338.

more generic: income[3] + income[2] * ( age[3] – age[2] ) + income[1] * ( age[3] – age[1] ) + income[0] * (age[3] – age[0] )

I am struggling to formulate the age relative to the current iteration of age ( age[x] – age[y] ) in a generic way to use in a list comprehension or formula.

edit: the actual operation I want to apply is

income[3 ] + income[2]* interest ** ( age[3] - age[2] ) + income[1]*interest ** (age[3] - age[1] ...

exampe from above: 18+23*1.03 ** (15-10)+13*1.03 ** (15-5)+5*1.03 **(15-0) = 69,92

interest = 1.03

ANSWERED thanks to jezrael & mozway

Asked By: krawall

||

Answers:

Here is one way you could use a list comprehension to calculate the sumproduct for each value of age in your DataFrame:

sumproducts = [income[x] + income[x+1] * (age[x] - age[x+1]) + income[x+2] * (age[x] - age[x+2]) + income[x+3] * (age[x] - age[x+3]) for x in range(len(df["age"]))]

In this list comprehension, we are using the current value of x in the range (which corresponds to the current value of age) to determine the values of age and income to use in the calculation.

Alternatively, you could use the apply() method to apply a custom function to each row in the DataFrame, which would allow you to perform the calculation without using a list comprehension. The function would take a row of the DataFrame as input and return the sumproduct for that row. Here is an example:

def calculate_sumproduct(row):
  x = row["age"]
  income = row["income"]
  return income[x] + income[x+1] * (age[x] - age[x+1]) + income[x+2] * (age[x] - age[x+2]) + income[x+3] * (age[x] - age[x+3])

sumproducts = df.apply(calculate_sumproduct, axis=1)

You can then use the resulting series of sumproducts to add a new column to your DataFrame if you want.

df["sumproduct"] = sumproducts
Answered By: AschenAI

Numpy solution – you can use broadcasting for avoid loops for improve performance:

df = pd.DataFrame({"age" : [0, 5, 10, 15, 20], "income": [5, 13, 23, 18, 12]})

interest = 1.03
age = df['age'].to_numpy()

Use power with subtracted values of mask:

arr = interest ** (age[:, None] - age ) 
print (arr)
[[1.         0.86260878 0.74409391 0.64186195 0.55367575]
 [1.15927407 1.         0.86260878 0.74409391 0.64186195]
 [1.34391638 1.15927407 1.         0.86260878 0.74409391]
 [1.55796742 1.34391638 1.15927407 1.         0.86260878]
 [1.80611123 1.55796742 1.34391638 1.15927407 1.        ]]

Then set 0 to upper triangle:

arr = np.where(np.triu(np.ones(arr.shape, dtype=bool)), 0, arr)
print (arr)
[[0.         0.         0.         0.         0.        ]
 [1.15927407 0.         0.         0.         0.        ]
 [1.34391638 1.15927407 0.         0.         0.        ]
 [1.55796742 1.34391638 1.15927407 0.         0.        ]
 [1.80611123 1.55796742 1.34391638 1.15927407 0.        ]]

Set 1 to diagonal:

np.fill_diagonal(arr, 1)
print (arr)
[[1.         0.         0.         0.         0.        ]
 [1.15927407 1.         0.         0.         0.        ]
 [1.34391638 1.15927407 1.         0.         0.        ]
 [1.55796742 1.34391638 1.15927407 1.         0.        ]
 [1.80611123 1.55796742 1.34391638 1.15927407 1.        ]]

Multiple by column income and sum per rows:

print (arr * df['income'].to_numpy())
[[ 5.          0.          0.          0.          0.        ]
 [ 5.79637037 13.          0.          0.          0.        ]
 [ 6.7195819  15.07056297 23.          0.          0.        ]
 [ 7.78983708 17.47091293 26.66330371 18.          0.        ]
 [ 9.03055617 20.25357642 30.91007672 20.86693334 12.        ]]


df['new'] = (arr * df['income'].to_numpy()).sum(axis=1)
print (df)   age  income        new
0    0       5   5.000000
1    5      13  18.796370
2   10      23  44.790145
3   15      18  69.924054
4   20      12  93.061143

Performance: For 5k rows, applyare loops under the hood, so slow (best avoid it)

df = pd.DataFrame({"age" : [0, 5, 10, 15, 20], "income": [5, 13, 23, 18, 12]})
df = pd.concat([df] * 1000, ignore_index=True)


In [292]: %%timeit
     ...: age = df['age'].to_numpy()
     ...: 
     ...: arr = interest ** (age[:, None] - age ) 
     ...: arr = np.where(np.triu(np.ones(arr.shape, dtype=bool)), 0, arr)
     ...: np.fill_diagonal(arr, 1)
     ...: df['new'] = (arr * df['income'].to_numpy()).sum(axis=1)
     ...: 
1.39 s ± 69.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [293]: %%timeit
     ...: df['sumproduct'] = (df['age'].expanding().apply(lambda x: sum(df.loc[:x.index[-1], 'income'] * interest**(x.iloc[-1]-x))))
     ...: 
5.13 s ± 411 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Answered By: jezrael

You can rewrite 18+(18+23+13+5)*15-(23+13+5)*(10+5+0) to be 18+(18+23+13+5)*15-(23+13+5)*(10+5+0):

The general formula is thus:

sumproduct(n) = (income
                + (n-1)*sum(age[:n-1]*income[:n-1])
                - sum(age[:n-1]*income[:n-1])
               )

As code:

df['sumproduct'] = (df['income']
 .add(df['age'].mul(df['income'].cumsum().shift(fill_value=0)))
 .sub(df['age'].mul(df['income']).cumsum().shift(fill_value=0))
)

output:

   age  income  sumproduct
0    0       5           5
1    5      13          38
2   10      23         138
3   15      18         338
4   20      12         627

power

powers are more complex as you cannot directly factorize, you can however rewrite the operation with expanding:

df['sumproduct'] = (df['age'].expanding()
 .apply(lambda x: sum(df.loc[:x.index[-1], 'income'] * interest**(x.iloc[-1]-x)))
)

Output:

   age  income  sumproduct
0    0       5    5.000000
1    5      13   18.796370
2   10      23   44.790145
3   15      18   69.924054
4   20      12   93.061143
Answered By: mozway