Tricky create calculation that pulls in retro values using Pandas
Question:
I have a dataset where I would like to create a new column called ‘aa_cumul’, by taking the sum, (Where the first instance of a numerical value occurs) for a specific city and ID of the value in the column,’new_r_aa’, which is 2, and the value in the column ‘cml_aa_bx’, 1 = 3.
From there we will take the cumulative sum of the value in ‘aa_cumul’ and ‘new r aa’
(3+8 = 11, 11+9 = 20 etc)
Data
import pandas as pd
data = {
'city': ['NY', 'NY', 'NY', 'NY', 'NY', 'CA'],
'ID': ['AA', 'AA', 'AA', 'AA', 'AA', 'AA'],
'cml_aa_bx': [1, 3, 6, 10, 12, 2],
'new_r_aa': [2, 6, 9, 8, 6, 5]
}
df = pd.DataFrame(data)
Desired
data = {
'city': ['NY', 'NY', 'NY', 'NY', 'NY', 'CA'],
'ID': ['AA', 'AA', 'AA', 'AA', 'AA', 'AA'],
'cml_aa_bx': [1, 3, 6, 10, 12, 2],
'new_r_aa': [2, 6, 9, 8, 6, 5],
'aa_cumul': [3, 11, 20, 28, 34, 6]
}
Doing
# Initialize the 'new cuml aa' column
new_cuml_aa = []
# Initialize the first value in 'new cuml aa' with the sum of the first value in 'new r aa' and 'cml_aa_bx'
new_cuml_aa.append(df['new_r_aa'][0] + df['cml_aa_bx'][0])
# Loop through the DataFrame to calculate 'new cuml aa' values
for i in range(1, len(df)):
new_cuml_aa_value = new_cuml_aa[i - 1] + df['new_r_aa'][i]
new_cuml_aa.append(new_cuml_aa_value)
However, this is giving me the wrong values/output. Any suggestion is appreciated
Answers:
One option is with pd.Series.mask
, where you create a condition and subsequently run the cumulative sum :
(df
.assign(aa_cumul = df['new r aa']
.mask(df.index==0, df.cml_aa_bx+df['new r aa'])
.cumsum()
)
)
city ID quarter cml_bb_bx r_aa_bx cml_aa_bx BB_AA_Bx_Ratio expected_aa_bx_delta total aa total round aa new r aa aa_cumul
0 NY AA 2024Q1 6 0 1 6.000000 1.810 1.8 2 2 3
1 NY AA 2024Q2 13 2 3 4.333333 2.857 4.9 6 8 11
2 NY AA 2024Q3 18 3 6 3.000000 2.395 5.4 6 9 20
3 NY AA 2024Q4 20 4 10 2.000000 0.000 4.0 4 8 28
Update: if you want it grouped, you can use:
df['aa_cumul'] = df.groupby(['city', 'ID'])['new_r_aa'].cumsum() + df.groupby(['city', 'ID'])['cml_aa_bx'].transform('first')
Original:
Unclear if you’re wanting a dataframe answer or a dictionary answer. Here’s a dataframe answer:
import pandas as pd
df = pd.DataFrame(data)
df['aa_cumul'] = df['new_r_aa'].cumsum() + df['cml_aa_bx'][0]
Output:
city ID quarter cml_bb_bx ... total aa total round aa new_r_aa aa_cumul
0 NY AA 2024Q1 6 ... 1.8 2 2 3
1 NY AA 2024Q2 13 ... 4.9 6 8 11
2 NY AA 2024Q3 18 ... 5.4 6 9 20
3 NY AA 2024Q4 20 ... 4.0 4 8 28
[4 rows x 12 columns]
…and here’s a dictionary answer (using numpy
):
import numpy as np
data['aa_cumul'] = np.cumsum(data['new_r_aa']) + df['cml_aa_bx'][0]
Output:
{'city': ['NY', 'NY', 'NY', 'NY'], 'ID': ['AA', 'AA', 'AA', 'AA'], 'quarter': ['2024Q1', '2024Q2', '2024Q3', '2024Q4'], 'cml_bb_bx': [6, 13, 18, 20], 'r_aa_bx': [0, 2, 3, 4], 'cml_aa_bx': [1, 3, 6, 10], 'BB_AA_Bx_Ratio': [6, 4.333333333, 3, 2], 'expected_aa_bx_delta': [1.81, 2.857, 2.395, 0], 'total aa': [1.8, 4.9, 5.4, 4.0], 'total round aa': [2, 6, 6, 4], 'new_r_aa': [2, 8, 9, 8], 'aa_cumul': array([ 3, 11, 20, 28])}
I have a dataset where I would like to create a new column called ‘aa_cumul’, by taking the sum, (Where the first instance of a numerical value occurs) for a specific city and ID of the value in the column,’new_r_aa’, which is 2, and the value in the column ‘cml_aa_bx’, 1 = 3.
From there we will take the cumulative sum of the value in ‘aa_cumul’ and ‘new r aa’
(3+8 = 11, 11+9 = 20 etc)
Data
import pandas as pd
data = {
'city': ['NY', 'NY', 'NY', 'NY', 'NY', 'CA'],
'ID': ['AA', 'AA', 'AA', 'AA', 'AA', 'AA'],
'cml_aa_bx': [1, 3, 6, 10, 12, 2],
'new_r_aa': [2, 6, 9, 8, 6, 5]
}
df = pd.DataFrame(data)
Desired
data = {
'city': ['NY', 'NY', 'NY', 'NY', 'NY', 'CA'],
'ID': ['AA', 'AA', 'AA', 'AA', 'AA', 'AA'],
'cml_aa_bx': [1, 3, 6, 10, 12, 2],
'new_r_aa': [2, 6, 9, 8, 6, 5],
'aa_cumul': [3, 11, 20, 28, 34, 6]
}
Doing
# Initialize the 'new cuml aa' column
new_cuml_aa = []
# Initialize the first value in 'new cuml aa' with the sum of the first value in 'new r aa' and 'cml_aa_bx'
new_cuml_aa.append(df['new_r_aa'][0] + df['cml_aa_bx'][0])
# Loop through the DataFrame to calculate 'new cuml aa' values
for i in range(1, len(df)):
new_cuml_aa_value = new_cuml_aa[i - 1] + df['new_r_aa'][i]
new_cuml_aa.append(new_cuml_aa_value)
However, this is giving me the wrong values/output. Any suggestion is appreciated
One option is with pd.Series.mask
, where you create a condition and subsequently run the cumulative sum :
(df
.assign(aa_cumul = df['new r aa']
.mask(df.index==0, df.cml_aa_bx+df['new r aa'])
.cumsum()
)
)
city ID quarter cml_bb_bx r_aa_bx cml_aa_bx BB_AA_Bx_Ratio expected_aa_bx_delta total aa total round aa new r aa aa_cumul
0 NY AA 2024Q1 6 0 1 6.000000 1.810 1.8 2 2 3
1 NY AA 2024Q2 13 2 3 4.333333 2.857 4.9 6 8 11
2 NY AA 2024Q3 18 3 6 3.000000 2.395 5.4 6 9 20
3 NY AA 2024Q4 20 4 10 2.000000 0.000 4.0 4 8 28
Update: if you want it grouped, you can use:
df['aa_cumul'] = df.groupby(['city', 'ID'])['new_r_aa'].cumsum() + df.groupby(['city', 'ID'])['cml_aa_bx'].transform('first')
Original:
Unclear if you’re wanting a dataframe answer or a dictionary answer. Here’s a dataframe answer:
import pandas as pd
df = pd.DataFrame(data)
df['aa_cumul'] = df['new_r_aa'].cumsum() + df['cml_aa_bx'][0]
Output:
city ID quarter cml_bb_bx ... total aa total round aa new_r_aa aa_cumul
0 NY AA 2024Q1 6 ... 1.8 2 2 3
1 NY AA 2024Q2 13 ... 4.9 6 8 11
2 NY AA 2024Q3 18 ... 5.4 6 9 20
3 NY AA 2024Q4 20 ... 4.0 4 8 28
[4 rows x 12 columns]
…and here’s a dictionary answer (using numpy
):
import numpy as np
data['aa_cumul'] = np.cumsum(data['new_r_aa']) + df['cml_aa_bx'][0]
Output:
{'city': ['NY', 'NY', 'NY', 'NY'], 'ID': ['AA', 'AA', 'AA', 'AA'], 'quarter': ['2024Q1', '2024Q2', '2024Q3', '2024Q4'], 'cml_bb_bx': [6, 13, 18, 20], 'r_aa_bx': [0, 2, 3, 4], 'cml_aa_bx': [1, 3, 6, 10], 'BB_AA_Bx_Ratio': [6, 4.333333333, 3, 2], 'expected_aa_bx_delta': [1.81, 2.857, 2.395, 0], 'total aa': [1.8, 4.9, 5.4, 4.0], 'total round aa': [2, 6, 6, 4], 'new_r_aa': [2, 8, 9, 8], 'aa_cumul': array([ 3, 11, 20, 28])}