Among investment histories how to find the cumulative sum option that proves to be more reliable for the long term?
Question:
In fact, not always the chart that reaches the highest peak of positive value when doing cumulative sum is the most reliable for long-term investments, because a single investment may have generated a very high profit but then it returns to the normal of being negative and if become an endless fall.
Also relying on higher ROI (return on investment) is risky for the same reasons as above.
That said, the cumulative sum graphs generated by these test values are:
ex_csv_1 = """
Col 1,Col 2,Col 3,return
a,b,c,1
a,b,c,1
a,b,c,-1
a,b,c,1
a,b,c,1
a,b,c,-1
a,b,c,1
"""
ex_csv_2 = """
Col 1,Col 2,Col 3,return
a,b,c,1
a,b,c,-2
a,b,c,-3
a,b,c,4
a,b,c,5
a,b,c,6
a,b,c,7
"""
ex_csv_3 = """
Col 1,Col 2,Col 3,return
a,b,c,2
a,b,c,2
a,b,c,2
a,b,c,2
a,b,c,2
a,b,c,-2
a,b,c,2
"""
If I wanted to find the one with the biggest peak, I would do it this way:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import io
ex_csv_1 = """
Col 1,Col 2,Col 3,return
a,b,c,1
a,b,c,1
a,b,c,-1
a,b,c,1
a,b,c,1
a,b,c,-1
a,b,c,1
"""
ex_csv_2 = """
Col 1,Col 2,Col 3,return
a,b,c,1
a,b,c,-2
a,b,c,-3
a,b,c,4
a,b,c,5
a,b,c,6
a,b,c,7
"""
ex_csv_3 = """
Col 1,Col 2,Col 3,return
a,b,c,2
a,b,c,2
a,b,c,2
a,b,c,2
a,b,c,2
a,b,c,-2
a,b,c,2
"""
def save_fig(cs):
values = np.cumsum(cs[2])
fig = plt.figure()
plt.plot(values)
fig.savefig(f'a_graph.png', dpi=fig.dpi)
fig.clf()
plt.close('all')
options = []
for i,strio in enumerate([ex_csv_1,ex_csv_2,ex_csv_3]):
df = pd.read_csv(io.StringIO(strio), sep=",")
df['invest'] = df.groupby(['Col 1','Col 2','Col 3'])['return'].cumsum().gt(df['return'])
pl = df[(df['invest'] == True)]['return']
total_sum = pl.sum()
roi = total_sum/len(pl)
options.append([total_sum,roi,pl])
max_list = max(options, key=lambda sublist: sublist[0])
save_fig(max_list)
But how should I go about finding which track record among the three demonstrates keeping the smallest fluctuation and delivering the greatest long-term reliability?
I will put two charts below, the second chart that has less oscillations is the most reliable among them for the long term, as the variations are smaller and maintains a crescent with an established pattern:
Answers:
One simple measure of "reliability" in a graph is how well the graph matches with linear behavior. To calculate this, we can perform a linear regression on the data. The scipy.stats
package has a nice built-in function for this. A "good" result should have a high R-value, meaning the data are behaving linearly. Also, the slope of the result should be positive, meaning it increases over time.
results = {}
for i,strio in enumerate([ex_csv_1,ex_csv_2,ex_csv_3]):
df = pd.read_csv(io.StringIO(strio), sep=",")
df['cumsum'] = df.groupby(['Col 1','Col 2','Col 3'])['return'].cumsum()
# Perform the linear regression
linreg = scipy.stats.linregress(df.index,df['cumsum'])
# Save the results for comparison later
results[i] = linreg
# Plot to see how the regression matches the data
plt.plot(df.index, df['cumsum'])
xmin, xmax = min(df.index), max(df.index)
plt.plot(
[xmin, xmax],
[xmin*linreg.slope + linreg.intercept, xmax*linreg.slope + linreg.intercept],
label = "slope: {:g}nR-value:{:g}".format(linreg.slope, linreg.rvalue)
)
plt.legend()
plt.show()
results
The output results
are:
{0: LinregressResult(slope=0.2857142857142857, intercept=1.1428571428571428, rvalue=0.7559289460184545, pvalue=0.04931308767365261, stderr=0.11065666703449761),
1: LinregressResult(slope=3.0, intercept=-4.714285714285714, rvalue=0.8373248339703451, pvalue=0.01874218974109145, stderr=0.8759834123860507),
2: LinregressResult(slope=1.2857142857142856, intercept=3.0, rvalue=0.9185586535436918, pvalue=0.0034781651152865026, stderr=0.24743582965269673)}
I would interpret this as:
- Low R-Value: Plot one has a lot of variability. Low slope: bad return on investment
- OK R-Value: Plot is more consistent. High slope: good return on investment
- High R-Value: Plot is very consistent. High slope: good return on investment
Used a similar approach as suggested by SNygard with some additions to what i said earlier. As data I took several segments of the ‘MA’ stock, closing prices act as your ‘return’ data here. If you want to use your data, then uncomment the following row: x['invest'] = x['return'].cumsum()
and replace 'Close'
with 'invest'
everywhere.
df = dataframe number, coef = slope, R = correlation coefficient between values and regression line, dif = absolute percentage difference.
With a difference in percentage, everything turned out to be not so clear. The first dataframe has a low value, but the fourth also has a low value, but a small correlation coefficient. I think it should be used in conjunction. By its value, you can see the hypothetical possible drawdown of the balance as a percentage.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import yfinance as yf
# """
df1 = yf.download('MA', start='2019-01-05', end='2019-04-23').reset_index()
df2 = yf.download('MA', start="2018-01-01", end="2018-04-21").reset_index()
df3 = yf.download('MA', start="2020-02-01", end="2020-05-21").reset_index()
df4 = yf.download('MA', start='2017-01-05', end='2017-04-23').reset_index()
df = [df1, df2, df3, df4]
hist = len(df)
aaa = np.ones((hist, 4), dtype=float)
for ind, x in enumerate(df):
lr = LinearRegression()
index = x.index.values.reshape((-1, 1))
#x['invest'] = x['return'].cumsum()
lr.fit(index, x['Close'])
x['lr'] = lr.predict(index)
x['dif'] = np.abs((x['Close'] - x['lr'])/(x['lr']/100.0))
R = lr.score(index, x['Close'])
aaa[ind][0] = ind
aaa[ind][1] = lr.coef_
aaa[ind][2] = R
aaa[ind][3] = x['dif'].mean()
rows = int(hist / 2)
if hist % 2 > 0:
rows += 1
for i in range(0, hist):
ax = plt.subplot(rows, 2, i + 1)
ur = df[i]['Close'].min() + (df[i]['Close'].max() - df[i]['Close'].min()) / 10
ax.text(df[i].index[-40], ur, 'df =' + str(int(aaa[i][0]) + 1) + ',' + ' coef='
+ str(round(aaa[i][1], 2)) + ' R=' + str(round(aaa[i][2],2))
+ ' dif = ' + str(round(aaa[i][3],2)), fontsize=10, ha='center')
ax.plot(df[i].index, df[i]['Close'])
ax.plot(df[i].index, df[i]['lr'])
plt.show()
if you want to see the difference separately:
for i in range(0, hist):
ax = plt.subplot(rows, 2, i + 1)
ur = df[i]['dif'].min() + (df[i]['dif'].max() - df[i]['dif'].min()) / 10
ax.text(df[i].index[-40], ur, 'df =' + str(int(aaa[i][0]) + 1) + ',' + ' coef='
+ str(round(aaa[i][1], 2)) + ' R=' + str(round(aaa[i][2], 2))
+ ' dif = ' + str(round(aaa[i][3], 2)), fontsize=10, ha='center')
ax.plot(df[i].index, df[i]['dif'])
plt.show()
or get the best dataframe by correlation coefficient:
aaa = aaa[aaa[:, 1] > 0]
aaa = aaa[(-aaa[:, 2]).argsort()]
fig, ax = plt.subplots()
ax.plot(df[int(aaa[0][0])].index, df[int(aaa[0][0])]['Close'])
ax.plot(df[int(aaa[0][0])].index, df[int(aaa[0][0])]['lr'])
fig.autofmt_xdate()
plt.show()
In fact, not always the chart that reaches the highest peak of positive value when doing cumulative sum is the most reliable for long-term investments, because a single investment may have generated a very high profit but then it returns to the normal of being negative and if become an endless fall.
Also relying on higher ROI (return on investment) is risky for the same reasons as above.
That said, the cumulative sum graphs generated by these test values are:
ex_csv_1 = """
Col 1,Col 2,Col 3,return
a,b,c,1
a,b,c,1
a,b,c,-1
a,b,c,1
a,b,c,1
a,b,c,-1
a,b,c,1
"""
ex_csv_2 = """
Col 1,Col 2,Col 3,return
a,b,c,1
a,b,c,-2
a,b,c,-3
a,b,c,4
a,b,c,5
a,b,c,6
a,b,c,7
"""
ex_csv_3 = """
Col 1,Col 2,Col 3,return
a,b,c,2
a,b,c,2
a,b,c,2
a,b,c,2
a,b,c,2
a,b,c,-2
a,b,c,2
"""
If I wanted to find the one with the biggest peak, I would do it this way:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import io
ex_csv_1 = """
Col 1,Col 2,Col 3,return
a,b,c,1
a,b,c,1
a,b,c,-1
a,b,c,1
a,b,c,1
a,b,c,-1
a,b,c,1
"""
ex_csv_2 = """
Col 1,Col 2,Col 3,return
a,b,c,1
a,b,c,-2
a,b,c,-3
a,b,c,4
a,b,c,5
a,b,c,6
a,b,c,7
"""
ex_csv_3 = """
Col 1,Col 2,Col 3,return
a,b,c,2
a,b,c,2
a,b,c,2
a,b,c,2
a,b,c,2
a,b,c,-2
a,b,c,2
"""
def save_fig(cs):
values = np.cumsum(cs[2])
fig = plt.figure()
plt.plot(values)
fig.savefig(f'a_graph.png', dpi=fig.dpi)
fig.clf()
plt.close('all')
options = []
for i,strio in enumerate([ex_csv_1,ex_csv_2,ex_csv_3]):
df = pd.read_csv(io.StringIO(strio), sep=",")
df['invest'] = df.groupby(['Col 1','Col 2','Col 3'])['return'].cumsum().gt(df['return'])
pl = df[(df['invest'] == True)]['return']
total_sum = pl.sum()
roi = total_sum/len(pl)
options.append([total_sum,roi,pl])
max_list = max(options, key=lambda sublist: sublist[0])
save_fig(max_list)
But how should I go about finding which track record among the three demonstrates keeping the smallest fluctuation and delivering the greatest long-term reliability?
I will put two charts below, the second chart that has less oscillations is the most reliable among them for the long term, as the variations are smaller and maintains a crescent with an established pattern:
One simple measure of "reliability" in a graph is how well the graph matches with linear behavior. To calculate this, we can perform a linear regression on the data. The scipy.stats
package has a nice built-in function for this. A "good" result should have a high R-value, meaning the data are behaving linearly. Also, the slope of the result should be positive, meaning it increases over time.
results = {}
for i,strio in enumerate([ex_csv_1,ex_csv_2,ex_csv_3]):
df = pd.read_csv(io.StringIO(strio), sep=",")
df['cumsum'] = df.groupby(['Col 1','Col 2','Col 3'])['return'].cumsum()
# Perform the linear regression
linreg = scipy.stats.linregress(df.index,df['cumsum'])
# Save the results for comparison later
results[i] = linreg
# Plot to see how the regression matches the data
plt.plot(df.index, df['cumsum'])
xmin, xmax = min(df.index), max(df.index)
plt.plot(
[xmin, xmax],
[xmin*linreg.slope + linreg.intercept, xmax*linreg.slope + linreg.intercept],
label = "slope: {:g}nR-value:{:g}".format(linreg.slope, linreg.rvalue)
)
plt.legend()
plt.show()
results
The output results
are:
{0: LinregressResult(slope=0.2857142857142857, intercept=1.1428571428571428, rvalue=0.7559289460184545, pvalue=0.04931308767365261, stderr=0.11065666703449761),
1: LinregressResult(slope=3.0, intercept=-4.714285714285714, rvalue=0.8373248339703451, pvalue=0.01874218974109145, stderr=0.8759834123860507),
2: LinregressResult(slope=1.2857142857142856, intercept=3.0, rvalue=0.9185586535436918, pvalue=0.0034781651152865026, stderr=0.24743582965269673)}
I would interpret this as:
- Low R-Value: Plot one has a lot of variability. Low slope: bad return on investment
- OK R-Value: Plot is more consistent. High slope: good return on investment
- High R-Value: Plot is very consistent. High slope: good return on investment
Used a similar approach as suggested by SNygard with some additions to what i said earlier. As data I took several segments of the ‘MA’ stock, closing prices act as your ‘return’ data here. If you want to use your data, then uncomment the following row: x['invest'] = x['return'].cumsum()
and replace 'Close'
with 'invest'
everywhere.
df = dataframe number, coef = slope, R = correlation coefficient between values and regression line, dif = absolute percentage difference.
With a difference in percentage, everything turned out to be not so clear. The first dataframe has a low value, but the fourth also has a low value, but a small correlation coefficient. I think it should be used in conjunction. By its value, you can see the hypothetical possible drawdown of the balance as a percentage.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import yfinance as yf
# """
df1 = yf.download('MA', start='2019-01-05', end='2019-04-23').reset_index()
df2 = yf.download('MA', start="2018-01-01", end="2018-04-21").reset_index()
df3 = yf.download('MA', start="2020-02-01", end="2020-05-21").reset_index()
df4 = yf.download('MA', start='2017-01-05', end='2017-04-23').reset_index()
df = [df1, df2, df3, df4]
hist = len(df)
aaa = np.ones((hist, 4), dtype=float)
for ind, x in enumerate(df):
lr = LinearRegression()
index = x.index.values.reshape((-1, 1))
#x['invest'] = x['return'].cumsum()
lr.fit(index, x['Close'])
x['lr'] = lr.predict(index)
x['dif'] = np.abs((x['Close'] - x['lr'])/(x['lr']/100.0))
R = lr.score(index, x['Close'])
aaa[ind][0] = ind
aaa[ind][1] = lr.coef_
aaa[ind][2] = R
aaa[ind][3] = x['dif'].mean()
rows = int(hist / 2)
if hist % 2 > 0:
rows += 1
for i in range(0, hist):
ax = plt.subplot(rows, 2, i + 1)
ur = df[i]['Close'].min() + (df[i]['Close'].max() - df[i]['Close'].min()) / 10
ax.text(df[i].index[-40], ur, 'df =' + str(int(aaa[i][0]) + 1) + ',' + ' coef='
+ str(round(aaa[i][1], 2)) + ' R=' + str(round(aaa[i][2],2))
+ ' dif = ' + str(round(aaa[i][3],2)), fontsize=10, ha='center')
ax.plot(df[i].index, df[i]['Close'])
ax.plot(df[i].index, df[i]['lr'])
plt.show()
if you want to see the difference separately:
for i in range(0, hist):
ax = plt.subplot(rows, 2, i + 1)
ur = df[i]['dif'].min() + (df[i]['dif'].max() - df[i]['dif'].min()) / 10
ax.text(df[i].index[-40], ur, 'df =' + str(int(aaa[i][0]) + 1) + ',' + ' coef='
+ str(round(aaa[i][1], 2)) + ' R=' + str(round(aaa[i][2], 2))
+ ' dif = ' + str(round(aaa[i][3], 2)), fontsize=10, ha='center')
ax.plot(df[i].index, df[i]['dif'])
plt.show()
or get the best dataframe by correlation coefficient:
aaa = aaa[aaa[:, 1] > 0]
aaa = aaa[(-aaa[:, 2]).argsort()]
fig, ax = plt.subplots()
ax.plot(df[int(aaa[0][0])].index, df[int(aaa[0][0])]['Close'])
ax.plot(df[int(aaa[0][0])].index, df[int(aaa[0][0])]['lr'])
fig.autofmt_xdate()
plt.show()