Among investment histories how to find the cumulative sum option that proves to be more reliable for the long term?

Question:

In fact, not always the chart that reaches the highest peak of positive value when doing cumulative sum is the most reliable for long-term investments, because a single investment may have generated a very high profit but then it returns to the normal of being negative and if become an endless fall.

Also relying on higher ROI (return on investment) is risky for the same reasons as above.

That said, the cumulative sum graphs generated by these test values are:

ex_csv_1 = """
Col 1,Col 2,Col 3,return
a,b,c,1
a,b,c,1
a,b,c,-1
a,b,c,1
a,b,c,1
a,b,c,-1
a,b,c,1
"""

enter image description here

ex_csv_2 = """
Col 1,Col 2,Col 3,return
a,b,c,1
a,b,c,-2
a,b,c,-3
a,b,c,4
a,b,c,5
a,b,c,6
a,b,c,7
"""

enter image description here

ex_csv_3 = """
Col 1,Col 2,Col 3,return
a,b,c,2
a,b,c,2
a,b,c,2
a,b,c,2
a,b,c,2
a,b,c,-2
a,b,c,2
"""

enter image description here

If I wanted to find the one with the biggest peak, I would do it this way:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import io

ex_csv_1 = """
Col 1,Col 2,Col 3,return
a,b,c,1
a,b,c,1
a,b,c,-1
a,b,c,1
a,b,c,1
a,b,c,-1
a,b,c,1
"""

ex_csv_2 = """
Col 1,Col 2,Col 3,return
a,b,c,1
a,b,c,-2
a,b,c,-3
a,b,c,4
a,b,c,5
a,b,c,6
a,b,c,7
"""

ex_csv_3 = """
Col 1,Col 2,Col 3,return
a,b,c,2
a,b,c,2
a,b,c,2
a,b,c,2
a,b,c,2
a,b,c,-2
a,b,c,2
"""

def save_fig(cs):
    values = np.cumsum(cs[2])
    fig = plt.figure()
    plt.plot(values)
    fig.savefig(f'a_graph.png', dpi=fig.dpi)
    fig.clf()
    plt.close('all')

options = []

for i,strio in enumerate([ex_csv_1,ex_csv_2,ex_csv_3]):
    df = pd.read_csv(io.StringIO(strio), sep=",")
    df['invest'] = df.groupby(['Col 1','Col 2','Col 3'])['return'].cumsum().gt(df['return'])
    pl = df[(df['invest'] == True)]['return']
    total_sum = pl.sum()
    roi = total_sum/len(pl)
    options.append([total_sum,roi,pl])
max_list = max(options, key=lambda sublist: sublist[0])
save_fig(max_list)

But how should I go about finding which track record among the three demonstrates keeping the smallest fluctuation and delivering the greatest long-term reliability?

I will put two charts below, the second chart that has less oscillations is the most reliable among them for the long term, as the variations are smaller and maintains a crescent with an established pattern:

enter image description here
enter image description here

Asked By: Digital Farmer

||

Answers:

One simple measure of "reliability" in a graph is how well the graph matches with linear behavior. To calculate this, we can perform a linear regression on the data. The scipy.stats package has a nice built-in function for this. A "good" result should have a high R-value, meaning the data are behaving linearly. Also, the slope of the result should be positive, meaning it increases over time.

results = {}
for i,strio in enumerate([ex_csv_1,ex_csv_2,ex_csv_3]):
    df = pd.read_csv(io.StringIO(strio), sep=",")
    df['cumsum'] = df.groupby(['Col 1','Col 2','Col 3'])['return'].cumsum()

    # Perform the linear regression
    linreg = scipy.stats.linregress(df.index,df['cumsum'])

    # Save the results for comparison later
    results[i] = linreg

    # Plot to see how the regression matches the data
    plt.plot(df.index, df['cumsum'])
    xmin, xmax = min(df.index), max(df.index)
    plt.plot(
        [xmin, xmax], 
        [xmin*linreg.slope + linreg.intercept, xmax*linreg.slope + linreg.intercept],
        label = "slope: {:g}nR-value:{:g}".format(linreg.slope, linreg.rvalue)
    )
    plt.legend()
    plt.show()
results

The output results are:

{0: LinregressResult(slope=0.2857142857142857, intercept=1.1428571428571428, rvalue=0.7559289460184545, pvalue=0.04931308767365261, stderr=0.11065666703449761),
 1: LinregressResult(slope=3.0, intercept=-4.714285714285714, rvalue=0.8373248339703451, pvalue=0.01874218974109145, stderr=0.8759834123860507),
 2: LinregressResult(slope=1.2857142857142856, intercept=3.0, rvalue=0.9185586535436918, pvalue=0.0034781651152865026, stderr=0.24743582965269673)}

I would interpret this as:

  1. Low R-Value: Plot one has a lot of variability. Low slope: bad return on investment
  2. OK R-Value: Plot is more consistent. High slope: good return on investment
  3. High R-Value: Plot is very consistent. High slope: good return on investment
Answered By: SNygard

Used a similar approach as suggested by SNygard with some additions to what i said earlier. As data I took several segments of the ‘MA’ stock, closing prices act as your ‘return’ data here. If you want to use your data, then uncomment the following row: x['invest'] = x['return'].cumsum() and replace 'Close' with 'invest' everywhere.

df = dataframe number, coef = slope, R = correlation coefficient between values and regression line, dif = absolute percentage difference.

With a difference in percentage, everything turned out to be not so clear. The first dataframe has a low value, but the fourth also has a low value, but a small correlation coefficient. I think it should be used in conjunction. By its value, you can see the hypothetical possible drawdown of the balance as a percentage.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import yfinance as yf

# """
df1 = yf.download('MA', start='2019-01-05', end='2019-04-23').reset_index()
df2 = yf.download('MA', start="2018-01-01", end="2018-04-21").reset_index()
df3 = yf.download('MA', start="2020-02-01", end="2020-05-21").reset_index()
df4 = yf.download('MA', start='2017-01-05', end='2017-04-23').reset_index()

df = [df1, df2, df3, df4]
hist = len(df)

aaa = np.ones((hist, 4), dtype=float)

for ind, x in enumerate(df):
    lr = LinearRegression()
    index = x.index.values.reshape((-1, 1))
    #x['invest'] = x['return'].cumsum()
    lr.fit(index, x['Close'])
    x['lr'] = lr.predict(index)
    x['dif'] = np.abs((x['Close'] - x['lr'])/(x['lr']/100.0))
    R = lr.score(index, x['Close'])
    aaa[ind][0] = ind
    aaa[ind][1] = lr.coef_
    aaa[ind][2] = R
    aaa[ind][3] = x['dif'].mean()

rows = int(hist / 2)
if hist % 2 > 0:
    rows += 1

for i in range(0, hist):
    ax = plt.subplot(rows, 2, i + 1)
    ur = df[i]['Close'].min() + (df[i]['Close'].max() - df[i]['Close'].min()) / 10
    ax.text(df[i].index[-40], ur, 'df =' + str(int(aaa[i][0]) + 1) + ',' + ' coef='
            + str(round(aaa[i][1], 2)) + ' R=' + str(round(aaa[i][2],2))
            + ' dif = ' + str(round(aaa[i][3],2)), fontsize=10, ha='center')

    ax.plot(df[i].index, df[i]['Close'])
    ax.plot(df[i].index, df[i]['lr'])

plt.show()

enter image description here
if you want to see the difference separately:

for i in range(0, hist):
    ax = plt.subplot(rows, 2, i + 1)
    ur = df[i]['dif'].min() + (df[i]['dif'].max() - df[i]['dif'].min()) / 10
    ax.text(df[i].index[-40], ur, 'df =' + str(int(aaa[i][0]) + 1) + ',' + ' coef='
            + str(round(aaa[i][1], 2)) + ' R=' + str(round(aaa[i][2], 2))
            + ' dif = ' + str(round(aaa[i][3], 2)), fontsize=10, ha='center')

    ax.plot(df[i].index, df[i]['dif'])

plt.show()

or get the best dataframe by correlation coefficient:

aaa = aaa[aaa[:, 1] > 0]
aaa = aaa[(-aaa[:, 2]).argsort()]

fig, ax = plt.subplots()
ax.plot(df[int(aaa[0][0])].index, df[int(aaa[0][0])]['Close'])
ax.plot(df[int(aaa[0][0])].index, df[int(aaa[0][0])]['lr'])
fig.autofmt_xdate()
plt.show()
Answered By: inquirer