Calculating slope of non-null points for a row of observations in Python

Question:

My dataframe looks something like this:

df = pd.DataFrame({'date':[1,2,3,4,5,6,7,8], 'price':[4.95, 5.04, 4.88, 4.22, 5.67, 5.89, 5.50, 5.12]})
pd.set_option('display.max_Columns', None)
for lag in range(1,7):
  df[f'price_lag{lag}M'] = df['price'].shift(lag)
print(df)

>>
    date  price  price_lag1M  price_lag2M  price_lag3M  price_lag4M  
0      1   4.95          NaN          NaN          NaN          NaN   
1      2   5.04         4.95          NaN          NaN          NaN   
2      3   4.88         5.04         4.95          NaN          NaN   
3      4   4.22         4.88         5.04         4.95          NaN   
4      5   5.67         4.22         4.88         5.04         4.95   
5      6   5.89         5.67         4.22         4.88         5.04   
6      7   5.50         5.89         5.67         4.22         4.88   
7      8   5.12         5.50         5.89         5.67         4.22   

   price_lag5M  price_lag6M  
0          NaN          NaN  
1          NaN          NaN  
2          NaN          NaN  
3          NaN          NaN  
4          NaN          NaN  
5         4.95          NaN  
6         5.04         4.95  
7         4.88         5.04  

I would like to calculate the slope of the lags for each month. I have mostly been using np.polyfit, and while it is quite fast, it gives me NaN if there’s at least one NaN in the row.

X = [1,2,3,4,5,6]
vars_to_consider = [f'price_lag{i}M' for i in range(1,7)]
Y = df.loc[:, vars_to_consider].values.T
df.loc[:, 'price_trend_6M'] = np.polyfit(X,Y,1)[0].round(4)
df = df.drop(vars_to_consider, axis=1)
print(df)

>>
    date  price  price_trend_6M
0      1   4.95             NaN
1      2   5.04             NaN
2      3   4.88             NaN
3      4   4.22             NaN
4      5   5.67             NaN
5      6   5.89             NaN
6      7   5.50         -0.1694
7      8   5.12         -0.1937

I’d like to calculate the slopes for any of the non-null values and ignore the null values, but for all the rows. For a small data such as this one, I’d do something like this:

vars_to_consider = [f'price_lag{i}M' for i in range(1,7)]
for i in range(len(df)):
  Y = df.loc[i, vars_to_consider].values
  idx = np.where(~np.isnan(Y))[0]
  if len(idx) < 2:
    df.loc[i, 'price_trend_6M'] = np.nan
  else:
    df.loc[i, 'price_trend_6M'] = np.polyfit(np.arange(len(idx)), Y[idx], 1)[0].round(4)
df = df.drop(vars_to_consider, axis=1)
print(df)

>>
   month  price  price_trend_6M
0      1   4.95             NaN
1      2   5.04             NaN
2      3   4.88         -0.0900
3      4   4.22          0.0350
4      5   5.67          0.2350
5      6   5.89         -0.0620
6      7   5.50         -0.1694
7      8   5.12         -0.1937

However, the original dataframe is around 300k rows long, and there are around 80 variables like ‘price’ that I want to calculate trends for. So the second method is taking too long. Is there a faster way to achieve the second output?

Asked By: Tejas

||

Answers:

Recognize that since your largest shift is 6 rows, np.polyfit will return nan only for the first six rows. You could continue using np.polyfit for the entire dataframe and then simply iterate over the first six rows to correct those. Since you know you’ll only iterate over a fixed, small number of rows, this will be much faster than iterating over all rows like you show in your second snippet of code.

# Vectorized call for the entire DF

# Note that X needs to be an array for the mask in the loop below to work
X = np.array([1,2,3,4,5,6])

vars_to_consider = [f'price_lag{i}M' for i in range(1,7)]
Y = df.loc[:, vars_to_consider].values.T
df.loc[:, 'price_trend_6M'] = np.polyfit(X,Y,1)[0].round(4)

# Fix first six rows
for i, row in df.head(len(X)).iterrows():
    ydata = row.loc[vars_to_consider].values
    mask = ~np.isnan(ydata) # Don't need `np.where` if we use boolean indexing

    if mask.sum() >= 2: # If >= 2 points, make a polyfit
        df.loc[i, 'price_trend_6M'] = np.polyfit(X[mask],ydata[mask],1)[0].round(4)

df = df.drop(vars_to_consider, axis=1)

Which gives your desired result:

   date  price  price_trend_6M
0     1   4.95             NaN
1     2   5.04             NaN
2     3   4.88         -0.0900
3     4   4.22          0.0350
4     5   5.67          0.2350
5     6   5.89         -0.0620
6     7   5.50         -0.1694
7     8   5.12         -0.1937
Answered By: Pranav Hosangadi

@Pranav’s answer is great in order to solve the question as I framed it. My original data has multiple IDs that have multiple dates and prices, so it won’t always be the top 6 rows. However, the rows for which the slope can be manually calculated with non-null values is much lesser than total rows. This is what I ended up using:

X = [1,2,3,4,5,6]
vars_to_consider = [f'price_lag{i}M' for i in range(1,7)]
Y = df.loc[:, vars_to_consider].values
df.loc[:, 'price_trend_6M'] = np.polyfit(X,Y.T,1)[0].round(4)

# Select indices where Y is not null
idx = ~np.isnan(Y)
# Count which rows have 2 to 5 nulls, since these rows need mending
idx2 = (idx.sum(axis=1) >= 2) & (idx.sum(axis=1) <= 5)
# Run a for loop with these rows, and calculate slopes with non-null values
for i in np.where(idx2)[0]:
  y = Y[i][~np.isnan(Y[i])]
  x = np.arange(len(y))
  df.loc[i, 'price_trend_6M'] = np.polyfit(x,y,1)[0].round(4)
df = df.drop(vars_to_consider, axis=1)
print(df)

>>
   month  price  price_trend_6M
0      1   4.95             NaN
1      2   5.04             NaN
2      3   4.88         -0.0900
3      4   4.22          0.0350
4      5   5.67          0.2350
5      6   5.89         -0.0620
6      7   5.50         -0.1694
7      8   5.12         -0.1937
Answered By: Tejas
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.