Calculating slope of non-null points for a row of observations in Python
Question:
My dataframe looks something like this:
df = pd.DataFrame({'date':[1,2,3,4,5,6,7,8], 'price':[4.95, 5.04, 4.88, 4.22, 5.67, 5.89, 5.50, 5.12]})
pd.set_option('display.max_Columns', None)
for lag in range(1,7):
df[f'price_lag{lag}M'] = df['price'].shift(lag)
print(df)
>>
date price price_lag1M price_lag2M price_lag3M price_lag4M
0 1 4.95 NaN NaN NaN NaN
1 2 5.04 4.95 NaN NaN NaN
2 3 4.88 5.04 4.95 NaN NaN
3 4 4.22 4.88 5.04 4.95 NaN
4 5 5.67 4.22 4.88 5.04 4.95
5 6 5.89 5.67 4.22 4.88 5.04
6 7 5.50 5.89 5.67 4.22 4.88
7 8 5.12 5.50 5.89 5.67 4.22
price_lag5M price_lag6M
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 4.95 NaN
6 5.04 4.95
7 4.88 5.04
I would like to calculate the slope of the lags for each month. I have mostly been using np.polyfit, and while it is quite fast, it gives me NaN if there’s at least one NaN in the row.
X = [1,2,3,4,5,6]
vars_to_consider = [f'price_lag{i}M' for i in range(1,7)]
Y = df.loc[:, vars_to_consider].values.T
df.loc[:, 'price_trend_6M'] = np.polyfit(X,Y,1)[0].round(4)
df = df.drop(vars_to_consider, axis=1)
print(df)
>>
date price price_trend_6M
0 1 4.95 NaN
1 2 5.04 NaN
2 3 4.88 NaN
3 4 4.22 NaN
4 5 5.67 NaN
5 6 5.89 NaN
6 7 5.50 -0.1694
7 8 5.12 -0.1937
I’d like to calculate the slopes for any of the non-null values and ignore the null values, but for all the rows. For a small data such as this one, I’d do something like this:
vars_to_consider = [f'price_lag{i}M' for i in range(1,7)]
for i in range(len(df)):
Y = df.loc[i, vars_to_consider].values
idx = np.where(~np.isnan(Y))[0]
if len(idx) < 2:
df.loc[i, 'price_trend_6M'] = np.nan
else:
df.loc[i, 'price_trend_6M'] = np.polyfit(np.arange(len(idx)), Y[idx], 1)[0].round(4)
df = df.drop(vars_to_consider, axis=1)
print(df)
>>
month price price_trend_6M
0 1 4.95 NaN
1 2 5.04 NaN
2 3 4.88 -0.0900
3 4 4.22 0.0350
4 5 5.67 0.2350
5 6 5.89 -0.0620
6 7 5.50 -0.1694
7 8 5.12 -0.1937
However, the original dataframe is around 300k rows long, and there are around 80 variables like ‘price’ that I want to calculate trends for. So the second method is taking too long. Is there a faster way to achieve the second output?
Answers:
Recognize that since your largest shift
is 6 rows, np.polyfit
will return nan
only for the first six rows. You could continue using np.polyfit
for the entire dataframe and then simply iterate over the first six rows to correct those. Since you know you’ll only iterate over a fixed, small number of rows, this will be much faster than iterating over all rows like you show in your second snippet of code.
# Vectorized call for the entire DF
# Note that X needs to be an array for the mask in the loop below to work
X = np.array([1,2,3,4,5,6])
vars_to_consider = [f'price_lag{i}M' for i in range(1,7)]
Y = df.loc[:, vars_to_consider].values.T
df.loc[:, 'price_trend_6M'] = np.polyfit(X,Y,1)[0].round(4)
# Fix first six rows
for i, row in df.head(len(X)).iterrows():
ydata = row.loc[vars_to_consider].values
mask = ~np.isnan(ydata) # Don't need `np.where` if we use boolean indexing
if mask.sum() >= 2: # If >= 2 points, make a polyfit
df.loc[i, 'price_trend_6M'] = np.polyfit(X[mask],ydata[mask],1)[0].round(4)
df = df.drop(vars_to_consider, axis=1)
Which gives your desired result:
date price price_trend_6M
0 1 4.95 NaN
1 2 5.04 NaN
2 3 4.88 -0.0900
3 4 4.22 0.0350
4 5 5.67 0.2350
5 6 5.89 -0.0620
6 7 5.50 -0.1694
7 8 5.12 -0.1937
@Pranav’s answer is great in order to solve the question as I framed it. My original data has multiple IDs that have multiple dates and prices, so it won’t always be the top 6 rows. However, the rows for which the slope can be manually calculated with non-null values is much lesser than total rows. This is what I ended up using:
X = [1,2,3,4,5,6]
vars_to_consider = [f'price_lag{i}M' for i in range(1,7)]
Y = df.loc[:, vars_to_consider].values
df.loc[:, 'price_trend_6M'] = np.polyfit(X,Y.T,1)[0].round(4)
# Select indices where Y is not null
idx = ~np.isnan(Y)
# Count which rows have 2 to 5 nulls, since these rows need mending
idx2 = (idx.sum(axis=1) >= 2) & (idx.sum(axis=1) <= 5)
# Run a for loop with these rows, and calculate slopes with non-null values
for i in np.where(idx2)[0]:
y = Y[i][~np.isnan(Y[i])]
x = np.arange(len(y))
df.loc[i, 'price_trend_6M'] = np.polyfit(x,y,1)[0].round(4)
df = df.drop(vars_to_consider, axis=1)
print(df)
>>
month price price_trend_6M
0 1 4.95 NaN
1 2 5.04 NaN
2 3 4.88 -0.0900
3 4 4.22 0.0350
4 5 5.67 0.2350
5 6 5.89 -0.0620
6 7 5.50 -0.1694
7 8 5.12 -0.1937
My dataframe looks something like this:
df = pd.DataFrame({'date':[1,2,3,4,5,6,7,8], 'price':[4.95, 5.04, 4.88, 4.22, 5.67, 5.89, 5.50, 5.12]})
pd.set_option('display.max_Columns', None)
for lag in range(1,7):
df[f'price_lag{lag}M'] = df['price'].shift(lag)
print(df)
>>
date price price_lag1M price_lag2M price_lag3M price_lag4M
0 1 4.95 NaN NaN NaN NaN
1 2 5.04 4.95 NaN NaN NaN
2 3 4.88 5.04 4.95 NaN NaN
3 4 4.22 4.88 5.04 4.95 NaN
4 5 5.67 4.22 4.88 5.04 4.95
5 6 5.89 5.67 4.22 4.88 5.04
6 7 5.50 5.89 5.67 4.22 4.88
7 8 5.12 5.50 5.89 5.67 4.22
price_lag5M price_lag6M
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 4.95 NaN
6 5.04 4.95
7 4.88 5.04
I would like to calculate the slope of the lags for each month. I have mostly been using np.polyfit, and while it is quite fast, it gives me NaN if there’s at least one NaN in the row.
X = [1,2,3,4,5,6]
vars_to_consider = [f'price_lag{i}M' for i in range(1,7)]
Y = df.loc[:, vars_to_consider].values.T
df.loc[:, 'price_trend_6M'] = np.polyfit(X,Y,1)[0].round(4)
df = df.drop(vars_to_consider, axis=1)
print(df)
>>
date price price_trend_6M
0 1 4.95 NaN
1 2 5.04 NaN
2 3 4.88 NaN
3 4 4.22 NaN
4 5 5.67 NaN
5 6 5.89 NaN
6 7 5.50 -0.1694
7 8 5.12 -0.1937
I’d like to calculate the slopes for any of the non-null values and ignore the null values, but for all the rows. For a small data such as this one, I’d do something like this:
vars_to_consider = [f'price_lag{i}M' for i in range(1,7)]
for i in range(len(df)):
Y = df.loc[i, vars_to_consider].values
idx = np.where(~np.isnan(Y))[0]
if len(idx) < 2:
df.loc[i, 'price_trend_6M'] = np.nan
else:
df.loc[i, 'price_trend_6M'] = np.polyfit(np.arange(len(idx)), Y[idx], 1)[0].round(4)
df = df.drop(vars_to_consider, axis=1)
print(df)
>>
month price price_trend_6M
0 1 4.95 NaN
1 2 5.04 NaN
2 3 4.88 -0.0900
3 4 4.22 0.0350
4 5 5.67 0.2350
5 6 5.89 -0.0620
6 7 5.50 -0.1694
7 8 5.12 -0.1937
However, the original dataframe is around 300k rows long, and there are around 80 variables like ‘price’ that I want to calculate trends for. So the second method is taking too long. Is there a faster way to achieve the second output?
Recognize that since your largest shift
is 6 rows, np.polyfit
will return nan
only for the first six rows. You could continue using np.polyfit
for the entire dataframe and then simply iterate over the first six rows to correct those. Since you know you’ll only iterate over a fixed, small number of rows, this will be much faster than iterating over all rows like you show in your second snippet of code.
# Vectorized call for the entire DF
# Note that X needs to be an array for the mask in the loop below to work
X = np.array([1,2,3,4,5,6])
vars_to_consider = [f'price_lag{i}M' for i in range(1,7)]
Y = df.loc[:, vars_to_consider].values.T
df.loc[:, 'price_trend_6M'] = np.polyfit(X,Y,1)[0].round(4)
# Fix first six rows
for i, row in df.head(len(X)).iterrows():
ydata = row.loc[vars_to_consider].values
mask = ~np.isnan(ydata) # Don't need `np.where` if we use boolean indexing
if mask.sum() >= 2: # If >= 2 points, make a polyfit
df.loc[i, 'price_trend_6M'] = np.polyfit(X[mask],ydata[mask],1)[0].round(4)
df = df.drop(vars_to_consider, axis=1)
Which gives your desired result:
date price price_trend_6M
0 1 4.95 NaN
1 2 5.04 NaN
2 3 4.88 -0.0900
3 4 4.22 0.0350
4 5 5.67 0.2350
5 6 5.89 -0.0620
6 7 5.50 -0.1694
7 8 5.12 -0.1937
@Pranav’s answer is great in order to solve the question as I framed it. My original data has multiple IDs that have multiple dates and prices, so it won’t always be the top 6 rows. However, the rows for which the slope can be manually calculated with non-null values is much lesser than total rows. This is what I ended up using:
X = [1,2,3,4,5,6]
vars_to_consider = [f'price_lag{i}M' for i in range(1,7)]
Y = df.loc[:, vars_to_consider].values
df.loc[:, 'price_trend_6M'] = np.polyfit(X,Y.T,1)[0].round(4)
# Select indices where Y is not null
idx = ~np.isnan(Y)
# Count which rows have 2 to 5 nulls, since these rows need mending
idx2 = (idx.sum(axis=1) >= 2) & (idx.sum(axis=1) <= 5)
# Run a for loop with these rows, and calculate slopes with non-null values
for i in np.where(idx2)[0]:
y = Y[i][~np.isnan(Y[i])]
x = np.arange(len(y))
df.loc[i, 'price_trend_6M'] = np.polyfit(x,y,1)[0].round(4)
df = df.drop(vars_to_consider, axis=1)
print(df)
>>
month price price_trend_6M
0 1 4.95 NaN
1 2 5.04 NaN
2 3 4.88 -0.0900
3 4 4.22 0.0350
4 5 5.67 0.2350
5 6 5.89 -0.0620
6 7 5.50 -0.1694
7 8 5.12 -0.1937