Prediction (out of sample)

In [1]:
%matplotlib inline
In [2]:
from __future__ import print_function
import numpy as np
import statsmodels.api as sm

Artificial data

In [3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.989
Model:                            OLS   Adj. R-squared:                  0.988
Method:                 Least Squares   F-statistic:                     1318.
Date:                Tue, 04 Feb 2020   Prob (F-statistic):           1.35e-44
Time:                        00:34:03   Log-Likelihood:                 9.5783
No. Observations:                  50   AIC:                            -11.16
Df Residuals:                      46   BIC:                            -3.508
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.9467      0.071     69.677      0.000       4.804       5.090
x1             0.4973      0.011     45.415      0.000       0.475       0.519
x2             0.4838      0.043     11.241      0.000       0.397       0.570
x3            -0.0195      0.001    -20.261      0.000      -0.021      -0.018
==============================================================================
Omnibus:                        0.129   Durbin-Watson:                   2.227
Prob(Omnibus):                  0.937   Jarque-Bera (JB):                0.103
Skew:                           0.092   Prob(JB):                        0.950
Kurtosis:                       2.875   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.45976769  4.93103551  5.36426047  5.73307379  6.02062305  6.22234099
  6.34669588  6.41380025  6.45210642  6.49373185  6.5691824   6.70234023
  6.90654008  7.18237838  7.51761526  7.88918556  8.26698832  8.61883231
  8.91572498  9.13663579  9.27195154  9.32505575  9.31177275  9.25776783
  9.19433047  9.15323355  9.16151323  9.23702689  9.38551929  9.59967869
  9.86033814 10.13962457 10.40553825 10.62721026 10.77997418 10.84941866
 10.83375525 10.74411214 10.60270591 10.43919136 10.28578934 10.17199294
 10.11972185 10.13972105 10.22979539 10.3751703  10.55091918 10.72605877
 10.86863962 10.95099529]

Create a new sample of explanatory variables Xnew, predict and plot

In [6]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.94316736 10.80751529 10.56301339 10.25290175  9.93409951  9.66326909
  9.48294327  9.41111166  9.43681624  9.52283421]

Plot comparison

In [7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [8]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [9]:
res.params
Out[9]:
Intercept           4.946724
x1                  0.497258
np.sin(x1)          0.483839
I((x1 - 5) ** 2)   -0.019478
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [10]:
res.predict(exog=dict(x1=x1n))
Out[10]:
0    10.943167
1    10.807515
2    10.563013
3    10.252902
4     9.934100
5     9.663269
6     9.482943
7     9.411112
8     9.436816
9     9.522834
dtype: float64