Prediction (out of sample)

In [1]:
%matplotlib inline

from __future__ import print_function
import numpy as np
import statsmodels.api as sm
/build/statsmodels-Lrgq1G/statsmodels-0.8.0/.pybuild/cpython3_3.6_statsmodels/build/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.986
Model:                            OLS   Adj. R-squared:                  0.985
Method:                 Least Squares   F-statistic:                     1072.
Date:                Wed, 27 Jun 2018   Prob (F-statistic):           1.48e-42
Time:                        17:47:00   Log-Likelihood:                 5.9082
No. Observations:                  50   AIC:                            -3.816
Df Residuals:                      46   BIC:                             3.832
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.1103      0.076     66.887      0.000       4.957       5.264
x1             0.4840      0.012     41.074      0.000       0.460       0.508
x2             0.4892      0.046     10.561      0.000       0.396       0.582
x3            -0.0191      0.001    -18.439      0.000      -0.021      -0.017
==============================================================================
Omnibus:                        1.462   Durbin-Watson:                   2.727
Prob(Omnibus):                  0.482   Jarque-Bera (JB):                1.080
Skew:                          -0.048   Prob(JB):                        0.583
Kurtosis:                       2.286   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[ 4.63339811  5.09979303  5.5279304   5.89114997  6.17241305  6.36710189
  6.48377837  6.54277733  6.57286556  6.60651557  6.67457036  6.80117611
  6.99981492  7.27108982  7.60262566  7.97110233  8.34608615  8.69503008
  8.98862108  9.2055962   9.33623594  9.38396124  9.36477134  9.30461521
  9.23512817  9.18843433  9.19186866  9.26348593  9.40909441  9.62130152
  9.88072846 10.15919407 10.42434493 10.64497091 10.79613293 10.86326037
 10.84454499 10.75123801 10.60580155 10.43821821 10.28106498 10.16416094
 10.10966808 10.12845013 10.21828769 10.36424287 10.54111356 10.71757403
 10.86132169 10.94438428]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.93650115 10.79968984 10.5531344  10.24055283  9.91949344  9.64724495
  9.46681027  9.39637786  9.42486856  9.51464816]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           5.110296
x1                  0.483978
np.sin(x1)          0.489187
I((x1 - 5) ** 2)   -0.019076
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
0    10.936501
1    10.799690
2    10.553134
3    10.240553
4     9.919493
5     9.647245
6     9.466810
7     9.396378
8     9.424869
9     9.514648
dtype: float64