Variable selection for multiple regression models

Posted on Tue 08 September 2015 in blog

Here, I want to look at using R to perform variable selection for a linear model. Let's consider forward and reverse selection, statistical techniques to keep only variables that maximize the variance explained. The dataset I'm using is the Boston housing price dataset from the MASS library.

Note, that there are some drawbacks/limitations to consider when using variable selection: http://www.stata.com/support/faqs/statistics/stepwise-regression-problems/

In [2]:
# Load dependencies
library(MASS)
attach(Boston)
# Define predictors
nm = names(Boston)
nm = nm[nm!='medv']

First, let's perform a set of linear models to see which predictor is the most significant

In [3]:
statsum<-data.frame(pred=rep(NA,length(nm)),R2=numeric(length(nm)))
for (i in 1:length(nm))
{
  fit   <- lm(medv ~ eval(parse(text=nm[i])))
  statsum$pred[i]=nm[i]
  statsum$R2[i]=summary(fit)$r.squared
}

highR2 = statsum$pred[statsum$R2==max(statsum$R2)]
print(
  paste(
    'The best predictor is',
    statsum$pred[statsum$R2==max(statsum$R2)],
    'with and R2 value of',
    round(statsum$R2[statsum$R2==max(statsum$R2)],2)
    )
  )
[1] "The best predictor is lstat with and R2 value of 0.54"

Okay, looks like lstat is the most predictive. So, we could iteratively add each significant variable until we explain the most variance. To reduce potential over-fitting, an impovement on this would be to penalize models for degrees for excess degrees of freedom. The Aikaike information criteria (AIC) metric accomplishes this, and the model with the lowest AIC should be kept.

R actually has functionality to do forward selection, as described, and also reverse selection. http://stackoverflow.com/questions/22913774/forward-stepwise-regression

First, let's start with forward selection

In [4]:
#Define the minimum model as having only a constant term and no predictors
min.model=lm(medv ~ 1)
#Define the maximum model as containing all the predictors
max.model=lm(medv~.,data=Boston)

fwd.model=step(object=min.model,scope=formula(max.model),direction='forward')
Start:  AIC=2246.51
medv ~ 1

          Df Sum of Sq   RSS    AIC
+ lstat    1   23243.9 19472 1851.0
+ rm       1   20654.4 22062 1914.2
+ ptratio  1   11014.3 31702 2097.6
+ indus    1    9995.2 32721 2113.6
+ tax      1    9377.3 33339 2123.1
+ nox      1    7800.1 34916 2146.5
+ crim     1    6440.8 36276 2165.8
+ rad      1    6221.1 36495 2168.9
+ age      1    6069.8 36647 2171.0
+ zn       1    5549.7 37167 2178.1
+ black    1    4749.9 37966 2188.9
+ dis      1    2668.2 40048 2215.9
+ chas     1    1312.1 41404 2232.7
<none>                 42716 2246.5

Step:  AIC=1851.01
medv ~ lstat

          Df Sum of Sq   RSS    AIC
+ rm       1    4033.1 15439 1735.6
+ ptratio  1    2670.1 16802 1778.4
+ chas     1     786.3 18686 1832.2
+ dis      1     772.4 18700 1832.5
+ age      1     304.3 19168 1845.0
+ tax      1     274.4 19198 1845.8
+ black    1     198.3 19274 1847.8
+ zn       1     160.3 19312 1848.8
+ crim     1     146.9 19325 1849.2
+ indus    1      98.7 19374 1850.4
<none>                 19472 1851.0
+ rad      1      25.1 19447 1852.4
+ nox      1       4.8 19468 1852.9

Step:  AIC=1735.58
medv ~ lstat + rm

          Df Sum of Sq   RSS    AIC
+ ptratio  1   1711.32 13728 1678.1
+ chas     1    548.53 14891 1719.3
+ black    1    512.31 14927 1720.5
+ tax      1    425.16 15014 1723.5
+ dis      1    351.15 15088 1725.9
+ crim     1    311.42 15128 1727.3
+ rad      1    180.45 15259 1731.6
+ indus    1     61.09 15378 1735.6
<none>                 15439 1735.6
+ zn       1     56.56 15383 1735.7
+ age      1     20.18 15419 1736.9
+ nox      1     14.90 15424 1737.1

Step:  AIC=1678.13
medv ~ lstat + rm + ptratio

        Df Sum of Sq   RSS    AIC
+ dis    1    499.08 13229 1661.4
+ black  1    389.68 13338 1665.6
+ chas   1    377.96 13350 1666.0
+ crim   1    122.52 13606 1675.6
+ age    1     66.24 13662 1677.7
<none>               13728 1678.1
+ tax    1     44.36 13684 1678.5
+ nox    1     24.81 13703 1679.2
+ zn     1     14.96 13713 1679.6
+ rad    1      6.07 13722 1679.9
+ indus  1      0.83 13727 1680.1

Step:  AIC=1661.39
medv ~ lstat + rm + ptratio + dis

        Df Sum of Sq   RSS    AIC
+ nox    1    759.56 12469 1633.5
+ black  1    502.64 12726 1643.8
+ chas   1    267.43 12962 1653.1
+ indus  1    242.65 12986 1654.0
+ tax    1    240.34 12989 1654.1
+ crim   1    233.54 12995 1654.4
+ zn     1    144.81 13084 1657.8
+ age    1     61.36 13168 1661.0
<none>               13229 1661.4
+ rad    1     22.40 13206 1662.5

Step:  AIC=1633.47
medv ~ lstat + rm + ptratio + dis + nox

        Df Sum of Sq   RSS    AIC
+ chas   1    328.27 12141 1622.0
+ black  1    311.83 12158 1622.7
+ zn     1    151.71 12318 1629.3
+ crim   1    141.43 12328 1629.7
+ rad    1     53.48 12416 1633.3
<none>               12469 1633.5
+ indus  1     17.10 12452 1634.8
+ tax    1     10.50 12459 1635.0
+ age    1      0.25 12469 1635.5

Step:  AIC=1621.97
medv ~ lstat + rm + ptratio + dis + nox + chas

        Df Sum of Sq   RSS    AIC
+ black  1   272.837 11868 1612.5
+ zn     1   164.406 11977 1617.1
+ crim   1   116.330 12025 1619.1
+ rad    1    58.556 12082 1621.5
<none>               12141 1622.0
+ indus  1    26.274 12115 1622.9
+ tax    1     4.187 12137 1623.8
+ age    1     2.331 12139 1623.9

Step:  AIC=1612.47
medv ~ lstat + rm + ptratio + dis + nox + chas + black

        Df Sum of Sq   RSS    AIC
+ zn     1   189.936 11678 1606.3
+ rad    1   144.320 11724 1608.3
+ crim   1    55.633 11813 1612.1
<none>               11868 1612.5
+ indus  1    15.584 11853 1613.8
+ age    1     9.446 11859 1614.1
+ tax    1     2.703 11866 1614.4

Step:  AIC=1606.31
medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn

        Df Sum of Sq   RSS    AIC
+ crim   1    94.712 11584 1604.2
+ rad    1    93.614 11585 1604.2
<none>               11678 1606.3
+ indus  1    16.048 11662 1607.6
+ tax    1     3.952 11674 1608.1
+ age    1     1.491 11677 1608.2

Step:  AIC=1604.19
medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn + 
    crim

        Df Sum of Sq   RSS    AIC
+ rad    1   228.604 11355 1596.1
<none>               11584 1604.2
+ indus  1    15.773 11568 1605.5
+ age    1     2.470 11581 1606.1
+ tax    1     1.305 11582 1606.1

Step:  AIC=1596.1
medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn + 
    crim + rad

        Df Sum of Sq   RSS    AIC
+ tax    1   273.619 11081 1585.8
<none>               11355 1596.1
+ indus  1    33.894 11321 1596.6
+ age    1     0.096 11355 1598.1

Step:  AIC=1585.76
medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn + 
    crim + rad + tax

        Df Sum of Sq   RSS    AIC
<none>               11081 1585.8
+ indus  1   2.51754 11079 1587.7
+ age    1   0.06271 11081 1587.8

So, if we look above, we can see the model iterate through and progressively add more terms to make the linear model more predictive. Notice that the AIC is decreasing at each step as a new variable is added to the model, until the final step, where any of the remaining variables give higher AIC than the previous model.

In [9]:
rev.model=step(object=max.model,scope=formula(min.model),direction='backward')
Start:  AIC=1589.64
medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + 
    tax + ptratio + black + lstat

          Df Sum of Sq   RSS    AIC
- age      1      0.06 11079 1587.7
- indus    1      2.52 11081 1587.8
<none>                 11079 1589.6
- chas     1    218.97 11298 1597.5
- tax      1    242.26 11321 1598.6
- crim     1    243.22 11322 1598.6
- zn       1    257.49 11336 1599.3
- black    1    270.63 11349 1599.8
- rad      1    479.15 11558 1609.1
- nox      1    487.16 11566 1609.4
- ptratio  1   1194.23 12273 1639.4
- dis      1   1232.41 12311 1641.0
- rm       1   1871.32 12950 1666.6
- lstat    1   2410.84 13490 1687.3

Step:  AIC=1587.65
medv ~ crim + zn + indus + chas + nox + rm + dis + rad + tax + 
    ptratio + black + lstat

          Df Sum of Sq   RSS    AIC
- indus    1      2.52 11081 1585.8
<none>                 11079 1587.7
- chas     1    219.91 11299 1595.6
- tax      1    242.24 11321 1596.6
- crim     1    243.20 11322 1596.6
- zn       1    260.32 11339 1597.4
- black    1    272.26 11351 1597.9
- rad      1    481.09 11560 1607.2
- nox      1    520.87 11600 1608.9
- ptratio  1   1200.23 12279 1637.7
- dis      1   1352.26 12431 1643.9
- rm       1   1959.55 13038 1668.0
- lstat    1   2718.88 13798 1696.7

Step:  AIC=1585.76
medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio + 
    black + lstat

          Df Sum of Sq   RSS    AIC
<none>                 11081 1585.8
- chas     1    227.21 11309 1594.0
- crim     1    245.37 11327 1594.8
- zn       1    257.82 11339 1595.4
- black    1    270.82 11352 1596.0
- tax      1    273.62 11355 1596.1
- rad      1    500.92 11582 1606.1
- nox      1    541.91 11623 1607.9
- ptratio  1   1206.45 12288 1636.0
- dis      1   1448.94 12530 1645.9
- rm       1   1963.66 13045 1666.3
- lstat    1   2723.48 13805 1695.0

Doing reverse selection operates on the same principle and gives the same result, in this case. So that is how variable selection works in R.

One more thing I'd like to try later is to test the predictive power of each of these models by splitting the dataset into training- and test sets.