Variable selection for multiple regression models
Posted on Tue 08 September 2015 in blog
Here, I want to look at using R to perform variable selection for a linear model. Let's consider forward and reverse selection, statistical techniques to keep only variables that maximize the variance explained. The dataset I'm using is the Boston housing price dataset from the MASS library.
Note, that there are some drawbacks/limitations to consider when using variable selection: http://www.stata.com/support/faqs/statistics/stepwise-regression-problems/
# Load dependencies
library(MASS)
attach(Boston)
# Define predictors
nm = names(Boston)
nm = nm[nm!='medv']
First, let's perform a set of linear models to see which predictor is the most significant
statsum<-data.frame(pred=rep(NA,length(nm)),R2=numeric(length(nm)))
for (i in 1:length(nm))
{
fit <- lm(medv ~ eval(parse(text=nm[i])))
statsum$pred[i]=nm[i]
statsum$R2[i]=summary(fit)$r.squared
}
highR2 = statsum$pred[statsum$R2==max(statsum$R2)]
print(
paste(
'The best predictor is',
statsum$pred[statsum$R2==max(statsum$R2)],
'with and R2 value of',
round(statsum$R2[statsum$R2==max(statsum$R2)],2)
)
)
Okay, looks like lstat is the most predictive. So, we could iteratively add each significant variable until we explain the most variance. To reduce potential over-fitting, an impovement on this would be to penalize models for degrees for excess degrees of freedom. The Aikaike information criteria (AIC) metric accomplishes this, and the model with the lowest AIC should be kept.
R actually has functionality to do forward selection, as described, and also reverse selection. http://stackoverflow.com/questions/22913774/forward-stepwise-regression
First, let's start with forward selection
#Define the minimum model as having only a constant term and no predictors
min.model=lm(medv ~ 1)
#Define the maximum model as containing all the predictors
max.model=lm(medv~.,data=Boston)
fwd.model=step(object=min.model,scope=formula(max.model),direction='forward')
So, if we look above, we can see the model iterate through and progressively add more terms to make the linear model more predictive. Notice that the AIC is decreasing at each step as a new variable is added to the model, until the final step, where any of the remaining variables give higher AIC than the previous model.
rev.model=step(object=max.model,scope=formula(min.model),direction='backward')
Doing reverse selection operates on the same principle and gives the same result, in this case. So that is how variable selection works in R.
One more thing I'd like to try later is to test the predictive power of each of these models by splitting the dataset into training- and test sets.