Using python to get an intuition for multiple regression

Posted on Fri 02 October 2015 in blog

I want to get some intuition about regression models using multiple independent variables. More precisely, I am unsure if the relevant predictors would be better uncovered by multiple regression, or by pairwise analysis of all predictors against the response variable. So I'd like to use a dataset where I know the precise contribution of each predictor to the response variable.

Here I'm creating a model as the sum of some random variables with different variances. Then, I'd like to see how well correlation analysis can assess significance versus a multiple regression model.

In [12]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from sklearn import linear_model
from scipy import stats

Create the dataset

First, initialize some random variables that are progressively smaller

In [19]:
p = [10*np.power(float(1)/2,i) for i in range(6)]
print 'Relative magnitude of variables x1-x6: ' + str(p)
#Set a randomization seed for reproducibility
np.random.seed(2)
d = 100
df = pd.DataFrame({'x1': np.random.randn(d)*p[0],
                   'x2': np.random.randn(d)*p[1],
                   'x3': np.random.randn(d)*p[2],
                   'x4': np.random.randn(d)*p[3],
                   'x5': np.random.randn(d)*p[4],
                   'x6': np.random.randn(d)*p[5],
                  })
Relative magnitude of variables x1-x6: [10.0, 5.0, 2.5, 1.25, 0.625, 0.3125]

Build the model

Now, sum up the variables into the response variable, Y

In [20]:
df['Y']=df.sum(axis=1)
df.head()
Out[20]:
x1 x2 x3 x4 x5 x6 Y
0 -4.167578 5.807607 -2.877507 1.088551 -0.551953 0.077143 -0.623738
1 -0.562668 1.930390 -1.066306 0.741008 0.965867 -0.458992 1.549299
2 -21.361961 -5.665666 -0.370368 -1.374122 0.091150 0.358469 -28.322498
3 16.402708 2.165463 3.753592 -0.851913 -0.250066 0.029862 21.249645
4 -17.934356 -1.520432 2.173995 0.225083 0.509504 -0.346075 -16.892280

Examine the predictor distribution

Okay, so now let's look at the data. Shown below, they are zero-centered distributions with progressively smaller variance.

In [15]:
#I should plot overlayed histograms with different alpha values
xList = ['x'+str(i+1) for i in range(6)]
fig = plt.figure(figsize=(5,4), dpi=1600)
ax = fig.add_subplot(111)

rollAlpha=float(1)/6
for i in xList:
    plt.hist(df[i],rwidth=1,alpha=rollAlpha,linewidth=0,bins=25,label=i)
    rollAlpha = rollAlpha+float(1)/8

#fig.legend(xList,col='blue',loc='upper right')
fig.tight_layout()
#fig.legend
plt.legend(loc='upper right',framealpha=0)
ax.set_xlabel('Magnitude',fontsize=16,fontweight='bold')
ax.set_ylabel('Frequency',fontsize=16,fontweight='bold')
plt.show()
#Next plot the legend in the upper right and make the plot pretty

Examining single predictor-response correlations

So we know that theoretically x1-x6 should each contribute to Y. Let's look at the scatterplots and plot the correlation coefficient and the assocaited p-value

In [16]:
xList = ['x'+str(i+1) for i in range(6)]
fig = plt.figure(figsize=(10,6), dpi=1600)
fig.suptitle('Correlations of Y with single variables',fontsize=15,y=1.08,fontweight='bold')

#Function to make a single plot
def multiPlot(col):    
    ax=plt.subplot2grid((ncol,nrow),(yPanel,xPanel))
    ax.scatter(df['Y'],df[col],alpha=0.2)
    ax.set_ylim(-4*p[legIdx],4*p[legIdx])
    pVal = str(np.round(stats.pearsonr(df['Y'],df[col])[1],3))
    rVal = str(np.round(stats.pearsonr(df['Y'],df[col])[0],3))
    ax.text(y=3*p[legIdx],x=-35,s='p = '+ pVal,fontsize=12)
    ax.text(y=2.3*p[legIdx],x=-35,s='R = '+ rVal,fontsize=12)
    ax.set_xlabel(col,fontsize=14,fontweight='bold')
    ax.set_ylabel('Y',fontsize=14,fontweight='bold',rotation='horizontal')
    ax.tick_params(labelsize=14)

#Script to add plots to multipanel figure
xPanel = 0; yPanel = 0; ncol = 2; nrow = 3; legIdx=0
#plt.tight_layout()
for i in xList:
    multiPlot(i)
    xPanel=xPanel+1; legIdx=legIdx+1
    if xPanel>(nrow-1):
        xPanel=0
        yPanel=yPanel+1
   
fig.tight_layout()

The correlation coefficient (R) decays quickly. The p-values agree with this. Only x1 and x2 are significantly correlated. So it doesn't look that sensitive.

Multiple regression model

Next, let's try instead to do a linear multiple regression model to see how much of the variance it can capture. Before we do that let's throw in some dummy predictors to see if the linear model will throw them out.

In [17]:
df['x7'] = np.random.randn(d)*p[0]
df['x8'] = np.random.randn(d)*p[2]
df['x9'] = np.random.randn(d)*p[4]

df.head()
Out[17]:
x1 x2 x3 x4 x5 x6 Y x7 x8 x9
0 -4.167578 5.807607 -2.877507 1.088551 -0.551953 0.077143 -0.623738 6.036715 3.942225 -0.192502
1 -0.562668 1.930390 -1.066306 0.741008 0.965867 -0.458992 1.549299 -0.796135 -3.516749 -0.830572
2 -21.361961 -5.665666 -0.370368 -1.374122 0.091150 0.358469 -28.322498 -2.112727 0.632365 -1.370604
3 16.402708 2.165463 3.753592 -0.851913 -0.250066 0.029862 21.249645 -16.577347 -3.263584 0.374666
4 -17.934356 -1.520432 2.173995 0.225083 0.509504 -0.346075 -16.892280 -15.010401 -5.324673 0.360172

Now we have 6 variables (x1-x6) that contribute to Y, and 3 variables (x7-x9) that do not.

First ket's start with all the data and then see how well a feature removal alogrithm retains only the relevant data

In [18]:
allVar = ['x' + str(i) for i in range(1,10)]
x = df[allVar]
y = df['Y']

#head df[allVars]
est = linear_model.LinearRegression()
est.fit(x,y)
print est.coef_
[  1.00000000e+00   1.00000000e+00   1.00000000e+00   1.00000000e+00
   1.00000000e+00   1.00000000e+00   0.00000000e+00  -1.55431223e-15
  -5.55111512e-17]

Conclusion

The multiple regression model did fantastic! It captured the exact coefficients used in the theoretical model, and was able to ignore the variables that do not contribute. Going beyond pairwise regressions is really worht it!

Some further analysis could be done using feed forward/reverse model selection. I did this using R in another post. Python could also do this analysis.