Statistical learning on NBA shot data

Posted on Sun 11 October 2015 in blog

In the last post, I pulled some NBA shot data for Andrew Wiggins and put that into a dataframe. Here, we will apply some supervised learning techniques from sklearn to build predictive models and then use visualizations to better understand the data.

Some topics we'll explore are prediction error, regularization, and the tradeoff between prediction accuracy and model interpretability.

Getting started

Load the dependencies

In [173]:
%matplotlib inline
# For data analysis
import matplotlib.pylab as plt
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import cm

# For learning algorithms
from sklearn.linear_model import SGDClassifier, RidgeClassifier, LogisticRegression
from sklearn import cross_validation
from sklearn.preprocessing import scale, StandardScaler
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier

Load the data

In [174]:
df = pd.read_csv('../../dataSandbox/forPelican/nbaWiggins.csv')
allShots = df[np.invert(np.isnan(df.SHOT_CLOCK))]
cMin,cSec = allShots.loc[:,'GAME_CLOCK'].str.split(':').str.get(0),allShots.loc[:,'GAME_CLOCK'].str.split(':').str.get(1)
allShots.loc[:,'secGameClock'] = cMin.astype('int') * 60 + cSec.astype('int')

Dummify variables

In [175]:
pd.get_dummies(allShots[['W','LOCATION','FGM']])
allShots.loc[:,'gameWon'] = pd.get_dummies(allShots['W'])['W']
allShots.loc[:,'homeGame'] = pd.get_dummies(allShots['LOCATION'])['H']
allShots.loc[:,'is3Pointer'] = pd.get_dummies(allShots['PTS_TYPE'])[3]
q = pd.get_dummies(allShots['PERIOD'])[[2,3,4,5]]
q.columns = ['q' + str(i) for i in q.columns]
allShots = pd.concat([allShots,q],axis=1)

Clean up

In [176]:
dropThese = ['GAME_ID','MATCHUP','CLOSEST_DEFENDER','CLOSEST_DEFENDER_PLAYER_ID','GAME_CLOCK','SHOT_RESULT','LOCATION','W','PERIOD','PTS_TYPE','PTS']
[ allShots.drop(i, axis=1, inplace=True) for i in dropThese ] 
Out[176]:
[None, None, None, None, None, None, None, None, None, None, None]

Split the data into a training and test set

In [177]:
trainShots = allShots[1::2] #Odd rows
testShots = allShots[::2]   #Even rows

Pluribus machina

Now, we will build a series of function to test machine learning algorithms head-to-head. Since these are all implemented in sklearn, it is feasible to create a generalized function to run each learning algorithm.

In [178]:
# Generalized function to implement sklearn learning algorithms

def runLearning(myData,myMod,trainIdx,testIdx,respVar,scaleIt=False,printMe=False): 
    predCol = [col for col in myData.columns if col != respVar] # Get predictor variables
    if scaleIt == True:
        myData = pd.DataFrame(scale(myData), index=myData.index, columns=myData.columns)
    trainSet = eval('myData' + trainIdx).astype(float)
    testSet = eval('myData' + testIdx).astype(float)
    myEst = eval(myMod)
    trainResp = trainSet[respVar].astype(float)
    trainResp.ix[trainResp > 0] = 1    # Need to binarize data or it will break
    trainResp.ix[trainResp <= 0] = 0
    # Train the model
    myEst.fit(
        trainSet[predCol],
        trainResp
    )
    # Test the model
    testResp = testSet[respVar].astype(float)
    testResp.ix[testResp > 0] = 1    # Need to binarize data or it will break
    testResp.ix[testResp <= 0] = 0
    matches = np.count_nonzero(myEst.predict(testSet[predCol]) == testResp)
    testLen = testSet.shape[0]
    if(printMe==True):
        print 'Algorithm used:\n' + myMod + '    scaled=' + str(scaleIt)
        print "Correctly classified:\n" + str(round(float(matches) / testLen * 100,2)) + '%\n\n'
    else:
        return round(float(matches) / testLen * 100,2)

Let's try it out for a couple algorithms

In [210]:
runLearning(myData=allShots,myMod='RandomForestClassifier(n_estimators=100)',trainIdx='[1::2]',testIdx='[::2]',respVar='FGM',scaleIt=False,printMe=True) 
runLearning(myData=allShots,myMod='BaggingClassifier()',trainIdx='[1::2]',testIdx='[::2]',respVar='FGM',scaleIt=False,printMe=True) 
Algorithm used:
RandomForestClassifier(n_estimators=100)    scaled=False
Correctly classified:
60.44%


Algorithm used:
BaggingClassifier()    scaled=False
Correctly classified:
59.35%


Now, calculate the accuracy for each algorithm

In [180]:
dfPred = pd.DataFrame(columns=['algorithm','accuracy','scaled'])

algs = ['RandomForestClassifier(n_estimators=100)','BaggingClassifier()','SGDClassifier(alpha=2)','RidgeClassifier(alpha=2)','SVC()','LogisticRegression()']
myScale = ['False','False','False','False','True','False']

for i in range(len(algs)):
    # Calculate prediction algorithm for scaled/non-scaled data in each algorithm
    temp1 = runLearning(myData=allShots,myMod=algs[i],trainIdx='[1::2]',testIdx='[::2]',respVar='FGM',scaleIt=myScale[i])
    algName = algs[i]
    algName = algName.split('(')[0]
    dfPred = dfPred.append({'algorithm': algName, 'accuracy': temp1, 'scaled': myScale[i]}, ignore_index = True)
    
#dfPred
Out[180]:
algorithm accuracy scaled
0 RandomForestClassifier 60.62 False
1 BaggingClassifier 60.98 False
2 SGDClassifier 54.81 False
3 RidgeClassifier 60.62 False
4 SVC 54.63 True
5 LogisticRegression 60.25 False

A first pass

Out of the box, it looks like the random forest classifier does the best at predicting whether a shot was made/missed. But consider that random selection would give us an expected accuracy of 0.5, so keep in mind these predictions are far from perfect.

A brute-force feel for noise

We have the accuracy, but how stable are these predictions? Let's re-run the analyses to get confidence intervals on our accuracy metrics

In [181]:
# Run each learning algorithm 200 times and store in data frame

dfPred = pd.DataFrame(columns=['algorithm','accuracy','scaled'])

algs = ['RandomForestClassifier(n_estimators=100)','BaggingClassifier()','SGDClassifier(alpha=2)','RidgeClassifier(alpha=2)','SVC()','LogisticRegression()']
myScale = ['False','False','False','False','True','False']

# Calculate prediction accuracy for each algorithm
for i in range(len(algs)):
    for j in range(200):    # Sample 200 times to get 95%CI
        temp1 = runLearning(myData=allShots,myMod=algs[i],trainIdx='[1::2]',testIdx='[::2]',respVar='FGM',scaleIt=myScale[i])
        algName = algs[i]
        algName = algName.split('(')[0]
        dfPred = dfPred.append({'algorithm': algName, 'accuracy': temp1, 'scaled': myScale[i]}, ignore_index = True)
    
dfPred.head()
Out[181]:
algorithm accuracy scaled
0 RandomForestClassifier 63.52 False
1 RandomForestClassifier 62.25 False
2 RandomForestClassifier 64.43 False
3 RandomForestClassifier 62.61 False
4 RandomForestClassifier 63.34 False

Now some functions to calculate statistics for each learning algorithm

In [211]:
def get95CI(df,alg):
    temp = df.loc[dfPred.algorithm==alg]
    temp = temp.sort_values(by='accuracy').reset_index(drop=True)
    temp
    mean = np.mean(temp.accuracy)  # mean
    low95 = mean - temp.ix[4,'accuracy']   # low95
    hi95 = temp.ix[194,'accuracy'] - mean # high95
    return mean, low95, hi95

def getAlgStatSummary(df, alg):
    algStats = get95CI(df, alg)
    df = pd.DataFrame(columns=['algorithm', 'mean', 'low95', 'high95'])
    df = df.append({'algorithm': alg, 'mean': algStats[0], 'low95': algStats[1], 'high95': algStats[2]}, ignore_index = True)
    return df
In [212]:
dfStats = pd.DataFrame(columns=['algorithm', 'mean', 'low95', 'high95'])
for i in  list(set(dfPred.algorithm)):
    dfStats = dfStats.append(getAlgStatSummary(dfPred,i),ignore_index=True)
dfStats = dfStats.sort_values('mean')
dfStats
Out[212]:
algorithm mean low95 high95
4 SGDClassifier 52.59725 7.047250e+00 7.472750e+00
3 SVC 54.63000 -1.421085e-13 1.421085e-13
2 BaggingClassifier 59.79290 3.352900e+00 2.457100e+00
0 LogisticRegression 60.25000 0.000000e+00 0.000000e+00
5 RidgeClassifier 60.62000 1.918465e-13 -1.918465e-13
1 RandomForestClassifier 62.54060 2.100600e+00 2.069400e+00

Now plot the accuracy metrics with 95% confidence intervals

In [186]:
fig = plt.figure(figsize=(3,5), dpi=1600)
ax = plt.subplot(111)

barPlot = plt.bar(range(6),dfStats.loc[:,'mean'],yerr=[list(dfStats.loc[:,'low95']),list(dfStats.loc[:,'high95'])],alpha=0.5)

xticks_pos = [0.65*patch.get_width() + patch.get_xy()[0] for patch in barPlot]
plt.xticks([i+0.5 for i in range(6)],dfStats.loc[:,'algorithm'],rotation=45,ha='right')
ax.set_ylabel('Prediction accuracy',fontsize=16,fontweight='bold')
plt.show()

The interpretability/accuracy trade-off

There you have it. We now have a sense of the accuracy and error of the different learning algorithms (prior to tweaking/optimization).

While the Random Forest Classifier is best if you want to predict whether a shot would go in or not, it is a black box. That is, you don't know what each individual predictor is doing in your model.

By contrast, you can choose to use a logistic regression instead. While this algorithm has a reduced prediction accuracy, you can look at the predictor strength to assertain its relative importance.

To make this more concrete, if you want to know if Wiggins' shot will go in, you should use the high accuracy, black box approach (Random Forest). But if you are playing against Wiggins and want to know what you should do to make him miss, you should use the interpretable approach (logistic regression).

Overfitting? We don't need no stinking overfitting!

All predictors are not created equal. If we use too many predictors, our model will be too complex and we will overfit the data. As a result, we lose prediction accuracy and miss out on some of the underlying relationships.

To account for overfitting, we will utilize regularization, which serves to penalize the model for complexity. This subsequently reduces the contribution of unimportant variables.

Luckily, regularization is built into the logistic regression function in sklearn, we just need to tune coeffecient value.

In [187]:
#Function to find the best C value for regularization

def findC(cVal):
    tempData = allShots
    trainSet = eval('tempData' + '[1::2]').astype(float)
    testSet = eval('tempData' + '[::2]').astype(float)
    myEst = LogisticRegression(C=cVal)
    predCol = [col for col in allShots.columns if col != 'FGM']
    predVar = trainSet['FGM'].astype(float)
    predVar.ix[predVar>0]=1
    predVar.ix[predVar<=0]=0
    # Train the model
    myEst.fit(
        X=trainSet[predCol],
        y=predVar
    )
    # Test the model
    testVar = testSet['FGM'].astype(float)
    testVar.ix[testVar>0]=1
    testVar.ix[testVar<=0]=0
    matches = np.count_nonzero(myEst.predict(testSet[predCol]) == testVar)
    return matches/float(len(testSet))*100
In [189]:
fig = plt.figure(figsize=(3,4), dpi=1600)
ax = plt.subplot(111)

cList=[]
for i in [np.power(10,float(i)) for i in np.arange(0,-9,-1)]:
    cList.append(findC(i))
    
cList
plt.plot([np.power(10,float(i)) for i in np.arange(0,-9,-1)],cList)
ax.set(xscale="log")
ax.set_ylim([0,70])
plt.plot((0.001, 0.001), (0, 70), 'g--')
ax.set_xlabel('"C" regulxn coef (small=stringent)',fontsize=16,fontweight='bold')
ax.set_ylabel('Prediction accuracy',fontsize=16,fontweight='bold')
plt.show()

Just right...

By optimizing the regularization coefficient (using green dashed line), we were able to increase the prediction accuracy by ~8%.

Now, let's take a look at how each predictor contributes to whether a shot is made or missed.

In [213]:
# Divide the data into training/test set and then perform the logistic regression

trainSet = eval('allShots' + '[1::2]').astype(float)
testSet = eval('allShots' + '[::2]').astype(float)

myEst = LogisticRegression(C=0.001)
predCol = [col for col in allShots.columns if col != 'FGM']
predVar = trainSet['FGM'].astype(float)
predVar.ix[predVar>0]=1
predVar.ix[predVar<=0]=0
# Train the model
myEst.fit(
    X=trainSet[predCol],
    y=predVar
)

# Test the model
testVar = testSet['FGM'].astype(float)
testVar.ix[testVar>0]=1
testVar.ix[testVar<=0]=0
In [216]:
# Plot coefficients for logistic regression (logR)

logR = pd.DataFrame(myEst.coef_)
logR.columns=predCol

fig = plt.figure(figsize=(8,5), dpi=1600)
ax = plt.subplot(111)

logR.ix[0].argsort()
newCol = logR.columns[logR.ix[0].argsort()]
logR = logR[newCol]

sns.set_style("dark")
plt.bar(range(len(predCol)),logR.iloc[0],alpha=0.8,width=1)

plt.xticks([i+0.5 for i in range(len(predCol))],logR.columns,rotation=45,ha='right',fontsize=12)
plt.yticks(fontsize=12)
ax.set_ylabel('Coefficient values',fontsize=18,fontweight='bold')
plt.plot((0, 15), (0, 0), 'k-')
plt.show()

Zeroing in

So how to read the graph? Positive values are correlated with made shots and negative values are correlated with missed shots. The magnitude indicates the strength of the correlation.

Reading this graph we see that high shot distance (SHOT_DIST) has a large negative value and therefore predicts a miss. By contrast, the closest defender distance (CLOSE_DEF_DIST) is a large positive value, indicating that far defenders are associated with made shots. Finally, the final game score margin (FINAL_MARGIN) is essentially zero, and we can use this as a negative control.

But these are algorithms and coefficients. I want to see the data! Let's use these coefficients to guide our data exploration.

In [207]:
# Function to compare kernel denisty distributions of made/missed shots
def compareDist(myVar,myTitle,df=allShots,myHist=False):
    made = df.loc[df['FGM'] == True]
    missed = df.loc[df['FGM']  == False]
    sns.distplot(made[myVar],hist=myHist,label="made shots")
    sns.distplot(missed[myVar],hist=myHist,label="missed shots")
    plt.legend();
    plt.title(myTitle,fontsize=16)

Made-missed denisty analysis

First, let's take a look at the data density for the made missed shots. To ease direct comparison, I'll overlay the kernel density estimates for both made and missed shots.

In [223]:
fig = plt.figure(figsize=(9,3.5), dpi=1600)
#plt.subplot(211)
plt.subplot(1,3,1)
compareDist('FINAL_MARGIN','Negative control')
plt.subplot(1,3,2)
compareDist('SHOT_DIST','Long shots predict miss')
plt.subplot(1,3,3)
#compareDist('SHOT_CLOCK','Hi shot clock pred. make')
#plt.subplot(3,1,3)
compareDist('CLOSE_DEF_DIST','Dfnd dist inconclusive??',myHist=False)

#compareDist('SHOT_DIST')
plt.tight_layout()

Reading the data

In the negative control (left), the curves nearly overlay, which we should expect.

In the middle graph, we can see, by looking at the differences between the green/blue curves, that for longer shots we would expect a higher amount of misses (the green curve has more density); and vice versa for close shots.

On the far right graph, it looks like the data is basically overlaid, like the negative control. Maybe we need another way of looking at the data, since it doesn't look so nicely bimodal like the middle graph.

Let's look instead at shooting percentage as a function of binned defender distance

In [255]:
# Function to display kernel density estimates of made/missed shots

def binShotPerPlot(myVar,stepSize):

    fig = plt.figure(figsize=(10,4), dpi=1600)
    maxVal = int(np.ceil(max(allShots[myVar])))
    dfDefDist = pd.DataFrame(columns=['madePerc'])

    myBins = list(np.arange(0,float(maxVal)-stepSize,stepSize))
    for i in myBins:
        madePerc = calcShootPercDist(i,myVar,stepSize=stepSize)
        dfDefDist = dfDefDist.append({'madePerc': madePerc},ignore_index=True)
    
    histHeights, histBins, c = plt.hist(np.array(allShots[myVar]),normed=True,bins=len(myBins))
    dfDefDist['density'] = histHeights

    myCols = cm.RdBu(dfDefDist['density']/max(dfDefDist['density']))

    ax = plt.subplot(111)
    cax = ax.imshow(dfDefDist[['density']], interpolation='nearest', cmap=cm.RdBu)
    fig.clf() #clear the histogram and the `imshow` image

    ax = plt.subplot(111)
    ax.set_ylim(0,100)
    ax.set_xlim(0,maxVal*(1/stepSize)-1)

    spData = dfDefDist['madePerc']

    for i in range(len(spData)-1):
        # if no shots taken, ignore interval
        if dfDefDist.loc[i,'density'] != 0 :
            plt.scatter(i,spData[i],color=([j for  j in myCols[i+1]][0:3]))
    
    ax.set_ylabel('Shooting %',fontsize=16)
    ax.set_xlabel(myVar,fontsize=16)
    cbar = plt.colorbar(cax,label='density')
    cbar.set_label('Data density',size=16)
    plt.show()
    
binShotPerPlot('CLOSE_DEF_DIST',1)

There it is!

For the data points with higher shot percentage, if is more likely that the closest defender is far away. This is far from a striking correlation, but it helps to predict if a shot went in or not.

Wrapping up

In this post we looked at using NBA shot data to compare learning algorithms, quantify prediction errors, and examine model accuracy/interpretability.