Scraping NBA shot data using python

Posted on Sat 10 October 2015 in blog

My goal is to learn how to scrape data using python and do some quick data analysis.

This is my first time scraping from the web. I found this documentation extremely helpful. Here, I'm pulling in the shot log for Andrew Wiggins, the NBA Rookie of the Year for the 2014-2015 season.

In [1]:
%matplotlib inline
import requests
import json
import pandas as pd
import matplotlib.pylab as plt
import numpy as np

Talking to the API

To get the info needed, you can navigate to the website that gives the data you want. Using Chrome developer tools, you can navigate to find the API and see the structure of the request by previewing it.

In [2]:
shotsUrl = 'http://stats.nba.com/stats/playerdashptshotlog?' + \
    'DateFrom=&DateTo=&GameSegment=&LastNGames=0&LeagueID=00' + \
    '&Location=&Month=0&OpponentTeamID=0&Outcome=&Period=0' + \
    '&PlayerID=203952&Season=2014-15&SeasonSegment=&SeasonType=' + \
    'Regular+Season&TeamID=0&VsConference=&VsDivision='

Now, use requests to get the data back in json format

In [3]:
response = requests.get(shotsUrl)
response.raise_for_status()
shots = response.json()['resultSets'][0]['rowSet']
colNames = response.json()['resultSets'][0]['headers']

Organize the data

Now, having all the data in memory, arrange it and put it in a dataframe. Also, let's save it for further use

In [11]:
pwd
Out[11]:
u'/Users/Peter/git/pelicanSite/content'
In [ ]:
ls '../../dataSandbox/forPelican/nbaWiggins.csv'
In [13]:
dfShots = pd.DataFrame(shots,columns=colNames)
dfShots.to_csv('../../dataSandbox/forPelican/nbaWiggins.csv',index_label=False)
dfShots.head()
Out[13]:
GAME_ID MATCHUP LOCATION W FINAL_MARGIN SHOT_NUMBER PERIOD GAME_CLOCK SHOT_CLOCK DRIBBLES TOUCH_TIME SHOT_DIST PTS_TYPE SHOT_RESULT CLOSEST_DEFENDER CLOSEST_DEFENDER_PLAYER_ID CLOSE_DEF_DIST FGM PTS
0 0021401222 APR 15, 2015 - MIN vs. OKC H L -25 1 1 11:23 11.9 1 3.6 10.5 2 missed Roberson, Andre 203460 3.0 0 0
1 0021401222 APR 15, 2015 - MIN vs. OKC H L -25 2 1 5:34 9.5 2 4.2 12.2 2 made Morrow, Anthony 201627 2.3 1 2
2 0021401222 APR 15, 2015 - MIN vs. OKC H L -25 3 1 1:58 6.0 0 1.0 24.5 3 missed Kanter, Enes 202683 7.0 0 0
3 0021401222 APR 15, 2015 - MIN vs. OKC H L -25 4 2 10:46 14.9 0 0.9 24.9 3 missed Collison, Nick 2555 7.4 0 0
4 0021401222 APR 15, 2015 - MIN vs. OKC H L -25 5 2 7:12 14.8 1 2.2 4.2 2 missed Adams, Steven 203500 3.3 0 0

Skimming the surface

There's a lot of data here. Let's just start with a quick analysis of how much time was on the shot clock when Wiggins shot the ball.

In [5]:
allShots = dfShots[np.invert(np.isnan(dfShots.SHOT_CLOCK))]
fig = plt.figure(figsize=(5,4), dpi=1600)
ax = fig.add_subplot(111)
allShots['SHOT_CLOCK'].hist(normed = True,bins=20,alpha=0.4,linewidth=0.4)
ax.set_xlabel('Shot clock (s)',fontsize=16,fontweight='bold')
ax.set_ylabel('Density',fontsize=16,fontweight='bold')
fig.suptitle('Andrew Wiggins',fontsize=20,fontweight='bold')
plt.show()

Is the shot clock distribution similar for made/missed shots?

We can continue to dig into the data. Next let's just compare the same data divided into made shots and missed shots.

In [6]:
fig = plt.figure(figsize=(5,4), dpi=1600)
ax = fig.add_subplot(111)

madeShots = dfShots[(dfShots.SHOT_RESULT=='made') & (np.invert(np.isnan(dfShots.SHOT_CLOCK)))]
#madeShots = allShots[allShots.SHOT_RESULT=='made']
missedShots = dfShots[(dfShots.SHOT_RESULT=='missed') & (np.invert(np.isnan(dfShots.SHOT_CLOCK)))]

madeShots['SHOT_CLOCK'].hist(normed = True,bins=20,alpha=0.3,linewidth=0.4,label='made')
missedShots['SHOT_CLOCK'].hist(normed = True,bins=20,alpha=0.3,linewidth=0.4,label='missed')
plt.legend(loc='upper right',framealpha=0)

ax.set_xlabel('Shot clock (s)',fontsize=16,fontweight='bold')
ax.set_ylabel('Density',fontsize=16,fontweight='bold')
fig.suptitle('Andrew Wiggins',fontsize=20,fontweight='bold')
plt.show()

It looks like the missed shots and made shots pretty much overlap. A couple of interesting points. The made shots density is relatively higher with more time left on the shot clock. This is probably due to quick put-back shots. At low time left, the missed is slightly higher, suggesting possible forced low-quality shots as the shot clock is expiring.

Nike bootstraps

We have these two distributions, and they look different by eye, but can we say with confidence that the sample means are not the same? To translate that to statistics mumbo jumbo, can we reject the null hypothesis that these two distributions have equal sample means?

If the data were nice pretty Gaussians, then a t-test would be perfect. I'm not sure what distribution describes the time remaining on the shot clock, so we'll instead take a non-parametric approach, that is, to not assume knowledge of the underlying distributions.

So... let's use bootstrapping to estimate the error of the estimates of the means for the two distributions. This will tell us if we can distinguish these distributions with confidence.

In [7]:
from sklearn.utils import resample

Now, let's write a function to resample the sample data.

Wait, you are seriously going to use only your data to make up more theoretical data and use this to draw conclusions about the real-world error of the original data?

Okay, so this hits a nerve. The experimentalist in me has had the luxury of resampling from the population by just doing another experiment. This is obiously the ideal case. So I've always kind of kept a healthy skepticism every time I thought about bootstrap resampling.

Apparently it is called bootstrapping precisely because it is physically impossible to pull yourself up off the ground by pulling your own bootstraps. This happens to work in statistics though. Think of it this way, as your sample increases in size, it will more closely approximate the underlying distribution of the data.

Bootstrapping relies upon the assumption that your sample closely resembles the unobserved/theoretical population distribution. Therefore sampling from your sample is equivalent to sampling from the the population. This makes sense for larger sample sizes, but I would stay cautious using bootstrapping with small samples, because the sample may not resemble the total population.

Okay, to the bootstrapping. Here's a function to return the 95% confidence interval for the estimate of the mean.

In [8]:
def getCI(dist,n=1000):
    bootMean = []
    for i in range(n):
        newDist = resample(dist)
        bootMean.append(newDist.mean())
    bootMean = sorted(bootMean)
    madeCI = {'lower': bootMean[int(n*0.025)], 'upper': bootMean[int(n*0.975)], 'mean': bootMean[int(n*0.5)]}
    return madeCI

Now, resample 1,000 times to calculate the 95% CI of the mean shot clock time remaining for both the made and missed shots.

In [9]:
made = getCI(madeShots['SHOT_CLOCK'],1000)
missed = getCI(missedShots['SHOT_CLOCK'],1000)

madeUp = abs(made['upper'] - made['mean'])
madeLow = abs(made['lower'] - made['mean'])
print "Made shots abs. error intervals:\n" + str(madeLow) +', ' + str(madeUp) + '\n'
missedUp = abs(missed['upper'] - missed['mean'])
missedLow = abs(missed['lower'] - missed['mean'])
print "Missed shots abs. error intervals:\n" + str(missedLow) + ', ' + str(missedUp)
Made shots abs. error intervals:
0.509053497942, 0.464609053498

Missed shots abs. error intervals:
0.396910569106, 0.426178861789
In [10]:
fig = plt.figure(figsize=(8,6), dpi=1600)
ax = fig.add_subplot(111)

madeShots = dfShots[(dfShots.SHOT_RESULT=='made') & (np.invert(np.isnan(dfShots.SHOT_CLOCK)))]
missedShots = dfShots[(dfShots.SHOT_RESULT=='missed') & (np.invert(np.isnan(dfShots.SHOT_CLOCK)))]

madeShots['SHOT_CLOCK'].hist(normed = True,bins=20,alpha=0.3,linewidth=0.4,label='made')
missedShots['SHOT_CLOCK'].hist(normed = True,bins=20,alpha=0.3,linewidth=0.4,label='missed')
plt.legend(loc='upper right',framealpha=0)

ax.set_xlabel('Shot clock (s)',fontsize=16,fontweight='bold')
ax.set_ylabel('Density',fontsize=16,fontweight='bold')
fig.suptitle('Andrew Wiggins',fontsize=20,fontweight='bold')

plt.errorbar(x=[made['mean']],y=[0.05],xerr=[[madeLow],[madeUp]],linewidth=3,capthick=3,ecolor='b')
plt.errorbar(x=[missed['mean']],y=[0.05],xerr=[[missedLow],[missedUp]],linewidth=3,capthick=3,ecolor='g')

plt.show()

Conclusions

There you have it. Using bootstrap resampling of these funny looking distributions, we can say with confidence that, on average, Wiggins' shot his made shots faster (with more remaining time on the shot clock) than his missed shots.