Scraping NBA shot data using python
Posted on Sat 10 October 2015 in blog
My goal is to learn how to scrape data using python and do some quick data analysis.
This is my first time scraping from the web. I found this documentation extremely helpful. Here, I'm pulling in the shot log for Andrew Wiggins, the NBA Rookie of the Year for the 2014-2015 season.
%matplotlib inline
import requests
import json
import pandas as pd
import matplotlib.pylab as plt
import numpy as np
Talking to the API¶
To get the info needed, you can navigate to the website that gives the data you want. Using Chrome developer tools, you can navigate to find the API and see the structure of the request by previewing it.
shotsUrl = 'http://stats.nba.com/stats/playerdashptshotlog?' + \
'DateFrom=&DateTo=&GameSegment=&LastNGames=0&LeagueID=00' + \
'&Location=&Month=0&OpponentTeamID=0&Outcome=&Period=0' + \
'&PlayerID=203952&Season=2014-15&SeasonSegment=&SeasonType=' + \
'Regular+Season&TeamID=0&VsConference=&VsDivision='
Now, use requests
to get the data back in json format
response = requests.get(shotsUrl)
response.raise_for_status()
shots = response.json()['resultSets'][0]['rowSet']
colNames = response.json()['resultSets'][0]['headers']
Organize the data¶
Now, having all the data in memory, arrange it and put it in a dataframe. Also, let's save it for further use
pwd
ls '../../dataSandbox/forPelican/nbaWiggins.csv'
dfShots = pd.DataFrame(shots,columns=colNames)
dfShots.to_csv('../../dataSandbox/forPelican/nbaWiggins.csv',index_label=False)
dfShots.head()
Skimming the surface¶
There's a lot of data here. Let's just start with a quick analysis of how much time was on the shot clock when Wiggins shot the ball.
allShots = dfShots[np.invert(np.isnan(dfShots.SHOT_CLOCK))]
fig = plt.figure(figsize=(5,4), dpi=1600)
ax = fig.add_subplot(111)
allShots['SHOT_CLOCK'].hist(normed = True,bins=20,alpha=0.4,linewidth=0.4)
ax.set_xlabel('Shot clock (s)',fontsize=16,fontweight='bold')
ax.set_ylabel('Density',fontsize=16,fontweight='bold')
fig.suptitle('Andrew Wiggins',fontsize=20,fontweight='bold')
plt.show()
Is the shot clock distribution similar for made/missed shots?¶
We can continue to dig into the data. Next let's just compare the same data divided into made shots and missed shots.
fig = plt.figure(figsize=(5,4), dpi=1600)
ax = fig.add_subplot(111)
madeShots = dfShots[(dfShots.SHOT_RESULT=='made') & (np.invert(np.isnan(dfShots.SHOT_CLOCK)))]
#madeShots = allShots[allShots.SHOT_RESULT=='made']
missedShots = dfShots[(dfShots.SHOT_RESULT=='missed') & (np.invert(np.isnan(dfShots.SHOT_CLOCK)))]
madeShots['SHOT_CLOCK'].hist(normed = True,bins=20,alpha=0.3,linewidth=0.4,label='made')
missedShots['SHOT_CLOCK'].hist(normed = True,bins=20,alpha=0.3,linewidth=0.4,label='missed')
plt.legend(loc='upper right',framealpha=0)
ax.set_xlabel('Shot clock (s)',fontsize=16,fontweight='bold')
ax.set_ylabel('Density',fontsize=16,fontweight='bold')
fig.suptitle('Andrew Wiggins',fontsize=20,fontweight='bold')
plt.show()
It looks like the missed shots and made shots pretty much overlap. A couple of interesting points. The made shots density is relatively higher with more time left on the shot clock. This is probably due to quick put-back shots. At low time left, the missed is slightly higher, suggesting possible forced low-quality shots as the shot clock is expiring.
Nike bootstraps¶
We have these two distributions, and they look different by eye, but can we say with confidence that the sample means are not the same? To translate that to statistics mumbo jumbo, can we reject the null hypothesis that these two distributions have equal sample means?
If the data were nice pretty Gaussians, then a t-test would be perfect. I'm not sure what distribution describes the time remaining on the shot clock, so we'll instead take a non-parametric approach, that is, to not assume knowledge of the underlying distributions.
So... let's use bootstrapping to estimate the error of the estimates of the means for the two distributions. This will tell us if we can distinguish these distributions with confidence.
from sklearn.utils import resample
Now, let's write a function to resample the sample data.
Wait, you are seriously going to use only your data to make up more theoretical data and use this to draw conclusions about the real-world error of the original data?
Okay, so this hits a nerve. The experimentalist in me has had the luxury of resampling from the population by just doing another experiment. This is obiously the ideal case. So I've always kind of kept a healthy skepticism every time I thought about bootstrap resampling.
Apparently it is called bootstrapping precisely because it is physically impossible to pull yourself up off the ground by pulling your own bootstraps. This happens to work in statistics though. Think of it this way, as your sample increases in size, it will more closely approximate the underlying distribution of the data.
Bootstrapping relies upon the assumption that your sample closely resembles the unobserved/theoretical population distribution. Therefore sampling from your sample is equivalent to sampling from the the population. This makes sense for larger sample sizes, but I would stay cautious using bootstrapping with small samples, because the sample may not resemble the total population.
Okay, to the bootstrapping. Here's a function to return the 95% confidence interval for the estimate of the mean.
def getCI(dist,n=1000):
bootMean = []
for i in range(n):
newDist = resample(dist)
bootMean.append(newDist.mean())
bootMean = sorted(bootMean)
madeCI = {'lower': bootMean[int(n*0.025)], 'upper': bootMean[int(n*0.975)], 'mean': bootMean[int(n*0.5)]}
return madeCI
Now, resample 1,000 times to calculate the 95% CI of the mean shot clock time remaining for both the made and missed shots.
made = getCI(madeShots['SHOT_CLOCK'],1000)
missed = getCI(missedShots['SHOT_CLOCK'],1000)
madeUp = abs(made['upper'] - made['mean'])
madeLow = abs(made['lower'] - made['mean'])
print "Made shots abs. error intervals:\n" + str(madeLow) +', ' + str(madeUp) + '\n'
missedUp = abs(missed['upper'] - missed['mean'])
missedLow = abs(missed['lower'] - missed['mean'])
print "Missed shots abs. error intervals:\n" + str(missedLow) + ', ' + str(missedUp)
fig = plt.figure(figsize=(8,6), dpi=1600)
ax = fig.add_subplot(111)
madeShots = dfShots[(dfShots.SHOT_RESULT=='made') & (np.invert(np.isnan(dfShots.SHOT_CLOCK)))]
missedShots = dfShots[(dfShots.SHOT_RESULT=='missed') & (np.invert(np.isnan(dfShots.SHOT_CLOCK)))]
madeShots['SHOT_CLOCK'].hist(normed = True,bins=20,alpha=0.3,linewidth=0.4,label='made')
missedShots['SHOT_CLOCK'].hist(normed = True,bins=20,alpha=0.3,linewidth=0.4,label='missed')
plt.legend(loc='upper right',framealpha=0)
ax.set_xlabel('Shot clock (s)',fontsize=16,fontweight='bold')
ax.set_ylabel('Density',fontsize=16,fontweight='bold')
fig.suptitle('Andrew Wiggins',fontsize=20,fontweight='bold')
plt.errorbar(x=[made['mean']],y=[0.05],xerr=[[madeLow],[madeUp]],linewidth=3,capthick=3,ecolor='b')
plt.errorbar(x=[missed['mean']],y=[0.05],xerr=[[missedLow],[missedUp]],linewidth=3,capthick=3,ecolor='g')
plt.show()
Conclusions¶
There you have it. Using bootstrap resampling of these funny looking distributions, we can say with confidence that, on average, Wiggins' shot his made shots faster (with more remaining time on the shot clock) than his missed shots.