Principal Component Analysis for a five year old
Posted on Thu 15 October 2015 in blog • Tagged with R, data science, stats, PCA
I went to a talk a couple of weeks ago at Stanford on using machine learning to understand complex biological data. At one point in the talk the speaker made an offhand comment about data so simple "that a five year old could cluster it". Wow, were you that smart at five?
Continue reading
Statistical learning on NBA shot data
Posted on Sun 11 October 2015 in blog • Tagged with python, NBA, api, machine learning, regression, logistic regression, regularization
In the last post, I pulled some NBA shot data for Andrew Wiggins and put that into a dataframe. Here, we will apply some supervised learning techniques from sklearn
to build predictive models and then use visualizations to better understand the data.
Some topics we'll explore are prediction error, regularization, and the tradeoff between prediction accuracy and model interpretability.
Continue reading
Scraping NBA shot data using python
Posted on Sat 10 October 2015 in blog • Tagged with python, NBA, api
My goal is to learn how to scrape data using python and do some quick data analysis.
This is my first time scraping from the web. I found this documentation extremely helpful. Here, I'm pulling in the shot log for Andrew Wiggins, the NBA Rookie of the Year for the 2014-2015 season.
Continue reading
Measuring cancer cell dynamics in response to therapy
Posted on Fri 09 October 2015 in blog • Tagged with R, research, systems
Example R code to process highly-multiplexed cancer drug responses. Formatted so that you should be able to run it on a mac.
Continue reading
Optimizing k in k-means clustering
Posted on Tue 06 October 2015 in blog • Tagged with R, data science, clustering
I want to get my hands dirty with clustering after seeing a great lecture at Stanford. Here I'm looking at k-means clustering, an algorithm to identify groups in multidimensional data.
I'm using a builtin dataset in R, "ruspini" Also, I found this site as a helpful template to start.
Using python to get an intuition for multiple regression
Posted on Fri 02 October 2015 in blog • Tagged with python, data science, regression, stats
I want to get some intuition about regression models using multiple independent variables. More precisely, I am unsure if the relevant predictors would be better uncovered by multiple regression, or by pairwise analysis of all predictors against the response variable. So I'd like to use a dataset where I know the precise contribution of each predictor to the response variable.
Continue reading
MyFIRST MySQL
Posted on Thu 01 October 2015 in blog • Tagged with MySQL, data science
Intro to SQL. I've looked at a couple tutorials on SQL, but the best way to learn is to play around right? Let's get started.
Continue reading
Plotting and error anlysis illustrating R notebooks (jupyter)
Posted on Wed 30 September 2015 in blog • Tagged with R, jupyter, data science
The goal of this exercise is to try R in jupyter and compare simple model fits. I spent way too much much time trying to get plots embedded using RStudio. For now, I just want a happy, functional black box. So luckily, I've been using ipython notebook (now jupyter) and there's now R functionaliy. Time to try it out!
Continue reading
Variable selection for multiple regression models
Posted on Tue 08 September 2015 in blog • Tagged with R, jupyter, regression, stats
Here, I want to look at using R to perform variable selection for a linear model. Let's consider forward and reverse selection, statistical techniques to keep only variables that maximize the variance explained. The dataset I'm using is the Boston housing price dataset from the MASS library.
Note, that there are some drawbacks/limitations to consider when using variable selection: http://www.stata.com/support/faqs/statistics/stepwise-regression-problems/
Continue reading
Trial python blog using jupyter
Posted on Thu 03 September 2015 in blog • Tagged with ipython, blogging
Purpose This is the first iPython notebook I'm adding online. The goal of this notebook is more to learn the syntax, etc, and to migrate content online. For example, one asterisk makes text italic. Two asterisks makes text bold. Putting 2 dollar signs before and after an expression gives an equation:
$$y=10*x$$An equation can be displayed inline with the text by adding a single dollar sign on either side of the expression, like so: $y=sin(x)$
Continue reading