What color is your paycheck?
Posted on Fri 04 December 2015 in blog • Tagged with python, data science, visualization, statistics, pca, api, bokeh, stats
One of the data science skills I want to play around with is deriving insights from data that publically available. Here, lets use some data on SF employee compensation and see what we can learn from the data.
First, per usual, load the dependencies.
Principal Component Analysis for a five year old
Posted on Thu 15 October 2015 in blog • Tagged with R, data science, stats, PCA
I went to a talk a couple of weeks ago at Stanford on using machine learning to understand complex biological data. At one point in the talk the speaker made an offhand comment about data so simple "that a five year old could cluster it". Wow, were you that smart at five?
Continue reading
Using python to get an intuition for multiple regression
Posted on Fri 02 October 2015 in blog • Tagged with python, data science, regression, stats
I want to get some intuition about regression models using multiple independent variables. More precisely, I am unsure if the relevant predictors would be better uncovered by multiple regression, or by pairwise analysis of all predictors against the response variable. So I'd like to use a dataset where I know the precise contribution of each predictor to the response variable.
Continue reading
Variable selection for multiple regression models
Posted on Tue 08 September 2015 in blog • Tagged with R, jupyter, regression, stats
Here, I want to look at using R to perform variable selection for a linear model. Let's consider forward and reverse selection, statistical techniques to keep only variables that maximize the variance explained. The dataset I'm using is the Boston housing price dataset from the MASS library.
Note, that there are some drawbacks/limitations to consider when using variable selection: http://www.stata.com/support/faqs/statistics/stepwise-regression-problems/
Continue reading