word2vec part 2: graph building and training

Posted on Tue 03 April 2018 in blog • Tagged with python, machine learning, tensorflow, nlp, prediction, word2vec

In the last post we built the data preprocessing required for word2vec training. In this post we will build the network and perform the training on the text8 dataset (source), a Wikipedia dump of ~17 million tokens.

Note that we are implementing the skip-gram version of word2vec since it has superior performance
Continue reading


word2vec part 1: exploration and defining data flow

Posted on Mon 02 April 2018 in blog • Tagged with python, machine learning, tensorflow, neural networds, nlp

One of the trends I've been seeing is the use of embeddings for similarity/recommendation in non-NLP domains (this talk and many others). While I've used/thought about word2vec for a while, I wanted to implement this to really get a sense of how it works and ways it can be extended to other domains. Intuitively, word2vec makes sense, and there are a lot of packages that will let you compute it without thinking much about the implementation (e.g. gensim
Continue reading


Using tensorflow multilayer perceptron to predict housing prices

Posted on Thu 29 March 2018 in blog • Tagged with python, machine learning, tensorflow, deep learning, prediction

I've used keras a bit for deep learning recently. In general it is an excellent tool, but I know there are limitations in using a higher-level tool. In trying to implement an initial model in tensorflow, I found that you have to think about fundamentals of data flow and building computation graphs -- as opposed to machine learning a la sklearn
Continue reading


Exploring neural networks for text classification

Posted on Fri 17 November 2017 in blog • Tagged with python, machine learning, keras, nlp, deep learning, classification

I've been working on text classification recently. I've found keras to be a quite good high-level language and great for learning different neural network architectures. In this notebook I will examine Tweet classification using CNN and LSTM model architechtures. While CNNs are widely used in Computer Vision, I saw a paper
Continue reading


Gradient descent by matrix multiplication

Posted on Thu 23 February 2017 in blog • Tagged with python, data science, machine learning, math

Deep learning is getting so popular that even Mark Cuban is urging folks to learn it to avoid becoming a "dinosaur". Okay Mark, message heard, I'm addressing this guilt trip now. I originally tried starting in tensorflow (tensors are multidimensional arrays), but I quickly realized that I don't think in terms of tensors/matrices. For example, I drew a blank when thinking about how to take a partial derivative using matrix multiplication. So, as an exercise to understand concepts such as notation and matrix computations, my goal is to implement gradient descent on a multiple regression model.


Continue reading

Finding trends in dynamic data: a data science exploration of baby names

Posted on Tue 03 January 2017 in blog • Tagged with python, machine learning, similarity, matplotlib

I've been thinking recently about finding growth trends in time-series data. This reminded me of a paper I read way back about discovering finding gene expression networks based on temporal expression patterns, something similar to this. The goal here is to explore the idea of similarity based on temporal data alone.


Continue reading

Topic modeling and visualization of tweets

Posted on Sun 31 January 2016 in blog • Tagged with python, data science, topic modeling, machine learning, twitter

As more people tweet to companies, it is imperative for companies to parse through the many tweets that are coming in, to figure out what people want and to quickly deal with upset customers. Machine learning can help to facilitate this. In this notebook, I'll examine a dataset of ~14,000 tweets directed at various airlines. The algorithm I'm choosing to use is Latent Dirichlet Allocation
Continue reading


What color is your paycheck?

Posted on Fri 04 December 2015 in blog • Tagged with python, data science, visualization, statistics, pca, api, bokeh, stats

One of the data science skills I want to play around with is deriving insights from data that publically available. Here, lets use some data on SF employee compensation and see what we can learn from the data.

First, per usual, load the dependencies.

What's in a name? That which we call a data scientist... (part 2)

Posted on Mon 19 October 2015 in blog • Tagged with python, data science, scraping

What is a data scientist? To answer this, we will scrape "data scientist" job posts from Stack Overflow. In the last post, we looked at how to scrape a single job posting. Here, we will iterate that same script over hundreds of posts. Let's get started.

What's in a name? That which we call a data scientist... (part 1)

Posted on Sat 17 October 2015 in blog • Tagged with python, data science, scraping, text

What is a data scientist? Seems like it means different things to different people. Well, what if we let the companies who need a data scientist tell us?

To do this, let's look at jobs on Stack Overflow. The advantage here is that each posting is along the same html format, as opposed to other sites like Indeed.com or Monster.com, where job descriptions vary by company. To do this, we need a cursory knowledge of html, and a python package that helps us with the heavy lifting.


Continue reading