In the last post we built the data preprocessing required for word2vec training. In this post we will build the network and perform the training on the text8 dataset (source), a Wikipedia dump of ~17 million tokens.
One of the trends I've been seeing is the use of embeddings for similarity/recommendation in non-NLP domains (this talk and many others). While I've used/thought about word2vec for a while, I wanted to implement this to really get a sense of how it works and ways it can be extended to other domains. Intuitively, word2vec makes sense, and there are a lot of packages that will let you compute it without thinking much about the implementation (e.g.
I've used keras a bit for deep learning recently. In general it is an excellent tool, but I know there are limitations in using a higher-level tool. In trying to implement an initial model in tensorflow, I found that you have to think about fundamentals of data flow and building computation graphs -- as opposed to machine learning a la
I've been working on text classification recently. I've found keras to be a quite good high-level language and great for learning different neural network architectures. In this notebook I will examine Tweet classification using CNN and LSTM model architechtures. While CNNs are widely used in Computer Vision, I saw a paper
Deep learning is getting so popular that even Mark Cuban is urging folks to learn it to avoid becoming a "dinosaur". Okay Mark, message heard, I'm addressing this guilt trip now. I originally tried starting in tensorflow (tensors are multidimensional arrays), but I quickly realized that I don't think in terms of tensors/matrices. For example, I drew a blank when thinking about how to take a partial derivative using matrix multiplication. So, as an exercise to understand concepts such as notation and matrix computations, my goal is to implement gradient descent on a multiple regression model.
I've been thinking recently about finding growth trends in time-series data. This reminded me of a paper I read way back about discovering finding gene expression networks based on temporal expression patterns, something similar to this. The goal here is to explore the idea of similarity based on temporal data alone.
As more people tweet to companies, it is imperative for companies to parse through the many tweets that are coming in, to figure out what people want and to quickly deal with upset customers. Machine learning can help to facilitate this. In this notebook, I'll examine a dataset of ~14,000 tweets directed at various airlines. The algorithm I'm choosing to use is Latent Dirichlet Allocation
One of the data science skills I want to play around with is deriving insights from data that publically available. Here, lets use some data on SF employee compensation and see what we can learn from the data.
First, per usual, load the dependencies.
What is a data scientist? To answer this, we will scrape "data scientist" job posts from Stack Overflow. In the last post, we looked at how to scrape a single job posting. Here, we will iterate that same script over hundreds of posts. Let's get started.
What is a data scientist? Seems like it means different things to different people. Well, what if we let the companies who need a data scientist tell us?
To do this, let's look at jobs on Stack Overflow. The advantage here is that each posting is along the same html format, as opposed to other sites like Indeed.com or Monster.com, where job descriptions vary by company. To do this, we need a cursory knowledge of html, and a python package that helps us with the heavy lifting.