What's in a name? That which we call a data scientist... (part 2)

Posted on Mon 19 October 2015 in blog

What is a data scientist? To answer this, we will scrape "data scientist" job posts from Stack Overflow. In the last post, we looked at how to scrape a single job posting. Here, we will iterate that same script over hundreds of posts. Let's get started.

In [2]:
%matplotlib inline
import pandas as pd
from bs4 import BeautifulSoup
from urllib2 import urlopen
import sys
sys.path.append('../../dataSandbox/forPelican/') #for loading custom script
from scrapeSO import soJobs
from collections import Counter
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud

Starting off

First, pull in the html generated from searching "data scientist" on StackOverflow careers.

In [3]:
BASE_URL = "http://careers.stackoverflow.com"
jobQ = '/jobs?searchTerm=data+scientist'

html = urlopen(BASE_URL + jobQ)
soup = BeautifulSoup(html , 'lxml')

First, get all the links to the job posts from the job search result. By a couple combinatorial "strains" in Beautiful Soup (<h3>,<a>), we can get them. Here's the first few.

In [4]:
for allElem in soup.findAll('h3',{'class':'-title'})[0:10]:
    for elem in allElem.findAll('a'):
        print elem.get('href')
/jobs/100529/data-tooling-engineer-enigma?a=xIhXLaEWErC&searchTerm=data+scientist&so=s
/jobs/97983/driven-javascript-nodejs-developer-clevertech?a=wRlBxyRMqKA&searchTerm=data+scientist&so=s
/jobs/89632/data-scientist-palo-alto-or-seattle-coupang?a=u3I0jiXvEty&searchTerm=data+scientist&so=s
/jobs/93842/data-engineer-qadium?a=vtfcrjPcBGg&searchTerm=data+scientist&so=s
/jobs/96522/senior-security-engineer-clover-health?a=wmY8ZOqM3o4&searchTerm=data+scientist&so=s
/jobs/89149/java-developer-gap-inc?a=tTFbbnEkyWI&searchTerm=data+scientist&so=s
/jobs/96126/data-engineer-namely?a=weJDzZgMsCY&searchTerm=data+scientist&so=s
/jobs/90084/senior-principal-software-data-engineer-rally-health?a=ud6xH4WL9Nm&searchTerm=data+scientist&so=s
/jobs/99448/software-engineer-jw-player?a=xlOdNpshurC&searchTerm=data+scientist&so=s
/jobs/93723/front-end-software-engineer-tgs-management-company?a=vqLX1j5oWpq&searchTerm=data+scientist&so=s

Multiplexify!

Here's the good part. If we look closely at the html elements, we can scrape the subsequent pages of the search results from the html footer. Meaning... we can scale up our scrape five-fold!

In [5]:
for footer in soup.findAll('div',{'class':'pagination'}):
    for morePages in footer.findAll('a',{'class':'job-link'})[0:5]:
        print morePages.get('href')
/jobs?searchTerm=data+scientist
/jobs?searchTerm=data+scientist&pg=2
/jobs?searchTerm=data+scientist&pg=3
/jobs?searchTerm=data+scientist&pg=4
/jobs?searchTerm=data+scientist&pg=50

The dirty work

OK, brace yourself, there are lots of for loops! Read it from the outside in and it makes sense.

  1. Use the footer links to bring up a unique set of postings
  2. Get all the valid job links from these elements
  3. Scrape the data, using a function based on our previous post.

Hang on, this takes a while...

In [6]:
d = pd.DataFrame()
for footer in soup.findAll('div',{'class':'pagination'}):
    for morePages in footer.findAll('a',{'class':'job-link'})[0:5]:
        for allElem in soup.findAll('h3',{'class':'-title'}):
            for elem in allElem.findAll('a'):
                d = d.append(soJobs(BASE_URL + elem.get('href')))
d['words'] = d['words'].str.lower()    #Make lowercase
d.head()
Out[6]:
words title url data
0 expertise DataToolingEngineer 100529 True
1 python DataToolingEngineer 100529 True
2 similar DataToolingEngineer 100529 True
3 scripting DataToolingEngineer 100529 True
4 language DataToolingEngineer 100529 True
In [7]:
len(d)
Out[7]:
11045

Taking stock

Now, from our scraping, we have a data frame of 11,045 words. But not all of them are what we want. So, we will discard postings where "data" isn't present in the first three words of the job title. Also, there's over-representation for a few uninteresting words, "you", "skills", "experience". So, clean those up, then let's count what we have.

In [10]:
ds = d.loc[d['data']==True]
blocked = ['experience','skills','you']

ds = ds[~ds['words'].isin(blocked)]
print len(ds)
ds.head()
6275
Out[10]:
words title url data
0 expertise DataToolingEngineer 100529 True
1 python DataToolingEngineer 100529 True
2 similar DataToolingEngineer 100529 True
3 scripting DataToolingEngineer 100529 True
4 language DataToolingEngineer 100529 True

6,275 words passing filter. Not bad!

In [8]:
ds['words'].value_counts()[0:15]
Out[8]:
data          215
strong         55
work           55
ability        50
business       45
science        45
language       40
expert         40
knowledge      40
design         40
python         40
models         40
learning       35
client         35
analytical     35
dtype: int64

Show me the (>6,000 scraped words) data!

Drumroll...
Now, let's visualize our word count and reap the fruits of our labor. To do this I'm making a word cloud. If you haven't seen one, it displays the most common words, and the size of the font is scaled to the frequency it occurs within the text you entered.

In [19]:
wordcloud = WordCloud(max_font_size=120,scale=8,stopwords=['s','ve','e','g','c','an','etc']).generate(' '.join(ds['words']))

fig = plt.figure(figsize=(24,18), dpi=1600)
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

Let the data speak

Looks about right, huh? Obviously, we see "data" and "scientist" standing out, which, if not tautological, looks pretty cool. Look a little closer and you will find your favorite skills. I see R, my gateway drug, and python too. Obviously this will work for other job names if you are interested