What's in a name? That which we call a data scientist... (part 2)
Posted on Mon 19 October 2015 in blog
What is a data scientist? To answer this, we will scrape "data scientist" job posts from Stack Overflow. In the last post, we looked at how to scrape a single job posting. Here, we will iterate that same script over hundreds of posts. Let's get started.
%matplotlib inline
import pandas as pd
from bs4 import BeautifulSoup
from urllib2 import urlopen
import sys
sys.path.append('../../dataSandbox/forPelican/') #for loading custom script
from scrapeSO import soJobs
from collections import Counter
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud
Starting off¶
First, pull in the html
generated from searching "data scientist" on StackOverflow careers.
BASE_URL = "http://careers.stackoverflow.com"
jobQ = '/jobs?searchTerm=data+scientist'
html = urlopen(BASE_URL + jobQ)
soup = BeautifulSoup(html , 'lxml')
Strain out the links¶
First, get all the links to the job posts from the job search result. By a couple combinatorial "strains" in Beautiful Soup
(<h3>
,<a>
), we can get them. Here's the first few.
for allElem in soup.findAll('h3',{'class':'-title'})[0:10]:
for elem in allElem.findAll('a'):
print elem.get('href')
Multiplexify!¶
Here's the good part. If we look closely at the html
elements, we can scrape the subsequent pages of the search results from the html
footer. Meaning... we can scale up our scrape five-fold!
for footer in soup.findAll('div',{'class':'pagination'}):
for morePages in footer.findAll('a',{'class':'job-link'})[0:5]:
print morePages.get('href')
The dirty work¶
OK, brace yourself, there are lots of for
loops! Read it from the outside in and it makes sense.
- Use the footer links to bring up a unique set of postings
- Get all the valid job links from these elements
- Scrape the data, using a function based on our previous post.
Hang on, this takes a while...
d = pd.DataFrame()
for footer in soup.findAll('div',{'class':'pagination'}):
for morePages in footer.findAll('a',{'class':'job-link'})[0:5]:
for allElem in soup.findAll('h3',{'class':'-title'}):
for elem in allElem.findAll('a'):
d = d.append(soJobs(BASE_URL + elem.get('href')))
d['words'] = d['words'].str.lower() #Make lowercase
d.head()
len(d)
Taking stock¶
Now, from our scraping, we have a data frame of 11,045 words. But not all of them are what we want. So, we will discard postings where "data" isn't present in the first three words of the job title. Also, there's over-representation for a few uninteresting words, "you", "skills", "experience". So, clean those up, then let's count what we have.
ds = d.loc[d['data']==True]
blocked = ['experience','skills','you']
ds = ds[~ds['words'].isin(blocked)]
print len(ds)
ds.head()
6,275 words passing filter. Not bad!¶
ds['words'].value_counts()[0:15]
Show me the (>6,000 scraped words) data!¶
Drumroll...
Now, let's visualize our word count and reap the fruits of our labor. To do this I'm making a word cloud. If you haven't seen one, it displays the most common words, and the size of the font is scaled to the frequency it occurs within the text you entered.
wordcloud = WordCloud(max_font_size=120,scale=8,stopwords=['s','ve','e','g','c','an','etc']).generate(' '.join(ds['words']))
fig = plt.figure(figsize=(24,18), dpi=1600)
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
Let the data speak¶
Looks about right, huh? Obviously, we see "data" and "scientist" standing out, which, if not tautological, looks pretty cool. Look a little closer and you will find your favorite skills. I see R, my gateway drug, and python too. Obviously this will work for other job names if you are interested