What's in a name? That which we call a data scientist... (part 1)

Posted on Sat 17 October 2015 in blog

What is a data scientist? Seems like it means different things to different people. Well, what if we let the companies who need a data scientist tell us?

To do this, let's look at jobs on Stack Overflow. The advantage here is that each posting is along the same html format, as opposed to other sites like Indeed.com or Monster.com, where job descriptions vary by company. To do this, we need a cursory knowledge of html, and a python package that helps us with the heavy lifting.

In [1]:
import pandas as pd
import re
from bs4 import BeautifulSoup
from urllib2 import urlopen
import string
import unicodedata
from IPython.display import Image
from nltk.corpus import stopwords

Getting started

Here I'm using a job posting that I found by doing a simple seach for "data scientist" and choosing the first result. This is fed in as BASE_URL, but this posting will likely expire and can be switched out easily.

We'll read in the posting using urlopen and then parse the page with BeautifulSoup.

In [2]:
BASE_URL = "http://careers.stackoverflow.com/jobs/89632/data-scientist-palo-alto-or-seattle-coupang?searchTerm=data+scientist"
html = urlopen(BASE_URL)
soup = BeautifulSoup(html, "lxml")

A first look

Let's take a look at the website rendered locally on a browser.

P.S. how was I able to get a picture of the whole webpage? I used the command-line tool webkit2png :)

In [3]:
Image(filename='../../dataSandbox/forPelican/dataSciencePost2.jpg') 
Out[3]:

Finding our target

There's our target, the box above, shown in blue. We can figure out how to scrape it by looking directly at the elements in the extracted html. This can be done using View > Developer > Developer tools in Chrome. See below.

In [4]:
Image(filename = '../../dataSandbox/forPelican/elementsHtml.png')
Out[4]:

Scraping the lists

What you can see is that the elements we want are list <li> elements, which are children of <div class="description">. So what we will do is

  1. Use Beautiful Soup to find <div class="description">, and choose the second instance, which contains the Skills and Requirements.
  2. Use a for loop , find all list <li> elements within <div class="description">.
  3. Clean up the text.
In [5]:
skillList = ''
for desc in soup.findAll("div", { "class" : "description" }):
    skills =  desc.findAll('li')
    skills = unicode.join(u'\n',map(unicode,skills))   #Switch away from ResultSet
    skills = re.sub('<[^>]*>', '', skills)             #Remove elements from text
    skills = re.sub('[()/!@#$,]', ' ', skills)         #Get rid of spurious characters
    skillList += skills

print skillList[0:554] + '\n...\n...'
Build cutting-edge machine learning models to proactively provide strategic recommendations to business partners that can drive effective impact
Conduct in-depth data analyses to drive and deliver insights to help business partners make strategic decision
Partner with different functional teams  leveraging various tools to automate models and reports
Cares deeply about data quality and demonstrates this commitment by attention to quality of data and code generatedMasters PhDs in CS machine learning  Statistics  Applied Math or a quantitative field

...
...

Getting a clean output

Looks good, just text, with the punctuation removed. One more thing, let's get rid of stop words, or common words like "the", "in", etc., that are really not interesting

In [6]:
jobName = soup.find("a", {"class" : "title job-link"}).text.split()[0:3]
jobName = re.sub(r'\W+', '',  unicode.join(u'',map(unicode,jobName)))

# Get rid of common words
filtWords = [i for i in skillList.split() if i not in stopwords.words('english')]

# Export words as a data frame
df = pd.DataFrame({"words": filtWords})
df['title'] = jobName
df['url'] = BASE_URL.rsplit('/',3)[2]
if jobName.upper().find('DATA') == -1:
    df['data'] = False
else:
    df['data'] = True
df['words'] = df['words'].str.lower()
df.head(10)
Out[6]:
words title url data
0 build DataScientistPalo 89632 True
1 cutting-edge DataScientistPalo 89632 True
2 machine DataScientistPalo 89632 True
3 learning DataScientistPalo 89632 True
4 models DataScientistPalo 89632 True
5 proactively DataScientistPalo 89632 True
6 provide DataScientistPalo 89632 True
7 strategic DataScientistPalo 89632 True
8 recommendations DataScientistPalo 89632 True
9 business DataScientistPalo 89632 True

Wrapping up and future ideas

There we go, now we have a nice data frame that we can spit out containing the cleaned up required words along with the title (first three words) and URL. Also, some of the postings were unrelated to data science, so I added a fourth column, with a logical value, True if "data" is present in the first three words of the job title.

Now, we can make this into a function and iterate across many postings. Stay tuned!