Finding trends in dynamic data: a data science exploration of baby names

Posted on Tue 03 January 2017 in blog

I've been thinking recently about finding growth trends in time-series data. This reminded me of a paper I read way back about discovering finding gene expression networks based on temporal expression patterns, something similar to this. The goal here is to explore the idea of similarity based on temporal data alone.

To play around with this idea, I downloaded a dataset of recorded U.S. baby names dating back to 1880. If you consider that each name is a 100-ish dimensional vector, then we can apply similarity metrics. So if you are expecting a baby, and your friend just took the name you wanted, what names would be good suggestions to pick?

One consideration is that I'd like to find names that grow together. If you use a distance metric like euclidean distance, this favors finding names with similar absolute counts. If you use log-normalized counts, then this favors finding names with similar growth rates.

Take a look at the graph below. The goal is to find names that grow together (green and orange). With no normalization, both names are most similar to the grey name, which is in the middle, but has a low growth rate. So as a first step, log2-transform the data to make exponential growth processes linear. Then normalize by subtracting by the value at the prior timestep (not shown). If you take away the offset by making all data = 0 (end-at-zero normalizrion), then it is very natural to find similarity of names with respect to the most recent timepoint

In [1]:
from IPython.display import Image
Image(filename='baby_names_slide.png')
Out[1]:

This is our goal (in reality we also normalize each time point to subtract the prior time point). Let's get started

In [2]:
from os import listdir
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import pairwise_distances
import matplotlib.cm as cm
import matplotlib.pylab as plt
import seaborn as sns
plt.style.use('/Users/pf494t/.matplotlib/stylelib/plf.mplstyle')

%matplotlib inline
In [5]:
file_list = listdir('ss_names/')[::-1] # read in files in descending chronology
In [6]:
file_list[0:3]
Out[6]:
['yob2015.txt', 'yob2014.txt', 'yob2013.txt']

First, assemble names into a dataframe

and keep only names that have at least 100 occurences in 2015 (most recent available)

In [8]:
df = pd.read_csv('ss_names/' + file_list[0],header=None)
df.columns = ['name','gender','count_2015']
df['count_2015'] = np.float64(df['count_2015'])

for file_year in file_list[1:]:
    df_new = pd.read_csv('ss_names/' + file_year,header=None)
    df_new.columns = ['name','gender','count_'+file_year[3:7]]
    df = df.merge(df_new,on=['name','gender'],how='left')
In [9]:
df_filt = df[df['count_2015']>100] # a threshold to filter out rare names
df_filt.fillna(value=2.5,inplace=True) # impute missing data
df_filt = df_filt[df_filt['gender']=='M'] # only keep boys, for now
df_filt.reset_index(inplace=True,drop=True)
df_filt.loc[:,'slice_num'] = list(df_filt.index)
df_filt.index=[df_filt.name,df_filt.slice_num]
df_filt = df_filt[df_filt.columns[::-1]] # make columns forward-chronological
df_filt.drop(['name','gender','slice_num'],inplace=True,axis=1)
n = df_filt.shape[0]
/Users/pf494t/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/core/frame.py:3035: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  downcast=downcast, **kwargs)

log transform and normalize so 2015 value = 0

In [10]:
df_log2 = np.log2(df_filt)
df_log2 = df_log2.subtract(df_log2['count_2015'].values,axis='index')
In [11]:
df_filt.head(2)
Out[11]:
count_1880 count_1881 count_1882 count_1883 count_1884 count_1885 count_1886 count_1887 count_1888 count_1889 ... count_2006 count_2007 count_2008 count_2009 count_2010 count_2011 count_2012 count_2013 count_2014 count_2015
name slice_num
Noah 0 103.0 81.0 108.0 81.0 94.0 76.0 90.0 94.0 83.0 85.0 ... 16324.0 16585.0 15778.0 17231.0 16438.0 16845.0 17323.0 18206.0 19229.0 19511.0
Liam 1 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ... 4511.0 5135.0 5976.0 8560.0 10924.0 13431.0 16785.0 18111.0 18421.0 18281.0

2 rows × 136 columns

In [12]:
df_log_delta = np.log2(df_filt)-np.log2(df_filt.shift(periods=-1,axis=1))
df_log_delta.drop('count_1880',axis=1,inplace=True)

compute pairwise distances amongst all boys

  • cosine distance
  • euclidean distance
In [13]:
log2_normed_dist = pairwise_distances(df_log2,metric='cosine')
log2_normed_dist[log2_normed_dist<0] = 0

abs_dist = pairwise_distances(df_filt,metric='cosine')
abs_dist[abs_dist<0] = 0

Write functions to get similar names

In [14]:
def getSimilarNames(name,list_len=5,my_df=df_filt,my_distances=log2_normed_dist):
    """given a name, return the indices of the most similar names"""
    current_slice = my_df.xs([name], level=['name']).index[0]
    similar_dist = list(my_distances[current_slice])
    similar_dist.sort()
    similar_idx = []
    for dist in similar_dist[0:list_len*5]:
        curr_idx = list(my_distances[current_slice]).index(dist)
        similar_idx.append(curr_idx)
    return similar_idx[0:list_len]
In [15]:
def plotSimilarSlices(name_slices,my_df = df_log_delta,my_ax=None):
    """
    plot similar names
    
    name_slices: indices of names to plot
    my_df: pandas DataFrame of data to plot
    my_ax: option for choosing to add to existing plot
    """
    if my_ax == None:
        fig = plt.figure()
        ax = plt.subplot(111)
        for name_slice in name_slices:
            current_slice = my_df.xs(name_slice,level='slice_num')
            ax.plot(range(my_df.shape[1]),current_slice.values[0],label=current_slice.index[0])
        box = ax.get_position()
        ax.set_position([box.x0, box.y0, box.width * 0.8, box.height])
        ax.legend(loc='center left', bbox_to_anchor=(1, 0.5),fontsize=12)
        ax.set_ylabel('Births',fontsize=24)
        ax.set_xlim([0,138])
        ax.set_xticklabels(np.arange(2002,2002-150,-25)[::-1],rotation=90)
        plt.show()
        
    else:
        for name_slice in name_slices:
            my_ax.plot(range(my_df.shape[1]),my_df.iloc[name_slice,:],label=my_df.iloc[name_slice,:].name[0])
        my_ax.legend()
In [16]:
# code to select a random name
df_filt.index[np.random.choice(np.arange(0,n),1)[0]][0]
Out[16]:
'Joshua'
In [17]:
def plot_names(name=None,df=df_log2,list_len=15,dist=log2_normed_dist):
    """
    plot the names most similar to a given name
    
    name: name to find similarity with
    list_len: number of names to plot
    dist: distance matrix
    """
    if name is None:
        name = df.index[np.random.choice(np.arange(0,n),1)[0]][0]
    plotSimilarSlices(
            getSimilarNames(
                    name,
                    list_len=list_len,
                    my_df=df,
                    my_distances=dist
            ),
        my_df=df_filt)

Try it out!

As you will see, the algorithm finds similar dynamic patterns. Note that this is the display of the raw data, not the what the algorithm "sees". But the log2-norm and the cosine similarity appear to together capture similar growth trends

Community detection using time data alone

Perhaps surprisingly, there are a variety of communities that can be detected along demography. Part of this is looks like immigration patterns as America becomes a "mixing pot" and part is the preferences of the majority. Take a look!

First is an example of hispanic names, probably a combination of naming trends and changing demographics...

In [18]:
plot_names(name='Lamar')

Here is a cohort of names popular for baby boomers only, primarily

In [19]:
plot_names(name='Monte')

Here are new, recently popular names. Notice the common "-n" ending

In [20]:
plot_names(name='Rayan')

The pre-boomer generation ...

In [21]:
plot_names(name='Wallace')

Hebrew names

In [22]:
plot_names(name='Moshe')

Islamic names

In [23]:
plot_names(name='Amir')

Names that are "back in style"

In [24]:
plot_names(name='August')

Just-past-their-prime names

In [25]:
plot_names(name='Javian')

Conclusion

Using time data alone, we were able to find names that grew with similar trends. These trends effectively isolate communities of similar naming patterns, generational differences, and ethnic groups. In other circumstances this approach could identify trends in data with growth processes.