Python/matpotlib: Plotting occurrences of the main characters in How I Met Your Mother

Normally when I’m playing around with data sets in R I get out ggplot2 to plot some charts to get a feel for the data but having spent quite a bit of time with Python and How I met your mother transcripts I haven’t created a single plot. I thought I’d better change change that.

After a bit of searching around it seems that matplotlib is the go to library for this job and I thought an interesting thing to plot would be how often each of the main characters appear in each episode across the show.

I’ve already got all the sentences from each episode as well as the list of episodes pulled out into CSV files so we can start from there.

This is a sample of the sentences file:

$ head -n 10 data/import/sentences.csv
SentenceId,EpisodeId,Season,Episode,Sentence
1,1,1,1,Pilot
2,1,1,1,Scene One
3,1,1,1,[Title: The Year 2030]
4,1,1,1,"Narrator: Kids, I'm going to tell you an incredible story. The story of how I met your mother"
5,1,1,1,Son: Are we being punished for something?
6,1,1,1,Narrator: No
7,1,1,1,"Daughter: Yeah, is this going to take a while?"
8,1,1,1,"Narrator: Yes. (Kids are annoyed) Twenty-five years ago, before I was dad, I had this whole other life."
9,1,1,1,"(Music Plays, Title ""How I Met Your Mother"" appears)"

My first step was to transform the CSV file into an array of words grouped by episode. I created a dictionary, iterated over the CSV file and then used nltk’s word tokeniser to pull out words from sentences:

import csv
from collections import defaultdict
 
episodes = defaultdict(list)
with open("data/import/sentences.csv", "r") as sentencesfile:
    reader = csv.reader(sentencesfile, delimiter = ",")
    reader.next()
    for row in reader:
        episodes[row[1]].append([ word for word in nltk.word_tokenize(row[4].lower())] )

Let’s have a quick look what’s in our dictionary:

>>> episodes.keys()[:10]
['165', '133', '132', '131', '130', '137', '136', '135', '134', '139']

We’ve got some episode numbers as we’d expect. Now let’s have a look at some of the words for one of the episodes:

>>> episodes["165"][5]
['\xe2\x99\xaa', 'how', 'i', 'met', 'your', 'mother', '8x05', '\xe2\x99\xaa']

So we’ve got an list of lists of words for each episode but gensim (which I wanted to play around with) requires a single array of words per document.

I transformed the data into the appropriate format and fed it into a gensim Dictionary:

from gensim import corpora
texts = []
for id, episode in episodes.iteritems():
    texts.append([item for sublist in episode for item in sublist])
dictionary = corpora.Dictionary(texts)

If we peek into ‘texts’ we can see that the list has been flattened:

>>> texts[0][10:20]
['a', 'bit', 'of', 'a', 'dog', ',', 'and', 'even', 'though', 'he']

We’ll now convert our dictionary of words into a sparse vector which contains pairs of word ids and the number of time they occur:

corpus = [dictionary.doc2bow(text) for text in texts]

Let’s try and find out how many times the word ‘ted’ occurs in our corpus. First we need to find out the word id for ‘ted':

>>> dictionary.token2id["ted"]
551

I don’t know how to look up the word id directly from the corpus but you can get back an individual document (episode) and its words quite easily:

>>> corpus[0][:5]
[(0, 8), (1, 1), (2, 2), (3, 13), (4, 20)]

We can then convert that into a dictionary and look up our word:

>>> dict(corpus[0]).get(551)
16

So ‘ted’ occurs 16 times in the first episode. If we generify that code we end up with the following:

words = ["ted", "robin", "barney", "lily", "marshall"]
words_dict = dict()
for word in words:
    word_id = dictionary.token2id[word]
    counts = []
    for episode in corpus:
        count = dict(episode).get(word_id) or 0
        counts.append(count)
    words_dict[word] = counts

There’s quite a lot of counts in there so let’s just preview the first 5 episodes:

>>> for word, counts in words_dict.iteritems():
       print word, counts[:5]
 
lily [3, 20, 47, 26, 41]
marshall [8, 25, 63, 27, 34]
barney [9, 94, 58, 92, 102]
ted [16, 46, 66, 32, 44]
robin [18, 43, 25, 24, 34]

Now it’s time to bring out matplotlib and make this visual! I initially put all the characters on one chart but it looks very messy and there’s a lot of overlap so I decided on separate charts.

The only thing I had to do to achieve this was call plt.figure() at the beginning of the loop to create a new plot:

import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
pylab.show()
 
for word, counts in words_dict.iteritems():
    plt.figure()
    plt.plot(counts)
    plt.legend([word], loc='upper left')
    plt.ylabel('occurrences')
    plt.xlabel('episode')
    plt.xlim(0, 208)
    plt.savefig('images/%s.png' % (word), dpi=200)

This generates plots like this:

2015 01 30 21 15 03

This is good but I thought it’d be interesting to put in the season demarcations to see if that could give any more insight. We can call the function plt.axvline and pass in the appropriate episode number to achieve this effect but I needed to know the episode ID for the last episode in each season which required a bit of code:

import pandas as pd
df = pd.read_csv('data/import/episodes.csv', index_col=False, header=0)
last_episode_in_season = list(df.groupby("Season").max()["NumberOverall"])
 
>>> last_episode_in_season
[22, 44, 64, 88, 112, 136, 160, 184, 208]

Now let’s plug that into matplotlib:

for word, counts in words_dict.iteritems():
    plt.figure()
    plt.plot(counts)
    for episode in last_episode_in_season:
        plt.axvline(x=episode, color = "red")
    plt.legend([word], loc='upper left')
    plt.ylabel('occurrences')
    plt.xlabel('episode')
    plt.xlim(0, 208)
    plt.savefig('images/%s.png' % (word), dpi=200)

2015 01 30 21 10 52

The last thing I wanted to do is get all the plots on the same scale for which I needed to get the maximum number of occurrences of any character in any episode. It was easier than I expected:

>>> y_max = max([max(count) for count in words_dict.values()])
>>> y_max
260

And now let’s plot again:

for word, counts in words_dict.iteritems():
    plt.figure()
    plt.plot(counts)
    for episode in last_episode_in_season:
        plt.axvline(x=episode, color = "red")
    plt.legend([word], loc='upper left')
    plt.ylabel('occurrences')
    plt.xlabel('episode')
    plt.xlim(0, 208)
    plt.ylim(0, y_max)
    plt.savefig('images/%s.png' % (word), dpi=200)


Our charts are now easy to compare:

2015 01 30 21 23 48

2015 01 30 21 24 03

For some reason there’s a big spike of the word ‘ted’ in the middle of the 7th season – I’m clearly not a big enough fan to know why that is but it’s a spike of 30% over the next highest value.

The data isn’t perfect – some of the episodes list the speaker of the sentence and some don’t so it may be that the spikes indicate that rather than anything else.

I find it’s always nice to do a bit of visual exploration of the data anyway and now I know it’s possible to do so pretty easily in Python land.

R: ggplot2 – Each group consist of only one observation. Do you need to adjust the group aesthetic?

I’ve been playing around with some weather data over the last couple of days which I aggregated down to the average temperature per month over the last 4 years and stored in a CSV file.

This is what the file looks like:

$ cat /tmp/averageTemperatureByMonth.csv
"month","aveTemperature"
"January",6.02684563758389
"February",5.89380530973451
"March",7.54838709677419
"April",10.875
"May",13.3064516129032
"June",15.9666666666667
"July",18.8387096774194
"August",18.3709677419355
"September",16.2583333333333
"October",13.4596774193548
"November",9.19166666666667
"December",7.01612903225806

I wanted to create a simple line chart which would show the months of the year in ascending order with the appropriate temperature.

My first attempt was the following:

df = read.csv("/tmp/averageTemperatureByMonth.csv")
df$month = factor(df$month, month.name)
 
ggplot(aes(x = month, y = aveTemperature), data = df) + 
  geom_line( ) + 
  ggtitle("Temperature by month")

which resulted in the following error:

geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?

My understanding is that the points don’t get joined up by default because the variable on the x axis is not a continuous one but rather a factor variable.

One way to work around this problem is to make it numeric, like so:

ggplot(aes(x = as.numeric(month), y = aveTemperature), data = df) + 
  geom_line( ) + 
  ggtitle("Temperature by month")

which results in the following chart:

2015 01 30 00 25 18

This isn’t bad but it’d be much nicer if we could have the month names along the bottom instead.

It turns out we can but we need to specify a group that each point belongs to. ggplot will then connects points which belong to the same group.

In this case we don’t really have one so we’ll define a dummy one instead:

ggplot(aes(x = month, y = aveTemperature, group=1), data = df) + 
  geom_line( ) + 
  ggtitle("Temperature by month")

And now we get the visualisation we want:

2015 01 29 23 28 23

Python: Find the highest value in a group

In my continued playing around with a How I met your mother data set I needed to find out the last episode that happened in a season so that I could use it in a chart I wanted to plot.

I had this CSV file containing each of the episodes:

$ head -n 10 data/import/episodes.csv
NumberOverall,NumberInSeason,Episode,Season,DateAired,Timestamp
1,1,/wiki/Pilot,1,"September 19, 2005",1127084400
2,2,/wiki/Purple_Giraffe,1,"September 26, 2005",1127689200
3,3,/wiki/Sweet_Taste_of_Liberty,1,"October 3, 2005",1128294000
4,4,/wiki/Return_of_the_Shirt,1,"October 10, 2005",1128898800
5,5,/wiki/Okay_Awesome,1,"October 17, 2005",1129503600
6,6,/wiki/Slutty_Pumpkin,1,"October 24, 2005",1130108400
7,7,/wiki/Matchmaker,1,"November 7, 2005",1131321600
8,8,/wiki/The_Duel,1,"November 14, 2005",1131926400
9,9,/wiki/Belly_Full_of_Turkey,1,"November 21, 2005",1132531200

I started out by parsing the CSV file into a dictionary of (seasons -> episode ids):

import csv
from collections import defaultdict
 
seasons = defaultdict(list)
with open("data/import/episodes.csv", "r") as episodesfile:
    reader = csv.reader(episodesfile, delimiter = ",")
    reader.next()
    for row in reader:
        seasons[int(row[3])].append(int(row[0]))
 
print seasons

which outputs the following:

$ python blog.py
defaultdict(<type 'list'>, {
  1: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22], 
  2: [23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44], 
  3: [45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64], 
  4: [65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88], 
  5: [89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112], 
  6: [113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136], 
  7: [137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160], 
  8: [161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184], 
  9: [185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208]})

It’s reasonably easy to transform that into a dictionary of (season -> max episode id) with the following couple of lines:

for season, episode_ids in seasons.iteritems():
    seasons[season] = max(episode_ids)
 
>>> print seasons
defaultdict(<type 'list'>, {1: 22, 2: 44, 3: 64, 4: 88, 5: 112, 6: 136, 7: 160, 8: 184, 9: 208})

This works fine but it felt very much like a dplyr problem to me so I wanted to see whether I could write something cleaner using pandas.

I started out by capturing the seasons and episode ids in separate lists and then building up a DataFrame:

import pandas as pd
from pandas import DataFrame
 
seasons, episode_ids = [], []
with open("data/import/episodes.csv", "r") as episodesfile:
    reader = csv.reader(episodesfile, delimiter = ",")
    reader.next()
    for row in reader:
        seasons.append(int(row[3]))
        episode_ids.append(int(row[0]))
 
df = DataFrame.from_items([('Season', seasons), ('EpisodeId', episode_ids)])
 
>>> print df.groupby("Season").max()["EpisodeId"]
Season
1          22
2          44
3          64
4          88
5         112
6         136
7         160
8         184
9         208

Or we can simplify that and read the CSV file directly into a DataFrame:

df = pd.read_csv('data/import/episodes.csv', index_col=False, header=0)
 
>>> print df.groupby("Season").max()["NumberOverall"]
Season
1          22
2          44
3          64
4          88
5         112
6         136
7         160
8         184
9         208

Pretty neat. I need to get more into pandas.

Python/pdfquery: Scraping the FIFA World Player of the Year votes PDF into shape

Last week the FIFA Ballon d’Or 2014 was announced and along with the announcement of the winner the individual votes were also made available.

Unfortunately they weren’t made open in a way that Ben Wellington (of IQuantNY fame) would approve of – the choice of format for the data is a PDF file!

I wanted to extract this data to play around with it but I wanted to automate the extraction as I’d done when working with Google Trends data.

I had a quick look for PDF scraping libraries in Python and R and eventually settled on Python’s pdfquery, mainly because there was lots of documentation which made it easy to get started.

One way you scrape data from a PDF is by locating an element on the page and then grabbing everything within a bounded box relative to that element.

In my case I had 17 pages all of which had a heading for each of six columns.

2015 01 22 00 08 18

I wanted to grab the data in each of those columns but initially struggled working out what elements I should be looking for until I came across the following function which allows you to dump an XML version of the PDF to disk:

import pdfquery
pdf = pdfquery.PDFQuery("fboaward_menplayer2014_neutral.pdf")
pdf.load()
pdf.tree.write("/tmp/yadda", pretty_print=True)

The output looks like this:

$ head -n 10 /tmp/yadda
<pdfxml ModDate="D:20150110224554+01'00'" CreationDate="D:20150110224539+01'00'" Producer="Microsoft&#174; Excel&#174; 2010" Creator="Microsoft&#174; Excel&#174; 2010">
  <LTPage bbox="[0, 0, 841.8, 595.2]" height="595.2" pageid="1" rotate="0" width="841.8" x0="0" x1="841.8" y0="0" y1="595.2" page_index="0" page_label="">
    <LTAnon> </LTAnon>
    <LTTextLineHorizontal bbox="[31.08, 546.15, 122.524, 556.59]" height="10.44" width="91.444" word_margin="0.1" x0="31.08" x1="122.524" y0="546.15" y1="556.59"><LTTextBoxHorizontal bbox="[31.08, 546.15, 122.524, 556.59]" height="10.44" index="0" width="91.444" x0="31.08" x1="122.524" y0="546.15" y1="556.59">FIFA Ballon d'Or 2014 </LTTextBoxHorizontal></LTTextLineHorizontal>
    <LTAnon> </LTAnon>
    <LTAnon> </LTAnon>
    <LTAnon> </LTAnon>
    <LTAnon> </LTAnon>
    <LTAnon> </LTAnon>
    <LTAnon> </LTAnon>

Having scanned through the file I realised that what I needed to do was locate the ‘LTTextLineHorizontal’ element for each heading and then grab all the ‘LTTextLineHorizontal’ elements that appeared in that column.

I started out by trying to grab the ‘Name’ column on the first page:

>>> name_element = pdf.pq('LTPage[pageid=\'1\'] LTTextLineHorizontal:contains("Name")')[0]
>>> name_element.text
'Name '

Next I needed to get the other elements in that column. With a bit of trial and error I ended up with the following code:

x = float(name_element.get('x0'))
y = float(name_element.get('y0'))
cells = pdf.extract( [
         ('with_parent','LTPage[pageid=\'1\']'),
         ('cells', 'LTTextLineHorizontal:in_bbox("%s,%s,%s,%s")' % (x, y-500, x+150, y))
    ])
 
>>> [cell.text.encode('utf-8').strip() for cell in cells['cells']]
['Amiri Islam', 'Cana Lorik', 'Bougherra Madjid', 'Luvu Rafe Talalelei', 'Sonejee Masand Oscar', 'Amaral Felisberto', 'Liddie Ryan', 'Griffith Quinton', 'Messi Lionel', 'Berezovskiy Roman', 'Breinburg Reinhard', 'Jedinak Mile', 'Fuchs Christian', 'Sadigov Rashad', 'Gavin Christie', 'Hasan Mohamed', 'Mamun Md Mamnul Islam', 'Burgess Romelle', 'Kalachou Tsimafei', 'Komany Vincent', 'Eiley Dalton', 'Nusum John', 'Tshering Passang', 'Raldes Ronald', 'D\xc5\xbeeko Edin', 'Da Silva Santos Junior Neymar', 'Ceasar Troy', 'Popov Ivelin', 'Kabore Charles', 'Ntibazonkiza Saidi', 'Kouch Sokumpheak']

I cleaned that up and generified it to work for any page and for columns of different widths. This is what the function looks like:

def extract_cells(page, header, cell_width):
    name_element = pdf.pq('LTPage[pageid=\'%s\'] LTTextLineHorizontal:contains("%s")' % (page, header))[0]
    x = float(name_element.get('x0'))
    y = float(name_element.get('y0'))
    cells = pdf.extract( [
         ('with_parent','LTPage[pageid=\'%s\']' %(page)),
         ('cells', 'LTTextLineHorizontal:in_bbox("%s,%s,%s,%s")' % (x, y-500, x+cell_width, y))
    ])
    return [cell.text.encode('utf-8').strip() for cell in cells['cells']]

We can then call that for each column on the page and zip together the resulting arrays to get a tuple for each row:

roles = extract_cells(1, "Vote", 50)
countries = extract_cells(1, "Country", 150)
voters = extract_cells(1, "Name", 170)
first = extract_cells(1, "First (5 points)", 150)
second = extract_cells(1, "Second (3 points)", 150)
third = extract_cells(1, "Third (1 point)", 130)
 
>>> for vote in zip(roles, countries, voters, first, second, third)[:5]:
       print vote
 
('Captain', 'Afghanistan', 'Amiri Islam', 'Messi Lionel', 'Cristiano Ronaldo', 'Ibrahimovic Zlatan')
('Captain', 'Albania', 'Cana Lorik', 'Cristiano Ronaldo', 'Robben Arjen', 'Mueller Thomas')
('Captain', 'Algeria', 'Bougherra Madjid', 'Cristiano Ronaldo', 'Robben Arjen', 'Benzema Karim')
('Captain', 'American Samoa', 'Luvu Rafe Talalelei', 'Neymar', 'Robben Arjen', 'Cristiano Ronaldo')
('Captain', 'Andorra', 'Sonejee Masand Oscar', 'Cristiano Ronaldo', 'Mueller Thomas', 'Kroos Toni')

The next step was to write out each of those rows to a CSV file so we can use it from another program. The full script looks like this:

import pdfquery
import csv
 
def extract_cells(page, header, cell_width):
    name_element = pdf.pq('LTPage[pageid=\'%s\'] LTTextLineHorizontal:contains("%s")' % (page, header))[0]
    x = float(name_element.get('x0'))
    y = float(name_element.get('y0'))
    cells = pdf.extract( [
         ('with_parent','LTPage[pageid=\'%s\']' %(page)),
         ('cells', 'LTTextLineHorizontal:in_bbox("%s,%s,%s,%s")' % (x, y-500, x+cell_width, y))
    ])
    return [cell.text.encode('utf-8').strip() for cell in cells['cells']]
 
if __name__ == "__main__":
    pdf = pdfquery.PDFQuery("fboaward_menplayer2014_neutral.pdf")
    pdf.load()
    pdf.tree.write("/tmp/yadda", pretty_print=True)
 
    pages_in_pdf = len(pdf.pq('LTPage'))
 
    with open('votes.csv', 'w') as votesfile:
        writer = csv.writer(votesfile, delimiter=",")
        writer.writerow(["Role", "Country", "Voter", "FirstPlace", "SecondPlace", "ThirdPlace"])
        for page in range(1, pages_in_pdf + 1):
            print page
            roles = extract_cells(page, "Vote", 50)
            countries = extract_cells(page, "Country", 150)
            voters = extract_cells(page, "Name", 170)
            first = extract_cells(page, "First (5 points)", 150)
            second = extract_cells(page, "Second (3 points)", 150)
            third = extract_cells(page, "Third (1 point)", 130)
            votes = zip(roles, countries, voters, first, second, third)
            print votes
            for vote in votes:
                writer.writerow(list(vote))

The code is on github if you want to play around with it or if you just want to grab the votes data that’s there too.

Python/NLTK: Finding the most common phrases in How I Met Your Mother

Following on from last week’s blog post where I found the most popular words in How I met your mother transcripts, in this post we’ll have a look at how we can pull out sentences and then phrases from our corpus.

The first thing I did was tweak the scraping script to pull out the sentences spoken by characters in the transcripts.

Each dialogue is separated by two line breaks so we use that as our separator. I also manually skimmed through the transcripts and found out which tags we need to strip out. I ended up with the following:

import csv
import nltk
import re
import bs4
 
from bs4 import BeautifulSoup, NavigableString
from soupselect import select
from nltk.corpus import stopwords
from collections import Counter
from nltk.tokenize import word_tokenize
 
episodes_dict = {}
 
def strip_tags(soup, invalid_tags):
    for tag in invalid_tags:
        for match in soup.findAll(tag):
            match.replaceWithChildren()
    return soup
 
def extract_sentences(html):
    clean = []
    brs_in_a_row = 0
    temp = ""
    for item in raw_text.contents:
        if item.name == "br":
            brs_in_a_row = brs_in_a_row + 1
        else:
            temp = temp + item
        if brs_in_a_row == 2:
            clean.append(temp)
            temp = ""
            brs_in_a_row = 0
    return clean
 
speakers = []
with open('data/import/episodes.csv', 'r') as episodes_file, \
     open("data/import/sentences.csv", 'w') as sentences_file:
    reader = csv.reader(episodes_file, delimiter=',')
    reader.next()
 
    writer = csv.writer(sentences_file, delimiter=',')
    writer.writerow(["SentenceId", "EpisodeId", "Season", "Episode", "Sentence"])
    sentence_id = 1
 
    for row in reader:
        transcript = open("data/transcripts/S%s-Ep%s" %(row[3], row[1])).read()
        soup = BeautifulSoup(transcript)
        rows = select(soup, "table.tablebg tr td.post-body div.postbody")
 
        raw_text = rows[0]
        [ad.extract() for ad in select(raw_text, "div.ads-topic")]
        [ad.extract() for ad in select(raw_text, "div.t-foot-links")]
        [ad.extract() for ad in select(raw_text, "hr")]
 
        for tag in ['strong', 'em', "a"]:
            for match in raw_text.findAll(tag):
                match.replace_with_children()
        print row
        for sentence in [
                item.encode("utf-8").strip()
                for item in extract_sentences(raw_text.contents)
            ]:
            writer.writerow([sentence_id, row[0], row[3], row[1], sentence])
            sentence_id = sentence_id + 1

Here’s a preview of the sentences CSV file:

$ head -n 10 data/import/sentences.csv
SentenceId,EpisodeId,Season,Episode,Sentence
1,1,1,1,Pilot
2,1,1,1,Scene One
3,1,1,1,[Title: The Year 2030]
4,1,1,1,"Narrator: Kids, I'm going to tell you an incredible story. The story of how I met your mother"
5,1,1,1,Son: Are we being punished for something?
6,1,1,1,Narrator: No
7,1,1,1,"Daughter: Yeah, is this going to take a while?"
8,1,1,1,"Narrator: Yes. (Kids are annoyed) Twenty-five years ago, before I was dad, I had this whole other life."
9,1,1,1,"(Music Plays, Title ""How I Met Your Mother"" appears)"

The next step is to iterate through each of those sentences and create some n-grams to capture the common phrases in the transcripts.

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech.

Python’s nltk library has a function that makes this easy e.g.

>>> import nltk
>>> tokens = nltk.word_tokenize("I want to be in an n gram")
>>> tokens
['I', 'want', 'to', 'be', 'in', 'an', 'n', 'gram']
>>> nltk.util.ngrams(tokens, 2)
[('I', 'want'), ('want', 'to'), ('to', 'be'), ('be', 'in'), ('in', 'an'), ('an', 'n'), ('n', 'gram')]
>>> nltk.util.ngrams(tokens, 3)
[('I', 'want', 'to'), ('want', 'to', 'be'), ('to', 'be', 'in'), ('be', 'in', 'an'), ('in', 'an', 'n'), ('an', 'n', 'gram')]

If we do a similar thing of HIMYM transcripts while stripping out the speaker’s name – lines are mostly in the form “Speaker:Sentence” – we end up with the following top phrases:

import nltk
import csv
import string
import re
 
from collections import Counter
 
non_speaker = re.compile('[A-Za-z]+: (.*)')
 
def extract_phrases(text, phrase_counter, length):
    for sent in nltk.sent_tokenize(text):
        strip_speaker = non_speaker.match(sent)
        if strip_speaker is not None:
            sent = strip_speaker.group(1)
        words = nltk.word_tokenize(sent)
        for phrase in nltk.util.ngrams(words, length):
            phrase_counter[bphrase] += 1
 
phrase_counter = Counter()
 
with open("data/import/sentences.csv", "r") as sentencesfile:
    reader = csv.reader(sentencesfile, delimiter=",")
    reader.next()
    for sentence in reader:
        extract_phrases(sentence[4], phrase_counter, 3)
 
most_common_phrases = phrase_counter.most_common(50)
for k,v in most_common_phrases:
    print '{0: <5}'.format(v), k

And if we run that:

$ python extract_phrases.py
1123  (',', 'I', "'m")
1099  ('I', 'do', "n't")
1005  (',', 'it', "'s")
535   ('I', 'ca', "n't")
523   ('I', "'m", 'not')
507   ('I', 'mean', ',')
507   (',', 'you', "'re")
459   (',', 'that', "'s")
458   ('2030', ')', ':')
454   ('(', '2030', ')')
453   ('Ted', '(', '2030')
449   ('I', "'m", 'sorry')
...
247   ('I', 'have', 'to')
247   ('No', ',', 'I')
246   ("'s", 'gon', 'na')
241   (',', 'I', "'ll")
229   ('I', "'m", 'going')
226   ('do', "n't", 'want')
226   ('It', "'s", 'not')

I noticed that quite a few of the phrases had punctuation in so my next step was to get rid of any of the phrases that had any punctuation in. I updated extract_phrases like so:

def extract_phrases(text, phrase_counter, length):
    for sent in nltk.sent_tokenize(text):
        strip_speaker = non_speaker.match(sent)
        if strip_speaker is not None:
            sent = strip_speaker.group(1)
        words = nltk.word_tokenize(sent)
        for phrase in nltk.util.ngrams(words, length):
            if all(word not in string.punctuation for word in phrase):
                phrase_counter[phrase] += 1

Let’s run it again:

$ python extract_phrases.py
1099  ('I', 'do', "n't")
535   ('I', 'ca', "n't")
523   ('I', "'m", 'not')
449   ('I', "'m", 'sorry')
414   ('do', "n't", 'know')
383   ('Ted', 'from', '2030')
338   ("'m", 'gon', 'na')
334   ('I', "'m", 'gon')
300   ('gon', 'na', 'be')
279   ('END', 'OF', 'FLASHBACK')
267   ("'re", 'gon', 'na')
...
155   ('It', "'s", 'just')
151   ('at', 'the', 'bar')
150   ('a', 'lot', 'of')
147   ("'re", 'going', 'to')
144   ('I', 'have', 'a')
142   ('I', "'m", 'so')
138   ('do', "n't", 'have')
137   ('I', 'think', 'I')
136   ('not', 'gon', 'na')
136   ('I', 'can', 'not')
135   ('and', 'I', "'m")

Next I wanted to display each phrase as a string rather than a tuple which was more difficult than I expected. I ended up with the following function which almost does the job:

def untokenize(ngram):
    tokens = list(ngram)
    return "".join([" "+i if not i.startswith("'") and \
                             i not in string.punctuation and \
                             i != "n't"
                          else i for i in tokens]).strip()

I updated extract_phrases to use that function:

def extract_phrases(text, phrase_counter, length):
    for sent in nltk.sent_tokenize(text):
        strip_speaker = non_speaker.match(sent)
        if strip_speaker is not None:
            sent = strip_speaker.group(1)
        words = nltk.word_tokenize(sent)
        for phrase in nltk.util.ngrams(words, length):
            if all(word not in string.punctuation for word in phrase):
                phrase_counter[untokenize(phrase)] += 1

Let’s go again:

$ python extract_phrases.py
1099  I don't
535   I can't
523   I'm not
449   I'm sorry
414   don't know
383   Ted from 2030
338   'm gon na
334   I'm gon
300   gon na be
279   END OF FLASHBACK
...
151   at the bar
150   a lot of
147   're going to
144   I have a
142   I'm so
138   don't have
137   I think I
136   not gon na
136   I can not
135   and I'm

These were some of the interesting things that stood out for me and deserve further digging into:

  • A lot of the most popular phrases begin with ‘I’ – it would be interesting to filter those sentences to find the general sentiment.
  • The ‘untokenize’ function struggles to reconstruct the slang phrase ‘gonna’ into a single word.
  • ‘Ted from 2030′ is actually a speaker which doesn’t follow the expected regex pattern and so wasn’t filtered out.
  • ‘END OF FLASHBACK’ shows quite high up and pulling out those flashbacks would probably be an interesting feature to extract to see which episodes reference each other.
  • ‘Marshall and Lily’ and ‘Lily and Marshall’ show up on the list – it would be interesting to explore the frequency of pairs of other characters.

The code is all on github if you want to play with it.

Python: Counter – ValueError: too many values to unpack

I recently came across Python’s Counter tool which makes it really easy to count the number of occurrences of items in a list.

In my case I was trying to work out how many times words occurred in a corpus so I had something like the following:

>> from collections import Counter
>> counter = Counter(["word1", "word2", "word3", "word1"])
>> print counter
Counter({'word1': 2, 'word3': 1, 'word2': 1})

I wanted to write a for loop to iterate over the counter and print the (key, value) pairs and started with the following:

>>> for key, value in counter:
...   print key, value
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: too many values to unpack

I’m not sure why I expected this to work but in fact since Counter is a sub class of dict we need to call iteritems to get an iterator of pairs rather than just keys.

The following does the job:

>>> for key, value in counter.iteritems():
...   print key, value
...
word1 2
word3 1
word2 1

Hopefully future Mark will remember this!

Python: scikit-learn: ImportError: cannot import name __check_build

In part 3 of Kaggle’s series on text analytics I needed to install scikit-learn and having done so ran into the following error when trying to use one of its classes:

>>> from sklearn.feature_extraction.text import CountVectorizer
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/markneedham/projects/neo4j-himym/himym/lib/python2.7/site-packages/sklearn/__init__.py", line 37, in <module>
    from . import __check_build
ImportError: cannot import name __check_build

This error doesn’t reveal very much but I found that when I exited the REPL and tried the same command again I got a different error which was a bit more useful:

>>> from sklearn.feature_extraction.text import CountVectorizer
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/markneedham/projects/neo4j-himym/himym/lib/python2.7/site-packages/sklearn/__init__.py", line 38, in <module>
    from .base import clone
  File "/Users/markneedham/projects/neo4j-himym/himym/lib/python2.7/site-packages/sklearn/base.py", line 10, in <module>
    from scipy import sparse
ImportError: No module named scipy

The fix for this is now obvious:

$ pip install scipy

And I can now load CountVectorizer without any problem:

$ python
Python 2.7.5 (default, Aug 25 2013, 00:04:04)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from sklearn.feature_extraction.text import CountVectorizer

Python: gensim – clang: error: unknown argument: ‘-mno-fused-madd’ [-Wunused-command-line-argument-hard-error-in-future]

While working through part 2 of Kaggle’s bag of words tutorial I needed to install the gensim library and initially ran into the following error:

$ pip install gensim
 
...
 
cc -fno-strict-aliasing -fno-common -dynamic -arch x86_64 -arch i386 -g -Os -pipe -fno-common -fno-strict-aliasing -fwrapv -mno-fused-madd -DENABLE_DTRACE -DMACOSX -DNDEBUG -Wall -Wstrict-prototypes -Wshorten-64-to-32 -DNDEBUG -g -fwrapv -Os -Wall -Wstrict-prototypes -DENABLE_DTRACE -arch x86_64 -arch i386 -pipe -I/Users/markneedham/projects/neo4j-himym/himym/build/gensim/gensim/models -I/System/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -I/Users/markneedham/projects/neo4j-himym/himym/lib/python2.7/site-packages/numpy/core/include -c ./gensim/models/word2vec_inner.c -o build/temp.macosx-10.9-intel-2.7/./gensim/models/word2vec_inner.o
 
clang: error: unknown argument: '-mno-fused-madd' [-Wunused-command-line-argument-hard-error-in-future]
 
clang: note: this will be a hard error (cannot be downgraded to a warning) in the future
 
command 'cc' failed with exit status 1
 
an integer is required
 
Traceback (most recent call last):
 
  File "<string>", line 1, in <module>
 
  File "/Users/markneedham/projects/neo4j-himym/himym/build/gensim/setup.py", line 166, in <module>
 
    include_package_data=True,
 
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/core.py", line 152, in setup
 
    dist.run_commands()
 
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/dist.py", line 953, in run_commands
 
    self.run_command(cmd)
 
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/dist.py", line 972, in run_command
 
    cmd_obj.run()
 
  File "/Users/markneedham/projects/neo4j-himym/himym/lib/python2.7/site-packages/setuptools/command/install.py", line 59, in run
 
    return orig.install.run(self)
 
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/command/install.py", line 573, in run
 
    self.run_command('build')
 
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/cmd.py", line 326, in run_command
 
    self.distribution.run_command(command)
 
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/dist.py", line 972, in run_command
 
    cmd_obj.run()
 
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/command/build.py", line 127, in run
 
    self.run_command(cmd_name)
 
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/cmd.py", line 326, in run_command
 
    self.distribution.run_command(command)
 
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/dist.py", line 972, in run_command
 
    cmd_obj.run()
 
  File "/Users/markneedham/projects/neo4j-himym/himym/build/gensim/setup.py", line 71, in run
 
    "There was an issue with your platform configuration - see above.")
 
TypeError: an integer is required
 
----------------------------------------
Cleaning up...
Command /Users/markneedham/projects/neo4j-himym/himym/bin/python -c "import setuptools, tokenize;__file__='/Users/markneedham/projects/neo4j-himym/himym/build/gensim/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /var/folders/sb/6zb6j_7n6bz1jhhplc7c41n00000gn/T/pip-i8aeKR-record/install-record.txt --single-version-externally-managed --compile --install-headers /Users/markneedham/projects/neo4j-himym/himym/include/site/python2.7 failed with error code 1 in /Users/markneedham/projects/neo4j-himym/himym/build/gensim
Storing debug log for failure in /Users/markneedham/.pip/pip.log

The exception didn’t make much sense to me but I came across a blog post which explained it:

The Apple LLVM compiler in Xcode 5.1 treats unrecognized command-line options as errors. This issue has been seen when building both Python native extensions and Ruby Gems, where some invalid compiler options are currently specified.

The author suggests this only became a problem with XCode 5.1 so I’m surprised I hadn’t come across it sooner since I haven’t upgraded XCode in a long time.

We can work around the problem by telling the compiler to treat extra command line arguments as a warning rather than an error

export ARCHFLAGS=-Wno-error=unused-command-line-argument-hard-error-in-future

Now it installs with no problems.

Python NLTK/Neo4j: Analysing the transcripts of How I Met Your Mother

After reading Emil’s blog post about dark data a few weeks ago I became intrigued about trying to find some structure in free text data and I thought How I met your mother’s transcripts would be a good place to start.

I found a website which has the transcripts for all the episodes and then having manually downloaded the two pages which listed all the episodes, wrote a script to grab each of the transcripts so I could use them on my machine.

I wanted to learn a bit of Python and my colleague Nigel pointed me towards the requests and BeautifulSoup libraries to help me with my task. The script to grab the transcripts looks like this:

import requests
from bs4 import BeautifulSoup
from soupselect import select
 
episodes = {}
for i in range(1,3):
    page = open("data/transcripts/page-" + str(i) + ".html", 'r')
    soup = BeautifulSoup(page.read())
 
    for row in select(soup, "td.topic-titles a"):
        parts = row.text.split(" - ")
        episodes[parts[0]] = {"title": parts[1], "link": row.get("href")}
 
for key, value in episodes.iteritems():
    parts = key.split("x")
    season = int(parts[0])
    episode = int(parts[1])
    filename = "data/transcripts/S%d-Ep%d" %(season, episode)
    print filename
 
    with open(filename, 'wb') as handle:
        headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
        response = requests.get("http://transcripts.foreverdreaming.org" + value["link"], headers = headers)
        if response.ok:
            for block in response.iter_content(1024):
                if not block:
                    break
 
                handle.write(block)

the files containing the lists of episodes are named ‘page-1′ and ‘page-2′

The code is reasonably simple – we find all the links inside the table, put them in a dictionary and then iterate through the dictionary and download the files to disk. The code to save the file is a bit of a monstrosity but there didn’t seem to be a ‘save’ method that I could use.

Having downloaded the files, I thought through all sorts of clever things I could do, including generating a bag of words model for each episode or performing sentiment analysis on each sentence which I’d learnt about from a Kaggle tutorial.

In the end I decided to start simple and extract all the words from the transcripts and count many times a word occurred in a given episode.

I ended up with the following script which created a dictionary of (episode -> words + occurrences):

import csv
import nltk
import re
 
from bs4 import BeautifulSoup
from soupselect import select
from nltk.corpus import stopwords
from collections import Counter
from nltk.tokenize import word_tokenize
 
def count_words(words):
    tally=Counter()
    for elem in words:
        tally[elem] += 1
    return tally
 
episodes_dict = {}
with open('data/import/episodes.csv', 'r') as episodes:
    reader = csv.reader(episodes, delimiter=',')
    reader.next()
 
    for row in reader:
        print row
        transcript = open("data/transcripts/S%s-Ep%s" %(row[3], row[1])).read()
        soup = BeautifulSoup(transcript)
        rows = select(soup, "table.tablebg tr td.post-body div.postbody")
 
        raw_text = rows[0]
        [ad.extract() for ad in select(raw_text, "div.ads-topic")]
        [ad.extract() for ad in select(raw_text, "div.t-foot-links")]
 
        text = re.sub("[^a-zA-Z]", " ", raw_text.text.strip())
        words = [w for w in nltk.word_tokenize(text) if not w.lower() in stopwords.words("english")]
 
        episodes_dict[row[0]] = count_words(words)

Next I wanted to explore the data a bit to see which words occurred across episodes or which word occurred most frequently and realised that this would be a much easier task if I stored the data somewhere.

s/somewhere/in Neo4j

Neo4j’s query language, Cypher, has a really nice ETL-esque tool called ‘LOAD CSV’ for loading in CSV files (as the name suggests!) so I added some code to save my words to disk:

with open("data/import/words.csv", "w") as words:
    writer = csv.writer(words, delimiter=",")
    writer.writerow(["EpisodeId", "Word", "Occurrences"])
    for episode_id, words in episodes_dict.iteritems():
        for word in words:
            writer.writerow([episode_id, word, words[word]])

This is what the CSV file contents look like:

$ head -n 10 data/import/words.csv
EpisodeId,Word,Occurrences
165,secondly,1
165,focus,1
165,baby,1
165,spiders,1
165,go,4
165,apartment,1
165,buddy,1
165,Exactly,1
165,young,1

Now we need to write some Cypher to get the data into Neo4j:

// words
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-himym/data/import/words.csv" AS row
MERGE (word:Word {value: row.Word})
// episodes
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-himym/data/import/words.csv" AS row
MERGE (episode:Episode {id: TOINT(row.EpisodeId)})
// words to episodes
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-himym/data/import/words.csv" AS row
MATCH (word:Word {value: row.Word})
MATCH (episode:Episode {id: TOINT(row.EpisodeId)})
MERGE (word)-[:USED_IN_EPISODE {times: TOINT(row.Occurrences) }]->(episode);

Having done that we can write some simple queries to explore the words used in How I met your mother:

MATCH (word:Word)-[r:USED_IN_EPISODE]->(episode) 
RETURN word.value, COUNT(episode) AS episodes, SUM(r.times) AS occurrences
ORDER BY occurrences DESC
LIMIT 10
 
==> +-------------------------------------+
==> | word.value | episodes | occurrences |
==> +-------------------------------------+
==> | "Ted"      | 207      | 11437       |
==> | "Barney"   | 208      | 8052        |
==> | "Marshall" | 208      | 7236        |
==> | "Robin"    | 205      | 6626        |
==> | "Lily"     | 207      | 6330        |
==> | "m"        | 208      | 4777        |
==> | "re"       | 208      | 4097        |
==> | "know"     | 208      | 3489        |
==> | "Oh"       | 197      | 3448        |
==> | "like"     | 208      | 2498        |
==> +-------------------------------------+
==> 10 rows

The main 5 characters occupy the top 5 positions which is probably what you’d expect. I’m not sure why ‘m’ and ‘re’ are in the next two position s – I expect that might be scraping gone wrong!

Our next query might focus around checking which character is referred to the post in each episode:

WITH ["Ted", "Barney", "Robin", "Lily", "Marshall"] as mainCharacters
MATCH (word:Word) WHERE word.value IN mainCharacters
MATCH (episode:Episode)<-[r:USED_IN_EPISODE]-(word)
WITH episode, word, r
ORDER BY episode.id, r.times DESC
WITH episode, COLLECT({word: word.value, times: r.times})[0] AS topWord
RETURN episode.id, topWord.word AS word, topWord.times AS occurrences
LIMIT 10
 
==> +---------------------------------------+
==> | episode.id | word       | occurrences |
==> +---------------------------------------+
==> | 72         | "Barney"   | 75          |
==> | 143        | "Ted"      | 16          |
==> | 43         | "Lily"     | 74          |
==> | 156        | "Ted"      | 12          |
==> | 206        | "Barney"   | 23          |
==> | 50         | "Marshall" | 51          |
==> | 113        | "Ted"      | 76          |
==> | 178        | "Barney"   | 21          |
==> | 182        | "Barney"   | 22          |
==> | 67         | "Ted"      | 84          |
==> +---------------------------------------+
==> 10 rows

If we dig into it further there’s actually quite a bit of variety in the number of times the top character in each episode is mentioned which again probably says something about the data:

WITH ["Ted", "Barney", "Robin", "Lily", "Marshall"] as mainCharacters
MATCH (word:Word) WHERE word.value IN mainCharacters
MATCH (episode:Episode)<-[r:USED_IN_EPISODE]-(word)
WITH episode, word, r
ORDER BY episode.id, r.times DESC
WITH episode, COLLECT({word: word.value, times: r.times})[0] AS topWord
RETURN MIN(topWord.times), MAX(topWord.times), AVG(topWord.times), STDEV(topWord.times)
 
==> +-------------------------------------------------------------------------------------+
==> | MIN(topWord.times) | MAX(topWord.times) | AVG(topWord.times) | STDEV(topWord.times) |
==> +-------------------------------------------------------------------------------------+
==> | 3                  | 259                | 63.90865384615385  | 42.36255207691068    |
==> +-------------------------------------------------------------------------------------+
==> 1 row

Obviously this is a very simple way of deriving structure from text, here are some of the things I want to try out next:

  • Detecting common phrases/memes/phrases used in the show (e.g. the yellow umbrella) – this should be possible by creating different length n-grams and then searching for those phrases across the corpus.
  • Pull out scenes – some of the transcripts use the keyword ‘scene’ to denote this although some of them don’t. Depending how many transcripts contain scene demarkations perhaps we could train a classifier to detect where scenes should be in the transcripts which don’t have scenes.
  • Analyse who talks to each other or who talks about each other most frequently
  • Create a graph of conversations as my colleagues Max and Michael have previously blogged about.

R: Featuring engineering for a linear model

I previously wrote about a linear model I created to predict how many people would RSVP ‘yes’ to a meetup event and having not found much correlation between any of my independent variables and RSVPs was a bit stuck.

As luck would have it I bumped into Antonios at a meetup a month ago and he offered to take a look at what I’d tried so far and give me some tips on how to progress.

The first thing he pointed out is that that all my features were related to date/time and that I should try and generate some other features. He suggested I start with the following:

  • info about organisers (quantify popularity of organisers, how many people work for them)
  • info about the venue (how many people fit there, how far it is from the centre of the city)
  • number of tweets for the event, during X days before the event

I’d read a lot on Kaggle forums about how feature engineering was the most important part of building statistical models but it didn’t click what that meant until Antonios pointed it out.

The first thing I decided to do was bring in the data for all London’s NoSQL meetups rather than just the Neo4j one to give myself a bit more data to work with.

Group Membership

Having done that, it seemed from visual inspection that the meetup groups with the most members (i.e. Data Science London, Big Data London) seemed to get the biggest turnouts.

I thought it’d be interesting to see what the correlation was between group membership and RSVPs so this was the first new feature I added.

I generated this feature by a combination of a Neo4j query and R code which resulted in this data frame as CSV file.

We can quickly preview it to see some of the events and the group membership at that time:

> df = read.csv("/tmp/membersWithGroupCounts.csv")
> df$eventTime = as.POSIXct(df$eventTime)
> df %>% sample_n(10) %>% select(event.name, g.name, eventTime, groupMembers, rsvps)
 
                                                                  event.name                                   g.name           eventTime groupMembers rsvps
23  Scoring Models, Apache Drill for querying structured & unstructured data                      Data Science London 2014-09-18 18:30:00         3466   159
421                                                      London Office Hours                London MongoDB User Group 2012-08-22 17:00:00          468     6
304                            MongoDB University Study Group London Meet up                London MongoDB User Group 2014-07-16 17:00:00         1256    23
43                                                           December Meetup          London ElasticSearch User Group 2014-12-10 18:30:00          721   126
222                                                          Intro to Graphs                Neo4j - London User Group 2014-09-03 18:30:00         1453    39
207                              Intro to Machine Learning with Scikit-Learn                            Women in Data 2014-11-11 18:15:00          574    41
168                                        NoSQL panel and LevelDB + Node.js                             London NoSQL 2014-04-15 18:30:00          183    51
443                                                      London Office Hours                London MongoDB User Group 2012-11-29 17:00:00          590     3
79                                  Apache Cassandra 1.2 with Jonathan Ellis                         Cassandra London 2013-03-06 19:00:00          399    95
362                                                          Span conference Span: scalable and distributed computing 2014-10-28 09:00:00           67    13

One thing I found difficult was finding features specific to an event – I’m not sure how much that matters. I generated features for the venue or group much more easily.

First let’s see if there’s actually any correlation between these two variables by plotting them:

ggplot(aes(x = groupMembers, y = rsvps), data = df) + 
  geom_point()
2014 12 28 21 02 21

It looks like there’s a positive correlation between these two variables but let’s create a single variable linear model to see how much of the variation is explained:

> fit = lm(rsvps ~ groupMembers, data = df)
> fit$coef
 (Intercept) groupMembers 
 20.03579637   0.05382738

Our linear model equation is therefore:

rsvps = 20.03579637 + 0.05382738(groupMembers)

Let’s see how well correlated our predicted RSVPs and actual RSVPs are:

> df$predictedRSVPs = predict(fit, df)
> with(df, cor(rsvps, predictedRSVPs))
[1] 0.6263096

Not too bad! There is quite a strong correlation between these variables although it’s not perfect.

Hours into the day

In my first model I’d treated time as a categorical variable but Antonios pointed out that it’s often easier to understand the relationship between variables if they’re both continuos so I transformed the event time like so:

df$hoursIntoDay = as.numeric(df$eventTime - trunc(df$eventTime, "day"), units="hours")

Let’s see how that plots against the RSVP count:

ggplot(aes(x = hoursIntoDay, y = rsvps), data = df) + 
  geom_point()
2014 12 28 21 27 48

It’s a bit more difficult to see a trend here as there are quite discrete times at which events happen and the majority start at 6.30 or 7.00. Nevertheless let’s build a linear model with just this variable:

> fit = lm(rsvps ~ hoursIntoDay, data = df)
> fit$coef
 (Intercept) hoursIntoDay 
   -18.79895      4.12984 
> 
> df$predictedRSVPs = predict(fit, df)
> with(df, cor(rsvps, predictedRSVPs))
[1] 0.181472

Distance from the centre of London

Next up I tried a feature based on the location of the venue that the events were held at. The hypothesis was that if a venue was closer to the centre of London then people would be more likely to attend.

To calculate this distance I used the distHaversine function from the geosphere package as shown in a previous blog post.

Let’s have a look at the graph for that variable:

ggplot(aes(x = distanceFromCentre, y = rsvps), data = df) + 
  geom_point()
2014 12 28 21 37 41

It’s hard to tell much from this plot, mainly because a majority of the points are clustered around the 2,500 metre mark which represents Shoreditch venues. Let’s plug it into a linear model and see what we come up with:

> fit = lm(rsvps ~ distanceFromCentre, data = df)
> fit$coef
       (Intercept) distanceFromCentre 
      57.243646619       -0.001310492 
> 
> df$predictedRSVPs = predict(fit, df)
> with(df, cor(rsvps, predictedRSVPs))
[1] 0.02999708

Interestingly there’s barely any correlation here which was surprising to me. I tried combining this variable in a multiple variable model with the others but it still didn’t have much impact so I think we’ll park this one for now.

This is as much as I’ve done at the moment and despite spending quite a bit of time on it I still haven’t really explained very much of the variation in RSVP rates!

I have managed to identify some ways that I was able to come up with new features to try out though:

  • Read what other people are doing e.g. I have some ideas for lag variables (e.g. how many people went to your previous meetup) having read about this baseball linear model
  • Talk to other people about your model – they often have ideas you wouldn’t think of being too deep into the problem.
  • Look at what data you already have and try and incorporate that and see where it leads

The next avenue I started exploring is topic modelling as I have a hypothesis that people RSVP for events based on the content of talks but I’m not sure of the best way to go about that.

My current thinking is to pull out some topics/terms by following the example from Chapter 6 of Machine Learning for Hackers.