Mark Needham

Thoughts on Software Development

Archive for the ‘Machine Learning’ Category

How I met your mother: Story arcs

without comments

After weeks of playing around with various algorithms to extract story arcs in How I met your mother I’ve come to the conclusion that I don’t yet have the skills to completely automate this process so I’m going to change my approach.

The new plan is to treat the outputs of the algorithms as suggestions for possible themes but then have a manual step where I extract what I think are interesting themes in the series.

A theme can consist of a single word or a phrase and the idea is that once a story arc is identified we’ll search over the corpus and find the episodes where that phrase occurs.

We can then generate a CSV file of (story arc) -> (episodeId), store that into our HIMYM graph and use the story arc as another factor for episode similarity.

I ended up with the following script to work out which episodes contained a story arc:

#!/bin/bash
 
find_term() {
  arc=${1}
  searchTerm=${2}
  episodes=$(grep --color -iE "${searchTerm}" data/import/sentences.csv | awk -F"," '{ print $2  }' | sort | uniq)
  for episode in ${episodes}; do
    echo ${arc},${episode}
  done
}
 
find_term "Bro Code" "bro code"
find_term "Legendary" "legen(.*)ary"
find_term "Slutty Pumpkin" "slutty pumpkin"
find_term "Magician's Code" "magician's code"
find_term "Thanksgiving" "thanksgiving"
find_term "The Playbook" "playbook"
find_term "Slap Bet" "slap bet"
find_term "Wrestlers and Robots" "wrestlers"
find_term "Robin Sparkles" "sparkles"
find_term "Blue French Horn" "blue french horn"
find_term "Olive Theory" "olive"
find_term "Thank You Linus" "thank you, linus"
find_term "Have you met...?" "have you met"
find_term "Laser Tag" "laser tag"
find_term "Goliath National Bank" "goliath national bank"
find_term "Challenge Accepted" "challenge accepted"
find_term "Best Man" "best man"

If we run this script we’ll see something like the following:

$ ./scripts/arcs.sh
Bro Code,14
Bro Code,155
Bro Code,170
Bro Code,188
Bro Code,201
Bro Code,61
Bro Code,64
Legendary,117
Legendary,120
Legendary,122
Legendary,136
Legendary,137
Legendary,15
Legendary,152
Legendary,157
Legendary,162
Legendary,171
...
Best Man,208
Best Man,30
Best Man,32
Best Man,41
Best Man,42

I pulled out these themes by eyeballing the output of the following scripts:

  • TF/IDF – calculates TF/IDF scores for ngrams. This helps find important themes in the context of a single episode. I then did some manual searching to see how many of those themes existed in other episodes
  • Weighted Term Frequency – this returns a weighted term frequency for ngrams of different lengths. The weights are determined by the skewed random discrete distribution I wrote about earlier in the week. I ran it with different skews and ngram lengths.
  • Named entity extraction – this pulls out any phrases that are named entities. It mostly pulled out names of people (which I used as a stop word list in some other algorithms) but also revealed a couple of themes.
  • Topic modelling – I used mallet to extract topics across the corpus. Most of them didn’t make much sense to me but there were a few which identified themes that I recognised.

I can’t remember off the top of my head if any obvious themes have been missed so if you know HIMYM better than me let me know and I’ll try and work out why those didn’t surface.

Next I want to see how these scripts fare against some other TV shows and see how quickly I can extract themes for those. It’d also be cool if I can make the whole process a bit more automated.

Written by Mark Needham

April 3rd, 2015 at 11:31 pm

Posted in Machine Learning

Tagged with

Topic Modelling: Working out the optimal number of topics

without comments

In my continued exploration of topic modelling I came across The Programming Historian blog and a post showing how to derive topics from a corpus using the Java library mallet.

The instructions on the blog make it very easy to get up and running but as with other libraries I’ve used, you have to specify how many topics the corpus consists of. I’m never sure what value to select but the authors make the following suggestion:

How do you know the number of topics to search for? Is there a natural number of topics? What we have found is that one has to run the train-topics with varying numbers of topics to see how the composition file breaks down. If we end up with the majority of our original texts all in a very limited number of topics, then we take that as a signal that we need to increase the number of topics; the settings were too coarse.

There are computational ways of searching for this, including using MALLETs hlda command, but for the reader of this tutorial, it is probably just quicker to cycle through a number of iterations (but for more see Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Science, 101, 5228-5235).

Since I haven’t yet had the time to dive into the paper or explore how to use the appropriate option in mallet I thought I’d do some variations on the stop words and number of topics and see how that panned out.

As I understand it, the idea is to try and get a uniform spread of topics -> documents i.e. we don’t want all the documents to have the same topic otherwise any topic similarity calculations we run won’t be that interesting.

I tried running mallet with 10,15,20 and 30 topics and also varied the stop words used. I had one version which just stripped out the main characters and the word ‘narrator’ & another where I stripped out the top 20% of words by occurrence and any words that appeared less than 10 times.

The reason for doing this was that it should identify interesting phrases across episodes better than TF/IDF can while not just selecting the most popular words across the whole corpus.

I used mallet from the command line and ran it in two parts.

  1. Generate the model
  2. Work out the allocation of topics and documents based on hyper parameters

I wrote a script to help me out:

#!/bin/sh
 
train_model() {
  ./mallet-2.0.7/bin/mallet import-dir \
    --input mallet-2.0.7/sample-data/himym \
    --output ${2} \
    --keep-sequence \
    --remove-stopwords \
    --extra-stopwords ${1}
}
 
extract_topics() {
  ./mallet-2.0.7/bin/mallet train-topics \
    --input ${2} --num-topics ${1} \
    --optimize-interval 20 \
    --output-state himym-topic-state.gz \
    --output-topic-keys output/himym_${1}_${3}_keys.txt \
    --output-doc-topics output/himym_${1}_${3}_composition.txt
}
 
train_model "stop_words.txt" "output/himym.mallet"
train_model "main-words-stop.txt" "output/himym.main.words.stop.mallet"
 
extract_topics 10 "output/himym.mallet" "all.stop.words"
extract_topics 15 "output/himym.mallet" "all.stop.words"
extract_topics 20 "output/himym.mallet" "all.stop.words"
extract_topics 30 "output/himym.mallet" "all.stop.words"
 
extract_topics 10 "output/himym.main.words.stop.mallet" "main.stop.words"
extract_topics 15 "output/himym.main.words.stop.mallet" "main.stop.words"
extract_topics 20 "output/himym.main.words.stop.mallet" "main.stop.words"
extract_topics 30 "output/himym.main.words.stop.mallet" "main.stop.words"

As you can see, this script first generates a bunch of models from text files in ‘mallet-2.0.7/sample-data/himym’ – there is one file per episode of HIMYM. We then use that model to generate differently sized topic models.

The output is two files; one containing a list of topics and another describing what percentage of the words in each document come from each topic.

$ cat output/himym_10_all.stop.words_keys.txt
 
0	0.08929	back brad natalie loretta monkey show call classroom mitch put brunch betty give shelly tyler interview cigarette mc laren
1	0.05256	zoey jerry arthur back randy arcadian gael simon blauman blitz call boats becky appartment amy gary made steve boat
2	0.06338	back claudia trudy doug int abby call carl stuart voix rachel stacy jenkins cindy vo katie waitress holly front
3	0.06792	tony wendy royce back jersey jed waitress bluntly lucy made subtitle film curt mosley put laura baggage officer bell
4	0.21609	back give patrice put find show made bilson nick call sam shannon appartment fire robots top basketball wrestlers jinx
5	0.07385	blah bob back thanksgiving ericksen maggie judy pj valentine amanda made call mickey marcus give put dishes juice int
6	0.04638	druthers karen back jen punchy jeanette lewis show jim give pr dah made cougar call jessica sparkles find glitter
7	0.05751	nora mike pete scooter back magazine tiffany cootes garrison kevin halloween henrietta pumpkin slutty made call bottles gruber give
8	0.07321	ranjit back sandy mary burger call find mall moby heather give goat truck made put duck found stangel penelope
9	0.31692	back give call made find put move found quinn part ten original side ellen chicago italy locket mine show
$ head -n 10 output/himym_10_all.stop.words_composition.txt
#doc name topic proportion ...
0	file:/Users/markneedham/projects/mallet/mallet-2.0.7/sample-data/himym/1.txt	0	0.70961794636687	9	0.1294699168584466	8	0.07950442338871108	2	0.07192178481473664	4	0.008360809510263838	5	2.7862560133367015E-4	3	2.562409242784946E-4	7	2.1697378721335337E-4	1	1.982849604752168E-4	6	1.749937876710496E-4
1	file:/Users/markneedham/projects/mallet/mallet-2.0.7/sample-data/himym/10.txt	2	0.9811551470820473	9	0.016716882136209997	4	6.794128563082893E-4	0	2.807350575301132E-4	5	2.3219634098530471E-4	8	2.3018997315244256E-4	3	2.1354177341696056E-4	7	1.8081798384467614E-4	1	1.6524340216541808E-4	6	1.4583339433951297E-4
2	file:/Users/markneedham/projects/mallet/mallet-2.0.7/sample-data/himym/100.txt	2	0.724061485807234	4	0.13624729774423758	0	0.13546964196228636	9	0.0019436342339785994	5	4.5291919356563914E-4	8	4.490055982996677E-4	3	4.1653183421485213E-4	7	3.5270123154213927E-4	1	3.2232165301666123E-4	6	2.8446074162457316E-4
3	file:/Users/markneedham/projects/mallet/mallet-2.0.7/sample-data/himym/101.txt	2	0.7815231689893246	0	0.14798271520316794	9	0.023582384458063092	8	0.022251052243582908	1	0.022138209217973336	4	0.0011804626661380394	5	4.0343527385745457E-4	3	3.7102343418895774E-4	7	3.1416667687862693E-4	6	2.533818368250992E-
4	file:/Users/markneedham/projects/mallet/mallet-2.0.7/sample-data/himym/102.txt	6	0.6448245189567259	4	0.18612146979166502	3	0.16624873439661025	9	0.0012233726722317548	0	3.4467218590717303E-4	5	2.850788252495599E-4	8	2.8261550915084904E-4	2	2.446611421432842E-4	7	2.2199909869250053E-4	1	2.028774216237081E-
5	file:/Users/markneedham/projects/mallet/mallet-2.0.7/sample-data/himym/103.txt	8	0.7531586740033047	5	0.17839539108961253	0	0.06512376460651902	9	0.001282794040111701	4	8.746645156304241E-4	3	2.749100345664577E-4	2	2.5654476523149865E-4	7	2.327819863700214E-4	1	2.1273153572848481E-4	6	1.8774342292520802E-4
6	file:/Users/markneedham/projects/mallet/mallet-2.0.7/sample-data/himym/104.txt	7	0.9489502365148181	8	0.030091466847852504	4	0.017936457663121977	9	0.0013482824985091328	0	3.7986419553884905E-4	5	3.141861834124008E-4	3	2.889445824352445E-4	2	2.6964174000656E-4	1	2.2359178288566958E-4	6	1.9732799141958482E-4
7	file:/Users/markneedham/projects/mallet/mallet-2.0.7/sample-data/himym/105.txt	8	0.7339694064061175	7	0.1237041841318045	9	0.11889696041555338	0	0.02005288536233353	4	0.0014026751618923005	5	4.793786828705149E-4	3	4.408655780020889E-4	2	4.1141370625324785E-4	1	3.411516484151411E-4	6	3.0107890675777946E-4
8	file:/Users/markneedham/projects/mallet/mallet-2.0.7/sample-data/himym/106.txt	5	0.37064909999661005	9	0.3613559917055785	0	0.14857567731040344	6	0.09545466082502917	4	0.022300625744661403	8	3.8725629469313333E-4	3	3.592484711785775E-4	2	3.3524900189121E-4	7	3.041961449432886E-4	1	2.779945050112539E-4

The output is a bit tricky to understand on its own so I did a bit of post processing using pandas and then ran the results of that through matplotlib to see the distribution of documents for different topics sizes with different stop words. You can see the script here.

I ended up with the following chart:

2015 03 24 22 08 48

On the left hand side we’re using more stop words and on the right just the main ones. For most of the variations there are one or two topics which most documents belong to but interestingly the most uniform distribution seems to be when we have few topics.

These are the main words for the most popular topics on the left hand side:

15 topics

8       0.50732 back give call made put find found part move show side ten mine top abby front fire full fianc

20 topics

12      0.61545 back give call made put find show found part move side mine top front ten full cry fire fianc

30 topics

22      0.713   back call give made put find show part found side move front ten full top mine fire cry bottom

All contain more or less the same words which at first glance seem like quite generic words so I’m surprised they weren’t excluded.

On the right hand side we haven’t removed many words so we’d expect common words in the English language to dominate. Let’s see if they do:

10 topics

1       3.79451 don yeah ll hey ve back time guys good gonna love god night wait uh thing guy great make

15 topics

5       2.81543 good time love ll great man guy ve night make girl day back wait god life yeah years thing
 
10      1.52295 don yeah hey gonna uh guys didn back ve ll um kids give wow doesn thing totally god fine

20 topics

1       3.06732 good time love wait great man make day back ve god life years thought big give apartment people work
 
13      1.68795 don yeah hey gonna ll uh guys night didn back ve girl um kids wow guy kind thing baby

30 topics

14      1.42509 don yeah hey gonna uh guys didn back ve um thing ll kids wow time doesn totally kind wasn
 
24      2.19053 guy love man girl wait god ll back great yeah day call night people guys years home room phone
 
29      1.84685 good make ve ll stop time made nice put feel love friends big long talk baby thought things happy

Again we have similar words across each run and as expected they are all quite generic words.

My take away from this exploration is that I should vary the stop word percentages as well and see if that leads to an improved distribution.

Taking out very common words like we do with the left hand side charts seems to make sense although I need to work out why there’s a single outlier in each group.

The authors suggest that having the majority of our texts in a small number of topics means we need to create more of them so I will investigate that too.

The code is all on github along with the transcripts so give it a try and let me know what you think.

Written by Mark Needham

March 24th, 2015 at 10:33 pm

Posted in Machine Learning,Python

Tagged with

Python: scikit-learn/lda: Extracting topics from QCon talk abstracts

with one comment

Following on from Rik van Bruggen’s blog post on a QCon graph he’s created ahead of this week’s conference, I was curious whether we could extract any interesting relationships between talks based on their abstracts.

Talks are already grouped by their hosting track but there’s likely to be some overlap in topics even for talks on different tracks.
I therefore wanted to extract topics and connect each talk to the topic that describes it best.

My first attempt was following an example which uses Non-Negative Matrix factorization which worked very well for extracting topics but didn’t seem to provide an obvious way to work out how to link those topics to individual talks.

Instead I ended up looking at the lda library which uses Latent Dirichlet Allocation and allowed me to achieve both goals.

I already had some code to run TF/IDF over each of the talks so I thought I’d be able to feed the matrix output from that into the LDA function. This is what I started with:

import csv
 
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF
from collections import defaultdict
from bs4 import BeautifulSoup, NavigableString
from soupselect import select
 
def uri_to_file_name(uri):
    return uri.replace("/", "-")
 
sessions = {}
with open("data/sessions.csv", "r") as sessions_file:
    reader = csv.reader(sessions_file, delimiter = ",")
    reader.next() # header
    for row in reader:
        session_id = int(row[0])
        filename = "data/sessions/" + uri_to_file_name(row[4])
        page = open(filename).read()
        soup = BeautifulSoup(page)
        abstract = select(soup, "div.brenham-main-content p")
        if abstract:
            sessions[session_id] = {"abstract" : abstract[0].text, "title": row[3] }
        else:
            abstract = select(soup, "div.pane-content p")
            sessions[session_id] = {"abstract" : abstract[0].text, "title": row[3] }
 
corpus = []
titles = []
for id, session in sorted(sessions.iteritems(), key=lambda t: int(t[0])):
    corpus.append(session["abstract"])
    titles.append(session["title"])
 
n_topics = 15
n_top_words = 50
n_features = 6000
 
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1,1), min_df = 0, stop_words = 'english')
matrix =  vectorizer.fit_transform(corpus)
feature_names = vectorizer.get_feature_names()
 
import lda
import numpy as np
 
vocab = feature_names
 
model = lda.LDA(n_topics=20, n_iter=500, random_state=1)
model.fit(matrix)
topic_word = model.topic_word_
n_top_words = 20
 
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-n_top_words:-1]
    print('Topic {}: {}'.format(i, ' '.join(topic_words)))

And if we run it?

Topic 0: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 1: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 2: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 3: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 4: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 5: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 6: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 7: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 8: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 9: 10 faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 10: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 11: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 12: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 13: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 14: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 15: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 16: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 17: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 18: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 19: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure

As you can see, every topic has the same set of words which isn’t what we want. Let’s switch out our TF/IDF vectorizer for a simpler count based one:

vectorizer = CountVectorizer(analyzer='word', ngram_range=(1,1), min_df = 0, stop_words = 'english')

The rest of the code stays the same and these are the topics that get extracted:

Topic 0: time people company did writing real way used let cassandra soundcloud successful know web organization audio lives swift stuck
Topic 1: process development delivery platform developer continuous testing rapidly deployment implementing release demonstrate paas advice hard light predictable radically introduce
Topic 2: way open space kind people change meetings ll lead powerful practice times everyday simple qconlondon organization unconference track extraordinary
Topic 3: apache apis processing open spark distributed leading making environments solr cases brooklyn components existing ingestion contributing data target evolved
Topic 4: management million effective cost halo gameplay player billion ad catastrophic store microsoft final music influence information launch research purchased
Topic 5: product look like use talk problems working analysis projects challenges 2011 functionality useful spread business deep inside happens sensemaker
Topic 6: ll computers started principles free focus face smaller atlas control uses products avoid computing ground billions mean volume consistently
Topic 7: code end users developers just application way line apps mobile features sites hours issues applications write faster game better
Topic 8: ve development teams use things world like time learned lessons think methods multiple story say customer developer experiences organisations
Topic 9: software building docker built challenges monitoring gilt application discuss solution decision talk download source center critical decisions bintray customers
Topic 10: years infrastructure tools language different service lot devops talk adoption scala popular clojure advantages introduced effectively looking wasn includes
Topic 11: high does latency session requirements functional performance real world questions problem second engineering patterns gravity explain discuss expected time
Topic 12: business make build technology technologies help trying developers parts want interfaces small best centres implementations critical moo databases going
Topic 13: need design systems large driven scale software applications slow protocol change needs approach gets new contracts solutions complicated distributed
Topic 14: architecture service micro architectures increasing talk microservices order market value values new present presents services scalable trading practices today
Topic 15: java using fast robovm lmax ios presentation really jvm native best exchange azul hardware started project slowdowns goal bring
Topic 16: data services using traditional create ways support uk large user person complex systems production impact art organizations accessing mirage
Topic 17: agile team experience don work doing processes based key reach extra defined pressure machines nightmare practices learn goals guidance
Topic 18: internet new devices programming things iot big number deliver day connected performing growing got state thing provided times automated
Topic 19: cloud including deploy session api government security culture software type attack techniques environment digital secure microservice better creation interaction

Some of the groupings seem to make sense e.g. Topic 11 contains words related to high performance code with low latency; Topic 15 covers Java, the JVM and other related words; but others are more difficult to decipher

e.g. both Topic 14 and Topic 19 talk about micro services but the latter mentions ‘government’ and ‘security’ so perhaps the talks linked to that topic come at micro services from a different angle altogether.

Next let’s see which topics a talk is most likely to be about. We’ll look at the first ten:

doc_topic = model.doc_topic_
for i in range(0, 10):
    print("{} (top topic: {})".format(titles[i], doc_topic[i].argmax()))
    print(doc_topic[i].argsort()[::-1][:3])
 
To the Moon (top topic: 8)
[ 8  0 11]
Evolutionary Architecture and Micro-Services - A Match Enabled by Continuous Delivery (top topic: 14)
[14 19 16]
How SoundCloud uses Cassandra (top topic: 0)
[0 6 5]
DevOps and the Need for Speed (top topic: 18)
[18  5 16]
Neuro-diversity and agile (top topic: 7)
[17  7  2]
Java 8 in Anger (top topic: 7)
[ 7 15 12]
APIs that Change Lifestyles (top topic: 9)
[ 9  6 19]
Elasticsearch powers the Citizen Advice Bureau (CAB) to monitor trends in society before they become issues (top topic: 16)
[16 12 19]
Architecture Open Space (top topic: 2)
[ 2 19 18]
Don’t let Data Gravity crush your infrastructure (top topic: 11)
[11 16  3]

So our third talk on the list ‘How SoundCloud uses Cassandra’ does end up being tagged with topic 0 which mentions SoundCloud so that’s good!

Topic 0: time people company did writing real way used let cassandra soundcloud successful know web organization audio lives swift stuck

It’s next two topics are 5 & 6 which contain the following words…

Topic 5: product look like use talk problems working analysis projects challenges 2011 functionality useful spread business deep inside happens sensemaker
Topic 6: ll computers started principles free focus face smaller atlas control uses products avoid computing ground billions mean volume consistently

…which are not as intuitive. What about Java 8 in Anger? It’s been tagged with topics 7, 15 and 12:

Topic 7: code end users developers just application way line apps mobile features sites hours issues applications write faster game better
Topic 15: java using fast robovm lmax ios presentation really jvm native best exchange azul hardware started project slowdowns goal bring
Topic 12: business make build technology technologies help trying developers parts want interfaces small best centres implementations critical moo databases going

15 makes sense since that mentions Java and perhaps 12 and 7 do as well as they both mention developers.

So while the topics pulled out are not horrendous I don’t think they’re particularly useful yet either. These are some of the areas I need to do some more research around:

  • How do you measure the success of topic modelling? I’ve been eyeballing the output of the algorithm but I imagine there’s an automated way to do that.
  • How do you determine the right number of topics? I found an article written by Christophe Grainger which explains a way of doing that which I need to look at in more detail.
  • It feels like I would be able to pull out better topics if I had an ontology of computer science/software words and then ran the words through that to derive topics.
  • Another approach suggested by Michael is to find the most popular words using the CountVectorizer and tag talks with those instead.

If you have any suggestions let me know. The full code is on github if you want to play around with it.

Written by Mark Needham

March 5th, 2015 at 8:52 am

Posted in Machine Learning,Python

Tagged with

Kaggle Titanic: Python pandas attempt

with one comment

Nathan and I have been looking at Kaggle’s Titanic problem and while working through the Python tutorial Nathan pointed out that we could greatly simplify the code if we used pandas instead.

The problem we had with numpy is that you use integers to reference columns. We spent a lot of time being thoroughly confused as to why something wasn’t working only to realise we were using the wrong column.

The algorithm scores an accuracy of 77.99% and the way it works is that we build a ‘survival table’ which works out an average survival rate of people based on 3 features:

  • Passenger Class
  • Passenger Fare (grouped into those who paid 0-9, 10-19, 20-29, 30+)
  • Gender

It looks like this:

2013 10 30 07 05 03

And the code that creates that is:

import pandas as pd
 
def addrow(df, row):
    return df.append(pd.DataFrame(row), ignore_index=True)
 
def fare_in_bucket(fare, fare_bracket_size, bucket):
    return (fare > bucket * fare_bracket_size) & (fare <= ((bucket+1) * fare_bracket_size))
 
def build_survival_table(training_file):    
    fare_ceiling = 40
    train_df = pd.read_csv(training_file)
    train_df[train_df['Fare'] >= 39.0] = 39.0
    fare_bracket_size = 10
    number_of_price_brackets = fare_ceiling / fare_bracket_size
    number_of_classes = 3 #There were 1st, 2nd and 3rd classes on board 
 
    survival_table = pd.DataFrame(columns=['Sex', 'Pclass', 'PriceDist', 'Survived', 'NumberOfPeople'])
 
    for pclass in range(1, number_of_classes + 1): # add 1 to handle 0 start
        for bucket in range(0, number_of_price_brackets):
            for sex in ['female', 'male']:
                survival = train_df[(train_df['Sex'] == sex) 
                                    & (train_df['Pclass'] == pclass) 
                                    & fare_in_bucket(train_df["Fare"], fare_bracket_size, bucket)]
 
                row = [dict(Pclass=pclass, Sex=sex, PriceDist = bucket, 
                            Survived = round(survival['Survived'].mean()), 
                            NumberOfPeople = survival.count()[0]) ]
                survival_table = addrow(survival_table, row)
 
    return survival_table.fillna(0)
 
survival_table = build_survival_table("train.csv")

where ‘train.csv’ is structured like so:

$ head -n5 train.csv 
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S

After we’ve built that we iterate through the test data set and look up each person in the table and find their survival rate.

def select_bucket(fare):
    if (fare >= 0 and fare < 10):
        return 0
    elif (fare >= 10 and fare < 20):
        return 1
    elif (fare >= 20 and fare < 30):
        return 2
    else:
        return 3
 
def calculate_survival(survival_table, row):
    survival_row = survival_table[(survival_table["Sex"] == row["Sex"]) & (survival_table["Pclass"] == row["Pclass"]) & (survival_table["PriceDist"] == select_bucket(row["Fare"]))]
    return int(survival_row["Survived"].iat[0])    
 
test_df = pd.read_csv('test.csv')             
test_df["Survived"] = test_df.apply(lambda row: calculate_survival(survival_table, row), axis=1)

I wrote up the difficulties we had working out how to append the ‘Survived’ column if you want more detail.

‘test.csv’ looks like this:

$ head -n5 test.csv 
PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7,,S
894,2,"Myles, Mr. Thomas Francis",male,62,0,0,240276,9.6875,,Q
895,3,"Wirz, Mr. Albert",male,27,0,0,315154,8.6625,,S

We then write out the survival value for each customer along with their ID:

test_df.to_csv("result.csv", cols=['PassengerId', 'Survived'], index=False)
$ head -n5 result.csv 
PassengerId,Survived
892,0
893,1
894,0
895,0

I’ve pasted the code as a gist for those who want to see it all as one.

Next step: introduce some real machine learning, probably using scikit-learn unless there’s something else we should be using?

Written by Mark Needham

October 30th, 2013 at 7:26 am

Posted in Machine Learning

Tagged with ,

Feature Extraction/Selection – What I’ve learnt so far

with one comment

A couple of weeks ago I wrote about some feature extraction work that I’d done on the Kaggle Digit Recognizer data set and having realised that I had no idea what I was doing I thought I should probably learn a bit more.

I came across Dunja Mladenic’s ‘Dimensionality Reduction by Feature Selection in Machine Learning‘ presentation in which she sweeps across the landscape of feature selection and explains how everything fits together.

The talk starts off by going through some reasons that we’d want to use dimensionality reduce/feature selection:

  • Improve the prediction performance
  • Improve learning efficiency
  • Provide faster predictors possibly requesting less information on the original data
  • Reduce complexity of the learned results, enable better understanding of the underlying process

Mladenic suggests that there are a few ways we can go about reducing the dimensionality of data:

  • Selecting a subset of the original features
  • Constructing features to replace the original features
  • Using background knowledge to construct new features to be used in addition to the original features

The talk focuses on the first of these and a lot of it focuses on how we can go about using feature selection as a pre-processing step on our data sets.

The approach seems to involve either starting with all the features and removing them one at a time and seeing how the outcome is affected or starting with none of the features and adding them one at a time.

However, about half way through the talk Mladenic points out that some algorithms actually have feature selection built into them so there’s no need to have the pre-processing step.

I think this is the case with random forests of decision trees because the decision trees are constructed by taking into account which features give the greatest information gain so low impact features are less likely to be used.

I previously wrote a blog post describing how I removed all the features with zero variance from the data set and after submitting a random forest trained on the new data set we saw no change in accuracy which proved the point.

I also came across an interesting paper by Isabelle Guyon & Andre Elisseeff titled ‘An Introduction to Variable and Feature Selection‘ which has a flow chart-ish set of questions to help you work out where to start.

One of the things I picked up from reading this paper is that if you have domain knowledge then you might be able to construct a better set of features by making use of this knowledge.

Another suggestion is to come up with a variable ranking for each feature i.e. how much that feature contributes to the outcome/prediction. This is something also suggested in the Coursera Data Analysis course and in R we can use the glm function to help work this out.

The authors also point out that we should separate the problem of model selection (i.e. working out which features to use) from the problem of testing our classifier.

To test the classifier we’d most likely keep a test set aside but we shouldn’t use this data for testing feature selection, rather we should use the training data. Cross validation probably works best here.

There’s obviously more covered in the presentation & paper than what I’ve covered here but I’ve found that in general the material I’ve come across tends to drift towards being quite abstract/theoretical and therefore quite difficult for me to follow.

If anyone has come across any articles/books which explain how to go about feature selection using an example I’d love to read it/them!

Written by Mark Needham

February 10th, 2013 at 3:42 pm

Posted in Machine Learning

Tagged with

Kaggle Digit Recognizer: A feature extraction #fail

with 4 comments

I’ve written a few blog posts about our attempts at the Kaggle Digit Recogniser problem and one thing we haven’t yet tried is feature extraction.

Feature extraction in this context means that we’d generate some other features to train a classifier with rather than relying on just the pixel values we were provided.

Every week Jen would try and persuade me that we should try it out but it wasn’t until I was flicking through the notes from the Columbia Data Science class that it struck home:

5. The Space between the Data Set and the Algorithm

Many people go straight from a data set to applying an algorithm. But there’s a huge space in between of important stuff. It’s easy to run a piece of code that predicts or classifies. That’s not the hard part. The hard part is doing it well.

One needs to conduct exploratory data analysis as I’ve emphasized; and conduct feature selection as Will Cukierski emphasized.

I’ve highlighted the part of the post which describes exactly what we’ve been doing!

There were some examples of feature extraction on the Kaggle forums so I thought I’d try and create some other features using R.

I created features for the number of non zero pixels, the number of 255 pixels, the average number of pixels and the average of the middle pixels of a number.

The code reads like this:

initial <- read.csv("train.csv", header = TRUE)
initial$nonZeros <- apply(initial, 1, function(entries) length(Filter(function (x) x != 0, entries)))
initial$fullHouses <- apply(initial, 1, function(entries) length(Filter(function (x) x == 255, entries)))
initial$meanPixels <- apply(initial, 1, mean)
initial$middlePixels <- apply(initial[,200:500], 1, mean)

I then wrote those features out into a CSV file like so:

newFeatures <- subset(initial, select=c(label, nonZeros, meanPixels, fullHouses, middlePixels))
write.table(file="feature-extraction.txt", newFeatures, row.names=FALSE, sep=",")

I then created a 100 tree random forest using Mahout to see whether or not we could get any sort of accuracy using these features.

Unfortunately the accuracy on the cross validation set (10% of the training data) was only 24% which is pretty useless so it’s back to the drawing board!

Our next task is to try and work out whether we can derive some features which have a stronger correlation with the label values or combining the new features with the existing pixel values to see if that has any impact.

As you can probably tell I don’t really understand how you should go about extracting features so if anybody has ideas or papers/articles I can read to learn more please let me know in the comments!

Written by Mark Needham

January 31st, 2013 at 11:24 pm

Posted in Machine Learning

Tagged with ,

Kaggle Digit Recognizer: Finding pixels with no variance using R

with one comment

I’ve written previously about our attempts at the Kaggle Digit Recogniser problem and our approach so far has been to use the data provided and plug it into different algorithms and see what we end up with.

From browsing through the forums we saw others mentioning feature extraction – an approach where we transform the data into another format , the thinking being that we can train a better classifier with better data.

There was quite a nice quote from a post written by Rachel Schutt about the Columbia Data Science course which summed up the mistake we’d made:

The Space between the Data Set and the Algorithm

Many people go straight from a data set to applying an algorithm. But there’s a huge space in between of important stuff. It’s easy to run a piece of code that predicts or classifies. That’s not the hard part. The hard part is doing it well.

One thing we’d noticed while visually scanning the data set was that a lot of the features seemed to consistently have a value of 0. We thought we’d try and find out which pixels those were by finding the features which had zero variance in their values.

I started out by loading a subset of the data set and then taking a sample of that to play around with:

initial <- read.csv("train.csv", nrows=10000, header = TRUE)
 
# take a sample of 1000 rows of the input 
sampleSet <- initial[sample(1:nrow(initial), 1000), ]

Just for fun I thought it’d be interesting to see how well the labels were distributed in my sample which we can do with the following code:

# get all the labels
sampleSet.labels <- as.factor(sampleSet$label)
 
> table(sampleSet.labels)
sampleSet.labels
  0   1   2   3   4   5   6   7   8   9 
102 116 100  97  95  91  79 122 102  96

There are a few more 1’s and 7s than the other labels but they’re roughly in the same ballpark so it’s ok.

I wanted to exclude the ‘label’ field from the data set because the variance of labels isn’t interesting to us on this occasion. We can do that with the following code:

# get data set excluding label
excludingLabel <- subset( sampleSet, select = -label)

To find all the features with no variance we then do this:

# show all the features which don't have any variance - all have the same value
variances <- apply(excludingLabel, 2, var)
 
# get the names of the labels which have no variance
> pointlessFeatures <- names(excludingLabel[variances == 0][1,])
 
 [1] "pixel0"   "pixel1"   "pixel2"   "pixel3"   "pixel4"   "pixel5"   "pixel6"   "pixel7"  
  [9] "pixel8"   "pixel9"   "pixel10"  "pixel11"  "pixel12"  "pixel13"  "pixel14"  "pixel15" 
 [17] "pixel16"  "pixel17"  "pixel18"  "pixel19"  "pixel20"  "pixel21"  "pixel22"  "pixel23" 
 [25] "pixel24"  "pixel25"  "pixel26"  "pixel27"  "pixel28"  "pixel29"  "pixel30"  "pixel31" 
 [33] "pixel32"  "pixel33"  "pixel51"  "pixel52"  "pixel53"  "pixel54"  "pixel55"  "pixel56" 
 [41] "pixel57"  "pixel58"  "pixel59"  "pixel60"  "pixel82"  "pixel83"  "pixel84"  "pixel85" 
 [49] "pixel86"  "pixel88"  "pixel110" "pixel111" "pixel112" "pixel113" "pixel114" "pixel139"
 [57] "pixel140" "pixel141" "pixel142" "pixel168" "pixel169" "pixel196" "pixel252" "pixel280"
 [65] "pixel308" "pixel335" "pixel336" "pixel364" "pixel365" "pixel392" "pixel393" "pixel420"
 [73] "pixel421" "pixel448" "pixel476" "pixel504" "pixel532" "pixel559" "pixel560" "pixel587"
 [81] "pixel615" "pixel643" "pixel644" "pixel645" "pixel671" "pixel672" "pixel673" "pixel699"
 [89] "pixel700" "pixel701" "pixel727" "pixel728" "pixel729" "pixel730" "pixel731" "pixel752"
 [97] "pixel753" "pixel754" "pixel755" "pixel756" "pixel757" "pixel758" "pixel759" "pixel760"
[105] "pixel779" "pixel780" "pixel781" "pixel782" "pixel783"

We can count how many labels there are by using the length function:

# count how many labels have no variance
> length(names(excludingLabel[apply(excludingLabel, 2, var) == 0][1,]))
[1] 109

I then wrote those out to a file so that we could use them as the input to the code which builds up our classifier.

write(file="pointless-features.txt", pointlessFeatures)

Of course we should run the variance test against the full data set rather than just a sample and on the whole data set there are only 76 features with zero variance:

> sampleSet <- read.csv("train.csv", header = TRUE)
> sampleSet.labels <- as.factor(sampleSet$label)
> table(sampleSet.labels)
sampleSet.labels
   0    1    2    3    4    5    6    7    8    9 
4132 4684 4177 4351 4072 3795 4137 4401 4063 4188 
> excludingLabel <- subset( sampleSet, select = -label)
> variances <- apply(excludingLabel, 2, var)
> pointlessFeatures <- names(excludingLabel[variances == 0][1,])
> length(names(excludingLabel[apply(excludingLabel, 2, var) == 0][1,]))
[1] 76

We’ve built decision trees using this reduced data set but haven’t yet submitted the forest to Kaggle to see if it’s any more accurate!

I picked up the little R I know from the Computing for Data Analysis course which started last week and from the book ‘R in a Nutshell‘ which my colleague Paul Lam recommended.

Written by Mark Needham

January 8th, 2013 at 12:48 am

Posted in Machine Learning,R

Tagged with

Mahout: Parallelising the creation of DecisionTrees

with 5 comments

A couple of months ago I wrote a blog post describing our use of Mahout random forests for the Kaggle Digit Recogniser Problem and after seeing how long it took to create forests with 500+ trees I wanted to see if this could be sped up by parallelising the process.

From looking at the DecisionTree it seemed like it should be possible to create lots of small forests and then combine them together.

After unsuccessfully trying to achieve this by directly using DecisionForest I decided to just copy all the code from that class into my own version which allowed me to achieve this.

The code to build up the forest ends up looking like this:

List<Node> trees = new ArrayList<Node>();
 
MultiDecisionForest forest = MultiDecisionForest.load(new Configuration(), new Path("/path/to/mahout-tree"));
trees.addAll(forest.getTrees());
 
MultiDecisionForest forest = new MultiDecisionForest(trees);

We can then use forest to classify values in a test data set and it seems to work reasonably well.

I wanted to try and avoid putting any threading code in so I made use of GNU parallel which is available on Mac OS X with a brew install parallel and on Ubuntu by adding the following repository to /etc/apt/sources.list

deb http://ppa.launchpad.net/ieltonf/ppa/ubuntu oneiric main 
deb-src http://ppa.launchpad.net/ieltonf/ppa/ubuntu oneiric main

…followed by a apt-get update and apt-get install parallel.

I then wrote a script to parallelise the creation of the forests:

parallelise-forests.sh

#!/bin/bash 
 
start=`date`
startTime=`date '+%s'`
numberOfRuns=$1
 
seq 1 ${numberOfRuns} | parallel -P 8 "./build-forest.sh"
 
end=`date`
endTime=`date '+%s'`
 
echo "Started: ${start}"
echo "Finished: ${end}"
echo "Took: " $(expr $endTime - $startTime)

build-forest.sh

#!/bin/bash
 
java -Xmx1024m -cp target/machinenursery-1.0.0-SNAPSHOT-standalone.jar main.java.MahoutPlaybox

It should be possible to achieve this by using the parallel option in xargs but unfortunately I wasn’t able to achieve the same success with that command.

I hadn’t come across the seq command until today but it works quite well here for allowing us to specify how many times we want to call the script.

I was probably able to achieve about a 30% speed increase when running this on my Air. There was a greater increase running on a high CPU AWS instance although for some reason some of the jobs seemed to get killed and I couldn’t figure out why.

Sadly even with a new classifier with a massive number of trees I didn’t see an improvement over the Weka random forest using AdaBoost which I wrote about a month ago. We had an accuracy of 96.282% here compared to 96.529% with the Weka version.

Written by Mark Needham

December 27th, 2012 at 12:08 am

Posted in Machine Learning

Tagged with ,

Weka: Saving and loading classifiers

with 3 comments

In our continued machine learning travels Jen and I have been building some classifiers using Weka and one thing we wanted to do was save the classifier and then reuse it later.

There is documentation for how to do this from the command line but we’re doing everything programatically and wanted to be able to save our classifiers from Java code.

As it turns out it’s not too tricky when you know which classes to call and saving a classifier to a file is as simple as this:

MultilayerPerceptron classifier = new MultilayerPerceptron();
classifier.buildClassifier(instances); // instances gets passed in from elsewhere
 
Debug.saveToFile("/path/to/weka-neural-network", classifier);

If we want to load that classifier up we can make use of the SerializedClassifier class like so:

SerializedClassifier classifier = new SerializedClassifier();
classifier.setModelFile(new File("/path/to/weka-neural-network"));

Simples!

Written by Mark Needham

December 12th, 2012 at 12:04 am

Posted in Machine Learning

Tagged with ,

Kaggle Digit Recognizer: Weka AdaBoost attempt

without comments

In our latest attempt at Kaggle’s Digit Recognizer Jen and I decided to try out boosting on our random forest algorithm, an approach that Jen had come across in a talk at the Clojure Conj.

We couldn’t find any documentation that it was possible to apply boosting to Mahout’s random forest algorithm but we knew it was possible with Weka so we decided to use that instead!

As I understand it the way that boosting works in the context of random forests is that each of the trees in the forest will be assigned a weight based on how accurately it’s able to classify the data set and these weights are then used in the voting stage.

There’s a more detailed explanation of the algorithm in this paper.

We had the following code to train the random forest:

public class WekaAdaBoostRandomForest {
    public static void main(String[] args) {
        FastVector attributes = attributes();
 
        Instances instances = new Instances("digit recognizer", attributes, 40000);
        instances.setClassIndex(0);
 
        String[] trainingDataValues = KaggleInputReader.fileAsStringArray("data/train.csv");
 
        for (String trainingDataValue : trainingDataValues) {
            Instance instance = createInstance(trainingDataValue);
            instances.add(instance);
        }
 
        Classifier classifier = buildClassifier(instances);
    }
 
    private static Classifier buildClassifier(Instances instances) throws Exception {
        RandomForest randomForest = new RandomForest();
        randomForest.setNumTrees(200);
 
        MultiBoostAB multiBoostAB = new MultiBoostAB();
        multiBoostAB.setClassifier(randomForest);
        multiBoostAB.buildClassifier(instances);
        return multiBoostAB;
    }
 
    private static FastVector attributes() {
        FastVector attributes = new FastVector();
        attributes.addElement(digit());
 
        for (int i = 0; i <= 783; i++) {
            attributes.addElement(new Attribute("pixel" + i));
        }
 
        return attributes;
    }
 
    private static Attribute digit() {
        FastVector possibleClasses = new FastVector(10);
        possibleClasses.addElement("0");
        possibleClasses.addElement("1");
        possibleClasses.addElement("2");
        possibleClasses.addElement("3");
        possibleClasses.addElement("4");
        possibleClasses.addElement("5");
        possibleClasses.addElement("6");
        possibleClasses.addElement("7");
        possibleClasses.addElement("8");
        possibleClasses.addElement("9");
        return new Attribute("label", possibleClasses, 0);
 
    }
 
}

The code in the KaggleInputReader is used to process the CSV file and is the same as that included in a previous post so I won’t bother including it in this post.

The Weka API is slightly different to the Mahout one in that we have to tell it the names of all the labels that a combination of features belong to whereas with Mahout it seems to work it out for you.

Wf use the RandomForest class to build up our trees and then wrap it in the MultiBoostAB class to apply the boosting. There is another class we could use to do this called AdaBoostM1 but they both seem to give similar results so we stuck with this one.

Once we’d trained the classifier up we ran it against our test data set like so:

public class WekaAdaBoostRandomForest {
    public static void main(String[] args) {
        ...
        String[] testDataValues = KaggleInputReader.fileAsStringArray("data/test.csv");
 
 
        FileWriter fileWriter = new FileWriter("weka-attempts/out-" + System.currentTimeMillis() + ".txt");
        PrintWriter out = new PrintWriter(fileWriter);
        for (String testDataValue : testDataValues) {
            Iteration iterate = iterate(testDataValue, classifier, instances);
            out.println((int) iterate.getPrediction());
            System.out.println("Actual: " + iterate.getActual() + ", Prediction: " + iterate.getPrediction());
        }
        out.close();
    }
 
    private static Iteration iterate(String testDataValue, Classifier classifier, Instances instances) throws Exception {
        Instance predictMe = createTestDataBasedInstanceToPredict(testDataValue, instances);
        double prediction = classifier.classifyInstance(predictMe);
 
        return new Iteration(new Double(testDataValue.split(",")[0]), prediction);
    }
 
    private static Instance createTestDataBasedInstanceToPredict(String testDataValue, Instances instances) {
        String[] columns = testDataValue.split(",");
        Instance instance = new Instance(785);
 
        for (int i = 0; i < columns.length; i++) {
            instance.setValue(new Attribute("pixel" + i, i+1), new Double(columns[i]));
        }
 
        instance.setDataset(instances);
        return instance;
    }
}

We got an accuracy of 96.529% with this code which is 0.2% higher than we managed with the Mahout Random forest without any boosting. The full code for this solution is on github as always!

We still haven’t managed to get an accuracy higher than the default solution provided by Kaggle so any suggestions about what else to try are welcome!

We’ve been playing around with neural networks using encog but they seem a bit magical and the moment and it’s difficult to work out why they don’t work when you don’t get the result you expect!

Written by Mark Needham

November 29th, 2012 at 5:09 pm

Posted in Machine Learning

Tagged with ,