Mark Needham

Thoughts on Software Development

Neo4j: Generating real time recommendations with Cypher

without comments

One of the most common uses of Neo4j is for building real time recommendation engines and a common theme is that they make use of lots of different bits of data to come up with an interesting recommendation.

For example in this video Amanda shows how dating websites build real time recommendation engines by starting with social connections and then introducing passions, location and a few other things.

Graph Aware have a neat framework that helps you to build your own recommendation engine using Java and I was curious what a Cypher version would look like.

This is the sample graph:

CREATE
    (m:Person:Male {name:'Michal', age:30}),
    (d:Person:Female {name:'Daniela', age:20}),
    (v:Person:Male {name:'Vince', age:40}),
    (a:Person:Male {name:'Adam', age:30}),
    (l:Person:Female {name:'Luanne', age:25}),
    (c:Person:Male {name:'Christophe', age:60}),
 
    (lon:City {name:'London'}),
    (mum:City {name:'Mumbai'}),
 
    (m)-[:FRIEND_OF]->(d),
    (m)-[:FRIEND_OF]->(l),
    (m)-[:FRIEND_OF]->(a),
    (m)-[:FRIEND_OF]->(v),
    (d)-[:FRIEND_OF]->(v),
    (c)-[:FRIEND_OF]->(v),
    (d)-[:LIVES_IN]->(lon),
    (v)-[:LIVES_IN]->(lon),
    (m)-[:LIVES_IN]->(lon),
    (l)-[:LIVES_IN]->(mum);

We want to recommend some potential friends to ‘Adam’ so the first layer of our query is to find his friends of friends as there are bound to be some potential friends amongst them:

MATCH (me:Person {name: "Adam"})
MATCH (me)-[:FRIEND_OF]-()-[:FRIEND_OF]-(potentialFriend)
RETURN me, potentialFriend, COUNT(*) AS friendsInCommon
 
==> +--------------------------------------------------------------------------------------+
==> | me                             | potentialFriend                   | friendsInCommon |
==> +--------------------------------------------------------------------------------------+
==> | Node[1007]{name:"Adam",age:30} | Node[1006]{name:"Vince",age:40}   | 1               |
==> | Node[1007]{name:"Adam",age:30} | Node[1005]{name:"Daniela",age:20} | 1               |
==> | Node[1007]{name:"Adam",age:30} | Node[1008]{name:"Luanne",age:25}  | 1               |
==> +--------------------------------------------------------------------------------------+
==> 3 rows

This query gives us back a list of potential friends and how many friends we have in common.

Now that we’ve got some potential friends let’s start building a ranking for each of them. One indicator which could weigh in favour of a potential friend is if they live in the same location as us so let’s add that to our query:

MATCH (me:Person {name: "Adam"})
MATCH (me)-[:FRIEND_OF]-()-[:FRIEND_OF]-(potentialFriend)
 
WITH me, potentialFriend, COUNT(*) AS friendsInCommon
 
RETURN  me,
        potentialFriend,
        SIZE((potentialFriend)-[:LIVES_IN]->()<-[:LIVES_IN]-(me)) AS sameLocation
 
==> +-----------------------------------------------------------------------------------+
==> | me                             | potentialFriend                   | sameLocation |
==> +-----------------------------------------------------------------------------------+
==> | Node[1007]{name:"Adam",age:30} | Node[1006]{name:"Vince",age:40}   | 0            |
==> | Node[1007]{name:"Adam",age:30} | Node[1005]{name:"Daniela",age:20} | 0            |
==> | Node[1007]{name:"Adam",age:30} | Node[1008]{name:"Luanne",age:25}  | 0            |
==> +-----------------------------------------------------------------------------------+
==> 3 rows

Next we’ll check whether Adams’ potential friends have the same gender as him by comparing the labels each node has. We’ve got ‘Male’ and ‘Female’ labels which indicate gender.

MATCH (me:Person {name: "Adam"})
MATCH (me)-[:FRIEND_OF]-()-[:FRIEND_OF]-(potentialFriend)
 
WITH me, potentialFriend, COUNT(*) AS friendsInCommon
 
RETURN  me,
        potentialFriend,
        SIZE((potentialFriend)-[:LIVES_IN]->()<-[:LIVES_IN]-(me)) AS sameLocation,
        LABELS(me) = LABELS(potentialFriend) AS gender
 
==> +--------------------------------------------------------------------------------------------+
==> | me                             | potentialFriend                   | sameLocation | gender |
==> +--------------------------------------------------------------------------------------------+
==> | Node[1007]{name:"Adam",age:30} | Node[1006]{name:"Vince",age:40}   | 0            | true   |
==> | Node[1007]{name:"Adam",age:30} | Node[1005]{name:"Daniela",age:20} | 0            | false  |
==> | Node[1007]{name:"Adam",age:30} | Node[1008]{name:"Luanne",age:25}  | 0            | false  |
==> +--------------------------------------------------------------------------------------------+
==> 3 rows

Next up let’s calculate the age different between Adam and his potential friends:

MATCH (me:Person {name: "Adam"})
MATCH (me)-[:FRIEND_OF]-()-[:FRIEND_OF]-(potentialFriend)
 
WITH me, potentialFriend, COUNT(*) AS friendsInCommon
 
RETURN me,
       potentialFriend,
       SIZE((potentialFriend)-[:LIVES_IN]->()<-[:LIVES_IN]-(me)) AS sameLocation,
       abs( me.age - potentialFriend.age) AS ageDifference,
       LABELS(me) = LABELS(potentialFriend) AS gender,
       friendsInCommon
 
==> +------------------------------------------------------------------------------------------------------------------------------+
==> | me                             | potentialFriend                   | sameLocation | ageDifference | gender | friendsInCommon |
==> +------------------------------------------------------------------------------------------------------------------------------+
==> | Node[1007]{name:"Adam",age:30} | Node[1006]{name:"Vince",age:40}   | 0            | 10.0          | true   | 1               |
==> | Node[1007]{name:"Adam",age:30} | Node[1005]{name:"Daniela",age:20} | 0            | 10.0          | false  | 1               |
==> | Node[1007]{name:"Adam",age:30} | Node[1008]{name:"Luanne",age:25}  | 0            | 5.0           | false  | 1               |
==> +------------------------------------------------------------------------------------------------------------------------------+
==> 3 rows

Now let’s do some filtering to get rid of people that Adam is already friends with – there wouldn’t be much point in recommending those people!

MATCH (me:Person {name: "Adam"})
MATCH (me)-[:FRIEND_OF]-()-[:FRIEND_OF]-(potentialFriend)
 
WITH me, potentialFriend, COUNT(*) AS friendsInCommon
 
WITH me,
     potentialFriend,
     SIZE((potentialFriend)-[:LIVES_IN]->()<-[:LIVES_IN]-(me)) AS sameLocation,
     abs( me.age - potentialFriend.age) AS ageDifference,
     LABELS(me) = LABELS(potentialFriend) AS gender,
     friendsInCommon
 
WHERE NOT (me)-[:FRIEND_OF]-(potentialFriend)
 
RETURN me,
       potentialFriend,
       SIZE((potentialFriend)-[:LIVES_IN]->()<-[:LIVES_IN]-(me)) AS sameLocation,
       abs( me.age - potentialFriend.age) AS ageDifference,
       LABELS(me) = LABELS(potentialFriend) AS gender,
       friendsInCommon
 
==> +------------------------------------------------------------------------------------------------------------------------------+
==> | me                             | potentialFriend                   | sameLocation | ageDifference | gender | friendsInCommon |
==> +------------------------------------------------------------------------------------------------------------------------------+
==> | Node[1007]{name:"Adam",age:30} | Node[1006]{name:"Vince",age:40}   | 0            | 10.0          | true   | 1               |
==> | Node[1007]{name:"Adam",age:30} | Node[1005]{name:"Daniela",age:20} | 0            | 10.0          | false  | 1               |
==> | Node[1007]{name:"Adam",age:30} | Node[1008]{name:"Luanne",age:25}  | 0            | 5.0           | false  | 1               |
==> +------------------------------------------------------------------------------------------------------------------------------+
==> 3 rows

In this case we haven’t actually filtered anyone out but for some of the other people we would see a reduction in the number of potential friends.

Our final step is to come up with a score for each of the features that we’ve identified as being important for making a friend suggestion.

We’ll assign a score of 10 if the people live in the same location or have the same gender as Adam and 0 if not. For the ageDifference and friendsInCommon we’ll apply apply a log curve so that those values don’t have a disproportional effect on our final score. We’ll use the formula defined in the ParetoScoreTransfomer to do this:

    public <OUT> float transform(OUT item, float score) {
        if (score < minimumThreshold) {
            return 0;
        }
 
        double alpha = Math.log((double) 5) / eightyPercentLevel;
        double exp = Math.exp(-alpha * score);
        return new Double(maxScore * (1 - exp)).floatValue();
    }

And now for our completed recommendation query:

MATCH (me:Person {name: "Adam"})
MATCH (me)-[:FRIEND_OF]-()-[:FRIEND_OF]-(potentialFriend)
 
WITH me, potentialFriend, COUNT(*) AS friendsInCommon
 
WITH me,
     potentialFriend,
     SIZE((potentialFriend)-[:LIVES_IN]->()<-[:LIVES_IN]-(me)) AS sameLocation,
     abs( me.age - potentialFriend.age) AS ageDifference,
     LABELS(me) = LABELS(potentialFriend) AS gender,
     friendsInCommon
 
WHERE NOT (me)-[:FRIEND_OF]-(potentialFriend)
 
WITH potentialFriend,
       // 100 -> maxScore, 10 -> eightyPercentLevel, friendsInCommon -> score (from the formula above)
       100 * (1 - exp((-1.0 * (log(5.0) / 10)) * friendsInCommon)) AS friendsInCommon,
       sameLocation * 10 AS sameLocation,
       -1 * (10 * (1 - exp((-1.0 * (log(5.0) / 20)) * ageDifference))) AS ageDifference,
       CASE WHEN gender THEN 10 ELSE 0 END as sameGender
 
RETURN potentialFriend,
      {friendsInCommon: friendsInCommon,
       sameLocation: sameLocation,
       ageDifference:ageDifference,
       sameGender: sameGender} AS parts,
     friendsInCommon + sameLocation + ageDifference + sameGender AS score
ORDER BY score DESC
 
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | potentialFriend                   | parts                                                                                                           | score             |
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | Node[1006]{name:"Vince",age:40}   | {friendsInCommon -> 14.86600774792154, sameLocation -> 0, ageDifference -> -5.52786404500042, sameGender -> 10} | 19.33814370292112 |
==> | Node[1008]{name:"Luanne",age:25}  | {friendsInCommon -> 14.86600774792154, sameLocation -> 0, ageDifference -> -3.312596950235779, sameGender -> 0} | 11.55341079768576 |
==> | Node[1005]{name:"Daniela",age:20} | {friendsInCommon -> 14.86600774792154, sameLocation -> 0, ageDifference -> -5.52786404500042, sameGender -> 0}  | 9.33814370292112  |
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

The final query isn’t too bad – the only really complex bit is the log curve calculation. This is where user defined functions will come into their own in the future.

The nice thing about this approach is that we don’t have to step outside of cypher so if you’re not comfortable with Java you can still do real time recommendations! On the other hand, the different parts of the recommendation engine all get mixed up so it’s not as easy to see the whole pipeline as if you use the graph aware framework.

The next step is to apply this to the Twitter graph and come up with follower recommendations on there.

Written by Mark Needham

March 27th, 2015 at 6:59 am

Posted in neo4j

Tagged with

Python: matplotlib hangs and shows nothing (Mac OS X)

without comments

I’ve been playing around with some of the matplotlib demos recently and discovered that simply copying one of the examples didn’t actually work for me.

I was following the bar chart example and had the following code:

import numpy as np
import matplotlib.pyplot as plt
 
N = 5
ind = np.arange(N)
fig, ax = plt.subplots()
menMeans = (20, 35, 30, 35, 27)
menStd =   (2, 3, 4, 1, 2)
width = 0.35       # the width of the bars
rects1 = ax.bar(ind, menMeans, width, color='r', yerr=menStd)
 
plt.show()

When I execute this script from the command line it just hangs and I don’t see anything at all.

Via a combination of different blog posts (which all suggested different things!) I ended up with the following variation of imports which seems to do the job:

import numpy as np
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
 
N = 5
ind = np.arange(N)
fig, ax = plt.subplots()
menMeans = (20, 35, 30, 35, 27)
menStd =   (2, 3, 4, 1, 2)
width = 0.35       # the width of the bars
rects1 = ax.bar(ind, menMeans, width, color='r', yerr=menStd)
 
plt.show()

If I run this script a Python window pops up and contains the following image which is what I expected to happen in the first place!

2015 03 25 23 56 08

The thing to notice is that we’ve had to change the backend in order to use matplotlib from the shell:

With the TkAgg backend, which uses the Tkinter user interface toolkit, you can use matplotlib from an arbitrary non-gui python shell.

Current state: Wishing for ggplot!

Written by Mark Needham

March 26th, 2015 at 12:02 am

Posted in Python

Tagged with

Topic Modelling: Working out the optimal number of topics

without comments

In my continued exploration of topic modelling I came across The Programming Historian blog and a post showing how to derive topics from a corpus using the Java library mallet.

The instructions on the blog make it very easy to get up and running but as with other libraries I’ve used, you have to specify how many topics the corpus consists of. I’m never sure what value to select but the authors make the following suggestion:

How do you know the number of topics to search for? Is there a natural number of topics? What we have found is that one has to run the train-topics with varying numbers of topics to see how the composition file breaks down. If we end up with the majority of our original texts all in a very limited number of topics, then we take that as a signal that we need to increase the number of topics; the settings were too coarse.

There are computational ways of searching for this, including using MALLETs hlda command, but for the reader of this tutorial, it is probably just quicker to cycle through a number of iterations (but for more see Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Science, 101, 5228-5235).

Since I haven’t yet had the time to dive into the paper or explore how to use the appropriate option in mallet I thought I’d do some variations on the stop words and number of topics and see how that panned out.

As I understand it, the idea is to try and get a uniform spread of topics -> documents i.e. we don’t want all the documents to have the same topic otherwise any topic similarity calculations we run won’t be that interesting.

I tried running mallet with 10,15,20 and 30 topics and also varied the stop words used. I had one version which just stripped out the main characters and the word ‘narrator’ & another where I stripped out the top 20% of words by occurrence and any words that appeared less than 10 times.

The reason for doing this was that it should identify interesting phrases across episodes better than TF/IDF can while not just selecting the most popular words across the whole corpus.

I used mallet from the command line and ran it in two parts.

  1. Generate the model
  2. Work out the allocation of topics and documents based on hyper parameters

I wrote a script to help me out:

#!/bin/sh
 
train_model() {
  ./mallet-2.0.7/bin/mallet import-dir \
    --input mallet-2.0.7/sample-data/himym \
    --output ${2} \
    --keep-sequence \
    --remove-stopwords \
    --extra-stopwords ${1}
}
 
extract_topics() {
  ./mallet-2.0.7/bin/mallet train-topics \
    --input ${2} --num-topics ${1} \
    --optimize-interval 20 \
    --output-state himym-topic-state.gz \
    --output-topic-keys output/himym_${1}_${3}_keys.txt \
    --output-doc-topics output/himym_${1}_${3}_composition.txt
}
 
train_model "stop_words.txt" "output/himym.mallet"
train_model "main-words-stop.txt" "output/himym.main.words.stop.mallet"
 
extract_topics 10 "output/himym.mallet" "all.stop.words"
extract_topics 15 "output/himym.mallet" "all.stop.words"
extract_topics 20 "output/himym.mallet" "all.stop.words"
extract_topics 30 "output/himym.mallet" "all.stop.words"
 
extract_topics 10 "output/himym.main.words.stop.mallet" "main.stop.words"
extract_topics 15 "output/himym.main.words.stop.mallet" "main.stop.words"
extract_topics 20 "output/himym.main.words.stop.mallet" "main.stop.words"
extract_topics 30 "output/himym.main.words.stop.mallet" "main.stop.words"

As you can see, this script first generates a bunch of models from text files in ‘mallet-2.0.7/sample-data/himym’ – there is one file per episode of HIMYM. We then use that model to generate differently sized topic models.

The output is two files; one containing a list of topics and another describing what percentage of the words in each document come from each topic.

$ cat output/himym_10_all.stop.words_keys.txt
 
0	0.08929	back brad natalie loretta monkey show call classroom mitch put brunch betty give shelly tyler interview cigarette mc laren
1	0.05256	zoey jerry arthur back randy arcadian gael simon blauman blitz call boats becky appartment amy gary made steve boat
2	0.06338	back claudia trudy doug int abby call carl stuart voix rachel stacy jenkins cindy vo katie waitress holly front
3	0.06792	tony wendy royce back jersey jed waitress bluntly lucy made subtitle film curt mosley put laura baggage officer bell
4	0.21609	back give patrice put find show made bilson nick call sam shannon appartment fire robots top basketball wrestlers jinx
5	0.07385	blah bob back thanksgiving ericksen maggie judy pj valentine amanda made call mickey marcus give put dishes juice int
6	0.04638	druthers karen back jen punchy jeanette lewis show jim give pr dah made cougar call jessica sparkles find glitter
7	0.05751	nora mike pete scooter back magazine tiffany cootes garrison kevin halloween henrietta pumpkin slutty made call bottles gruber give
8	0.07321	ranjit back sandy mary burger call find mall moby heather give goat truck made put duck found stangel penelope
9	0.31692	back give call made find put move found quinn part ten original side ellen chicago italy locket mine show
$ head -n 10 output/himym_10_all.stop.words_composition.txt
#doc name topic proportion ...
0	file:/Users/markneedham/projects/mallet/mallet-2.0.7/sample-data/himym/1.txt	0	0.70961794636687	9	0.1294699168584466	8	0.07950442338871108	2	0.07192178481473664	4	0.008360809510263838	5	2.7862560133367015E-4	3	2.562409242784946E-4	7	2.1697378721335337E-4	1	1.982849604752168E-4	6	1.749937876710496E-4
1	file:/Users/markneedham/projects/mallet/mallet-2.0.7/sample-data/himym/10.txt	2	0.9811551470820473	9	0.016716882136209997	4	6.794128563082893E-4	0	2.807350575301132E-4	5	2.3219634098530471E-4	8	2.3018997315244256E-4	3	2.1354177341696056E-4	7	1.8081798384467614E-4	1	1.6524340216541808E-4	6	1.4583339433951297E-4
2	file:/Users/markneedham/projects/mallet/mallet-2.0.7/sample-data/himym/100.txt	2	0.724061485807234	4	0.13624729774423758	0	0.13546964196228636	9	0.0019436342339785994	5	4.5291919356563914E-4	8	4.490055982996677E-4	3	4.1653183421485213E-4	7	3.5270123154213927E-4	1	3.2232165301666123E-4	6	2.8446074162457316E-4
3	file:/Users/markneedham/projects/mallet/mallet-2.0.7/sample-data/himym/101.txt	2	0.7815231689893246	0	0.14798271520316794	9	0.023582384458063092	8	0.022251052243582908	1	0.022138209217973336	4	0.0011804626661380394	5	4.0343527385745457E-4	3	3.7102343418895774E-4	7	3.1416667687862693E-4	6	2.533818368250992E-
4	file:/Users/markneedham/projects/mallet/mallet-2.0.7/sample-data/himym/102.txt	6	0.6448245189567259	4	0.18612146979166502	3	0.16624873439661025	9	0.0012233726722317548	0	3.4467218590717303E-4	5	2.850788252495599E-4	8	2.8261550915084904E-4	2	2.446611421432842E-4	7	2.2199909869250053E-4	1	2.028774216237081E-
5	file:/Users/markneedham/projects/mallet/mallet-2.0.7/sample-data/himym/103.txt	8	0.7531586740033047	5	0.17839539108961253	0	0.06512376460651902	9	0.001282794040111701	4	8.746645156304241E-4	3	2.749100345664577E-4	2	2.5654476523149865E-4	7	2.327819863700214E-4	1	2.1273153572848481E-4	6	1.8774342292520802E-4
6	file:/Users/markneedham/projects/mallet/mallet-2.0.7/sample-data/himym/104.txt	7	0.9489502365148181	8	0.030091466847852504	4	0.017936457663121977	9	0.0013482824985091328	0	3.7986419553884905E-4	5	3.141861834124008E-4	3	2.889445824352445E-4	2	2.6964174000656E-4	1	2.2359178288566958E-4	6	1.9732799141958482E-4
7	file:/Users/markneedham/projects/mallet/mallet-2.0.7/sample-data/himym/105.txt	8	0.7339694064061175	7	0.1237041841318045	9	0.11889696041555338	0	0.02005288536233353	4	0.0014026751618923005	5	4.793786828705149E-4	3	4.408655780020889E-4	2	4.1141370625324785E-4	1	3.411516484151411E-4	6	3.0107890675777946E-4
8	file:/Users/markneedham/projects/mallet/mallet-2.0.7/sample-data/himym/106.txt	5	0.37064909999661005	9	0.3613559917055785	0	0.14857567731040344	6	0.09545466082502917	4	0.022300625744661403	8	3.8725629469313333E-4	3	3.592484711785775E-4	2	3.3524900189121E-4	7	3.041961449432886E-4	1	2.779945050112539E-4

The output is a bit tricky to understand on its own so I did a bit of post processing using pandas and then ran the results of that through matplotlib to see the distribution of documents for different topics sizes with different stop words. You can see the script here.

I ended up with the following chart:

2015 03 24 22 08 48

On the left hand side we’re using more stop words and on the right just the main ones. For most of the variations there are one or two topics which most documents belong to but interestingly the most uniform distribution seems to be when we have few topics.

These are the main words for the most popular topics on the left hand side:

15 topics

8       0.50732 back give call made put find found part move show side ten mine top abby front fire full fianc

20 topics

12      0.61545 back give call made put find show found part move side mine top front ten full cry fire fianc

30 topics

22      0.713   back call give made put find show part found side move front ten full top mine fire cry bottom

All contain more or less the same words which at first glance seem like quite generic words so I’m surprised they weren’t excluded.

On the right hand side we haven’t removed many words so we’d expect common words in the English language to dominate. Let’s see if they do:

10 topics

1       3.79451 don yeah ll hey ve back time guys good gonna love god night wait uh thing guy great make

15 topics

5       2.81543 good time love ll great man guy ve night make girl day back wait god life yeah years thing
 
10      1.52295 don yeah hey gonna uh guys didn back ve ll um kids give wow doesn thing totally god fine

20 topics

1       3.06732 good time love wait great man make day back ve god life years thought big give apartment people work
 
13      1.68795 don yeah hey gonna ll uh guys night didn back ve girl um kids wow guy kind thing baby

30 topics

14      1.42509 don yeah hey gonna uh guys didn back ve um thing ll kids wow time doesn totally kind wasn
 
24      2.19053 guy love man girl wait god ll back great yeah day call night people guys years home room phone
 
29      1.84685 good make ve ll stop time made nice put feel love friends big long talk baby thought things happy

Again we have similar words across each run and as expected they are all quite generic words.

My take away from this exploration is that I should vary the stop word percentages as well and see if that leads to an improved distribution.

Taking out very common words like we do with the left hand side charts seems to make sense although I need to work out why there’s a single outlier in each group.

The authors suggest that having the majority of our texts in a small number of topics means we need to create more of them so I will investigate that too.

The code is all on github along with the transcripts so give it a try and let me know what you think.

Written by Mark Needham

March 24th, 2015 at 10:33 pm

Posted in Machine Learning,Python

Tagged with

Python: Equivalent to flatMap for flattening an array of arrays

without comments

I found myself wanting to flatten an array of arrays while writing some Python code earlier this afternoon and being lazy my first attempt involved building the flattened array manually:

episodes = [
    {"id": 1, "topics": [1,2,3]},
    {"id": 2, "topics": [4,5,6]}
]
 
flattened_episodes = []
for episode in episodes:
    for topic in episode["topics"]:
        flattened_episodes.append({"id": episode["id"], "topic": topic})
 
for episode in flattened_episodes:
    print episode

If we run that we’ll see this output:

$ python flatten.py
 
{'topic': 1, 'id': 1}
{'topic': 2, 'id': 1}
{'topic': 3, 'id': 1}
{'topic': 4, 'id': 2}
{'topic': 5, 'id': 2}
{'topic': 6, 'id': 2}

What I was really looking for was the Python equivalent to the flatmap function which I learnt can be achieved in Python with a list comprehension like so:

flattened_episodes = [{"id": episode["id"], "topic": topic}
                      for episode in episodes
                      for topic in episode["topics"]]
 
for episode in flattened_episodes:
    print episode

We could also choose to use itertools in which case we’d have the following code:

from itertools import chain, imap
flattened_episodes = chain.from_iterable(
                        imap(lambda episode: [{"id": episode["id"], "topic": topic}
                                             for topic in episode["topics"]],
                             episodes))
for episode in flattened_episodes:
    print episode

We can then simplify this approach a little by wrapping it up in a ‘flatmap’ function:

def flatmap(f, items):
        return chain.from_iterable(imap(f, items))
 
flattened_episodes = flatmap(
    lambda episode: [{"id": episode["id"], "topic": topic} for topic in episode["topics"]], episodes)
 
for episode in flattened_episodes:
    print episode

I think the list comprehensions approach still works but I need to look into itertools more – it looks like it could work well for other list operations.

Written by Mark Needham

March 23rd, 2015 at 12:45 am

Posted in Python

Tagged with

Python: Simplifying the creation of a stop word list with defaultdict

without comments

I’ve been playing around with topics models again and recently read a paper by David Mimno which suggested the following heuristic for working out which words should go onto the stop list:

A good heuristic for identifying such words is to remove those that occur in more than 5-10% of documents (most common) and those that occur fewer than 5-10 times in the entire corpus (least common).

I decided to try this out on the HIMYM dataset that I’ve been working on over the last couple of months.

I started out with the following code to build a dictionary of words, their total occurrences and the episodes they’d been used in:

import csv
from sklearn.feature_extraction.text import CountVectorizer
from collections import defaultdict
 
episodes = defaultdict(str)
with open("sentences.csv", "r") as file:
    reader = csv.reader(file, delimiter = ",")
    reader.next()
    for row in reader:
        episodes[row[1]] += row[4]
 
vectorizer = CountVectorizer(analyzer='word', min_df = 0, stop_words = 'english')
matrix = vectorizer.fit_transform(episodes.values())
features = vectorizer.get_feature_names()
 
words = {}
for doc_id, doc in enumerate(matrix.todense()):
    for word_id, score in enumerate(doc.tolist()[0]):
        word = features[word_id]
        if not words.get(word):
            words[word] = {}
 
        if not words[word].get("score"):
            words[word]["score"] = 0
        words[word]["score"] += score
 
        if not words[word].get("episodes"):
            words[word]["episodes"] = set()
 
        if score > 0:
            words[word]["episodes"].add(doc_id)

This works fine but the code inside the last for block is ugly and most of it is handling the case when parts of a dictionary aren’t yet initialised which is defaultdict territory. You’ll notice I am using defaultdict in the first part of the code but not yet the second as I’d struggled to get it working.

This was my first attempt to make the ‘words’ variable based on it:

>>> words = defaultdict({})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: first argument must be callable

We can see why this doesn’t work if we try to evaluate ‘{}’ as a function which is what defaultdict does internally:

>>> {}()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'dict' object is not callable

Instead what we need is to pass in ‘dict':

>>> dict()
{}
 
>>> words = defaultdict(dict)
 
>>> words
defaultdict(<type 'dict'>, {})

That simplifies the first bit of the loop:

words = defaultdict(dict)
for doc_id, doc in enumerate(matrix.todense()):
    for word_id, score in enumerate(doc.tolist()[0]):
        word = features[word_id]
        if not words[word].get("score"):
            words[word]["score"] = 0
        words[word]["score"] += score
 
        if not words[word].get("episodes"):
            words[word]["episodes"] = set()
 
        if score > 0:
            words[word]["episodes"].add(doc_id)

We’ve still got a couple of other places to simplify though which we can do by defining a custom function and passing that into defaultdict:

def default_dict_function():
   return {"score": 0, "episodes": set()}
 
>>> words = defaultdict(default_dict_function)
 
>>> words
defaultdict(<function default_dict_function at 0x10963fcf8>, {})

And here’s the final product:

def default_dict_function():
   return {"score": 0, "episodes": set()}
words = defaultdict(default_dict_function)
 
for doc_id, doc in enumerate(matrix.todense()):
    for word_id, score in enumerate(doc.tolist()[0]):
        word = features[word_id]
        words[word]["score"] += score
        if score > 0:
            words[word]["episodes"].add(doc_id)

After this we can write out the words to our stop list:

with open("stop_words.txt", "w") as file:
    writer = csv.writer(file, delimiter = ",")
    for word, value in words.iteritems():
        # appears in > 10% of episodes
        if len(value["episodes"]) > int(len(episodes) / 10):
            writer.writerow([word.encode('utf-8')])
 
        # less than 10 occurences
        if value["score"] < 10:
            writer.writerow([word.encode('utf-8')])

Written by Mark Needham

March 22nd, 2015 at 1:51 am

Posted in Python

Tagged with

Python: Forgetting to use enumerate

without comments

Earlier this evening I found myself writing the equivalent of the following Python code while building a stop list for a topic model…

words = ["mark", "neo4j", "michael"]
word_position = 0
for word in words:
   print word_position, word
   word_position +=1

…which is very foolish given that there’s already a function that makes it really easy to grab the position of an item in a list:

for word_position, word in enumerate(words):
   print word_position, word

Python does make things extremely easy at times – you’re welcome future Mark!

Written by Mark Needham

March 22nd, 2015 at 1:28 am

Posted in Python

Tagged with

Badass: Making users awesome – Kathy Sierra: Book Review

without comments

I started reading Kathy Sierra’s new book ‘Badass: Making users awesome‘ a couple of weeks ago and with the gift of flights to/from Stockholm this week I’ve got through the rest of it.

I really enjoyed the book and have found myself returning to it almost every day to check up exactly what was said on a particular topic.

There were a few things that I’ve taken away and have been going on about to anyone who will listen.

2015 03 20 06 52 51

Paraphrasing, ‘help users acquire skills, don’t throw knowledge at them.’ I found this advice helpful both in my own learning of new things as well as for thinking how to help users of Neo4j get up and running faster.

Whether we’re doing a talk, workshop or online training, the goal isn’t to teach the user a bunch of information/facts but rather to help them learn skills which they can use to achieve their ‘compelling context‘.

Having said that, it’s very easy to fall into the information/facts trap as that type of information is much easier to prepare and present. You don’t have to spend much time thinking about how the user is going to use, rather you hope that if you throw enough information at them some of it will stick.

A user’s compelling context the problem they’re trying to solve regardless of the tools they use to solve it. The repeated example of this is a camera – we don’t buy a camera because we want to buy a camera, we buy it because we want to take great photographs.

2015 03 17 23 49 25

There’s a really interesting section in the middle of the book which talks about expert performance and skill acquisition and how we can achieve this through deliberate practice.

My main take away here is that we have only mastered a skill if we can achieve 95% reliability in repeating the task within 1-3 45-90 minute sessions.

If we can’t achieve this then the typical reaction is to either give up or keep trying to achieve the goal for many more hours. Neither of these is considered a useful approach.

Instead we should realise that if we can’t do the skill it’s probably because there’s a small sub skill that we need to master first. So our next step is to break this skill down into its components, master those and then try the original skill again.

Amy Hoy’s ‘doing it backwards‘ guide is very helpful for doing the skill breakdown as it makes you ask the question ‘can I do it tomorrow?‘ or is there something else that I need to do (learn) first.

I’ve been trying to apply this approach to my machine learning adventures which most recently has involved various topic modelling attempts on a How I met your mother data set.

I’d heard good things about the MALLET open source library but having never used it before sketched out the goals/skills I wanted to achieve:

Extract topics for HIMYM corpus ->
Train a topic model with mallet ->
Tweak an existing topic model that uses mallet ->
Run an existing topic model that uses mallet -> 
Install mallet
2015 03 20 00 11 48

The idea is that you then start from the last action and work your way back up the chain – it should also act as a nice deterrent for yak shaving.

While learning about mallet I came across several more articles that I should read about topic modelling and while these don’t directly contribute to learning a skill I think they will give me good background to help pick up some of the intuition behind topic modelling.

My take away about gaining knowledge on a skill is that when we’re getting started we should spend more time gaining practical knowledge rather than only reading but once we get more into it we’ll naturally become more curious and do the background reading. I often find myself just reading non stop about things but never completely understanding them because I don’t go hands on so this was a good reminder.

One of the next things I’m working on is a similar skill break down for people learning Neo4j and then we’ll look to apply this to make our meetup sessions more effective – should be fun!

The other awesome thing about this book is that I’ve come away with a bunch of other books to read as well:

In summary, if learning is your thing get yourself a copy of the book and read it over a few times – so many great tips, I’ve only covered a few.

Written by Mark Needham

March 20th, 2015 at 7:30 am

Posted in Books

Tagged with

Neo4j: Detecting potential typos using EXPLAIN

without comments

I’ve been running a few intro to Neo4j training sessions recently using Neo4j 2.2.0 RC1 and at some stage in every session somebody will make a typo when writing out of the example queries.

For example one of the queries that we do about half way finds the actors and directors who have worked together and aggregates the movies they were in.

This is the correct query:

MATCH (actor:Person)-[:ACTED_IN]->(movie)<-[:DIRECTED]-(director)
RETURN actor.name, director.name, COLLECT(movie.title) AS movies
ORDER BY LENGTH(movies) DESC
LIMIT 5

which should yield the following results:

==> +-----------------------------------------------------------------------------------------------------------------------+
==> | actor.name           | director.name    | movies                                                                      |
==> +-----------------------------------------------------------------------------------------------------------------------+
==> | "Hugo Weaving"       | "Andy Wachowski" | ["Cloud Atlas","The Matrix Revolutions","The Matrix Reloaded","The Matrix"] |
==> | "Hugo Weaving"       | "Lana Wachowski" | ["Cloud Atlas","The Matrix Revolutions","The Matrix Reloaded","The Matrix"] |
==> | "Laurence Fishburne" | "Lana Wachowski" | ["The Matrix Revolutions","The Matrix Reloaded","The Matrix"]               |
==> | "Keanu Reeves"       | "Lana Wachowski" | ["The Matrix Revolutions","The Matrix Reloaded","The Matrix"]               |
==> | "Carrie-Anne Moss"   | "Lana Wachowski" | ["The Matrix Revolutions","The Matrix Reloaded","The Matrix"]               |
==> +-----------------------------------------------------------------------------------------------------------------------+

However, a common typo is to write ‘DIRECTED_IN’ instead of ‘DIRECTED’ in which case we’ll see no results:

MATCH (actor:Person)-[:ACTED_IN]->(movie)<-[:DIRECTED_IN]-(director)
RETURN actor.name, director.name, COLLECT(movie.title) AS movies
ORDER BY LENGTH(movies) DESC
LIMIT 5
 
==> +-------------------------------------+
==> | actor.name | director.name | movies |
==> +-------------------------------------+
==> +-------------------------------------+
==> 0 row

It’s not immediately obvious why we aren’t seeing any results which can be quite frustrating.

However, in Neo4j 2.2 the ‘EXPLAIN’ keyword has been introduced and we can use this to see what the query planner thinks of the query we want to execute without actually executing it.

Instead the planner makes use of knowledge that it has about our schema to come up with a plan that it would run and how much of the graph it thinks that plan would touch:

EXPLAIN MATCH (actor:Person)-[:ACTED_IN]->(movie)<-[:DIRECTED_IN]-(director)
RETURN actor.name, director.name, COLLECT(movie.title) AS movies
ORDER BY LENGTH(movies) DESC
LIMIT 5

2015 03 17 23 39 55

The first row of the query plan describes an all nodes scan which tells us that the query will start from the ‘director’ but it’s the second row that’s interesting.

The estimated rows when expanding the ‘DIRECTED_IN’ relationship is 0 when we’d expect it to at least be a positive value if there were some instances of that relationship in the database.

If we compare this to the plan generated when using the proper ‘DIRECTED’ relationship we can see the difference:

2015 03 17 23 43 11

Here we see an estimated 44 rows from expanding the ‘DIRECTED’ relationship so we know there are at least some nodes connected by that relationship type.

In summary if you find your query not returning anything when you expect it to, prefix an ‘EXPLAIN’ and make sure you’re not seeing the dreaded ‘0 expected rows’.

Written by Mark Needham

March 17th, 2015 at 10:46 pm

Posted in neo4j

Tagged with

One month of mini habits

without comments

I recently read a book in the ‘getting things done’ genre written by Stephen Guise titled ‘Mini Habits‘ and although I generally don’t like those types of books I quite enjoyed this one and decided to give his system a try.

The underlying idea is that there are two parts of actually doing stuff:

  • Planning what to do
  • Doing it

We often get stuck in between the first and second steps because what we’ve planned to do is too big and overwhelming.

Guise’s approach for overcoming this inaction is to shrink the amount of work to do until it’s small enough that we don’t feel any resistance to getting started.

It should be something that you can do in 1 or 2 minutes – stupidly small – something that you can do even on your worst day when you have no time/energy.

I’m extremely good at procrastinating so I thought I’d give it a try and see if it helped. Guise suggests starting with one or two habits but I had four things that I want to do so I’ve ignored that advice for now.

My attempted habits are the following:

  • Read one page of a data science related paper/article a day
  • Read one page of a computer science related paper/article a day
  • Write one line of data science related code a day
  • Write 50 words on blog a day

Sooooo….has it helped?

In terms of doing each of the habits I’ve been successful so far – today is the 35th day in a row that I’ve managed to do each of them. Having said that, there have been some times when I’ve got back home at 11pm and realised that I haven’t done 2 of the habits and need to quickly do the minimum to ‘tick them off’.

The habit I’ve enjoyed doing the most is writing one line of data science related code a day.

My initial intention was that this was going to only involved writing machine learning code but at the moment I’ve made it a bit more generic so it can include things like the Twitter Graph or other bits and pieces that I want to get started on.

The main problem I’ve had with making progress on mini projects like that is that I imagine its end state and it feels too daunting to start on. Committing to just one line of code a day has been liberating in some way.

One tweak I have made to all the habits is to have some rough goal of where all the daily habits are leading as I noticed that the stuff I was doing each day was becoming very random. Michael pointed me at Amy Hoy’s ‘Guide to doing it backwards‘ which describes a neat technique for working back from a goal and determining the small steps required to achieve it.

Writing at least 50 words a day has been beneficial for getting blog posts written. Before the last month I’ve found myself writing most of my posts at the end of month but I have a more regular cadence now which feels better.

Computer science wise I’ve been picking up papers which have some sort of link to databases to try and learn more of the low level detail there. e.g. I’ve read the LRU-K cache paper which Neo4j 2.2’s page cache is based on and have been flicking through the original CRDTs paper over the last few days.

I also recently came across the Papers We Love repository so I’ll probably work through some of the distributed systems papers they’ve collated next.

Other observations

I’ve found that if I do stuff early in the morning it feels better as you know it’s out of the way and doesn’t linger over you for the rest of the day.

I sometimes find myself wanting to just tick off the habits for the day even when it might be interesting to spend more time on one of the habits. I’m not sure what to make of this really – perhaps I should reduce the number of habits to the ones I’m really interested in?

With the writing it does sometimes feel like I’m just writing for the sake of it but it is a good habit to get into as it forces me to explain what I’m working on and get ideas from other people so I’m going to keep doing it.

I’ve enjoyed my experience with ‘mini habits’ so far although I think I’d be better off focusing on fewer habits so that there’s still enough time in the day to read/learn random spontaneous stuff that doesn’t fit into these habits.

Written by Mark Needham

March 17th, 2015 at 1:32 am

Python: Transforming Twitter datetime string to timestamp (z’ is a bad directive in format)

with one comment

I’ve been playing around with importing Twitter data into Neo4j and since Neo4j can’t store dates natively just yet I needed to convert a date string to timestamp.

I started with the following which unfortunately throws an exception:

from datetime import datetime
date = "Sat Mar 14 18:43:19 +0000 2015"
 
>>> datetime.strptime(date, "%a %b %d %H:%M:%S %z %Y")
 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/_strptime.py", line 317, in _strptime
    (bad_directive, format))
ValueError: 'z' is a bad directive in format '%a %b %d %H:%M:%S %z %Y'

%z is actually a valid option used to extract the timezone but my googling suggests it not working is one of the idiosyncrasies of strptime.

I eventually came across the python-dateutil library, as recommended by Joe Shaw on StackOverflow.

Using that library the problem is suddenly much simpler:

$ pip install python-dateutil
from dateutil import parser
parsed_date = parser.parse(date)
 
>>> parsed_date
datetime.datetime(2015, 3, 14, 18, 43, 19, tzinfo=tzutc())

To get to a timestamp we can use calendar as I’ve described before:

import calendar
timestamp = calendar.timegm(parser.parse(date).timetuple())
 
>>> timestamp
1426358599

Written by Mark Needham

March 15th, 2015 at 10:43 pm

Posted in Python

Tagged with