Mark Needham

Thoughts on Software Development

Archive for the ‘neo4j’ Category

Neo4j: The learning to cycle dependency graph

without comments

Over the past couple of weeks I’ve been reading about skill building and the break down of skills into more manageable chunks, and recently had a chance to break down the skills required to learn to cycle.

I initially sketched out the skill progression but quickly realised I had drawn a dependency graph and thought that putting it into Neo4j would simplify things.

I started out with the overall goal for cycling which was to ‘Be able to cycle through a public park':

MERGE (:Goal:Task {name: "Be able to cycle through a public park"})

This goal is easy for someone who’s already learnt to cycle but if we’re starting from scratch it’s a bit daunting so we need to break it down into a simpler skill that we can practice.

The mini algorithm that we’re going to employ for task breakdown is this:

  1. Can we do the given task now?
  2. Break the task down into something simpler and return to 1.

One of the things to keep in mind is that we won’t get the break down perfect the first time so we may need to change it. For a diagram drawn on a piece of paper this would be annoying but in Neo4j it’s just a simpler refactoring.

Going back to cycling. Since the goal isn’t yet achievable we need to break that down into something a bit easier. Let’s start with something really simple:

MERGE (task:Task {name: "Take a few steps forward while standing over the bike"})
WITH task
MATCH (goal:Goal:Task {name: "Be able to cycle through a public park"})
MERGE (goal)-[:DEPENDS_ON]->(task)

In the first line we create our new task and then we connect it to our goal which we created earlier.

Graph  9

After we’ve got the hang of walking with the bike we want to get comfortable with cycling forward a few rotations while sitting on the bike but to do that we need to be able to get the bike moving from a standing start. We might also have another step where we cycle forward while standing on the bike as that might be slightly easier.

Let’s update our graph:

// First let's get rid of the relationship between our initial task and the goal
MATCH (initialTask:Task {name: "Take a few steps forward while standing over the bike"})
MATCH (goal:Goal {name: "Be able to cycle through a public park"})
MATCH (goal)-[rel:DEPENDS_ON]->(initialTask)
DELETE rel
 
WITH initialTask, goal, ["Get bike moving from standing start", "Cycle forward while standing", "Cycle forward while sitting"] AS newTasks
 
// Create some nodes for our new tasks
UNWIND newTasks AS newTask
MERGE (t:Task {name: newTask})
WITH initialTask, goal, COLLECT(t) AS newTasks
WITH initialTask, goal, newTasks, newTasks[0] AS firstTask, newTasks[-1] AS lastTask
 
// Connect the last task to the goal
MERGE (goal)-[:DEPENDS_ON]->(lastTask)
 
// And the first task to our initial task
MERGE (firstTask)-[:DEPENDS_ON]->(initialTask)
 
// And all the tasks to each other
FOREACH(i in RANGE(0, length(newTasks) - 2) |
  FOREACH(t1 in [newTasks[i]] | FOREACH(t2 in [newTasks[i+1]] |
    MERGE (t2)-[:DEPENDS_ON]->(t1) 
)))
Graph  10

We don’t strictly need to learn how to cycle while standing up – we could just go straight from getting the bike moving to cycling forward while sitting. Let’s update the graph to reflect that:

MATCH (sitting:Task {name: "Cycle forward while sitting"})
MATCH (moving:Task {name: "Get bike moving from standing start"})
MERGE (sitting)-[:DEPENDS_ON]->(moving)

Graph  11

Once we’ve got the hang of those tasks let’s add in a few more to get us closer to our goal:

WITH [
  {skill: "Controlled stop using brakes/feet", dependsOn: "Cycle forward while sitting"},
  {skill: "Steer around stationary objects", dependsOn: "Controlled stop using brakes/feet"},
  {skill: "Steer around people", dependsOn: "Steer around stationary objects"},
  {skill: "Navigate a small circular circuit", dependsOn: "Steer around stationary objects"},
  {skill: "Navigate a loop of a section of the park", dependsOn: "Navigate a small circular circuit"},
  {skill: "Navigate a loop of a section of the park", dependsOn: "Steer around people"},
  {skill: "Be able to cycle through a public park", dependsOn: "Navigate a loop of a section of the park"}
 
] AS newTasks
 
FOREACH(newTask in newTasks |
  MERGE (t1:Task {name: newTask.skill})   
  MERGE (t2:Task {name: newTask.dependsOn})
  MERGE (t1)-[:DEPENDS_ON]->(t2)
)

Finally let’s get rid of the relationship from our goal to ‘Cycle forward while sitting’ since we’ve replaced that with some intermediate steps:

MATCH (task:Task {name: "Cycle forward while sitting"})
WITH task
MATCH (goal:Goal:Task {name: "Be able to cycle through a public park"})
MERGE (goal)-[rel:DEPENDS_ON]->(task)
DELETE rel

And here’s what the final dependency graph looks like:

Graph  13

Although I put this into Neo4j in order to visualise the dependencies we can now query the data as well. For example, let’s say I know how to cycle forward while sitting on the bike. What steps are there between me and being able to cycle around a park?

MATCH (t:Task {name: "Cycle forward while sitting"}),
      (g:Goal {name: "Be able to cycle through a public park"}),
      path = shortestpath((g)-[:DEPENDS_ON*]->(t))
RETURN path

Graph  14

Or if we want a list of the tasks we need to do next we could restructure the query slightly:

MATCH (t:Task {name: "Cycle forward while sitting"}),
      (g:Goal {name: "Be able to cycle through a public park"}),
      path = shortestpath((t)<-[:DEPENDS_ON*]->(g))
WITH [n in nodes(path) | n.name] AS tasks
UNWIND tasks AS task
RETURN task
 
==> +--------------------------------------------+
==> | task                                       |
==> +--------------------------------------------+
==> | "Cycle forward while sitting"              |
==> | "Controlled stop using brakes/feet"        |
==> | "Steer around stationary objects"          |
==> | "Steer around people"                      |
==> | "Navigate a loop of a section of the park" |
==> | "Be able to cycle through a public park"   |
==> +--------------------------------------------+
==> 6 rows

That’s all for now but I think this is an interesting way of tracking how you’d learn a skill. I’m trying a similar approach for some statistics topics I’m learning about but I’ve found the order of tasks isn’t so linear there – interestingly much more a graph than a tree.

Written by Mark Needham

April 7th, 2015 at 8:59 pm

Posted in neo4j

Tagged with

Neo4j: Cypher – Building the query for a movie’s profile page

with one comment

2015 04 01 12 11 51

Yesterday I spent the day in Berlin delivering a workshop as part of the Data Science Retreat and one of the exercises we did was write a query that would pull back all the information you’d need to create the IMDB page for a movie.

Scanning the page we can see that need to get some basic meta data including the title. Next we’ll need to pull in the actors, directors, producers and finally a recommendation for some other movies the viewer might like to see.

I’d struggle to be able to write this all in one go – it’s non trivial. However, if we break it down there are actually 5 simpler queries that we probably can write. Our final step is then to work out how to glue them all together.

Let’s get started.

If you want to follow along open up your Neo4j browser and type :play movies and import the built in data set.

We’re going to create the query for The Matrix home page so the first step is to find the node representing that movie in the database:

match (movie:Movie {title: "The Matrix"})
return movie.title
 
==> +--------------+
==> | movie.title  |
==> +--------------+
==> | "The Matrix" |
==> +--------------+
==> 1 row

Easy enough. Now let’s get back the producers:

match (movie:Movie {title: "The Matrix"})
optional match (producer)-[:PRODUCED]->(movie)
 
RETURN movie.title, COLLECT(producer.name) AS producers
 
==> +--------------------------------+
==> | movie.title  | producers       |
==> +--------------------------------+
==> | "The Matrix" | ["Joel Silver"] |
==> +--------------------------------+
==> 1 row

We’ve introduced the COLLECT function here as we want to ensure that our final result only has one row regardless of how many producers there are. COLLECT applies an implicit group by ‘movie.title’ and collects the producers for each movie (in this case just The Matrix) into an array.

We’ve used OPTIONAL MATCH *LINK* because we still want to return a row for the query even if it has no producers. In the case that there are no producers we’d hope to see an empty array.

Now let’s write the same query to pull back the directors of the movie:

match (movie:Movie {title: "The Matrix"})
optional match (director)-[:DIRECTED]->(movie)
 
RETURN movie.title, COLLECT(director.name) AS directors
 
==> +----------------------------------------------------+
==> | movie.title  | directors                           |
==> +----------------------------------------------------+
==> | "The Matrix" | ["Lana Wachowski","Andy Wachowski"] |
==> +----------------------------------------------------+
==> 1 row

We really want to do both of these in one query so we get back a single result with 3 columns. To do that we’re going to introduce the WITH clause which allows us combine the results of traversals together.

In this case we’ll first do a traversal to get the producers, collect those into an array and then traverse out again to get the directors and collect those. This is what the query looks like:

match (movie:Movie {title: "The Matrix"})
optional match (producer)-[:PRODUCED]->(movie)
 
with movie, COLLECT(producer.name) AS producers
optional match (director)-[:DIRECTED]->(movie)
 
RETURN movie.title, producers, COLLECT(director.name) AS directors
 
==> +----------------------------------------------------------------------+
==> | movie.title  | producers       | directors                           |
==> +----------------------------------------------------------------------+
==> | "The Matrix" | ["Joel Silver"] | ["Lana Wachowski","Andy Wachowski"] |
==> +----------------------------------------------------------------------+
==> 1 row

We can follow the same pattern to return the actors:

match (movie:Movie {title: "The Matrix"})
optional match (producer)-[:PRODUCED]->(movie)
 
with movie, COLLECT(producer.name) AS producers
optional match (director)-[:DIRECTED]->(movie)
 
with movie, producers, COLLECT(director.name) AS directors
optional match (actor)-[:ACTED_IN]->(movie)
 
RETURN movie.title, COLLECT(actor.name) AS actors, producers, directors
 
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | movie.title  | actors                                                                                | producers       | directors                           |
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | "The Matrix" | ["Hugo Weaving","Laurence Fishburne","Carrie-Anne Moss","Keanu Reeves","Emil Eifrem"] | ["Joel Silver"] | ["Lana Wachowski","Andy Wachowski"] |
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> 1 row

So far, so good. We’ve got everything except the other movies recommendation which is a bit trickier so we’ll write it on its own first:

match (movie:Movie {title: "The Matrix"})<-[:ACTED_IN]-(actor)-[:ACTED_IN]->(otherMovie)
RETURN otherMovie, COUNT(*) AS score
ORDER BY score DESC
 
==> +---------------------------------------------------------------------------------------------------------------------------+
==> | otherMovie                                                                                                        | score |
==> +---------------------------------------------------------------------------------------------------------------------------+
==> | Node[348]{title:"The Matrix Revolutions",released:2003,tagline:"Everything that has a beginning has an end"}      | 4     |
==> | Node[347]{title:"The Matrix Reloaded",released:2003,tagline:"Free your mind"}                                     | 4     |
==> | Node[490]{title:"Something's Gotta Give",released:2003}                                                           | 1     |
==> | Node[349]{title:"The Devil's Advocate",released:1997,tagline:"Evil has its winning ways"}                         | 1     |
==> | Node[438]{title:"Johnny Mnemonic",released:1995,tagline:"The hottest data on earth. In the coolest head in town"} | 1     |
==> | Node[443]{title:"Cloud Atlas",released:2012,tagline:"Everything is connected"}                                    | 1     |
==> | Node[452]{title:"V for Vendetta",released:2006,tagline:"Freedom! Forever!"}                                       | 1     |
==> | Node[425]{title:"The Replacements",released:2000,tagline:"Pain heals, Chicks dig scars... Glory lasts forever"}   | 1     |
==> +---------------------------------------------------------------------------------------------------------------------------+
==> 8 rows

Our recommendation query finds all the actors in The Matrix and then traverses out to find other movies they’ve acted in and orders those movies based on how many of our actors appeared in them. Not surprisingly the other Matrix movies come out top.

In order to plug this into the rest of the query we need a single row to be returned i.e. our other movie suggestions need to be returned as an array rather than individual rows. Let’s do that:

match (movie:Movie {title: "The Matrix"})<-[:ACTED_IN]-(actor)-[:ACTED_IN]->(otherMovie)
 
WITH otherMovie, COUNT(*) AS score
ORDER BY score DESC
 
RETURN COLLECT({movie: otherMovie.title, score: score}) AS otherMovies
 
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | recommended                                                                                                                                                                                                                                                                                                                                                  |
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | [{movie -> "The Matrix Revolutions", score -> 4},{movie -> "The Matrix Reloaded", score -> 4},{movie -> "Something's Gotta Give", score -> 1},{movie -> "The Devil's Advocate", score -> 1},{movie -> "Johnny Mnemonic", score -> 1},{movie -> "Cloud Atlas", score -> 1},{movie -> "V for Vendetta", score -> 1},{movie -> "The Replacements", score -> 1}] |
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

We’ve introduced a WITH clause for two reasons:

  1. To ensure the order of the movies based on highest score
  2. Because we can’t do an aggregation within an aggregation i.e. COLLECT(COUNT(…)) would be an illegal operation in Cypher.

Now we’re ready to plug this recommendation query into our main one:

match (movie:Movie {title: "The Matrix"})
optional match (producer)-[:PRODUCED]->(movie)
 
with movie, COLLECT(producer.name) AS producers
optional match (director)-[:DIRECTED]->(movie)
 
with movie, producers, COLLECT(director.name) AS directors
optional match (actor)-[:ACTED_IN]->(movie)
 
WITH  movie, COLLECT(actor.name) AS actors, producers, directors
optional match (movie)<-[:ACTED_IN]-(actor)-[:ACTED_IN]->(otherMovie)
 
WITH movie, actors, producers, directors, otherMovie, COUNT(*) AS score
ORDER BY score DESC
 
RETURN movie, actors, producers, directors,  
       COLLECT({movie: otherMovie.title, score: score}) AS recommended
 
==> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | movie                                                                           | actors                                                                                | producers       | directors                           | recommended                                                                                                                                                                                                                                                                                                                                                  |
==> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | Node[338]{title:"The Matrix",released:1999,tagline:"Welcome to the Real World"} | ["Hugo Weaving","Laurence Fishburne","Carrie-Anne Moss","Keanu Reeves","Emil Eifrem"] | ["Joel Silver"] | ["Lana Wachowski","Andy Wachowski"] | [{movie -> "The Matrix Revolutions", score -> 4},{movie -> "The Matrix Reloaded", score -> 4},{movie -> "Johnny Mnemonic", score -> 1},{movie -> "The Replacements", score -> 1},{movie -> "Cloud Atlas", score -> 1},{movie -> "V for Vendetta", score -> 1},{movie -> "Something's Gotta Give", score -> 1},{movie -> "The Devil's Advocate", score -> 1}] |
==> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> 1 row

Voila! 4 different types of data gathered and just one query to do it all.

For the eagle eyed cypher specialists (Hi Michael!), you’ll have noticed a bit of duplication in how we traverse out to the actors twice, once to retrieve them and once to make the movie recommendation.

We could optimise this by collecting the actors once and then using the UNWIND clause but that’s an optimisation which I think slightly obscures the intent of the query so I’ve left it like this for now.

Written by Mark Needham

April 1st, 2015 at 11:54 am

Posted in neo4j

Tagged with

Neo4j: Generating real time recommendations with Cypher

with one comment

One of the most common uses of Neo4j is for building real time recommendation engines and a common theme is that they make use of lots of different bits of data to come up with an interesting recommendation.

For example in this video Amanda shows how dating websites build real time recommendation engines by starting with social connections and then introducing passions, location and a few other things.

Graph Aware have a neat framework that helps you to build your own recommendation engine using Java and I was curious what a Cypher version would look like.

This is the sample graph:

CREATE
    (m:Person:Male {name:'Michal', age:30}),
    (d:Person:Female {name:'Daniela', age:20}),
    (v:Person:Male {name:'Vince', age:40}),
    (a:Person:Male {name:'Adam', age:30}),
    (l:Person:Female {name:'Luanne', age:25}),
    (c:Person:Male {name:'Christophe', age:60}),
 
    (lon:City {name:'London'}),
    (mum:City {name:'Mumbai'}),
 
    (m)-[:FRIEND_OF]->(d),
    (m)-[:FRIEND_OF]->(l),
    (m)-[:FRIEND_OF]->(a),
    (m)-[:FRIEND_OF]->(v),
    (d)-[:FRIEND_OF]->(v),
    (c)-[:FRIEND_OF]->(v),
    (d)-[:LIVES_IN]->(lon),
    (v)-[:LIVES_IN]->(lon),
    (m)-[:LIVES_IN]->(lon),
    (l)-[:LIVES_IN]->(mum);

We want to recommend some potential friends to ‘Adam’ so the first layer of our query is to find his friends of friends as there are bound to be some potential friends amongst them:

MATCH (me:Person {name: "Adam"})
MATCH (me)-[:FRIEND_OF]-()-[:FRIEND_OF]-(potentialFriend)
RETURN me, potentialFriend, COUNT(*) AS friendsInCommon
 
==> +--------------------------------------------------------------------------------------+
==> | me                             | potentialFriend                   | friendsInCommon |
==> +--------------------------------------------------------------------------------------+
==> | Node[1007]{name:"Adam",age:30} | Node[1006]{name:"Vince",age:40}   | 1               |
==> | Node[1007]{name:"Adam",age:30} | Node[1005]{name:"Daniela",age:20} | 1               |
==> | Node[1007]{name:"Adam",age:30} | Node[1008]{name:"Luanne",age:25}  | 1               |
==> +--------------------------------------------------------------------------------------+
==> 3 rows

This query gives us back a list of potential friends and how many friends we have in common.

Now that we’ve got some potential friends let’s start building a ranking for each of them. One indicator which could weigh in favour of a potential friend is if they live in the same location as us so let’s add that to our query:

MATCH (me:Person {name: "Adam"})
MATCH (me)-[:FRIEND_OF]-()-[:FRIEND_OF]-(potentialFriend)
 
WITH me, potentialFriend, COUNT(*) AS friendsInCommon
 
RETURN  me,
        potentialFriend,
        SIZE((potentialFriend)-[:LIVES_IN]->()<-[:LIVES_IN]-(me)) AS sameLocation
 
==> +-----------------------------------------------------------------------------------+
==> | me                             | potentialFriend                   | sameLocation |
==> +-----------------------------------------------------------------------------------+
==> | Node[1007]{name:"Adam",age:30} | Node[1006]{name:"Vince",age:40}   | 0            |
==> | Node[1007]{name:"Adam",age:30} | Node[1005]{name:"Daniela",age:20} | 0            |
==> | Node[1007]{name:"Adam",age:30} | Node[1008]{name:"Luanne",age:25}  | 0            |
==> +-----------------------------------------------------------------------------------+
==> 3 rows

Next we’ll check whether Adams’ potential friends have the same gender as him by comparing the labels each node has. We’ve got ‘Male’ and ‘Female’ labels which indicate gender.

MATCH (me:Person {name: "Adam"})
MATCH (me)-[:FRIEND_OF]-()-[:FRIEND_OF]-(potentialFriend)
 
WITH me, potentialFriend, COUNT(*) AS friendsInCommon
 
RETURN  me,
        potentialFriend,
        SIZE((potentialFriend)-[:LIVES_IN]->()<-[:LIVES_IN]-(me)) AS sameLocation,
        LABELS(me) = LABELS(potentialFriend) AS gender
 
==> +--------------------------------------------------------------------------------------------+
==> | me                             | potentialFriend                   | sameLocation | gender |
==> +--------------------------------------------------------------------------------------------+
==> | Node[1007]{name:"Adam",age:30} | Node[1006]{name:"Vince",age:40}   | 0            | true   |
==> | Node[1007]{name:"Adam",age:30} | Node[1005]{name:"Daniela",age:20} | 0            | false  |
==> | Node[1007]{name:"Adam",age:30} | Node[1008]{name:"Luanne",age:25}  | 0            | false  |
==> +--------------------------------------------------------------------------------------------+
==> 3 rows

Next up let’s calculate the age different between Adam and his potential friends:

MATCH (me:Person {name: "Adam"})
MATCH (me)-[:FRIEND_OF]-()-[:FRIEND_OF]-(potentialFriend)
 
WITH me, potentialFriend, COUNT(*) AS friendsInCommon
 
RETURN me,
       potentialFriend,
       SIZE((potentialFriend)-[:LIVES_IN]->()<-[:LIVES_IN]-(me)) AS sameLocation,
       abs( me.age - potentialFriend.age) AS ageDifference,
       LABELS(me) = LABELS(potentialFriend) AS gender,
       friendsInCommon
 
==> +------------------------------------------------------------------------------------------------------------------------------+
==> | me                             | potentialFriend                   | sameLocation | ageDifference | gender | friendsInCommon |
==> +------------------------------------------------------------------------------------------------------------------------------+
==> | Node[1007]{name:"Adam",age:30} | Node[1006]{name:"Vince",age:40}   | 0            | 10.0          | true   | 1               |
==> | Node[1007]{name:"Adam",age:30} | Node[1005]{name:"Daniela",age:20} | 0            | 10.0          | false  | 1               |
==> | Node[1007]{name:"Adam",age:30} | Node[1008]{name:"Luanne",age:25}  | 0            | 5.0           | false  | 1               |
==> +------------------------------------------------------------------------------------------------------------------------------+
==> 3 rows

Now let’s do some filtering to get rid of people that Adam is already friends with – there wouldn’t be much point in recommending those people!

MATCH (me:Person {name: "Adam"})
MATCH (me)-[:FRIEND_OF]-()-[:FRIEND_OF]-(potentialFriend)
 
WITH me, potentialFriend, COUNT(*) AS friendsInCommon
 
WITH me,
     potentialFriend,
     SIZE((potentialFriend)-[:LIVES_IN]->()<-[:LIVES_IN]-(me)) AS sameLocation,
     abs( me.age - potentialFriend.age) AS ageDifference,
     LABELS(me) = LABELS(potentialFriend) AS gender,
     friendsInCommon
 
WHERE NOT (me)-[:FRIEND_OF]-(potentialFriend)
 
RETURN me,
       potentialFriend,
       SIZE((potentialFriend)-[:LIVES_IN]->()<-[:LIVES_IN]-(me)) AS sameLocation,
       abs( me.age - potentialFriend.age) AS ageDifference,
       LABELS(me) = LABELS(potentialFriend) AS gender,
       friendsInCommon
 
==> +------------------------------------------------------------------------------------------------------------------------------+
==> | me                             | potentialFriend                   | sameLocation | ageDifference | gender | friendsInCommon |
==> +------------------------------------------------------------------------------------------------------------------------------+
==> | Node[1007]{name:"Adam",age:30} | Node[1006]{name:"Vince",age:40}   | 0            | 10.0          | true   | 1               |
==> | Node[1007]{name:"Adam",age:30} | Node[1005]{name:"Daniela",age:20} | 0            | 10.0          | false  | 1               |
==> | Node[1007]{name:"Adam",age:30} | Node[1008]{name:"Luanne",age:25}  | 0            | 5.0           | false  | 1               |
==> +------------------------------------------------------------------------------------------------------------------------------+
==> 3 rows

In this case we haven’t actually filtered anyone out but for some of the other people we would see a reduction in the number of potential friends.

Our final step is to come up with a score for each of the features that we’ve identified as being important for making a friend suggestion.

We’ll assign a score of 10 if the people live in the same location or have the same gender as Adam and 0 if not. For the ageDifference and friendsInCommon we’ll apply apply a log curve so that those values don’t have a disproportional effect on our final score. We’ll use the formula defined in the ParetoScoreTransfomer to do this:

    public <OUT> float transform(OUT item, float score) {
        if (score < minimumThreshold) {
            return 0;
        }
 
        double alpha = Math.log((double) 5) / eightyPercentLevel;
        double exp = Math.exp(-alpha * score);
        return new Double(maxScore * (1 - exp)).floatValue();
    }

And now for our completed recommendation query:

MATCH (me:Person {name: "Adam"})
MATCH (me)-[:FRIEND_OF]-()-[:FRIEND_OF]-(potentialFriend)
 
WITH me, potentialFriend, COUNT(*) AS friendsInCommon
 
WITH me,
     potentialFriend,
     SIZE((potentialFriend)-[:LIVES_IN]->()<-[:LIVES_IN]-(me)) AS sameLocation,
     abs( me.age - potentialFriend.age) AS ageDifference,
     LABELS(me) = LABELS(potentialFriend) AS gender,
     friendsInCommon
 
WHERE NOT (me)-[:FRIEND_OF]-(potentialFriend)
 
WITH potentialFriend,
       // 100 -> maxScore, 10 -> eightyPercentLevel, friendsInCommon -> score (from the formula above)
       100 * (1 - exp((-1.0 * (log(5.0) / 10)) * friendsInCommon)) AS friendsInCommon,
       sameLocation * 10 AS sameLocation,
       -1 * (10 * (1 - exp((-1.0 * (log(5.0) / 20)) * ageDifference))) AS ageDifference,
       CASE WHEN gender THEN 10 ELSE 0 END as sameGender
 
RETURN potentialFriend,
      {friendsInCommon: friendsInCommon,
       sameLocation: sameLocation,
       ageDifference:ageDifference,
       sameGender: sameGender} AS parts,
     friendsInCommon + sameLocation + ageDifference + sameGender AS score
ORDER BY score DESC
 
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | potentialFriend                   | parts                                                                                                           | score             |
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | Node[1006]{name:"Vince",age:40}   | {friendsInCommon -> 14.86600774792154, sameLocation -> 0, ageDifference -> -5.52786404500042, sameGender -> 10} | 19.33814370292112 |
==> | Node[1008]{name:"Luanne",age:25}  | {friendsInCommon -> 14.86600774792154, sameLocation -> 0, ageDifference -> -3.312596950235779, sameGender -> 0} | 11.55341079768576 |
==> | Node[1005]{name:"Daniela",age:20} | {friendsInCommon -> 14.86600774792154, sameLocation -> 0, ageDifference -> -5.52786404500042, sameGender -> 0}  | 9.33814370292112  |
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

The final query isn’t too bad – the only really complex bit is the log curve calculation. This is where user defined functions will come into their own in the future.

The nice thing about this approach is that we don’t have to step outside of cypher so if you’re not comfortable with Java you can still do real time recommendations! On the other hand, the different parts of the recommendation engine all get mixed up so it’s not as easy to see the whole pipeline as if you use the graph aware framework.

The next step is to apply this to the Twitter graph and come up with follower recommendations on there.

Written by Mark Needham

March 27th, 2015 at 6:59 am

Posted in neo4j

Tagged with

Neo4j: Detecting potential typos using EXPLAIN

without comments

I’ve been running a few intro to Neo4j training sessions recently using Neo4j 2.2.0 RC1 and at some stage in every session somebody will make a typo when writing out of the example queries.

For example one of the queries that we do about half way finds the actors and directors who have worked together and aggregates the movies they were in.

This is the correct query:

MATCH (actor:Person)-[:ACTED_IN]->(movie)<-[:DIRECTED]-(director)
RETURN actor.name, director.name, COLLECT(movie.title) AS movies
ORDER BY LENGTH(movies) DESC
LIMIT 5

which should yield the following results:

==> +-----------------------------------------------------------------------------------------------------------------------+
==> | actor.name           | director.name    | movies                                                                      |
==> +-----------------------------------------------------------------------------------------------------------------------+
==> | "Hugo Weaving"       | "Andy Wachowski" | ["Cloud Atlas","The Matrix Revolutions","The Matrix Reloaded","The Matrix"] |
==> | "Hugo Weaving"       | "Lana Wachowski" | ["Cloud Atlas","The Matrix Revolutions","The Matrix Reloaded","The Matrix"] |
==> | "Laurence Fishburne" | "Lana Wachowski" | ["The Matrix Revolutions","The Matrix Reloaded","The Matrix"]               |
==> | "Keanu Reeves"       | "Lana Wachowski" | ["The Matrix Revolutions","The Matrix Reloaded","The Matrix"]               |
==> | "Carrie-Anne Moss"   | "Lana Wachowski" | ["The Matrix Revolutions","The Matrix Reloaded","The Matrix"]               |
==> +-----------------------------------------------------------------------------------------------------------------------+

However, a common typo is to write ‘DIRECTED_IN’ instead of ‘DIRECTED’ in which case we’ll see no results:

MATCH (actor:Person)-[:ACTED_IN]->(movie)<-[:DIRECTED_IN]-(director)
RETURN actor.name, director.name, COLLECT(movie.title) AS movies
ORDER BY LENGTH(movies) DESC
LIMIT 5
 
==> +-------------------------------------+
==> | actor.name | director.name | movies |
==> +-------------------------------------+
==> +-------------------------------------+
==> 0 row

It’s not immediately obvious why we aren’t seeing any results which can be quite frustrating.

However, in Neo4j 2.2 the ‘EXPLAIN’ keyword has been introduced and we can use this to see what the query planner thinks of the query we want to execute without actually executing it.

Instead the planner makes use of knowledge that it has about our schema to come up with a plan that it would run and how much of the graph it thinks that plan would touch:

EXPLAIN MATCH (actor:Person)-[:ACTED_IN]->(movie)<-[:DIRECTED_IN]-(director)
RETURN actor.name, director.name, COLLECT(movie.title) AS movies
ORDER BY LENGTH(movies) DESC
LIMIT 5

2015 03 17 23 39 55

The first row of the query plan describes an all nodes scan which tells us that the query will start from the ‘director’ but it’s the second row that’s interesting.

The estimated rows when expanding the ‘DIRECTED_IN’ relationship is 0 when we’d expect it to at least be a positive value if there were some instances of that relationship in the database.

If we compare this to the plan generated when using the proper ‘DIRECTED’ relationship we can see the difference:

2015 03 17 23 43 11

Here we see an estimated 44 rows from expanding the ‘DIRECTED’ relationship so we know there are at least some nodes connected by that relationship type.

In summary if you find your query not returning anything when you expect it to, prefix an ‘EXPLAIN’ and make sure you’re not seeing the dreaded ‘0 expected rows’.

Written by Mark Needham

March 17th, 2015 at 10:46 pm

Posted in neo4j

Tagged with

Python/Neo4j: Finding interesting computer sciency people to follow on Twitter

with 4 comments

At the beginning of this year I moved from Neo4j’s field team to dev team and since the code we write there is much lower level than I’m used to I thought I should find some people to follow on twitter whom I can learn from.

My technique for finding some of those people was to pick a person from the Neo4j kernel team who’s very good at systems programming and uses twitter which led me to Mr Chris Vest.

I thought that the people Chris interacts with on twitter are likely to be interested in this type of programming so I manually traversed out to those people and had a look who they interacted with and put them all into a list.

This approach has worked well and I’ve picked up lots of reading material from following these people. It does feel like something that could be automated though and we’ll using Python and Neo4j as our tool kit.

We need to find a library to connect to the Twitter API. There are a few to choose from but tweepy seems simple enough so I started using that.

The first thing we need to so is fill in our Twitter API auth details. Since the application is just for me I’m not going to bother with setting up OAuth – instead I’ll just create an application on apps.twitter.com and grab the appropriate tokens:

2015 03 11 01 18 47

Once we’ve done that we can navigate to that app and get a consumer key, consumer secret, access token and access token secret. They all reside on the ‘Keys and Access Tokens’ tab:

2015 03 11 01 20 17
2015 03 11 01 20 30

Now that we’ve got ourself all auth’d up let’s write some code that starts from Chris and goes out and finds his latest tweets and the people he’s interacted with in them. We’ll write the appropriate information out to a CSV file so that we can import it into Neo4j later on:

import tweepy
import csv
from collections import Counter, deque
 
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
 
api = tweepy.API(auth, wait_on_rate_limit = True, wait_on_rate_limit_notify = True)
 
counter = Counter()
users_to_process = deque()
USERS_TO_PROCESS = 50
 
def extract_tweet(tweet):
    user_mentions = ",".join([user["screen_name"].encode("utf-8")
                             for user in tweet.entities["user_mentions"]])
    urls = ",".join([url["expanded_url"]
                     for url in tweet.entities["urls"]])
    return [tweet.user.screen_name.encode("utf-8"),
            tweet.id,
            tweet.text.encode("utf-8"),
            user_mentions,
            urls]
 
starting_user = "chvest"
with open("tweets.csv", "a") as tweets:
    writer = csv.writer(tweets, delimiter=",", escapechar="\\", doublequote = False)
    for tweet in tweepy.Cursor(api.user_timeline, id=starting_user).items(50):
        writer.writerow(extract_tweet(tweet))
        tweets.flush()
        for user in tweet.entities["user_mentions"]:
            if not len(users_to_process) > USERS_TO_PROCESS:
                users_to_process.append(user["screen_name"])
                counter[user["screen_name"]] += 1
            else:
                break

As well as printing out Chris’ tweets I’m also capturing other users who he’s had interacted and putting them in a queue that we’ll drain later on. We’re limiting the number of other users that we’ll process to 50 for now but it’s easy to change.

If we print out the first few lines of ‘tweets.csv’ this is what we’d see:

$ head -n 5 tweets.csv
userName,tweetId,contents,usersMentioned,urls
chvest,575427045167071233,@shipilev http://t.co/WxqFIsfiSF,shipilev,
chvest,575403105174552576,@AlTobey I often use http://t.co/G7Cdn9Udst for small one-off graph diagrams.,AlTobey,http://www.apcjones.com/arrows/
chvest,575337346687766528,RT @theburningmonk: this is why you need composition over inheritance... :s #CompositionOverInheritance http://t.co/aKRwUaZ0qo,theburningmonk,
chvest,575269402083459072,@chvest except…? “Each library implementation should therefore be identical with respect to the public API”,chvest,

We’re capturing the user, tweetId, the tweet itself, any users mentioned in the tweet and any URLs shared in the tweet.

Next we want to get some of the tweets of the people Chris has interacted with

# Grab the code from here too - https://gist.github.com/mneedham/3188c44b2cceb88c6de0
 
import tweepy
import csv
from collections import Counter, deque
 
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
 
api = tweepy.API(auth, wait_on_rate_limit = True, wait_on_rate_limit_notify = True)
 
counter = Counter()
users_to_process = deque()
USERS_TO_PROCESS = 50
 
def extract_tweet(tweet):
    user_mentions = ",".join([user["screen_name"].encode("utf-8")
                             for user in tweet.entities["user_mentions"]])
    urls = ",".join([url["expanded_url"]
                     for url in tweet.entities["urls"]])
    return [tweet.user.screen_name.encode("utf-8"),
            tweet.id,
            tweet.text.encode("utf-8"),
            user_mentions,
            urls]
 
starting_user = "chvest"
with open("tweets.csv", "a") as tweets:
    writer = csv.writer(tweets, delimiter=",", escapechar="\\", doublequote = False)
    for tweet in tweepy.Cursor(api.user_timeline, id=starting_user).items(50):
        writer.writerow(extract_tweet(tweet))
        tweets.flush()
        for user in tweet.entities["user_mentions"]:
            if not len(users_to_process) > USERS_TO_PROCESS:
                users_to_process.append(user["screen_name"])
                counter[user["screen_name"]] += 1
            else:
                break
    users_processed = set([starting_user])
    while True:
        if len(users_processed) >= USERS_TO_PROCESS:
            break
        else:
            if len(users_to_process) > 0:
                next_user = users_to_process.popleft()
                print next_user
                if next_user in users_processed:
                    "-- user already processed"
                else:
                    "-- processing user"
                    users_processed.add(next_user)
                    for tweet in tweepy.Cursor(api.user_timeline, id=next_user).items(10):
                        writer.writerow(extract_tweet(tweet))
                        tweets.flush()
                        for user_mentioned in tweet.entities["user_mentions"]:
                            if not len(users_processed) > 50:
                                users_to_process.append(user_mentioned["screen_name"])
                                counter[user_mentioned["screen_name"]] += 1
                            else:
                                break
            else:
                break

Finally let’s take a quick look at the users who show up most frequently:

>>> for user_name, count in counter.most_common(20):
        print user_name, count
 
neo4j 13
devnexus 12
AlTobey 11
bitprophet 11
hazelcast 10
chvest 9
shipilev 9
AntoineGrondin 8
gvsmirnov 8
GatlingTool 8
lagergren 7
tomsontom 6
dorkitude 5
noctarius2k 5
DanHeidinga 5
chris_mahan 5
coda 4
mccv 4
gAmUssA 4
jmhodges 4

A few of the people on that list are in my list which is a good start. We can explore the data set better once it’s in Neo4j though so let’s write some Cypher import statements to create our own mini Twitter graph:

// add people
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-twitter/tweets.csv" AS row
MERGE (p:Person {userName: row.userName});
 
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-twitter/tweets.csv" AS row
WITH SPLIT(row.usersMentioned, ",") AS users
UNWIND users AS user
MERGE (p:Person {userName: user});
 
// add tweets
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-twitter/tweets.csv" AS row
MERGE (t:Tweet {id: row.tweetId})
ON CREATE SET t.contents = row.contents;
 
// add urls
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-twitter/tweets.csv" AS row
WITH SPLIT(row.urls, ",") AS urls
UNWIND urls AS url
MERGE (:URL {value: url});
 
// add links
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-twitter/tweets.csv" AS row
MATCH (p:Person {userName: row.userName})
MATCH (t:Tweet {id: row.tweetId})
MERGE (p)-[:TWEETED]->(t);
 
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-twitter/tweets.csv" AS row
WITH SPLIT(row.usersMentioned, ",") AS users, row
UNWIND users AS user
MATCH (p:Person {userName: user})
MATCH (t:Tweet {id: row.tweetId})
MERGE (p)-[:MENTIONED_IN]->(t);
 
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-twitter/tweets.csv" AS row
WITH SPLIT(row.urls, ",") AS urls, row
UNWIND urls AS url
MATCH (u:URL {value: url})
MATCH (t:Tweet {id: row.tweetId})
MERGE (t)-[:CONTAINS_LINK]->(u);

We can put all those commands in a file and execute them using neo4j-shell:

$ ./neo4j-community-2.2.0-RC01/bin/neo4j-shell --file import.cql

Now let’s write some queries against the graph:

// Find the tweets where Chris mentioned himself
MATCH path = (n:Person {userName: "chvest"})-[:TWEETED]->()<-[:MENTIONED_IN]-(n)
RETURN path

Graph  5

// Find the most popular links shared in the network
MATCH (u:URL)<-[r:CONTAINS_LINK]->()
RETURN u.value, COUNT(*) AS times
ORDER BY times DESC
LIMIT 10
 
+-------------------------------------------------------------------------------------------------+
| u.value                                                                                 | times |
+-------------------------------------------------------------------------------------------------+
| "http://www.polyglots.dk/"                                                              | 4     |
| "http://www.java-forum-nord.de/"                                                        | 4     |
| "http://hirt.se/blog/?p=646"                                                            | 3     |
| "http://wp.me/p26jdv-Ja"                                                                | 3     |
| "https://instagram.com/p/0D4I_hH77t/"                                                   | 3     |
| "https://blogs.oracle.com/java/entry/new_java_champion_tom_chindl"                      | 3     |
| "http://www.kennybastani.com/2015/03/spark-neo4j-tutorial-docker.html"                  | 2     |
| "https://firstlook.org/theintercept/2015/03/10/ispy-cia-campaign-steal-apples-secrets/" | 2     |
| "http://buff.ly/1GzZXlo"                                                                | 2     |
| "http://buff.ly/1BrgtQd"                                                                | 2     |
+-------------------------------------------------------------------------------------------------+
10 rows

The first link is for a programming language meetup in Copenhagen, the second for a Java conference in Hanovier and the third an announcement about the latest version of Java Mission Control. So far so good!

A next step in this area would be to run the links through Prismatic’s interest graph so we can model topics in our graph as well. For now let’s have a look at the interactions between Chris and others in the graph:

// Find the people who Chris interacts with most often
MATCH path = (n:Person {userName: "chvest"})-[:TWEETED]->()<-[:MENTIONED_IN]-(other)
RETURN other.userName, COUNT(*) AS times
ORDER BY times DESC
LIMIT 5
 
+------------------------+
| other.userName | times |
+------------------------+
| "gvsmirnov"    | 7     |
| "shipilev"     | 5     |
| "nitsanw"      | 4     |
| "DanHeidinga"  | 3     |
| "AlTobey"      | 3     |
+------------------------+
5 rows

Let’s generalise that to find interactions between any pair of people:

// Find the people who interact most often
MATCH (n:Person)-[:TWEETED]->()<-[:MENTIONED_IN]-(other)
WHERE n <> other
RETURN n.userName, other.userName, COUNT(*) AS times
ORDER BY times DESC
LIMIT 5
 
+------------------------------------------+
| n.userName    | other.userName   | times |
+------------------------------------------+
| "fbogsany"    | "AntoineGrondin" | 8     |
| "chvest"      | "gvsmirnov"      | 7     |
| "chris_mahan" | "bitprophet"     | 6     |
| "maxdemarzi"  | "neo4j"          | 6     |
| "chvest"      | "shipilev"       | 5     |
+------------------------------------------+
5 rows

Let’s combine a couple of these together to come up with a score for each person:

MATCH (n:Person)
 
// number of mentions
OPTIONAL MATCH (n)-[mention:MENTIONED_IN]->()
WITH n, COUNT(mention) AS mentions
 
// number of links shared by someone else
OPTIONAL MATCH (n)-[:TWEETED]->()-[:CONTAINS_LINK]->(link)<-[:CONTAINS_LINK]-()
 
WITH n, mentions, COUNT(link) AS links
RETURN n.userName, mentions + links AS score, mentions, links
ORDER BY score DESC
LIMIT 10
 
+------------------------------------------+
| n.userName    | score | mentions | links |
+------------------------------------------+
| "chvest"      | 17    | 10       | 7     |
| "hazelcast"   | 16    | 10       | 6     |
| "neo4j"       | 15    | 13       | 2     |
| "noctarius2k" | 14    | 4        | 10    |
| "devnexus"    | 12    | 12       | 0     |
| "polyglotsdk" | 11    | 2        | 9     |
| "shipilev"    | 11    | 10       | 1     |
| "AlTobey"     | 11    | 10       | 1     |
| "bitprophet"  | 10    | 9        | 1     |
| "GatlingTool" | 10    | 8        | 2     |
+------------------------------------------+
10 rows

Amusingly Chris is top of his own network but we also see three accounts which aren’t people, but rather products – neo4j, hazelcast and GatlingTool. The rest are legit though

That’s as far as I’ve got but to make this more useful I think we need to introduce follower/friend links as well as importing more data.

In the mean time I’ve got a bunch of links to go and read!

Written by Mark Needham

March 11th, 2015 at 9:13 pm

Posted in neo4j,Python

Tagged with ,

Neo4j: TF/IDF (and variants) with cypher

without comments

A few weeks ago I wrote a blog post on running TF/IDF over HIMYM transcripts using scikit-learn to find the most important phrases by episode and afterwards I was curious how difficult it’d be to do in Neo4j.

I started by translating one of wikipedia’s TF/IDF examples to cypher to see what the algorithm would look like:

WITH 3 as termFrequency, 2 AS numberOfDocuments, 1 as numberOfDocumentsWithTerm
WITH termFrequency, log10(numberOfDocuments / numberOfDocumentsWithTerm) AS inverseDocumentFrequency
return termFrequency * inverseDocumentFrequency
 
0.9030899869919435

Next I needed to go over the HIMYM episode transcripts and extract phrases and their corresponding counts in each episode. I used scikit-learn’s CountVectorizer to do this and wrote the results into a CSV file.

Here’s a preview of that file:

$ head -n 10 data/import/words_scikit.csv
EpisodeId,Phrase,Count
1,2005,1
1,2005 seven,1
1,2005 seven just,1
1,2030,3
1,2030 kids,1
1,2030 kids intently,1
1,2030 narrator,1
1,2030 narrator kids,1
1,2030 son,1

Now let’s import that into Neo4j using the LOAD CSV tool:

// phrases
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/words_scikit.csv" AS row
MERGE (phrase:Phrase {value: row.Phrase});
// episode -> phrase
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/words_scikit.csv" AS row
MATCH (phrase:Phrase {value: row.Phrase})
MATCH (episode:Episode {id: TOINT(row.EpisodeId)})
MERGE (episode)-[:CONTAINED_PHRASE {times:TOINT(row.Count)}]->(phrase);

Now that all the data’s in we can translate the TF/IDF query to make use of our graph. We’ll start with episode 1:

match (e:Episode)
WITH COUNT(e) AS numberOfDocuments
match (p:Phrase)<-[r:CONTAINED_PHRASE]-(e:Episode {id: 1})
WITH numberOfDocuments, p, r.times AS termFrequency
MATCH (p)<-[:CONTAINED_PHRASE]->(otherEpisode)
WITH p, COUNT(otherEpisode) AS numberOfDocumentsWithTerm, numberOfDocuments, termFrequency
WITH p, numberOfDocumentsWithTerm,  log10(numberOfDocuments / numberOfDocumentsWithTerm) AS inverseDocumentFrequency, termFrequency, numberOfDocuments
RETURN p.value, termFrequency, numberOfDocumentsWithTerm, inverseDocumentFrequency, termFrequency * inverseDocumentFrequency AS score
ORDER BY score DESC
LIMIT 10
 
==> +--------------------------------------------------------------------------------------------------------------------+
==> | p.value                | termFrequency | numberOfDocumentsWithTerm | inverseDocumentFrequency | score              |
==> +--------------------------------------------------------------------------------------------------------------------+
==> | "olives"               | 18            | 2                         | 2.0170333392987803       | 36.306600107378046 |
==> | "yasmine"              | 13            | 1                         | 2.3180633349627615       | 30.1348233545159   |
==> | "signal"               | 11            | 5                         | 1.6127838567197355       | 17.740622423917088 |
==> | "goanna"               | 10            | 4                         | 1.7160033436347992       | 17.16003343634799  |
==> | "flashback date"       | 6             | 1                         | 2.3180633349627615       | 13.908380009776568 |
==> | "scene"                | 17            | 37                        | 0.6989700043360189       | 11.88249007371232  |
==> | "flashback date robin" | 5             | 1                         | 2.3180633349627615       | 11.590316674813808 |
==> | "ted yasmine"          | 5             | 1                         | 2.3180633349627615       | 11.590316674813808 |
==> | "smurf pen1s"          | 5             | 2                         | 2.0170333392987803       | 10.085166696493902 |
==> | "eye patch"            | 5             | 2                         | 2.0170333392987803       | 10.085166696493902 |
==> +--------------------------------------------------------------------------------------------------------------------+
==> 10 rows

The score we’ve calculated is different to scikit-learn’s but the relative order seems fine so that’s good. The neat thing about calculating this in Neo4j is that we can now vary the ‘inverse document’ part of the equation e.g. to find out the most important phrases in a season rather than an episode:

match (:Season) 
WITH COUNT(*) AS numberOfDocuments
match (p:Phrase)<-[r:CONTAINED_PHRASE]-(:Episode)-[:IN_SEASON]->(s:Season {number: "1"})
WITH p, SUM(r.times) AS termFrequency, numberOfDocuments
MATCH (p)<-[:CONTAINED_PHRASE]->(otherEpisode)-[:IN_SEASON]->(s:Season)
WITH p, COUNT(DISTINCT s) AS numberOfDocumentsWithTerm, termFrequency, numberOfDocuments
WITH p, numberOfDocumentsWithTerm,  log10(numberOfDocuments / numberOfDocumentsWithTerm) AS inverseDocumentFrequency, termFrequency, numberOfDocuments
RETURN p.value, termFrequency, numberOfDocumentsWithTerm, inverseDocumentFrequency, termFrequency * inverseDocumentFrequency AS score
ORDER BY score DESC
LIMIT 10
 
==> +-------------------------------------------------------------------------------------------------------------+
==> | p.value         | termFrequency | numberOfDocumentsWithTerm | inverseDocumentFrequency | score              |
==> +-------------------------------------------------------------------------------------------------------------+
==> | "moby"          | 46            | 1                         | 0.9542425094393249       | 43.895155434208945 |
==> | "int"           | 71            | 3                         | 0.47712125471966244      | 33.87560908509603  |
==> | "ellen"         | 53            | 2                         | 0.6020599913279624       | 31.909179540382006 |
==> | "claudia"       | 104           | 4                         | 0.3010299956639812       | 31.307119549054043 |
==> | "ericksen"      | 59            | 3                         | 0.47712125471966244      | 28.150154028460083 |
==> | "party number"  | 29            | 1                         | 0.9542425094393249       | 27.67303277374042  |
==> | "subtitle"      | 27            | 1                         | 0.9542425094393249       | 25.76454775486177  |
==> | "vo"            | 47            | 3                         | 0.47712125471966244      | 22.424698971824135 |
==> | "ted vo"        | 47            | 3                         | 0.47712125471966244      | 22.424698971824135 |
==> | "future ted vo" | 45            | 3                         | 0.47712125471966244      | 21.47045646238481  |
==> +-------------------------------------------------------------------------------------------------------------+
==> 10 rows

From this query we learn that ‘Moby’ was only mentioned once in the whole series and actually all of those mentions were in the same episode. The occurrence of ‘int’ seems to be more of a data issue – in some episodes the transcript describes location but in many it doesn’t:

$ ack -iw "int" data/import/sentences.csv
2361,8,1,8,"INT. LIVING ROOM, YEAR 2030"
2377,8,1,8,INT. CHINESE RESTAURANT
2395,8,1,8,INT. APARTMENT
2412,8,1,8,INT. APARTMENT
2419,8,1,8,INT. BAR
2472,8,1,8,INT. APARTMENT
2489,8,1,8,INT. BAR
2495,8,1,8,INT. APARTMENT
2506,8,1,8,INT. BAR
2584,8,1,8,INT. APARTMENT
2629,8,1,8,INT. RESTAURANT
2654,8,1,8,INT. APARTMENT
2682,8,1,8,INT. RESTAURANT
2689,8,1,8,(Robin gets up and leaves restaurant) INT. HOSPITAL WAITING AREA

‘vo’ stands for voice over which should probably be stripped out in the stop words as it doesn’t add much value. It shows up here because the transcripts aren’t consistent in the way they represent Future Ted speaking.

Let’s have a look at the final season to see how that fares:

match (:Season)
WITH COUNT(*) AS numberOfDocuments
match (p:Phrase)<-[r:CONTAINED_PHRASE]-(:Episode)-[:IN_SEASON]->(s:Season {number: "9"})
WITH p, SUM(r.times) AS termFrequency, numberOfDocuments
MATCH (p)<-[:CONTAINED_PHRASE]->(otherEpisode:Episode)-[:IN_SEASON]->(s:Season)
WITH p, COUNT(DISTINCT s) AS numberOfDocumentsWithTerm, termFrequency, numberOfDocuments
WITH p, numberOfDocumentsWithTerm,  log10(numberOfDocuments / numberOfDocumentsWithTerm) AS inverseDocumentFrequency, termFrequency, numberOfDocuments
RETURN p.value, termFrequency, numberOfDocumentsWithTerm, inverseDocumentFrequency, termFrequency * inverseDocumentFrequency AS score
ORDER BY score DESC
LIMIT 10
 
==> +------------------------------------------------------------------------------------------------------------------+
==> | p.value              | termFrequency | numberOfDocumentsWithTerm | inverseDocumentFrequency | score              |
==> +------------------------------------------------------------------------------------------------------------------+
==> | "ring bear"          | 28            | 1                         | 0.9542425094393249       | 26.718790264301095 |
==> | "click options"      | 26            | 1                         | 0.9542425094393249       | 24.810305245422448 |
==> | "thank linus"        | 26            | 1                         | 0.9542425094393249       | 24.810305245422448 |
==> | "vow"                | 39            | 2                         | 0.6020599913279624       | 23.480339661790534 |
==> | "just click"         | 24            | 1                         | 0.9542425094393249       | 22.901820226543798 |
==> | "rehearsal dinner"   | 23            | 1                         | 0.9542425094393249       | 21.947577717104473 |
==> | "linus"              | 36            | 2                         | 0.6020599913279624       | 21.674159687806647 |
==> | "just click options" | 22            | 1                         | 0.9542425094393249       | 20.993335207665147 |
==> | "locket"             | 32            | 2                         | 0.6020599913279624       | 19.265919722494797 |
==> | "cassie"             | 19            | 1                         | 0.9542425094393249       | 18.13060767934717  |
==> +------------------------------------------------------------------------------------------------------------------+

There are several phrases which are specific to Barney & Robin’s wedding (‘vow’, ‘ring bear’, ‘rehearsal dinner’) so it makes sense that those come out top. The ‘linus’ here mostly refers to the server at the bar who interacts with Lily although a quick search over the transcripts reveals that she also had an Uncle Linus!

$ ack -iw "linus" data/import/sentences.csv  | head -n 5
18649,61,3,17,Lily: Why don't we just call Duluth Mental Hospital and say my Uncle Linus can live with us?
59822,185,9,1,Linus.
59826,185,9,1,"Are you my guy, Linus?"
59832,185,9,1,Thank you Linus.
59985,185,9,1,"Thank you, Linus."
...

From doing this exercise I think TF/IDF is an interesting way of exploring unstructured data but for a phrase to be really interesting to us it should appear across multiple episodes/seasons.

One way to achieve that would be to weight those features more so I’ll try that out next.

All the code in this post is on github if you want to take a look and improve it.

Written by Mark Needham

March 8th, 2015 at 1:24 pm

Posted in neo4j

Tagged with

Python’s pandas vs Neo4j’s cypher: Exploring popular phrases in How I met your mother transcripts

without comments

I’ve previously written about extracting TF/IDF scores for phrases in documents using scikit-learn and the final step in that post involved writing the words into a CSV file for analysis later on.

I wasn’t sure what the most appropriate tool of choice for that analysis was so I decided to explore the data using Python’s pandas library and load it into Neo4j and write some Cypher queries.

To do anything with Neo4j we need to first load the CSV file into the database. The easiest way to do that is with Cypher’s LOAD CSV command.

First we’ll load the phrases in and then we’ll connect them to the episodes which were previously loaded:

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/tfidf_scikit.csv" AS row
MERGE (phrase:Phrase {value: row.Phrase});
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/tfidf_scikit.csv" AS row
MATCH (phrase:Phrase {value: row.Phrase})
MATCH (episode:Episode {id: TOINT(row.EpisodeId)})
MERGE (phrase)-[:USED_IN_EPISODE {tfidfScore: TOFLOAT(row.Score)}]->(episode);

Now we’re ready to start writing some queries. To start with we’ll write a simple query to find the top 3 phrases for each episode.

In pandas this is quite easy – we just need to group by the appropriate field and then take the top 3 records in that grouping:

top_words_by_episode = df \
    .sort(["EpisodeId", "Score"], ascending = [True, False]) \
    .groupby(["EpisodeId"], sort = False) \
    .head(3)
 
>>> print(top_words_by_episode.to_string())
 
        EpisodeId              Phrase     Score
3976            1                 ted  0.262518
2912            1              olives  0.195714
2441            1            marshall  0.155515
8143            2                 ted  0.292184
5197            2              carlos  0.227454
7482            2               robin  0.195150
12551           3                 ted  0.232662
9040            3              barney  0.187255
11254           3              mcneil  0.170619
15641           4             natalie  0.562485
16763           4                 ted  0.191873
16234           4               robin  0.102671
20715           5            subtitle  0.310866
18121           5          coat check  0.181682
20861           5                 ted  0.169973
...

The cypher version looks quite similar, the main difference being that we use the COLLECT to generate an array of phrases by episode and then take the top 3:

MATCH (e:Episode)<-[rel:USED_IN_EPISODE]-(phrase)
WITH e, rel, phrase
ORDER BY e.id, rel.tfidfScore DESC
RETURN e.id, e.title, COLLECT({phrase: phrase.value, score: rel.tfidfScore})[..3]
ORDER BY e.id
 
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | e.id | e.title                                     | COLLECT({phrase: phrase.value, score: rel.tfidfScore})[..3]                                                                                                               |
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | 1    | "Pilot"                                     | [{phrase -> "ted", score -> 0.2625177493269755},{phrase -> "olives", score -> 0.19571419072701732},{phrase -> "marshall", score -> 0.15551468983363487}]                  |
==> | 2    | "Purple Giraffe"                            | [{phrase -> "ted", score -> 0.292184496766088},{phrase -> "carlos", score -> 0.22745438090499026},{phrase -> "robin", score -> 0.19514993122773566}]                      |
==> | 3    | "Sweet Taste of Liberty"                    | [{phrase -> "ted", score -> 0.23266190616714866},{phrase -> "barney", score -> 0.18725456678444408},{phrase -> "officer mcneil", score -> 0.17061872221616137}]           |
==> | 4    | "Return of the Shirt"                       | [{phrase -> "natalie", score -> 0.5624848345525686},{phrase -> "ted", score -> 0.19187323894701674},{phrase -> "robin", score -> 0.10267067360622682}]                    |
==> | 5    | "Okay Awesome"                              | [{phrase -> "subtitle", score -> 0.310865508347106},{phrase -> "coat check", score -> 0.18168178787561182},{phrase -> "ted", score -> 0.16997258596683185}]               |
==> | 6    | "Slutty Pumpkin"                            | [{phrase -> "mike", score -> 0.2966610054610693},{phrase -> "ted", score -> 0.19333276951599407},{phrase -> "robin", score -> 0.1656172994411056}]                        |
==> | 7    | "Matchmaker"                                | [{phrase -> "ellen", score -> 0.4947912795578686},{phrase -> "sarah", score -> 0.24462913913669443},{phrase -> "ted", score -> 0.23728319597607636}]                      |
==> | 8    | "The Duel"                                  | [{phrase -> "ted", score -> 0.26713931416222847},{phrase -> "marshall", score -> 0.22816702335751904},{phrase -> "swords", score -> 0.17841675237702592}]                 |
==> | 9    | "Belly Full of Turkey"                      | [{phrase -> "ericksen", score -> 0.43145756691027665},{phrase -> "mrs ericksen", score -> 0.1939318283559959},{phrase -> "kendall", score -> 0.1846969793866628}]         |
==> | 10   | "The Pineapple Incident"                    | [{phrase -> "ted", score -> 0.439756993033922},{phrase -> "trudy", score -> 0.36367907631894536},{phrase -> "carl", score -> 0.16413071244131686}]                        |
==> | 11   | "The Limo"                                  | [{phrase -> "moby", score -> 0.48314164479037003},{phrase -> "party number", score -> 0.30458929780262456},{phrase -> "ranjit", score -> 0.1991061739767796}]             |
...
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

In the cypher version we get one row per episode whereas with the Python version we get 3 rows. It might be possible to achieve this effect with pandas too but I wasn’t sure how to do so.

Next let’s find the top phrases for a single episode – the type of query that might be part of an episode page on a How I met your mother wiki:

top_words = df[(df["EpisodeId"] == 1)] \
    .sort(["Score"], ascending = False) \
    .head(20)
 
>>> print(top_words.to_string())
 
      EpisodeId                Phrase     Score
3976          1                   ted  0.262518
2912          1                olives  0.195714
2441          1              marshall  0.155515
4732          1               yasmine  0.152279
3347          1                 robin  0.130418
209           1                barney  0.124412
2146          1                  lily  0.122925
3637          1                signal  0.103793
1366          1                goanna  0.098138
3524          1                 scene  0.095342
710           1                   cut  0.091734
2720          1              narrator  0.086462
1147          1             flashback  0.078296
1148          1        flashback date  0.070283
3224          1                ranjit  0.069393
4178          1           ted yasmine  0.058569
1149          1  flashback date robin  0.058569
525           1                  carl  0.058210
3714          1           smurf pen1s  0.054365
2048          1              lebanese  0.054365
MATCH (e:Episode {title: "Pilot"})<-[rel:USED_IN_EPISODE]-(phrase)
WITH phrase, rel
ORDER BY rel.tfidfScore DESC
RETURN phrase.value AS phrase, rel.tfidfScore AS score
LIMIT 20
 
==> +-----------------------------------------------+
==> | phrase                 | score                |
==> +-----------------------------------------------+
==> | "ted"                  | 0.2625177493269755   |
==> | "olives"               | 0.19571419072701732  |
==> | "marshall"             | 0.15551468983363487  |
==> | "yasmine"              | 0.15227880637176266  |
==> | "robin"                | 0.1304175242341549   |
==> | "barney"               | 0.12441175186690791  |
==> | "lily"                 | 0.12292497785945679  |
==> | "signal"               | 0.1037932464656365   |
==> | "goanna"               | 0.09813798750091524  |
==> | "scene"                | 0.09534236041231685  |
==> | "cut"                  | 0.09173366535740156  |
==> | "narrator"             | 0.08646229819848741  |
==> | "flashback"            | 0.07829592155397117  |
==> | "flashback date"       | 0.07028252601773662  |
==> | "ranjit"               | 0.06939276915589167  |
==> | "ted yasmine"          | 0.05856877168144719  |
==> | "flashback date robin" | 0.05856877168144719  |
==> | "carl"                 | 0.058210117288760355 |
==> | "smurf pen1s"          | 0.05436505297972703  |
==> | "lebanese"             | 0.05436505297972703  |
==> +-----------------------------------------------+

Our next query is a negation – find the episodes which don’t mention the phrase ‘robin’. In python we can do some simple set operations to work this out:

all_episodes = set(range(1, 209))
robin_episodes = set(df[(df["Phrase"] == "robin")]["EpisodeId"])
 
>>> print(set(all_episodes) - set(robin_episodes))
set([145, 198, 143])

In cypher land a query will suffice:

MATCH (episode:Episode), (phrase:Phrase {value: "robin"})
WHERE NOT (episode)<-[:USED_IN_EPISODE]-(phrase)
RETURN episode.id AS id, episode.season AS season, episode.number AS episode

And finally a mini recommendation engine type query – how many of the top phrases in Episode 1 were used in other episodes:

First python:

phrases_used = set(df[(df["EpisodeId"] == 1)] \
    .sort(["Score"], ascending = False) \
    .head(10)["Phrase"])
 
phrases = df[df["Phrase"].isin(phrases_used)]
 
print (phrases[phrases["EpisodeId"] != 1] \
    .groupby(["Phrase"]) \
    .size() \
    .order(ascending = False))

Here we’ve pulled it out into a few steps – first we identify the top phrases, then we find out where they occur across the whole data set and finally we filter out the occurrences in the first episode and count the other occurrences.

Phrase
marshall    207
barney      207
ted         206
lily        206
robin       204
scene        36
signal        4
goanna        3
olives        1

In cypher we can write a query to do this as well:

MATCH (episode:Episode {title: "Pilot"})<-[rel:USED_IN_EPISODE]-(phrase)
WITH phrase, rel, episode
ORDER BY rel.tfidfScore DESC
LIMIT 10
MATCH (phrase)-[:USED_IN_EPISODE]->(otherEpisode)
WHERE otherEpisode <> episode
RETURN phrase.value AS phrase, COUNT(*) AS numberOfOtherEpisodes
ORDER BY numberOfOtherEpisodes DESC
 
==> +------------------------------------+
==> | phrase     | numberOfOtherEpisodes |
==> +------------------------------------+
==> | "barney"   | 207                   |
==> | "marshall" | 207                   |
==> | "ted"      | 206                   |
==> | "lily"     | 206                   |
==> | "robin"    | 204                   |
==> | "scene"    | 36                    |
==> | "signal"   | 4                     |
==> | "goanna"   | 3                     |
==> | "olives"   | 1                     |
==> +------------------------------------+

Overall there’s not much in it – for some of the queries I found it easier in cypher and for others easier with pandas. It’s always useful to have multiple tools in the toolbox!

Written by Mark Needham

February 19th, 2015 at 12:52 am

Posted in neo4j

Tagged with ,

Neo4j: Building a topic graph with Prismatic Interest Graph API

without comments

Over the last few weeks I’ve been using various NLP libraries to derive topics for my corpus of How I met your mother episodes without success and was therefore enthused to see the release of Prismatic’s Interest Graph API

The Interest Graph API exposes a web service to which you feed a block of text and get back a set of topics and associated score.

It has been trained over the last few years with millions of articles that people share on their social media accounts and in my experience using Prismatic the topics have been very useful for finding new material to read.

The first step is to head to interest-graph.getprismatic.com and get an API key which will be emailed to you.

Having done that we’re ready to make some calls to the API and get back some topics.

I’m going to use Python to call the API and I’ve found the requests library the easiest library to use for this type of work. Our call to the API looks like this:

import requests
payload = { 'title': "insert title of article here",
            'body': "insert body of text here"),
            'api-token': "insert token sent by email here"}
r = requests.post("http://interest-graph.getprismatic.com/text/topic", data=payload)

One thing to keep in mind is that the API is rate limited to 20 requests a second so we need to restrict our requests or we’re going to receive error response codes. Luckily I came across an excellent blog post showing how to write a decorator around a function and only allow it to execute at a certain frequency.

To rate limit our calls to the Interest Graph we need to pull the above code into a function and annotate it appropriately:

import time
 
def RateLimited(maxPerSecond):
    minInterval = 1.0 / float(maxPerSecond)
    def decorate(func):
        lastTimeCalled = [0.0]
        def rateLimitedFunction(*args,**kargs):
            elapsed = time.clock() - lastTimeCalled[0]
            leftToWait = minInterval - elapsed
            if leftToWait>0:
                time.sleep(leftToWait)
            ret = func(*args,**kargs)
            lastTimeCalled[0] = time.clock()
            return ret
        return rateLimitedFunction
    return decorate
 
@RateLimited(0.3)
def topics(title, body):
    payload = { 'title': title,
                'body': body,
                'api-token': "insert token sent by email here"}
    r = requests.post("http://interest-graph.getprismatic.com/text/topic", data=payload)
    return r

The text I want to classify is stored in a CSV file – one sentence per line. Here’s a sample:

$ head -n 10 data/import/sentences.csv
SentenceId,EpisodeId,Season,Episode,Sentence
1,1,1,1,Pilot
2,1,1,1,Scene One
3,1,1,1,[Title: The Year 2030]
4,1,1,1,"Narrator: Kids, I'm going to tell you an incredible story. The story of how I met your mother"
5,1,1,1,Son: Are we being punished for something?
6,1,1,1,Narrator: No
7,1,1,1,"Daughter: Yeah, is this going to take a while?"
8,1,1,1,"Narrator: Yes. (Kids are annoyed) Twenty-five years ago, before I was dad, I had this whole other life."
9,1,1,1,"(Music Plays, Title ""How I Met Your Mother"" appears)"

We’ll also need to refer to another CSV file to get the title of each episode since it isn’t being stored with the sentence:

$ head -n 10 data/import/episodes_full.csv
NumberOverall,NumberInSeason,Episode,Season,DateAired,Timestamp,Title,Director,Viewers,Writers,Rating
1,1,/wiki/Pilot,1,"September 19, 2005",1127084400,Pilot,Pamela Fryman,10.94,"Carter Bays,Craig Thomas",68
2,2,/wiki/Purple_Giraffe,1,"September 26, 2005",1127689200,Purple Giraffe,Pamela Fryman,10.40,"Carter Bays,Craig Thomas",63
3,3,/wiki/Sweet_Taste_of_Liberty,1,"October 3, 2005",1128294000,Sweet Taste of Liberty,Pamela Fryman,10.44,"Phil Lord,Chris Miller",67
4,4,/wiki/Return_of_the_Shirt,1,"October 10, 2005",1128898800,Return of the Shirt,Pamela Fryman,9.84,Kourtney Kang,59
5,5,/wiki/Okay_Awesome,1,"October 17, 2005",1129503600,Okay Awesome,Pamela Fryman,10.14,Chris Harris,53
6,6,/wiki/Slutty_Pumpkin,1,"October 24, 2005",1130108400,Slutty Pumpkin,Pamela Fryman,10.89,Brenda Hsueh,62
7,7,/wiki/Matchmaker,1,"November 7, 2005",1131321600,Matchmaker,Pamela Fryman,10.55,"Sam Johnson,Chris Marcil",57
8,8,/wiki/The_Duel,1,"November 14, 2005",1131926400,The Duel,Pamela Fryman,10.35,Gloria Calderon Kellett,46
9,9,/wiki/Belly_Full_of_Turkey,1,"November 21, 2005",1132531200,Belly Full of Turkey,Pamela Fryman,10.29,"Phil Lord,Chris Miller",60

Now we need to get our episode titles and transcripts ready to pass to the topics function. Since we’ve only got ~ 200 episodes we can create a dictionary to store that data:

episodes = {}
with open("data/import/episodes_full.csv", "r") as episodesfile:
    episodes_reader = csv.reader(episodesfile, delimiter=",")
    episodes_reader.next()
    for episode in episodes_reader:
        episodes[int(episode[0])] = {"title": episode[6], "sentences" : [] }
 
with open("data/import/sentences.csv", "r") as sentencesfile:
     sentences_reader = csv.reader(sentencesfile, delimiter=",")
     sentences_reader.next()
     for sentence in sentences_reader:
         episodes[int(sentence[1])]["sentences"].append(sentence[4])
 
>>> episodes[1]["title"]
'Pilot'
>>> episodes[1]["sentences"][:5]
['Pilot', 'Scene One', '[Title: The Year 2030]', "Narrator: Kids, I'm going to tell you an incredible story. The story of how I met your mother", 'Son: Are we being punished for something?']

Now we’re going to loop through each of the episodes, call topics and write the result into a CSV file so we can load it into Neo4j afterwards to explore the data:

import json
 
with open("data/import/topics.csv", "w") as topicsfile:
    topics_writer = csv.writer(topicsfile, delimiter=",")
    topics_writer.writerow(["EpisodeId", "TopicId", "Topic", "Score"])
 
    for episode_id, episode in episodes.iteritems():
        tmp = topics(episode["title"], "".join(episode["sentences"]).json()
        print episode_id, tmp
        for topic in tmp['topics']:
            topics_writer.writerow([episode_id, topic["id"], topic["topic"], topic["score"]])

It takes about 10 minutes to run and this is a sample of the output:

$ head -n 10 data/import/topics.csv
EpisodeId,TopicId,Topic,Score
1,1519,Fiction,0.5798245566455255
1,2015,Humour,0.565154963605359
1,24031,Laughing,0.5587120401021765
1,16693,Flirting,0.5514098189505282
1,1163,Dating and Courtship,0.5487490108554022
1,2386,Kissing,0.5476185929151934
1,31929,Puns,0.5375100569837977
2,24031,Laughing,0.5670926949850333
2,1519,Fiction,0.5396488295397263

We’ll use Neo4j’s LOAD CSV command to load the data in:

// make sure the topics exist
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/topics.csv" AS row
MERGE (topic:Topic {id: TOINT(row.TopicId)})
ON CREATE SET topic.value = row.Topic
// make sure the topics exist
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/topics.csv" AS row
MERGE (topic:Topic {id: TOINT(row.TopicId)})
ON CREATE SET topic.value = row.Topic
// now link the episodes and topics
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/topics.csv" AS row
MATCH (topic:Topic {id: TOINT(row.TopicId)})
MATCH (episode:Episode {id: TOINT(row.EpisodeId)})
MERGE (episode)-[:TOPIC {score: TOFLOAT(row.Score)}]->(topic)

We’ll assume that the episodes and seasons are already loaded – the commands to load those in are on github.

We can now write some queries against our topic graph. We’ll start simple – show me the topics for an episode:

MATCH (episode:Episode {id: 1})-[r:TOPIC]->(topic)
RETURN topic, r

Graph

Let’s say we liked the ‘Puns’ aspect of the Pilot episode and want to find out which other episodes had puns. The following query would let us find those:

MATCH (episode:Episode {id: 1})-[r:TOPIC]->(topic {value: "Puns"})<-[:TOPIC]-(other)
RETURN episode, topic, other

Graph  1

Or maybe we want to find the episode which has the most topics in common:

MATCH (episode:Episode {id: 1})-[:TOPIC]->(topic),
      (topic)<-[r:TOPIC]-(otherEpisode)
RETURN otherEpisode.title as episode, COUNT(r) AS topicsInCommon
ORDER BY topicsInCommon DESC
LIMIT 10
==> +------------------------------------------------+
==> | episode                       | topicsInCommon |
==> +------------------------------------------------+
==> | "Purple Giraffe"              | 6              |
==> | "Ten Sessions"                | 5              |
==> | "Farhampton"                  | 4              |
==> | "The Three Days Rule"         | 4              |
==> | "How I Met Everyone Else"     | 4              |
==> | "The Time Travelers"          | 4              |
==> | "Mary the Paralegal"          | 4              |
==> | "Lobster Crawl"               | 4              |
==> | "The Magician's Code, Part 2" | 4              |
==> | "Slutty Pumpkin"              | 4              |
==> +------------------------------------------------+
==> 10 rows

We could then tweak that query to get the names of those topics:

MATCH (episode:Episode {id: 1})-[:TOPIC]->(topic),
      (topic)<-[r:TOPIC]-(otherEpisode)-[:IN_SEASON]->(season)
RETURN otherEpisode.title as episode, season.number AS season, COUNT(r) AS topicsInCommon, COLLECT(topic.value)
ORDER BY topicsInCommon DESC
LIMIT 10
 
==> +-----------------------------------------------------------------------------------------------------------------------------------+
==> | episode                   | season | topicsInCommon | COLLECT(topic.value)                                                        |
==> +-----------------------------------------------------------------------------------------------------------------------------------+
==> | "Purple Giraffe"          | "1"    | 6              | ["Humour","Fiction","Kissing","Dating and Courtship","Flirting","Laughing"] |
==> | "Ten Sessions"            | "3"    | 5              | ["Humour","Puns","Dating and Courtship","Flirting","Laughing"]              |
==> | "How I Met Everyone Else" | "3"    | 4              | ["Humour","Fiction","Dating and Courtship","Laughing"]                      |
==> | "Farhampton"              | "8"    | 4              | ["Humour","Fiction","Kissing","Dating and Courtship"]                       |
==> | "Bedtime Stories"         | "9"    | 4              | ["Humour","Puns","Dating and Courtship","Laughing"]                         |
==> | "Definitions"             | "5"    | 4              | ["Kissing","Dating and Courtship","Flirting","Laughing"]                    |
==> | "Lobster Crawl"           | "8"    | 4              | ["Humour","Dating and Courtship","Flirting","Laughing"]                     |
==> | "Little Boys"             | "3"    | 4              | ["Humour","Puns","Dating and Courtship","Laughing"]                         |
==> | "Wait for It"             | "3"    | 4              | ["Fiction","Puns","Flirting","Laughing"]                                    |
==> | "Mary the Paralegal"      | "1"    | 4              | ["Humour","Dating and Courtship","Flirting","Laughing"]                     |
==> +-----------------------------------------------------------------------------------------------------------------------------------+

Overall 168 (out of 208) of the other episodes have a topic in common with the first episode so perhaps just having a topic in common isn’t the best indication of similarity.

An interesting next step would be to calculate cosine or jaccard similarity between the episodes and store that value in the graph for querying later on.

I’ve also calculated the most common bigrams across all the transcripts so it would be interesting to see if there are any interesting insights at the intersection of episodes, topics and phrases.

Written by Mark Needham

February 13th, 2015 at 11:38 pm

Posted in neo4j,Python

Tagged with

Python NLTK/Neo4j: Analysing the transcripts of How I Met Your Mother

without comments

After reading Emil’s blog post about dark data a few weeks ago I became intrigued about trying to find some structure in free text data and I thought How I met your mother’s transcripts would be a good place to start.

I found a website which has the transcripts for all the episodes and then having manually downloaded the two pages which listed all the episodes, wrote a script to grab each of the transcripts so I could use them on my machine.

I wanted to learn a bit of Python and my colleague Nigel pointed me towards the requests and BeautifulSoup libraries to help me with my task. The script to grab the transcripts looks like this:

import requests
from bs4 import BeautifulSoup
from soupselect import select
 
episodes = {}
for i in range(1,3):
    page = open("data/transcripts/page-" + str(i) + ".html", 'r')
    soup = BeautifulSoup(page.read())
 
    for row in select(soup, "td.topic-titles a"):
        parts = row.text.split(" - ")
        episodes[parts[0]] = {"title": parts[1], "link": row.get("href")}
 
for key, value in episodes.iteritems():
    parts = key.split("x")
    season = int(parts[0])
    episode = int(parts[1])
    filename = "data/transcripts/S%d-Ep%d" %(season, episode)
    print filename
 
    with open(filename, 'wb') as handle:
        headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
        response = requests.get("http://transcripts.foreverdreaming.org" + value["link"], headers = headers)
        if response.ok:
            for block in response.iter_content(1024):
                if not block:
                    break
 
                handle.write(block)

the files containing the lists of episodes are named ‘page-1′ and ‘page-2′

The code is reasonably simple – we find all the links inside the table, put them in a dictionary and then iterate through the dictionary and download the files to disk. The code to save the file is a bit of a monstrosity but there didn’t seem to be a ‘save’ method that I could use.

Having downloaded the files, I thought through all sorts of clever things I could do, including generating a bag of words model for each episode or performing sentiment analysis on each sentence which I’d learnt about from a Kaggle tutorial.

In the end I decided to start simple and extract all the words from the transcripts and count many times a word occurred in a given episode.

I ended up with the following script which created a dictionary of (episode -> words + occurrences):

import csv
import nltk
import re
 
from bs4 import BeautifulSoup
from soupselect import select
from nltk.corpus import stopwords
from collections import Counter
from nltk.tokenize import word_tokenize
 
def count_words(words):
    tally=Counter()
    for elem in words:
        tally[elem] += 1
    return tally
 
episodes_dict = {}
with open('data/import/episodes.csv', 'r') as episodes:
    reader = csv.reader(episodes, delimiter=',')
    reader.next()
 
    for row in reader:
        print row
        transcript = open("data/transcripts/S%s-Ep%s" %(row[3], row[1])).read()
        soup = BeautifulSoup(transcript)
        rows = select(soup, "table.tablebg tr td.post-body div.postbody")
 
        raw_text = rows[0]
        [ad.extract() for ad in select(raw_text, "div.ads-topic")]
        [ad.extract() for ad in select(raw_text, "div.t-foot-links")]
 
        text = re.sub("[^a-zA-Z]", " ", raw_text.text.strip())
        words = [w for w in nltk.word_tokenize(text) if not w.lower() in stopwords.words("english")]
 
        episodes_dict[row[0]] = count_words(words)

Next I wanted to explore the data a bit to see which words occurred across episodes or which word occurred most frequently and realised that this would be a much easier task if I stored the data somewhere.

s/somewhere/in Neo4j

Neo4j’s query language, Cypher, has a really nice ETL-esque tool called ‘LOAD CSV’ for loading in CSV files (as the name suggests!) so I added some code to save my words to disk:

with open("data/import/words.csv", "w") as words:
    writer = csv.writer(words, delimiter=",")
    writer.writerow(["EpisodeId", "Word", "Occurrences"])
    for episode_id, words in episodes_dict.iteritems():
        for word in words:
            writer.writerow([episode_id, word, words[word]])

This is what the CSV file contents look like:

$ head -n 10 data/import/words.csv
EpisodeId,Word,Occurrences
165,secondly,1
165,focus,1
165,baby,1
165,spiders,1
165,go,4
165,apartment,1
165,buddy,1
165,Exactly,1
165,young,1

Now we need to write some Cypher to get the data into Neo4j:

// words
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-himym/data/import/words.csv" AS row
MERGE (word:Word {value: row.Word})
// episodes
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-himym/data/import/words.csv" AS row
MERGE (episode:Episode {id: TOINT(row.EpisodeId)})
// words to episodes
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-himym/data/import/words.csv" AS row
MATCH (word:Word {value: row.Word})
MATCH (episode:Episode {id: TOINT(row.EpisodeId)})
MERGE (word)-[:USED_IN_EPISODE {times: TOINT(row.Occurrences) }]->(episode);

Having done that we can write some simple queries to explore the words used in How I met your mother:

MATCH (word:Word)-[r:USED_IN_EPISODE]->(episode) 
RETURN word.value, COUNT(episode) AS episodes, SUM(r.times) AS occurrences
ORDER BY occurrences DESC
LIMIT 10
 
==> +-------------------------------------+
==> | word.value | episodes | occurrences |
==> +-------------------------------------+
==> | "Ted"      | 207      | 11437       |
==> | "Barney"   | 208      | 8052        |
==> | "Marshall" | 208      | 7236        |
==> | "Robin"    | 205      | 6626        |
==> | "Lily"     | 207      | 6330        |
==> | "m"        | 208      | 4777        |
==> | "re"       | 208      | 4097        |
==> | "know"     | 208      | 3489        |
==> | "Oh"       | 197      | 3448        |
==> | "like"     | 208      | 2498        |
==> +-------------------------------------+
==> 10 rows

The main 5 characters occupy the top 5 positions which is probably what you’d expect. I’m not sure why ‘m’ and ‘re’ are in the next two position s – I expect that might be scraping gone wrong!

Our next query might focus around checking which character is referred to the post in each episode:

WITH ["Ted", "Barney", "Robin", "Lily", "Marshall"] as mainCharacters
MATCH (word:Word) WHERE word.value IN mainCharacters
MATCH (episode:Episode)<-[r:USED_IN_EPISODE]-(word)
WITH episode, word, r
ORDER BY episode.id, r.times DESC
WITH episode, COLLECT({word: word.value, times: r.times})[0] AS topWord
RETURN episode.id, topWord.word AS word, topWord.times AS occurrences
LIMIT 10
 
==> +---------------------------------------+
==> | episode.id | word       | occurrences |
==> +---------------------------------------+
==> | 72         | "Barney"   | 75          |
==> | 143        | "Ted"      | 16          |
==> | 43         | "Lily"     | 74          |
==> | 156        | "Ted"      | 12          |
==> | 206        | "Barney"   | 23          |
==> | 50         | "Marshall" | 51          |
==> | 113        | "Ted"      | 76          |
==> | 178        | "Barney"   | 21          |
==> | 182        | "Barney"   | 22          |
==> | 67         | "Ted"      | 84          |
==> +---------------------------------------+
==> 10 rows

If we dig into it further there’s actually quite a bit of variety in the number of times the top character in each episode is mentioned which again probably says something about the data:

WITH ["Ted", "Barney", "Robin", "Lily", "Marshall"] as mainCharacters
MATCH (word:Word) WHERE word.value IN mainCharacters
MATCH (episode:Episode)<-[r:USED_IN_EPISODE]-(word)
WITH episode, word, r
ORDER BY episode.id, r.times DESC
WITH episode, COLLECT({word: word.value, times: r.times})[0] AS topWord
RETURN MIN(topWord.times), MAX(topWord.times), AVG(topWord.times), STDEV(topWord.times)
 
==> +-------------------------------------------------------------------------------------+
==> | MIN(topWord.times) | MAX(topWord.times) | AVG(topWord.times) | STDEV(topWord.times) |
==> +-------------------------------------------------------------------------------------+
==> | 3                  | 259                | 63.90865384615385  | 42.36255207691068    |
==> +-------------------------------------------------------------------------------------+
==> 1 row

Obviously this is a very simple way of deriving structure from text, here are some of the things I want to try out next:

  • Detecting common phrases/memes/phrases used in the show (e.g. the yellow umbrella) – this should be possible by creating different length n-grams and then searching for those phrases across the corpus.
  • Pull out scenes – some of the transcripts use the keyword ‘scene’ to denote this although some of them don’t. Depending how many transcripts contain scene demarkations perhaps we could train a classifier to detect where scenes should be in the transcripts which don’t have scenes.
  • Analyse who talks to each other or who talks about each other most frequently
  • Create a graph of conversations as my colleagues Max and Michael have previously blogged about.

Written by Mark Needham

January 10th, 2015 at 1:22 am

Posted in neo4j,Python

Tagged with ,

Neo4j 2.1.6 – Cypher: FOREACH slowness

without comments

A common problem that people have when using Neo4j for social network applications is updating a person with their newly imported friends.

We’ll have an array of friends that we want to connect to a single Person node. Assuming the following schema…

$ schema
Indexes
  ON :Person(id) ONLINE
 
No constraints

…a simplified version would look like this:

WITH range (2,1002) AS friends
MERGE (p:Person {id: 1})
 
FOREACH(f IN friends |
  MERGE (friend:Person {id: f})
  MERGE (friend)-[:FRIENDS]->p);

If we execute that on an empty database we’ll see something like this:

+-------------------+
| No data returned. |
+-------------------+
Nodes created: 1002
Relationships created: 1001
Properties set: 1002
Labels added: 1002
19173 ms

This took much longer than we’d expect so let’s have a look at the PROFILE output:

EmptyResult
  |
  +UpdateGraph(0)
    |
    +Eager
      |
      +UpdateGraph(1)
        |
        +Extract
          |
          +Null
 
+----------------+------+---------+-------------+--------------------------------------+
|       Operator | Rows |  DbHits | Identifiers |                                Other |
+----------------+------+---------+-------------+--------------------------------------+
|    EmptyResult |    0 |       0 |             |                                      |
| UpdateGraph(0) |    1 | 3015012 |             |                              Foreach |
|          Eager |    1 |       0 |             |                                      |
| UpdateGraph(1) |    1 |       5 |        p, p | MergeNode; {  AUTOINT2}; :Person(id) |
|        Extract |    1 |       0 |             |                              friends |
|           Null |    ? |       ? |             |                                      |
+----------------+------+---------+-------------+--------------------------------------+

The DbHits value on the 2nd row seems suspiciously high suggesting that FOREACH might not be making use of the Person#id index and is instead scanning all Person nodes each time.

I’m not sure how to drill into that further but an alternative approach is to try out the same query but using UNWIND instead and checking the profile output of that:

WITH range (2,1002) AS friends
MERGE (p:Person {id: 1})
WITH p, friends
UNWIND friends AS f
MERGE (friend:Person {id: f})
MERGE (friend)-[:FRIENDS]->p;
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 1002
Relationships created: 1001
Properties set: 1002
Labels added: 1002
343 ms
EmptyResult
  |
  +UpdateGraph(0)
    |
    +Eager(0)
      |
      +UpdateGraph(1)
        |
        +UNWIND
          |
          +Eager(1)
            |
            +UpdateGraph(2)
              |
              +Extract
                |
                +Null
 
+----------------+------+--------+-------------------------+--------------------------------------+
|       Operator | Rows | DbHits |             Identifiers |                                Other |
+----------------+------+--------+-------------------------+--------------------------------------+
|    EmptyResult |    0 |      0 |                         |                                      |
| UpdateGraph(0) | 1001 |      0 | friend, p,   UNNAMED136 |                         MergePattern |
|       Eager(0) | 1001 |      0 |                         |                                      |
| UpdateGraph(1) | 1001 |   5005 |          friend, friend |            MergeNode; f; :Person(id) |
|         UNWIND | 1001 |      0 |                         |                                      |
|       Eager(1) |    1 |      0 |                         |                                      |
| UpdateGraph(2) |    1 |      5 |                    p, p | MergeNode; {  AUTOINT2}; :Person(id) |
|        Extract |    1 |      0 |                         |                              friends |
|           Null |    ? |      ? |                         |                                      |
+----------------+------+--------+-------------------------+--------------------------------------+

That’s much quicker and doesn’t touch as many nodes as FOREACH was. I expect the index issue will be sorted out in future but until then UNWIND is our friend.

Written by Mark Needham

December 28th, 2014 at 4:28 am

Posted in neo4j

Tagged with ,