Mark Needham

Thoughts on Software Development

Clojure: First steps with reducers

without comments

I’ve been playing around with Clojure a bit today in preparation for a talk I’m giving next week and found myself writing the following code to apply the same function to three different scores:

(defn log2 [n]
  (/ (Math/log n) (Math/log 2)))
 
(defn score-item [n]
  (if (= n 0) 0 (log2 n)))
 
(+ (score-item 12) (score-item 13) (score-item 5)) 9.60733031374961

I’d forgotten about folding over a collection but quickly remembered that I could achieve the same result with the following code:

(reduce #(+ %1 (score-item %2)) 0 [12 13 5]) 9.60733031374961

The added advantage here is that if I want to add a 4th score to the mix all I need to do is append it to the end of the vector:

(reduce #(+ %1 (score-item %2)) 0 [12 13 5 6]) 12.192292814470767

However, while Googling to remind myself of the order of the arguments to reduce I kept coming across articles and documentation about reducers which I’d heard about but never used.

As I understand they’re used to achieve performance gains and easier composition of functions over collections so I’m not sure how useful they’ll be to me but I thought I’d give them a try.

Our first step is to bring the namespace into scope:

(require '[clojure.core.reducers :as r])

Now we can compute the same result using the reduce function:

(r/reduce #(+ %1 (score-item %2)) 0 [12 13 5 6]) 12.192292814470767

So far, so identical. If we wanted to calculate individual scores and then filter out those below a certain threshold the code would behave a little differently:

(->>[12 13 5 6]
    (map score-item)
    (filter #(> % 3))) (3.5849625007211565 3.700439718141092)
 
(->> [12 13 5 6]
     (r/map score-item)
     (r/filter #(> % 3))) #object[clojure.core.reducers$folder$reify__19192 0x5d0edf21 "clojure.core.reducers$folder$reify__19192@5d0edf21"]

Instead of giving us a vector of scores the reducers version returns a reducer which can pass into reduce or fold if we want an accumulated result or into if we want to output a collection. In this case we want the latter:

(->> [12 13 5 6]
     (r/map score-item)
     (r/filter #(> % 3))
     (into [])) (3.5849625007211565 3.700439718141092)

With a measly 4 item collection I don’t think the reducers are going to provide much speed improvement here but we’d need to use the fold function if we want processing of the collection to be done in parallel.

One for next time!

Written by Mark Needham

January 24th, 2016 at 10:01 pm

Posted in Clojure

Tagged with

Neo4j: Cypher – avoid duplicate calls to NOT patterns

with 2 comments

I’ve been reacquainting myself with the meetup.com dataset ahead of Wednesday’s meetup in London and wanted to write a collaborative filtering type query to work out which groups people in my groups were in.

This started simple enough:

MATCH (member:Member {name: "Mark Needham"})-[:MEMBER_OF]->(group:Group)<-[:MEMBER_OF]-(other:Member)-[:MEMBER_OF]->(otherGroup:Group)
RETURN otherGroup, COUNT(*) AS commonMembers
ORDER BY commonMembers DESC
LIMIT 5

And doesn’t take too long to run:

Cypher version: CYPHER 2.3, planner: COST. 1084378 total db hits in 1103 ms.

However, it was showing up several groups that I’m already a member of so I added in a “WHERE NOT” clause to sort that out:

MATCH (member:Member {name: "Mark Needham"})-[:MEMBER_OF]->(group:Group)<-[:MEMBER_OF]-(other:Member)-[:MEMBER_OF]->(otherGroup:Group)
WHERE NOT (member)-[:MEMBER_OF]->(otherGroup)
RETURN otherGroup, COUNT(*) AS commonMembers
ORDER BY commonMembers DESC
LIMIT 5

Unfortunately when I ran this the amount of db hits increased by 14x and it now took 3x as long to run:

Cypher version: CYPHER 2.3, planner: COST. 14061442 total db hits in 3364 ms.


The problem is that we’re making lots of duplicate calls to NOT (member)-[:MEMBER_OF]->(otherGroup) because each group shows up lots of times.

This is the ‘reduce cardinality of work in progress’ tip from Michael Hunger’s blog post:

Bonus Query Tuning Tip: Reduce Cardinality of Work in Progress

When following longer paths, you’ll encounter duplicates. If you’re not interested in all the possible paths – but just distinct information from stages of the path – make sure that you eagerly eliminate duplicates, so that later matches don’t have to be executed many multiple times.

We can reduce the WIP in our query by doing the counting of common members first and then filtering out the groups we’re already a member of:

MATCH (member:Member {name: "Mark Needham"})-[:MEMBER_OF]->(group:Group)<-[:MEMBER_OF]-(other:Member)-[:MEMBER_OF]->(otherGroup:Group)
WITH otherGroup, member, COUNT(*) AS commonMembers
WHERE NOT (member)-[:MEMBER_OF]->(otherGroup)
RETURN otherGroup, commonMembers
ORDER BY commonMembers DESC
LIMIT 5

This gets us back down to something closer to the running time/db hits of our initial query:

Cypher version: CYPHER 2.3, planner: COST. 1097114 total db hits in 1004 ms.

Written by Mark Needham

January 17th, 2016 at 12:19 pm

Posted in neo4j

Tagged with

2015: A year in the life of the Neo4j London meetup group

without comments

Given we’ve only got a few more hours left of 2015 I thought it’d be fun to do a quick overview of how things have been going in the London chapter of the Neo4j meetup using Neo4j with a bit of R mixed in.

We’re going to be using the RNeo4j library to interact with the database along with a few other libraries which will help us out with different tasks:

library(RNeo4j)
library(ggplot2)
library(dplyr)
library(zoo)
 
graph = startGraph("http://localhost:7474/db/data/", username = "neo4j", password = "myPassword")

Let’s get to it:

Members

query = "MATCH (:Group {name: {name}})<-[membership:MEMBER_OF]-()
         RETURN membership.joined AS timestamp"
 
joinedDF = cypher(graph, query, name = "Neo4j - London User Group")
joinedDF$joinDate = as.Date(as.POSIXct(joinedDF$timestamp / 1000, origin="1970-01-01"))
joinedDF$joinDate = as.Date(as.POSIXct(joinedDF$timestamp / 1000, origin="1970-01-01"))
 
ggplot(aes(x = year, y = n, label = n), 
       data = joinedDF %>% mutate(year = format(joinDate, "%Y")) %>% count(year)) + 
  geom_bar(stat = "identity", fill = "Dark Blue") + 
  ggtitle("Number of new members by year") +
  geom_text(vjust=-0.5)
2015 12 31 12 23 06

A bit down on 2014 but not too far away. We’re still attracting new people who are interested in learning about graphs. Let’s drill into those numbers a bit:

byYearMon = joinedDF %>% 
  filter(format(joinDate, "%Y") == 2015) %>% 
  mutate(yearmon = as.Date(as.yearmon(joinDate))) %>% 
  count(yearmon)
 
ggplot(aes(x = yearmon, y = n, label = n), data = byYearMon) + 
  geom_bar(stat = "identity", fill = "Dark Blue") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_x_date(labels = date_format("%B"), breaks = "1 month") +
  ggtitle("Number of new members by month/year")
2015 12 31 12 39 04

We had a bit of an end of year surge in October/November which was unexpected. December has been low in previous years, there was an April dip which I think is because we stopped doing events before Graph Connect 2015. I’m not sure about the September dip so let’s have a look:

eventsQuery = "MATCH (:Group {name: {name}})-[:HOSTED_EVENT]->(event)
               RETURN event.time + event.utcOffset AS timestamp"
eventsDF = cypher(graph, eventsQuery, name = "Neo4j - London User Group")
eventsDF$timestamp = as.Date(as.POSIXct(eventsDF$timestamp / 1000, origin="1970-01-01"))
 
eventsByYearMon = eventsDF %>% 
  filter(format(timestamp, "%Y") == 2015) %>% 
  mutate(yearmon = as.Date(as.yearmon(timestamp))) %>% 
  count(yearmon) 
 
merge(eventsByYearMon, byYearMon, by="yearmon")
 
      yearmon n.x n.y
1  2015-01-01   3  80
2  2015-02-01   6  76
3  2015-03-01   2  70
4  2015-04-01   2  53
5  2015-05-01   4  78
6  2015-06-01   5  83
7  2015-07-01   3  73
8  2015-08-01   5  73
9  2015-09-01   3  40
10 2015-10-01   3  94
11 2015-11-01   4 117
12 2015-12-01   3  48

At first glance there doesn’t seem to be any correlation between the number of events held and the number of new members so I think we’ll have to look for another predictor of that variable!

Events

Next let’s have a look at the events we ran in 2015. We’ll start with a quick chart showing the number of events we’ve run over the years:

ggplot(aes(x = year, y = n, label = n), data = eventsDF %>% mutate(year = format(timestamp, "%Y")) %>% count(year)) + 
  geom_bar(stat = "identity", fill = "Dark Blue") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle("Number of events")

2015 12 31 13 43 15

So less events than last year but how many people RSVPD ‘yes’ to the ones we did host?

eventsQuery = "MATCH (:Group {name: {name}})-[:HOSTED_EVENT]->(event)<-[:RSVPD {response: 'yes'}]-()
               WHERE event.time + event.utcOffset < timestamp()
               WITH event, COUNT(*) AS rsvps
               RETURN event.time + event.utcOffset AS timestamp, rsvps"
eventsDF = cypher(graph, eventsQuery, name = "Neo4j - London User Group")
eventsDF$timestamp = as.Date(as.POSIXct(eventsDF$timestamp / 1000, origin="1970-01-01"))
 
ggplot(aes(x = year, y = rsvps), 
       data = eventsDF %>% mutate(year = format(timestamp, "%Y")) %>% group_by(year) %>% summarise(rsvps= sum(rsvps)) ) + 
  geom_bar(stat = "identity", fill = "Dark Blue") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle("Number of attendees")
2015 12 31 13 54 50

Slightly more ‘yes’ RSVPs than last year. Now let’s drill into the repeat events we ran this year:

eventsQuery = "MATCH (:Group {name: {name}})-[:HOSTED_EVENT]->(event)
               WHERE {startYear} <= (event.time + event.utcOffset) < {endYear}
               RETURN event.name AS event, COUNT(*) AS times
               ORDER BY times DESC"
eventsDF = cypher(graph, eventsQuery, name = "Neo4j - London User Group", 
                  startYear  = as.numeric(as.POSIXct("2015-01-01", format="%Y-%m-%d")) * 1000, 
                  endYear = as.numeric(as.POSIXct("2015-12-31", format="%Y-%m-%d")) * 1000)
eventsDF %>% filter(times > 1)
 
                                                       event times
1                      Relational to graph: A worked example     7
2                                            Intro to Graphs     6
3                          Graph Modelling - Do's and Don'ts     5
4          Hands On Intro to Cypher - Neo4j's Query Language     3
5 Build your own recommendation engine with Neo4j in an hour     2
6                                Fraud Detection using Neo4j     2

I thought we’d run ‘Intro to Graphs’ most often but the data doesn’t lie – it’s all about relational to graph. And which were the most popular repeat events in terms of ‘yes’ RSVPs?

eventsQuery = "MATCH (:Group {name: {name}})-[:HOSTED_EVENT]->(event)
               WHERE {startYear} <= (event.time + event.utcOffset) < {endYear}
               MATCH (event)<-[:RSVPD {response: 'yes'}]-()
               WITH event, COUNT(*) AS yesRSVPs
               WITH event.name AS event, COUNT(*) AS times, SUM(yesRSVPs) AS rsvps
               RETURN event, times, rsvps, rsvps / times AS rsvpsPerEvent
               ORDER BY rsvpsPerEvent DESC"
eventsDF = cypher(graph, eventsQuery, name = "Neo4j - London User Group", 
                  startYear  = as.numeric(as.POSIXct("2015-01-01", format="%Y-%m-%d")) * 1000, 
                  endYear = as.numeric(as.POSIXct("2015-12-31", format="%Y-%m-%d")) * 1000)
eventsDF %>% filter(times > 1)
 
                                                       event times rsvps rsvpsPerEvent
1                                Fraud Detection using Neo4j     2   150            75
2                                            Intro to Graphs     6   352            58
3                          Graph Modelling - Do's and Don'ts     5   281            56
4                      Relational to graph: A worked example     7   367            52
5 Build your own recommendation engine with Neo4j in an hour     2    85            42
6          Hands On Intro to Cypher - Neo4j's Query Language     3   104            34

It looks like fraud is a popular topic although we’ve only run it twice so perhaps best not to read too much into that. We’re running that one again in a couple of weeks if you’re interested.

Ignoring repeat events let’s see which event drew the biggest crowd:

eventsQuery = "MATCH (:Group {name: {name}})-[:HOSTED_EVENT]->(event)
               WHERE {startYear} <= (event.time + event.utcOffset) < {endYear}
               MATCH (event)<-[:RSVPD {response: 'yes'}]-()
               WITH event.id AS id, event.name AS event, COUNT(*) AS rsvps
               RETURN event, rsvps
               ORDER BY rsvps DESC"
eventsDF = cypher(graph, eventsQuery, name = "Neo4j - London User Group", 
                  startYear  = as.numeric(as.POSIXct("2015-01-01", format="%Y-%m-%d")) * 1000, 
                  endYear = as.numeric(as.POSIXct("2015-12-31", format="%Y-%m-%d")) * 1000)
eventsDF %>% head(5)
 
                                                                         event rsvps
1 Neo4j Full Stack Applications + Python, R and Neo4j - The Data Science Stack   133
2                          Modelling a recommendation engine: A worked example   118
3                    Building a repository of biomedical ontologies with Neo4j   107
4                     GraphHack @ Graph Connect: The night before Election Day    91
5                                        Bootstrapping a Recommendation Engine    88

A double header featuring Nicole White and Matt Wright proved to be the most popular event of the year and in fact the most popular in terms of ‘yes’ RSVPs so far:

eventsQuery = "MATCH (:Group {name: {name}})-[:HOSTED_EVENT]->(event)<-[:RSVPD {response: 'yes'}]-()
               WITH event, COUNT(*) AS rsvps
               RETURN event.name AS event, event.time + event.utcOffset AS time, rsvps
               ORDER BY rsvps DESC"
eventsDF = cypher(graph, eventsQuery, name = "Neo4j - London User Group")
eventsDF$time = as.Date(as.POSIXct(eventsDF$time / 1000, origin="1970-01-01"))
eventsDF %>% mutate(year = format(time, "%Y")) %>% dplyr::select(-time) %>% head(10)
 
                                                                          event rsvps year
1  Neo4j Full Stack Applications + Python, R and Neo4j - The Data Science Stack   133 2015
2                           Modelling a recommendation engine: A worked example   118 2015
3                     Building a repository of biomedical ontologies with Neo4j   107 2015
4                                                    Real world Neo4j use cases    98 2014
5                                                           The transport graph    94 2014
6                                                     The Visualisation Special    93 2014
7                  Impossible is Nothing by Jim Webber, Neo4j's Chief Scientist    93 2014
8                      GraphHack @ Graph Connect: The night before Election Day    91 2015
9                                         Bootstrapping a Recommendation Engine    88 2015
10                                    Scraping and Graphing the Apple app store    88 2015

3 of the top 4 belong to 2015 and 6 of the top 10. Let’s see what 2016 has in store.

Thanks to everyone who’s come along to one of our meetups and Happy New Year!

Written by Mark Needham

December 31st, 2015 at 1:58 pm

Posted in neo4j

Tagged with

Study until your mind wanders

with 2 comments

I’ve previously found it very difficult to read math heavy content which has made it challenging to read Distributed Computing which I bought last May.

After several false starts where I gave up after getting frustrated that I couldn’t understand things the first time around and forgot everything if I left it a couple of days I decided to try again with a different approach.

I’ve been trying a technique I learned from Mini Habits where every day I have a (very small) goal of reading one page of the book. Having such a small goal means that I can read the material as slowly as I like (repeating previous days if necessary).

So far I’ve read 4 chapters (~100 pages) over the last month – some days I read 6 or 7 pages, other days I only manage one. The key is to keep the rhythm of reading something.

I tried doing the reading at different times of the day – on the bus on the way to work, in the evening before going to sleep – and found that for me the best time is immediately when I wake up. To minimise my mind wandering I don’t read any emails, chat messages or social media accounts before I start.

Despite this, I’ve noticed that after a while my mind starts to wander while reading proofs and that’s a signal for me to stop for the day. When I pick up again the next day I’ve often found that I understand what I was having difficulty with.

I’ve read that meditation prior to studying is an effective way to quiet the mind and would be a more ‘on demand’ way of achieving the concentration required to read this type of material. I’ve never done any meditation so if anyone has any tips on where to start that’d be helpful.

Written by Mark Needham

December 31st, 2015 at 10:47 am

R: Error in approxfun(x.values.1, y.values.1, method = “constant”, f = 1, : zero non-NA points

without comments

I’ve been following Michy Alice’s logistic regression tutorial to create an attendance model for London dev meetups and ran into an interesting problem while doing so.

Our dataset has a class imbalance i.e. most people RSVP ‘no’ to events which can lead to misleading accuracy score where predicting ‘no’ every time would lead to supposed high accuracy.

Source: local data frame [2 x 2]
 
  attended     n
     (dbl) (int)
1        0  1541
2        1    53

I sampled the data using caret‘s upSample function to avoid this:

attended = as.factor((df %>% dplyr::select(attended))$attended)
upSampledDf = upSample(df %>% dplyr::select(-attended), attended)
upSampledDf$attended = as.numeric(as.character(upSampledDf$Class))

I then trained a logistic regression model but when I tried to plot the area under the curve I ran into trouble:

p <- predict(model, newdata=test, type="response")
pr <- prediction(p, test$attended)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
 
Error in approxfun(x.values.1, y.values.1, method = "constant", f = 1,  : 
  zero non-NA points

I don’t have any NA values in my data frame so this message was a bit confusing to start with. As usual Stack Overflow came to the rescue with the suggestion that I was probably missing positive/negative values for the independent variable i.e. ‘approved’.

A quick count on the test data frame using dplyr confirmed my mistake:

> test %>% count(attended)
Source: local data frame [1 x 2]
 
  attended     n
     (dbl) (int)
1        1   582

I’ll have to randomly sort the data frame and then reassign my training and test data frames to work around it.

Written by Mark Needham

December 27th, 2015 at 12:24 pm

Posted in R

Tagged with

Python: Squashing ‘duplicate’ pairs together

with 5 comments

As part of a data cleaning pipeline I had pairs of ids of duplicate addresses that I wanted to group together.

I couldn’t work out how to solve the problem immediately so I simplified the problem into pairs of letters i.e.

A	B		(A is the same as B)
B	C		(B is the same as C)
C	D		...
E	F		(E is the same as F)
F	G		...

The output that I want to get is:

(A, B, C, D)
(E, F, G)

I spent several hours trying to come up with a clever data structure to do this until Reshmee suggested tracking the sets of duplicates using an array of arrays or list of lists since we’re going to script this using Python.

The actual data is in a CSV file but we’ll create a list of tuples to save ourselves some work:

pairs = [ ("A", "B"), ("B", "C"), ("C", "D"), ("E", "F"), ("F", "G") ]

We’re going to iterate through the list of pairs and on each iteration we’ll check if there’s an entry in the list containing either of the values. There can be three outcomes from this check:

  1. No entry – we’ll add a new entry with our pair of values.
  2. One entry – we’ll add the other value to that entry.
  3. Two entries – we’ll merge them together replacing the existing entry.

The first step is to write a function to check the list of lists for a matching pair:

def find_matching_index(pair, dups):
    return [index
            for index, dup in enumerate(dups)
            if pair[0] in dup or pair[1] in dup]
 
print find_matching_index(("A", "B"), [set(["D", "E"])])
[]
 
print find_matching_index(("B", "C"), [set(["A", "B"])])
[0]
 
print find_matching_index(("B", "C"), [set(["A", "B"]), set(["C", "D"])])
[0, 1]

Next we need to write a function which iterates over all our pairs of values and uses find_matching_index to work out which decision to make:

def extract_groups(items):
    dups = []
    for pair in items:
        matching_index = find_matching_index(pair, dups)
 
        if len(matching_index) == 0:
            dups.append(set([pair[0], pair[1]]))
        elif len(matching_index) == 1:
            index = matching_index[0]
            matching_dup = dups[index]
            dups.pop(index)
            dups.append(matching_dup.union([pair[0], pair[1]]))
        else:
            index1, index2 = matching_index
            dup1 = dups[index1]
            dup2 = dups[index2]
 
            dups.pop(index1)
            dups.pop(index2 - 1) # the index decrements since we removed one entry on the previous line
            dups.append(dup1.union(dup2))
    return dups

Now let’s run this with a few test cases:

test_cases = [
    [ ("A", "B"), ("B", "C"), ("C", "D"), ("E", "F"), ("F", "G") ],
    [ ("A", "B"), ("B", "C"), ("C", "D"), ("E", "F"), ("F", "G"), ("G", "A"), ("G", "Z"), ("B", "D") ],
    [ ("A", "B"), ("B", "C"), ("C", "E"), ("E", "A") ],
    [ ("A", "B"), ("C", "D"), ("F", "G"), ("H", "I"), ("J", "A") ]
]
 
for test_case in test_cases:
    print extract_groups(test_case)
 
[set(['A', 'C', 'B', 'D']), set(['E', 'G', 'F'])]
[set(['A', 'C', 'B', 'E', 'D', 'G', 'F', 'Z'])]
[set(['A', 'C', 'B', 'E'])]
[set(['C', 'D']), set(['G', 'F']), set(['I', 'H']), set(['A', 'J', 'B'])]

This certainly doesn’t scale very well but since I only have a few hundred duplicate addresses it does the job for me.

It feels like there should be a more functional way to write these functions without mutating all these lists but I haven’t figured out what that is yet.

Written by Mark Needham

December 20th, 2015 at 12:12 pm

Posted in Python

Tagged with

Neo4j: Specific relationship vs Generic relationship + property

without comments

For optimal traversal speed in Neo4j queries we should make our relationship types as specific as possible.

Let’s take a look at an example from the ‘modelling a recommendations engine‘ talk I presented at Skillsmatter a couple of weeks ago.

I needed to decided how to model the ‘RSVP’ relationship between a Member and an Event. A person can RSVP ‘yes’ or ‘no’ to an event and I’d like to capture both of these responses.

i.e. we can choose between:

2015 12 13 20 39 05

and:

2015 12 13 20 39 54

When deciding on a model we mainly need to think about the types of queries that we want to write. We shouldn’t forget about updating the model but in my experience more time is spent querying graphs than updating them.

Let’s take a look at each of those in turn:

What queries do we want to write?

The first query was going to use previous ‘yes’ RSVPs as an indicator of interest for future events. We’re not interested in ‘no’ RSVPs for this query.

I started out with the generic RSVP relationship type with a ‘response’ property to distinguish between ‘yes’ and ‘no’:

MATCH (member:Member {name: "Mark Needham"})
MATCH (futureEvent:Event) WHERE futureEvent.time >= timestamp()
MATCH (futureEvent)<-[:HOSTED_EVENT]-(group)
 
OPTIONAL MATCH (member)-[rsvp:RSVPD {response: "yes"}]->(pastEvent)<-[:HOSTED_EVENT]-(group)
WHERE pastEvent.time < timestamp()
 
RETURN group.name, futureEvent.name, COUNT(rsvp) AS previousEvents
ORDER BY  previousEvents DESC


This ran reasonably quickly but I was curious whether I could get the query to run any quicker by changing to the more specific model. Using the more specific relationship type our query reads:

MATCH (member:Member {name: "Mark Needham"})
MATCH (futureEvent:Event) WHERE futureEvent.time >= timestamp()
MATCH (futureEvent)<-[:HOSTED_EVENT]-(group)
 
OPTIONAL MATCH (member)-[rsvp:RSVP_YES]->(pastEvent)<-[:HOSTED_EVENT]-(group)
WHERE pastEvent.time < timestamp()
 
RETURN group.name, 
       futureEvent.name, 
       COUNT(rsvp) AS previousEvents
ORDER BY  previousEvents DESC

We can now profile our query and compare the db hits of both solutions:

RSVPD {response: "yes"}
Cypher version: CYPHER 2.3, planner: COST. 688635 total db hits in 232 ms.
 
RSVP_YES
Cypher version: CYPHER 2.3, planner: COST. 559866 total db hits in 207 ms.

So we get a slight gain by using the more specific relationship type. The reason the db hits is lower is partly because we’ve removed the need to lookup the ‘response’ property on every ‘RSVP’ property and check that it matches ‘yes’. We’re also evaluating fewer relationships since we only look at positive RSVPs, negative ones are ignored.

Our next query might be to capture all the RSVPs made by a member and list them alongside the events:

MATCH (member:Member {name: "Mark Needham"})-[rsvp:RSVPD]->(event)
WHERE event.time < timestamp()
RETURN event.name, event.time, rsvp.response
ORDER BY event.time DESC
MATCH (member:Member {name: "Mark Needham"})-[rsvp:RSVP_YES|:RSVP_NO]->(event)
WHERE event.time < timestamp()
RETURN event.name, event.time, CASE TYPE(rsvp) WHEN "RSVP_YES" THEN "yes" ELSE "no" END AS response
ORDER BY event.time DESC

Again we see a marginal db hits win for the more specific relationship type:

RSVPD {response: "yes"} / RSVPD {response: "no"}
Cypher version: CYPHER 2.3, planner: COST. 684 total db hits in 37 ms.
 
RSVP_YES / RSVP_NO
Cypher version: CYPHER 2.3, planner: COST. 541 total db hits in 24 ms.

However, the query is quite unwieldy and unless we store the response as a property on the relationship the code to return ‘yes’ or ‘no’ is a bit awkward. The more specific approach query would become even more painful to deal with if we introduced the ‘waitlist’ RSVP which we’ve chosen to exclude.

Will we need to update the relationship?

Yes! Users are able to change their RSVP up until the event happens so we need to be able to handle that.

Let’s have a look at the queries we’d have to write to handle a change in RSVP using both models:

Generic relationship type

MATCH (event:Event {id: {event_id}})
MATCH (member:Member {id: {member_id}})
MERGE (member)-[rsvpRel:RSVPD {id: {rsvp_id}}]->(event)
ON CREATE SET rsvpRel.created = toint({mtime})
ON MATCH  SET rsvpRel.lastModified = toint({mtime})
SET rsvpRel.response = {response}

Specific relationship type

MATCH (event:Event {id: {event_id}})
MATCH (member:Member {id: {member_id}})
 
FOREACH(ignoreMe IN CASE WHEN {response} = "yes" THEN [1] ELSE [] END |
  MERGE (member)-[rsvpYes:RSVP_YES {id: {rsvp_id}}]->(event)
  ON CREATE SET rsvpYes.created = toint({mtime})
  ON MATCH  SET rsvpYes.lastModified = toint({mtime})
 
  MERGE (member)-[oldRSVP:RSVP_NO]->(event)
  DELETE oldRSVP
)
 
FOREACH(ignoreMe IN CASE WHEN {response} = "no" THEN [1] ELSE [] END |
  MERGE (member)-[rsvpNo:RSVP_NO {id: {rsvp_id}}]->(event)
  ON CREATE SET rsvpNo.created = toint({mtime})
  ON MATCH  SET rsvpNo.lastModified = toint({mtime})
 
  MERGE (member)-[oldRSVP:RSVP_YES]->(event)
  DELETE oldRSVP
)

As you can see, the code to update an RSVP is more complicated when using the specific relationship type due in part to Cypher not yet having first class support for conditionals.

In summary, for our meetup.com model we gain speed improvements by using more specific relationship types but at the expense of some more complicated read queries and a significantly more convoluted update query.

Depending on the cardinality of relationships in your model your mileage may vary but it’s worth doing some profiling to compare all your options.

Written by Mark Needham

December 13th, 2015 at 9:22 pm

Posted in neo4j

Tagged with

Neo4j: Facts as nodes

without comments

On Tuesday I spoke at the Neo4j London user group about incrementally building a recommendation engine and described the ‘facts as nodes’ modeling pattern, defined as follows in the Graph Databases book:

When two or more domain entities interact for a period of time, a fact emerges. We represent a fact as a separate node with connections to each of the entities engaged in that fact.

Modeling an action in terms of its product—that is, in terms of the thing that results from the action—produces a similar structure: an intermediate node that represents the outcome of an interaction between two or more entities.

We started with the following model describing a meetup member and the groups they’ve joined:

2015 12 04 07 26 11

This model works well for the query it was defined for – find groups similar to ones that I’m already a member of:

MATCH (member:Member {name: "Mark Needham"})-[:MEMBER_OF]->(group)-[:HAS_TOPIC]->(topic)
WITH member, topic, COUNT(*) AS score
MATCH (topic)<-[:HAS_TOPIC]-(otherGroup) 
WHERE NOT (member)-[:MEMBER_OF]->(otherGroup)
RETURN otherGroup.name, COLLECT(topic.name), SUM(score) as score
ORDER BY score DESC

Prefixing that query with the ‘PROFILE’ keyword yields a query plan and the following summary text:

Cypher version: CYPHER 2.3, planner: COST. 89100 total db hits in 113 ms.

In this model it feels like there is a membership fact waiting to become a node.

2015 12 04 07 35 38

We can refactor towards that model with the following query:

MATCH (member:Member)-[rel:MEMBER_OF]->(group)
 
MERGE (membership:Membership {id: member.id + "_" + group.id})
SET membership.joined = rel.joined
 
MERGE (member)-[:HAS_MEMBERSHIP]->(membership)
MERGE (membership)-[:OF_GROUP]->(group);

We’d answer our initial question with the following query:

MATCH (member:Member {name: "Mark Needham"})-[:HAS_MEMBERSHIP]->()-[:OF_GROUP]->(group:Group)-[:HAS_TOPIC]->(topic)
WITH member, topic, COUNT(*) AS score
MATCH (topic)<-[:HAS_TOPIC]-(otherGroup) 
WHERE NOT (member)-[:HAS_MEMBERSHIP]->(:Membership)-[:OF_GROUP]->(otherGroup:Group)
RETURN otherGroup.name, COLLECT(topic.name), SUM(score) as score
ORDER BY score DESC

at the following cost:

Cypher version: CYPHER 2.3, planner: COST. 468201 total db hits in 346 ms.

The membership node hasn’t proved its value yet – it does 4x more work to get the same result. However, the next question we want to answer is ‘what group do people join after the Neo4j user group?’ where it might come in handy.

First we’ll add a ‘NEXT’ relationship between a user’s adjacent group memberships by writing the following query:

MATCH (member:Member)-[:HAS_MEMBERSHIP]->(membership)
 
WITH member, membership ORDER BY member.id, membership.joined
 
WITH member, COLLECT(membership) AS memberships
UNWIND RANGE(0,SIZE(memberships) - 2) as idx
 
WITH memberships[idx] AS m1, memberships[idx+1] AS m2
MERGE (m1)-[:NEXT]->(m2);

And now for the query:

MATCH (group:Group {name: "Neo4j - London User Group"})<-[:OF_GROUP]-(membership)-[:NEXT]->(nextMembership),         
      (membership)<-[:HAS_MEMBERSHIP]-(member:Member)-[:HAS_MEMBERSHIP]->(nextMembership),
      (nextMembership)-[:OF_GROUP]->(nextGroup)
RETURN nextGroup.name, COUNT(*) AS times
ORDER BY times DESC
Cypher version: CYPHER 2.3, planner: COST. 23671 total db hits in 39 ms.

And for comparison – the same query using the initial model:

MATCH (group:Group {name: "Neo4j - London User Group"})<-[membership:MEMBER_OF]-(member),
      (member)-[otherMembership:MEMBER_OF]->(otherGroup)
WHERE membership.joined < otherMembership.joined
WITH member, otherGroup 
ORDER BY otherMembership.joined
WITH member, COLLECT(otherGroup)[0] AS nextGroup
RETURN nextGroup.name, COUNT(*) AS times
ORDER BY times DESC
Cypher version: CYPHER 2.3, planner: COST. 86179 total db hits in 138 ms.

This time the membership model does 3x less work, so depending on the question a different model works better.

Given this observation we might choose to keep both models. The disadvantage of doing that is that we pay write and maintenance penalties to keep them both in sync. e.g. this is what queries to add a new membership or remove one would look like

Adding group membership

WITH "Mark Needham" AS memberName, 
     "Neo4j - London User Group" AS groupName,
     timestamp() AS now
 
MATCH (group:Group {name: groupName})
MATCH (member:Member {name: memberName})
 
MERGE (member)-[memberOfRel:MEMBER_OF]->(group)
ON CREATE SET memberOfRel.time = now
 
MERGE (membership:Membership {id: member.id + "_" + group.id})
ON CREATE SET membership.joined = now
MERGE (member)-[:HAS_MEMBERSHIP]->(membership)
MERGE (membership)-[:OF_GROUP]->(group)

Removing group membership

WITH "Mark Needham" AS memberName, 
     "Neo4j - London User Group" AS groupName,
     timestamp() AS now
 
MATCH (group:Group {name: groupName})
MATCH (member:Member {name: memberName})
 
MATCH (member)-[memberOfRel:MEMBER_OF]->(group)
 
MATCH (membership:Membership {id: member.id + "_" + group.id})
MATCH (member)-[hasMembershipRel:HAS_MEMBERSHIP]->(membership)
MATCH (membership)-[ofGroupRel:OF_GROUP]->(group)
 
DELETE memberOfRel, hasMembershipRel, ofGroupRel

The dataset is on github so take a look at it and send any questions my way.

Written by Mark Needham

December 4th, 2015 at 7:52 am

Posted in neo4j

Tagged with

Python: Parsing a JSON HTTP chunking stream

without comments

I’ve been playing around with meetup.com’s API again and this time wanted to consume the chunked HTTP RSVP stream and filter RSVPs for events I’m interested in.

I use Python for most of my hacking these days and if HTTP requests are required the requests library is my first port of call.

I started out with the following script

import requests
import json
 
def stream_meetup_initial():
    uri = "http://stream.meetup.com/2/rsvps"
    response = requests.get(uri, stream = True)
    for chunk in response.iter_content(chunk_size = None):
        yield chunk
 
for raw_rsvp in stream_meetup_initial():
    print raw_rsvp
    try:
        rsvp = json.loads(raw_rsvp)
    except ValueError as e:
        print e
        continue

This mostly worked but I also noticed the following error from time to time:

No JSON object could be decoded

Although less frequent, I also saw errors suggesting I was trying to parse an incomplete JSON object. I tweaked the function to keep a local buffer and only yield that if the chunk ended in a new line character:

def stream_meetup_newline():
    uri = "http://stream.meetup.com/2/rsvps"
    response = requests.get(uri, stream = True)
    buffer = ""
    for chunk in response.iter_content(chunk_size = 1):
        if chunk.endswith("\n"):
            buffer += chunk
            yield buffer
            buffer = ""
        else:
            buffer += chunk

This mostly works although I’m sure I’ve seen some occasions where two JSON objects were being yielded and then the call to ‘json.loads’ failed. I haven’t been able to reproduce that though.

A second read through the requests documentation made me realise I hadn’t read it very carefully the first time since we can make our lives much easier by using ‘iter_lines’ rather than ‘iter_content’:

r = requests.get('http://stream.meetup.com/2/rsvps', stream=True)
for raw_rsvp in r.iter_lines():
    if raw_rsvp:
        rsvp = json.loads(raw_rsvp)
        print rsvp

We can then process ‘rsvp’, filtering out the ones we’re interested in.

Written by Mark Needham

November 28th, 2015 at 1:56 pm

Posted in Python

Tagged with

jq: Cannot iterate over number / string and number cannot be added

without comments

In my continued parsing of meetup.com’s JSON API I wanted to extract some information from the following JSON file:

$ head -n40 data/members/18313232.json
[
  {
    "status": "active",
    "city": "London",
    "name": ". .",
    "other_services": {},
    "country": "gb",
    "topics": [],
    "lon": -0.13,
    "joined": 1438866605000,
    "id": 92951932,
    "state": "17",
    "link": "http://www.meetup.com/members/92951932",
    "photo": {
      "thumb_link": "http://photos1.meetupstatic.com/photos/member/8/d/6/b/thumb_250896203.jpeg",
      "photo_id": 250896203,
      "highres_link": "http://photos1.meetupstatic.com/photos/member/8/d/6/b/highres_250896203.jpeg",
      "photo_link": "http://photos1.meetupstatic.com/photos/member/8/d/6/b/member_250896203.jpeg"
    },
    "lat": 51.49,
    "visited": 1446745707000,
    "self": {
      "common": {}
    }
  },
  {
    "status": "active",
    "city": "London",
    "name": "Abdelkader Idryssy",
    "other_services": {},
    "country": "gb",
    "topics": [
      {
        "name": "Weekend Adventures",
        "urlkey": "weekend-adventures",
        "id": 16438
      },
      {
        "name": "Community Building",
        "urlkey": "community-building",

In particular I want to extract the member’s id, name, join date and the ids of topics they’re interested in. I started with the following jq query to try and extract those attributes:

$ jq -r '.[] | [.id, .name, .joined, (.topics[] | .id | join(";"))] | @csv' data/members/18313232.json
Cannot iterate over number (16438)

Annoyingly this treats topic ids on an individual basis rather than as an array as I wanted. I tweaked the query to the following with no luck:

$ jq -r '.[] | [.id, .name, .joined, (.topics[].id | join(";"))] | @csv' data/members/18313232.json
Cannot iterate over number (16438)

As a guess I decided to wrap ‘.topics[].id’ in an array literal to see if it had any impact:

$ jq -r '.[] | [.id, .name, .joined, ([.topics[].id] | join(";"))] | @csv' data/members/18313232.json
92951932,". .",1438866605000,""
jq: error (at data/members/18313232.json:60013): string ("") and number (16438) cannot be added

Woot! A different error message at least and this one seems to be due to a type mismatch between the string we want to end up with and the array of numbers that we currently have.

We can cast our way to victory with the ‘tostring’ function:

$ jq -r '.[] | [.id, .name, .joined, ([.topics[].id | tostring] | join(";"))] | @csv' data/members/18313232.json
...
92951932,". .",1438866605000,""
193866304,"Abdelkader Idryssy",1445195325000,"16438;20727;15401;9760;20246;20923;3336;2767;242;58259;4417;1789;10454;20274;10232;563;25375;16433;15187;17635;26273;21808;933;7789;23884;16212;144477;15322;21067;3833;108403;20221;1201;182;15083;9696;4377;15360;18296;15121;17703;10161;1322;3880;18333;3485;15585;44584;18692;21681"
28643052,"Abhishek Chanda",1439688955000,"646052;520302;15167;563;65735;537492;646072;537502;24959;1025832;8599;31197;24410;26118;10579;1064;189;48471;16216;18062;33089;107633;46831;20479;1423042;86258;21441;3833;21681;188;9696;58162;20398;113032;18060;29971;55324;30928;15261;58259;638;16475;27591;10107;242;109595;10470;26384;72514;1461192"
39523062,"Adam Kinder-Jones",1438677474000,"70576;21549;3833;42277;164111;21522;93380;48471;15167;189;563;25435;87614;9696;18062;58162;10579;21681;19882;108403;128595;15582;7029"
194119823,"Adam Lewis",1444867994000,"10209"
14847001,"Adam Rogers",1422917313000,""
87709042,"Adele Green",1436867381000,"15167;18062;102811;9696;30928;18060;78565;189;7029;48471;127567;10579;58162;563;3833;16216;21441;37708;209821;15401;59325;31792;21836;21900;984862;15720;17703;96823;4422;85951;87614;37428;2260;827;121802;19672;38660;84325;118991;135612;10464;1454542;17936;21549;21520;17628;148303;20398;66339;29661"
11497937,"Adrian Bridgett",1421067940000,"30372;15046;25375;638;498;933;374;27591;18062;18060;15167;10581;16438;15672;1998;1273;713;26333;15099;15117;4422;15892;242;142180;563;31197;20479;1502;131178;15018;43256;58259;1201;7319;15940;223;8652;66493;15029;18528;23274;9696;128595;21681;17558;50709;113737"
14151190,"adrian lawrence",1437142198000,"7029;78565;659;85951;15582;48471;9696;128595;563;10579;3833;101960;16137;1973;78566;206;223;21441;16216;108403;21681;186;1998;15731;17703;15043;16613;17885;53531;48375;16615;19646;62326;49954;933;22268;19243;37381;102811;30928;455;10358;73511;127567;106382;16573;36229;781;23981;1954"
183557824,"Adrien Pujol",1421882756000,"108403;563;9696;21681;188;24410;1064;32743;124668;15472;21123;1486432;1500742;87614;46831;1454542;46810;166000;126177;110474"
...

Written by Mark Needham

November 24th, 2015 at 12:12 am

Posted in Software Development

Tagged with