Mark Needham

Thoughts on Software Development

Archive for the ‘neo4j’ Category

Neo4j: A procedure for the SLM clustering algorithm

without comments

In the middle of last year I blogged about the Smart Local Moving algorithm which is used for community detection in networks and with the upcoming introduction of procedures in Neo4j I thought it’d be fun to make that code accessible as one.

If you want to grab the code and follow along it’s sitting on the SLM repository on my github.

At the moment the procedure is hard coded to work with a KNOWS relationship between two nodes but that could easily be changed.

To check it’s working correctly I thought it’d make most sense to use the Karate Club data set described on the SLM home page. I think this data set is originally from Networks, Crowds and Markets.

I wrote the following LOAD CSV script to create the graph in Neo4j:

LOAD CSV FROM "file:///Users/markneedham/projects/slm/karate_club_network.txt" as row
FIELDTERMINATOR "\t"
MERGE (person1:Person {id: row[0]})
MERGE (person2:Person {id: row[1]})
MERGE (person1)-[:KNOWS]->(person2)
Graph

Next we need to call the procedure which will add an appropriate label to each node depending which community it belongs to. This is what the procedure code looks like:

public class ClusterAllTheThings
{
    @Context
    public org.neo4j.graphdb.GraphDatabaseService db;
 
    @Procedure
    @PerformsWrites
    public Stream<Cluster> knows() throws IOException
    {
        String query = "MATCH (person1:Person)-[r:KNOWS]->(person2:Person) \n" +
                       "RETURN person1.id AS p1, person2.id AS p2, toFloat(1) AS weight";
 
        Result rows = db.execute( query );
 
        ModularityOptimizer.ModularityFunction modularityFunction = ModularityOptimizer.ModularityFunction.Standard;
        Network network = Network.create( modularityFunction, rows );
 
        double resolution = 1.0;
        int nRandomStarts = 1;
        int nIterations = 10;
        long randomSeed = 0;
 
        double modularity;
 
        Random random = new Random( randomSeed );
 
        double resolution2 = modularityFunction.resolution( resolution, network );
 
        Map<Integer, Node> cluster = new HashMap<>();
        double maxModularity = Double.NEGATIVE_INFINITY;
 
        for ( int randomStart = 0; randomStart < nRandomStarts; randomStart++ )
        {
            network.initSingletonClusters();
 
            int iteration = 0;
            do
            {
                network.runSmartLocalMovingAlgorithm( resolution2, random );
                iteration++;
 
                modularity = network.calcQualityFunction( resolution2 );
            } while ( (iteration < nIterations) );
 
            if ( modularity > maxModularity )
            {
                network.orderClustersByNNodes();
                cluster = network.getNodes();
                maxModularity = modularity;
            }
        }
 
        for ( Map.Entry<Integer, Node> entry : cluster.entrySet() )
        {
            Map<String, Object> params = new HashMap<>();
            params.put("userId", String.valueOf(entry.getKey()));
 
            db.execute("MATCH (person:Person {id: {userId}})\n" +
                       "SET person:`" + (format( "Community-%d`", entry.getValue().getCluster() )),
                    params);
        }
 
        return cluster
                .entrySet()
                .stream()
                .map( ( entry ) -> new Cluster( entry.getKey(), entry.getValue().getCluster() ) );
    }
 
    public static class Cluster
    {
        public long id;
        public long clusterId;
 
        public Cluster( int id, int clusterId )
        {
            this.id = id;
            this.clusterId = clusterId;
        }
    }
}

I’ve hardcoded some parameters to use defaults which could be exposed through the procedure to allow more control if necessary. The Network#create function assumes it is going to receive a stream of rows containing columns ‘p1’, ‘p2’ and ‘weight’ to represent the ‘source’, ‘destination’ and ‘weight’ of the relationship between them.

We call the procedure like this:

CALL org.neo4j.slm.knows()

It will return each of the nodes and the cluster it’s been assigned to and if we then visualise the network in the neo4j browser we’ll see this:

Graph  1

which is similar to the visualisation from the SLM home page:

Network

If you want to play around with the code feel free. You’ll need to run the following commands to create the JAR for the plugin and deploy it.

$ mvn clean package 
$ cp target/slm-1.0.jar /path/to/neo4j/plugins/ 
$ ./path/to/neo4j/bin/neo4j restart

And you’ll need the latest milestone of Neo4j which has procedures enabled.

Written by Mark Needham

February 28th, 2016 at 8:40 pm

Posted in neo4j

Tagged with

Neo4j: Cypher – avoid duplicate calls to NOT patterns

with 2 comments

I’ve been reacquainting myself with the meetup.com dataset ahead of Wednesday’s meetup in London and wanted to write a collaborative filtering type query to work out which groups people in my groups were in.

This started simple enough:

MATCH (member:Member {name: "Mark Needham"})-[:MEMBER_OF]->(group:Group)<-[:MEMBER_OF]-(other:Member)-[:MEMBER_OF]->(otherGroup:Group)
RETURN otherGroup, COUNT(*) AS commonMembers
ORDER BY commonMembers DESC
LIMIT 5

And doesn’t take too long to run:

Cypher version: CYPHER 2.3, planner: COST. 1084378 total db hits in 1103 ms.

However, it was showing up several groups that I’m already a member of so I added in a “WHERE NOT” clause to sort that out:

MATCH (member:Member {name: "Mark Needham"})-[:MEMBER_OF]->(group:Group)<-[:MEMBER_OF]-(other:Member)-[:MEMBER_OF]->(otherGroup:Group)
WHERE NOT (member)-[:MEMBER_OF]->(otherGroup)
RETURN otherGroup, COUNT(*) AS commonMembers
ORDER BY commonMembers DESC
LIMIT 5

Unfortunately when I ran this the amount of db hits increased by 14x and it now took 3x as long to run:

Cypher version: CYPHER 2.3, planner: COST. 14061442 total db hits in 3364 ms.


The problem is that we’re making lots of duplicate calls to NOT (member)-[:MEMBER_OF]->(otherGroup) because each group shows up lots of times.

This is the ‘reduce cardinality of work in progress’ tip from Michael Hunger’s blog post:

Bonus Query Tuning Tip: Reduce Cardinality of Work in Progress

When following longer paths, you’ll encounter duplicates. If you’re not interested in all the possible paths – but just distinct information from stages of the path – make sure that you eagerly eliminate duplicates, so that later matches don’t have to be executed many multiple times.

We can reduce the WIP in our query by doing the counting of common members first and then filtering out the groups we’re already a member of:

MATCH (member:Member {name: "Mark Needham"})-[:MEMBER_OF]->(group:Group)<-[:MEMBER_OF]-(other:Member)-[:MEMBER_OF]->(otherGroup:Group)
WITH otherGroup, member, COUNT(*) AS commonMembers
WHERE NOT (member)-[:MEMBER_OF]->(otherGroup)
RETURN otherGroup, commonMembers
ORDER BY commonMembers DESC
LIMIT 5

This gets us back down to something closer to the running time/db hits of our initial query:

Cypher version: CYPHER 2.3, planner: COST. 1097114 total db hits in 1004 ms.

Written by Mark Needham

January 17th, 2016 at 12:19 pm

Posted in neo4j

Tagged with

2015: A year in the life of the Neo4j London meetup group

without comments

Given we’ve only got a few more hours left of 2015 I thought it’d be fun to do a quick overview of how things have been going in the London chapter of the Neo4j meetup using Neo4j with a bit of R mixed in.

We’re going to be using the RNeo4j library to interact with the database along with a few other libraries which will help us out with different tasks:

library(RNeo4j)
library(ggplot2)
library(dplyr)
library(zoo)
 
graph = startGraph("http://localhost:7474/db/data/", username = "neo4j", password = "myPassword")

Let’s get to it:

Members

query = "MATCH (:Group {name: {name}})<-[membership:MEMBER_OF]-()
         RETURN membership.joined AS timestamp"
 
joinedDF = cypher(graph, query, name = "Neo4j - London User Group")
joinedDF$joinDate = as.Date(as.POSIXct(joinedDF$timestamp / 1000, origin="1970-01-01"))
joinedDF$joinDate = as.Date(as.POSIXct(joinedDF$timestamp / 1000, origin="1970-01-01"))
 
ggplot(aes(x = year, y = n, label = n), 
       data = joinedDF %>% mutate(year = format(joinDate, "%Y")) %>% count(year)) + 
  geom_bar(stat = "identity", fill = "Dark Blue") + 
  ggtitle("Number of new members by year") +
  geom_text(vjust=-0.5)
2015 12 31 12 23 06

A bit down on 2014 but not too far away. We’re still attracting new people who are interested in learning about graphs. Let’s drill into those numbers a bit:

byYearMon = joinedDF %>% 
  filter(format(joinDate, "%Y") == 2015) %>% 
  mutate(yearmon = as.Date(as.yearmon(joinDate))) %>% 
  count(yearmon)
 
ggplot(aes(x = yearmon, y = n, label = n), data = byYearMon) + 
  geom_bar(stat = "identity", fill = "Dark Blue") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_x_date(labels = date_format("%B"), breaks = "1 month") +
  ggtitle("Number of new members by month/year")
2015 12 31 12 39 04

We had a bit of an end of year surge in October/November which was unexpected. December has been low in previous years, there was an April dip which I think is because we stopped doing events before Graph Connect 2015. I’m not sure about the September dip so let’s have a look:

eventsQuery = "MATCH (:Group {name: {name}})-[:HOSTED_EVENT]->(event)
               RETURN event.time + event.utcOffset AS timestamp"
eventsDF = cypher(graph, eventsQuery, name = "Neo4j - London User Group")
eventsDF$timestamp = as.Date(as.POSIXct(eventsDF$timestamp / 1000, origin="1970-01-01"))
 
eventsByYearMon = eventsDF %>% 
  filter(format(timestamp, "%Y") == 2015) %>% 
  mutate(yearmon = as.Date(as.yearmon(timestamp))) %>% 
  count(yearmon) 
 
merge(eventsByYearMon, byYearMon, by="yearmon")
 
      yearmon n.x n.y
1  2015-01-01   3  80
2  2015-02-01   6  76
3  2015-03-01   2  70
4  2015-04-01   2  53
5  2015-05-01   4  78
6  2015-06-01   5  83
7  2015-07-01   3  73
8  2015-08-01   5  73
9  2015-09-01   3  40
10 2015-10-01   3  94
11 2015-11-01   4 117
12 2015-12-01   3  48

At first glance there doesn’t seem to be any correlation between the number of events held and the number of new members so I think we’ll have to look for another predictor of that variable!

Events

Next let’s have a look at the events we ran in 2015. We’ll start with a quick chart showing the number of events we’ve run over the years:

ggplot(aes(x = year, y = n, label = n), data = eventsDF %>% mutate(year = format(timestamp, "%Y")) %>% count(year)) + 
  geom_bar(stat = "identity", fill = "Dark Blue") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle("Number of events")

2015 12 31 13 43 15

So less events than last year but how many people RSVPD ‘yes’ to the ones we did host?

eventsQuery = "MATCH (:Group {name: {name}})-[:HOSTED_EVENT]->(event)<-[:RSVPD {response: 'yes'}]-()
               WHERE event.time + event.utcOffset < timestamp()
               WITH event, COUNT(*) AS rsvps
               RETURN event.time + event.utcOffset AS timestamp, rsvps"
eventsDF = cypher(graph, eventsQuery, name = "Neo4j - London User Group")
eventsDF$timestamp = as.Date(as.POSIXct(eventsDF$timestamp / 1000, origin="1970-01-01"))
 
ggplot(aes(x = year, y = rsvps), 
       data = eventsDF %>% mutate(year = format(timestamp, "%Y")) %>% group_by(year) %>% summarise(rsvps= sum(rsvps)) ) + 
  geom_bar(stat = "identity", fill = "Dark Blue") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle("Number of attendees")
2015 12 31 13 54 50

Slightly more ‘yes’ RSVPs than last year. Now let’s drill into the repeat events we ran this year:

eventsQuery = "MATCH (:Group {name: {name}})-[:HOSTED_EVENT]->(event)
               WHERE {startYear} <= (event.time + event.utcOffset) < {endYear}
               RETURN event.name AS event, COUNT(*) AS times
               ORDER BY times DESC"
eventsDF = cypher(graph, eventsQuery, name = "Neo4j - London User Group", 
                  startYear  = as.numeric(as.POSIXct("2015-01-01", format="%Y-%m-%d")) * 1000, 
                  endYear = as.numeric(as.POSIXct("2015-12-31", format="%Y-%m-%d")) * 1000)
eventsDF %>% filter(times > 1)
 
                                                       event times
1                      Relational to graph: A worked example     7
2                                            Intro to Graphs     6
3                          Graph Modelling - Do's and Don'ts     5
4          Hands On Intro to Cypher - Neo4j's Query Language     3
5 Build your own recommendation engine with Neo4j in an hour     2
6                                Fraud Detection using Neo4j     2

I thought we’d run ‘Intro to Graphs’ most often but the data doesn’t lie – it’s all about relational to graph. And which were the most popular repeat events in terms of ‘yes’ RSVPs?

eventsQuery = "MATCH (:Group {name: {name}})-[:HOSTED_EVENT]->(event)
               WHERE {startYear} <= (event.time + event.utcOffset) < {endYear}
               MATCH (event)<-[:RSVPD {response: 'yes'}]-()
               WITH event, COUNT(*) AS yesRSVPs
               WITH event.name AS event, COUNT(*) AS times, SUM(yesRSVPs) AS rsvps
               RETURN event, times, rsvps, rsvps / times AS rsvpsPerEvent
               ORDER BY rsvpsPerEvent DESC"
eventsDF = cypher(graph, eventsQuery, name = "Neo4j - London User Group", 
                  startYear  = as.numeric(as.POSIXct("2015-01-01", format="%Y-%m-%d")) * 1000, 
                  endYear = as.numeric(as.POSIXct("2015-12-31", format="%Y-%m-%d")) * 1000)
eventsDF %>% filter(times > 1)
 
                                                       event times rsvps rsvpsPerEvent
1                                Fraud Detection using Neo4j     2   150            75
2                                            Intro to Graphs     6   352            58
3                          Graph Modelling - Do's and Don'ts     5   281            56
4                      Relational to graph: A worked example     7   367            52
5 Build your own recommendation engine with Neo4j in an hour     2    85            42
6          Hands On Intro to Cypher - Neo4j's Query Language     3   104            34

It looks like fraud is a popular topic although we’ve only run it twice so perhaps best not to read too much into that. We’re running that one again in a couple of weeks if you’re interested.

Ignoring repeat events let’s see which event drew the biggest crowd:

eventsQuery = "MATCH (:Group {name: {name}})-[:HOSTED_EVENT]->(event)
               WHERE {startYear} <= (event.time + event.utcOffset) < {endYear}
               MATCH (event)<-[:RSVPD {response: 'yes'}]-()
               WITH event.id AS id, event.name AS event, COUNT(*) AS rsvps
               RETURN event, rsvps
               ORDER BY rsvps DESC"
eventsDF = cypher(graph, eventsQuery, name = "Neo4j - London User Group", 
                  startYear  = as.numeric(as.POSIXct("2015-01-01", format="%Y-%m-%d")) * 1000, 
                  endYear = as.numeric(as.POSIXct("2015-12-31", format="%Y-%m-%d")) * 1000)
eventsDF %>% head(5)
 
                                                                         event rsvps
1 Neo4j Full Stack Applications + Python, R and Neo4j - The Data Science Stack   133
2                          Modelling a recommendation engine: A worked example   118
3                    Building a repository of biomedical ontologies with Neo4j   107
4                     GraphHack @ Graph Connect: The night before Election Day    91
5                                        Bootstrapping a Recommendation Engine    88

A double header featuring Nicole White and Matt Wright proved to be the most popular event of the year and in fact the most popular in terms of ‘yes’ RSVPs so far:

eventsQuery = "MATCH (:Group {name: {name}})-[:HOSTED_EVENT]->(event)<-[:RSVPD {response: 'yes'}]-()
               WITH event, COUNT(*) AS rsvps
               RETURN event.name AS event, event.time + event.utcOffset AS time, rsvps
               ORDER BY rsvps DESC"
eventsDF = cypher(graph, eventsQuery, name = "Neo4j - London User Group")
eventsDF$time = as.Date(as.POSIXct(eventsDF$time / 1000, origin="1970-01-01"))
eventsDF %>% mutate(year = format(time, "%Y")) %>% dplyr::select(-time) %>% head(10)
 
                                                                          event rsvps year
1  Neo4j Full Stack Applications + Python, R and Neo4j - The Data Science Stack   133 2015
2                           Modelling a recommendation engine: A worked example   118 2015
3                     Building a repository of biomedical ontologies with Neo4j   107 2015
4                                                    Real world Neo4j use cases    98 2014
5                                                           The transport graph    94 2014
6                                                     The Visualisation Special    93 2014
7                  Impossible is Nothing by Jim Webber, Neo4j's Chief Scientist    93 2014
8                      GraphHack @ Graph Connect: The night before Election Day    91 2015
9                                         Bootstrapping a Recommendation Engine    88 2015
10                                    Scraping and Graphing the Apple app store    88 2015

3 of the top 4 belong to 2015 and 6 of the top 10. Let’s see what 2016 has in store.

Thanks to everyone who’s come along to one of our meetups and Happy New Year!

Written by Mark Needham

December 31st, 2015 at 1:58 pm

Posted in neo4j

Tagged with

Neo4j: Specific relationship vs Generic relationship + property

with one comment

For optimal traversal speed in Neo4j queries we should make our relationship types as specific as possible.

Let’s take a look at an example from the ‘modelling a recommendations engine‘ talk I presented at Skillsmatter a couple of weeks ago.

I needed to decided how to model the ‘RSVP’ relationship between a Member and an Event. A person can RSVP ‘yes’ or ‘no’ to an event and I’d like to capture both of these responses.

i.e. we can choose between:

2015 12 13 20 39 05

and:

2015 12 13 20 39 54

When deciding on a model we mainly need to think about the types of queries that we want to write. We shouldn’t forget about updating the model but in my experience more time is spent querying graphs than updating them.

Let’s take a look at each of those in turn:

What queries do we want to write?

The first query was going to use previous ‘yes’ RSVPs as an indicator of interest for future events. We’re not interested in ‘no’ RSVPs for this query.

I started out with the generic RSVP relationship type with a ‘response’ property to distinguish between ‘yes’ and ‘no’:

MATCH (member:Member {name: "Mark Needham"})
MATCH (futureEvent:Event) WHERE futureEvent.time >= timestamp()
MATCH (futureEvent)<-[:HOSTED_EVENT]-(group)
 
OPTIONAL MATCH (member)-[rsvp:RSVPD {response: "yes"}]->(pastEvent)<-[:HOSTED_EVENT]-(group)
WHERE pastEvent.time < timestamp()
 
RETURN group.name, futureEvent.name, COUNT(rsvp) AS previousEvents
ORDER BY  previousEvents DESC


This ran reasonably quickly but I was curious whether I could get the query to run any quicker by changing to the more specific model. Using the more specific relationship type our query reads:

MATCH (member:Member {name: "Mark Needham"})
MATCH (futureEvent:Event) WHERE futureEvent.time >= timestamp()
MATCH (futureEvent)<-[:HOSTED_EVENT]-(group)
 
OPTIONAL MATCH (member)-[rsvp:RSVP_YES]->(pastEvent)<-[:HOSTED_EVENT]-(group)
WHERE pastEvent.time < timestamp()
 
RETURN group.name, 
       futureEvent.name, 
       COUNT(rsvp) AS previousEvents
ORDER BY  previousEvents DESC

We can now profile our query and compare the db hits of both solutions:

RSVPD {response: "yes"}
Cypher version: CYPHER 2.3, planner: COST. 688635 total db hits in 232 ms.
 
RSVP_YES
Cypher version: CYPHER 2.3, planner: COST. 559866 total db hits in 207 ms.

So we get a slight gain by using the more specific relationship type. The reason the db hits is lower is partly because we’ve removed the need to lookup the ‘response’ property on every ‘RSVP’ property and check that it matches ‘yes’. We’re also evaluating fewer relationships since we only look at positive RSVPs, negative ones are ignored.

Our next query might be to capture all the RSVPs made by a member and list them alongside the events:

MATCH (member:Member {name: "Mark Needham"})-[rsvp:RSVPD]->(event)
WHERE event.time < timestamp()
RETURN event.name, event.time, rsvp.response
ORDER BY event.time DESC
MATCH (member:Member {name: "Mark Needham"})-[rsvp:RSVP_YES|:RSVP_NO]->(event)
WHERE event.time < timestamp()
RETURN event.name, event.time, CASE TYPE(rsvp) WHEN "RSVP_YES" THEN "yes" ELSE "no" END AS response
ORDER BY event.time DESC

Again we see a marginal db hits win for the more specific relationship type:

RSVPD {response: "yes"} / RSVPD {response: "no"}
Cypher version: CYPHER 2.3, planner: COST. 684 total db hits in 37 ms.
 
RSVP_YES / RSVP_NO
Cypher version: CYPHER 2.3, planner: COST. 541 total db hits in 24 ms.

However, the query is quite unwieldy and unless we store the response as a property on the relationship the code to return ‘yes’ or ‘no’ is a bit awkward. The more specific approach query would become even more painful to deal with if we introduced the ‘waitlist’ RSVP which we’ve chosen to exclude.

Will we need to update the relationship?

Yes! Users are able to change their RSVP up until the event happens so we need to be able to handle that.

Let’s have a look at the queries we’d have to write to handle a change in RSVP using both models:

Generic relationship type

MATCH (event:Event {id: {event_id}})
MATCH (member:Member {id: {member_id}})
MERGE (member)-[rsvpRel:RSVPD {id: {rsvp_id}}]->(event)
ON CREATE SET rsvpRel.created = toint({mtime})
ON MATCH  SET rsvpRel.lastModified = toint({mtime})
SET rsvpRel.response = {response}

Specific relationship type

MATCH (event:Event {id: {event_id}})
MATCH (member:Member {id: {member_id}})
 
FOREACH(ignoreMe IN CASE WHEN {response} = "yes" THEN [1] ELSE [] END |
  MERGE (member)-[rsvpYes:RSVP_YES {id: {rsvp_id}}]->(event)
  ON CREATE SET rsvpYes.created = toint({mtime})
  ON MATCH  SET rsvpYes.lastModified = toint({mtime})
 
  MERGE (member)-[oldRSVP:RSVP_NO]->(event)
  DELETE oldRSVP
)
 
FOREACH(ignoreMe IN CASE WHEN {response} = "no" THEN [1] ELSE [] END |
  MERGE (member)-[rsvpNo:RSVP_NO {id: {rsvp_id}}]->(event)
  ON CREATE SET rsvpNo.created = toint({mtime})
  ON MATCH  SET rsvpNo.lastModified = toint({mtime})
 
  MERGE (member)-[oldRSVP:RSVP_YES]->(event)
  DELETE oldRSVP
)

As you can see, the code to update an RSVP is more complicated when using the specific relationship type due in part to Cypher not yet having first class support for conditionals.

In summary, for our meetup.com model we gain speed improvements by using more specific relationship types but at the expense of some more complicated read queries and a significantly more convoluted update query.

Depending on the cardinality of relationships in your model your mileage may vary but it’s worth doing some profiling to compare all your options.

Written by Mark Needham

December 13th, 2015 at 9:22 pm

Posted in neo4j

Tagged with

Neo4j: Facts as nodes

without comments

On Tuesday I spoke at the Neo4j London user group about incrementally building a recommendation engine and described the ‘facts as nodes’ modeling pattern, defined as follows in the Graph Databases book:

When two or more domain entities interact for a period of time, a fact emerges. We represent a fact as a separate node with connections to each of the entities engaged in that fact.

Modeling an action in terms of its product—that is, in terms of the thing that results from the action—produces a similar structure: an intermediate node that represents the outcome of an interaction between two or more entities.

We started with the following model describing a meetup member and the groups they’ve joined:

2015 12 04 07 26 11

This model works well for the query it was defined for – find groups similar to ones that I’m already a member of:

MATCH (member:Member {name: "Mark Needham"})-[:MEMBER_OF]->(group)-[:HAS_TOPIC]->(topic)
WITH member, topic, COUNT(*) AS score
MATCH (topic)<-[:HAS_TOPIC]-(otherGroup) 
WHERE NOT (member)-[:MEMBER_OF]->(otherGroup)
RETURN otherGroup.name, COLLECT(topic.name), SUM(score) as score
ORDER BY score DESC

Prefixing that query with the ‘PROFILE’ keyword yields a query plan and the following summary text:

Cypher version: CYPHER 2.3, planner: COST. 89100 total db hits in 113 ms.

In this model it feels like there is a membership fact waiting to become a node.

2015 12 04 07 35 38

We can refactor towards that model with the following query:

MATCH (member:Member)-[rel:MEMBER_OF]->(group)
 
MERGE (membership:Membership {id: member.id + "_" + group.id})
SET membership.joined = rel.joined
 
MERGE (member)-[:HAS_MEMBERSHIP]->(membership)
MERGE (membership)-[:OF_GROUP]->(group);

We’d answer our initial question with the following query:

MATCH (member:Member {name: "Mark Needham"})-[:HAS_MEMBERSHIP]->()-[:OF_GROUP]->(group:Group)-[:HAS_TOPIC]->(topic)
WITH member, topic, COUNT(*) AS score
MATCH (topic)<-[:HAS_TOPIC]-(otherGroup) 
WHERE NOT (member)-[:HAS_MEMBERSHIP]->(:Membership)-[:OF_GROUP]->(otherGroup:Group)
RETURN otherGroup.name, COLLECT(topic.name), SUM(score) as score
ORDER BY score DESC

at the following cost:

Cypher version: CYPHER 2.3, planner: COST. 468201 total db hits in 346 ms.

The membership node hasn’t proved its value yet – it does 4x more work to get the same result. However, the next question we want to answer is ‘what group do people join after the Neo4j user group?’ where it might come in handy.

First we’ll add a ‘NEXT’ relationship between a user’s adjacent group memberships by writing the following query:

MATCH (member:Member)-[:HAS_MEMBERSHIP]->(membership)
 
WITH member, membership ORDER BY member.id, membership.joined
 
WITH member, COLLECT(membership) AS memberships
UNWIND RANGE(0,SIZE(memberships) - 2) as idx
 
WITH memberships[idx] AS m1, memberships[idx+1] AS m2
MERGE (m1)-[:NEXT]->(m2);

And now for the query:

MATCH (group:Group {name: "Neo4j - London User Group"})<-[:OF_GROUP]-(membership)-[:NEXT]->(nextMembership),         
      (membership)<-[:HAS_MEMBERSHIP]-(member:Member)-[:HAS_MEMBERSHIP]->(nextMembership),
      (nextMembership)-[:OF_GROUP]->(nextGroup)
RETURN nextGroup.name, COUNT(*) AS times
ORDER BY times DESC
Cypher version: CYPHER 2.3, planner: COST. 23671 total db hits in 39 ms.

And for comparison – the same query using the initial model:

MATCH (group:Group {name: "Neo4j - London User Group"})<-[membership:MEMBER_OF]-(member),
      (member)-[otherMembership:MEMBER_OF]->(otherGroup)
WHERE membership.joined < otherMembership.joined
WITH member, otherGroup 
ORDER BY otherMembership.joined
WITH member, COLLECT(otherGroup)[0] AS nextGroup
RETURN nextGroup.name, COUNT(*) AS times
ORDER BY times DESC
Cypher version: CYPHER 2.3, planner: COST. 86179 total db hits in 138 ms.

This time the membership model does 3x less work, so depending on the question a different model works better.

Given this observation we might choose to keep both models. The disadvantage of doing that is that we pay write and maintenance penalties to keep them both in sync. e.g. this is what queries to add a new membership or remove one would look like

Adding group membership

WITH "Mark Needham" AS memberName, 
     "Neo4j - London User Group" AS groupName,
     timestamp() AS now
 
MATCH (group:Group {name: groupName})
MATCH (member:Member {name: memberName})
 
MERGE (member)-[memberOfRel:MEMBER_OF]->(group)
ON CREATE SET memberOfRel.time = now
 
MERGE (membership:Membership {id: member.id + "_" + group.id})
ON CREATE SET membership.joined = now
MERGE (member)-[:HAS_MEMBERSHIP]->(membership)
MERGE (membership)-[:OF_GROUP]->(group)

Removing group membership

WITH "Mark Needham" AS memberName, 
     "Neo4j - London User Group" AS groupName,
     timestamp() AS now
 
MATCH (group:Group {name: groupName})
MATCH (member:Member {name: memberName})
 
MATCH (member)-[memberOfRel:MEMBER_OF]->(group)
 
MATCH (membership:Membership {id: member.id + "_" + group.id})
MATCH (member)-[hasMembershipRel:HAS_MEMBERSHIP]->(membership)
MATCH (membership)-[ofGroupRel:OF_GROUP]->(group)
 
DELETE memberOfRel, hasMembershipRel, ofGroupRel

The dataset is on github so take a look at it and send any questions my way.

Written by Mark Needham

December 4th, 2015 at 7:52 am

Posted in neo4j

Tagged with

Neo4j: Summarising neo4j-shell output

without comments

I frequently find myself trying to optimise a set of cypher queries and I tend to group them together in a script that I fed to the Neo4j shell.

When tweaking the queries it’s easy to make a mistake and end up not creating the same data so I decided to write a script which will show me the aggregates of all the commands executed.

I want to see the number of constraints created, indexes added, nodes, relationships and properties created. The first 2 don’t need to match across the scripts but the latter 3 should be the same.

I put together the following script:

import re
import sys
from tabulate import tabulate
 
lines = sys.stdin.readlines()
 
def search(term, line):
    m =  re.match(term + ": (.*)", line)
    return (int(m.group(1)) if m else 0)
 
nodes_created, relationships_created, constraints_added, indexes_added, labels_added, properties_set = 0, 0, 0, 0, 0, 0
for line in lines:
    nodes_created = nodes_created + search("Nodes created", line)
    relationships_created = relationships_created + search("Relationships created", line)
    constraints_added = constraints_added + search("Constraints added", line)
    indexes_added = indexes_added + search("Indexes added", line)
    labels_added = labels_added + search("Labels added", line)
    properties_set = properties_set + search("Properties set", line)
 
    time_match = re.match("real.*([0-9]+m[0-9]+\.[0-9]+s)$", line)
 
    if time_match:
        time = time_match.group(1)
 
table = [
            ["Constraints added", constraints_added],
            ["Indexes added", indexes_added],
            ["Nodes created", nodes_created],
            ["Relationships created", relationships_created],
            ["Labels added", labels_added],
            ["Properties set", properties_set],
            ["Time", time]
         ]
print tabulate(table)

Its input is the piped output of the neo4j-shell command which will contain a description of all the queries it executed.

$ cat import.sh
#!/bin/sh
 
{ ./neo4j-community-2.2.3/bin/neo4j stop; } 2>&1
rm -rf neo4j-community-2.2.3/data/graph.db/
{ ./neo4j-community-2.2.3/bin/neo4j start; } 2>&1
{ time ./neo4j-community-2.2.3/bin/neo4j-shell --file $1; } 2>&1

We can use the script in two ways.

Either we can pipe the output of our shell straight into it and just get the summary e.g.

$ ./import.sh local.import.optimised.cql | python summarise.py
 
---------------------  ---------
Constraints added      5
Indexes added          1
Nodes created          13249
Relationships created  32227
Labels added           21715
Properties set         36480
Time                   0m17.595s
---------------------  ---------

…or we can make use of the ‘tee’ function in Unix and pipe the output into stdout and into the file and then either tail the file on another window or inspect it afterwards to see the detailed timings. e.g.

$ ./import.sh local.import.optimised.cql | tee /tmp/output.txt |  python summarise.py
 
---------------------  ---------
Constraints added      5
Indexes added          1
Nodes created          13249
Relationships created  32227
Labels added           21715
Properties set         36480
Time                   0m11.428s
---------------------  ---------
$ tail -f /tmp/output.txt
+-------------+
| appearances |
+-------------+
| 3771        |
+-------------+
1 row
Nodes created: 3439
Properties set: 3439
Labels added: 3439
289 ms
+------------------------------------+
| appearances -> player, match, team |
+------------------------------------+
| 3771                               |
+------------------------------------+
1 row
Relationships created: 10317
1006 ms
...

My only dependency is the tabulate package to get the pretty table:

$ cat requirements.txt
 
tabulate==0.7.5

The cypher script I’m running creates a BBC football graph which is available as a github project. Feel free to grab it and play around – any problems let me know!

Written by Mark Needham

August 21st, 2015 at 8:59 pm

Posted in neo4j

Tagged with

Neo4j 2.2.3: Unmanaged extensions – Creating gzipped streamed responses with Jetty

without comments

Back in 2013 I wrote a couple of blog posts showing examples of an unmanaged extension which had a streamed and gzipped response but two years on I realised they were a bit out of date and deserved a refresh.

When writing unmanaged extensions in Neo4j a good rule of thumb is to try and reduce the amount of objects you keep hanging around. In this context this means that we should stream our response to the client as quickly as possible rather than building it up in memory and sending it in one go.

The documentation has a good example showing how to stream a list of colleagues but in this blog post we’ll look at how to do something simpler – we’ll create a couple of nodes representing people and then write an unmanaged extension to return them.

We’ll first create an unmanaged extension which runs a cypher query, iterates through the rows returned and sends them to the client:

@Path("/example")
public class ExampleResource {
    private final GraphDatabaseService db;
    private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
 
    public ExampleResource(@Context GraphDatabaseService db) {
        this.db = db;
    }
 
    @GET
    @Produces(MediaType.APPLICATION_JSON)
    @Path("/people")
    public Response allNodes() throws IOException {
        StreamingOutput stream = streamQueryResponse("MATCH (n:Person) RETURN n.name AS name");
        return Response.ok().entity(stream).type(MediaType.APPLICATION_JSON).build();
    }
 
    private StreamingOutput streamQueryResponse(final String query) {
        return new StreamingOutput() {
                @Override
                public void write(OutputStream os) throws IOException, WebApplicationException {
                    JsonGenerator jg = OBJECT_MAPPER.getJsonFactory().createJsonGenerator(os, JsonEncoding.UTF8);
                    jg.writeStartArray();
 
                    writeQueryResultTo(query, jg);
 
                    jg.writeEndArray();
                    jg.flush();
                    jg.close();
                }
            };
    }
 
    private void writeQueryResultTo(String query, JsonGenerator jg) throws IOException {
        try (Result result = db.execute(query)) {
            while (result.hasNext()) {
                Map<String, Object> row = result.next();
 
                jg.writeStartObject();
                for (Map.Entry<String, Object> entry : row.entrySet()) {
                    jg.writeFieldName(entry.getKey());
                    jg.writeString(entry.getValue().toString());
                }
                jg.writeEndObject();
            }
        }
    }
}

There’s nothing too complicated going on here although notice that we make much more fine grained calls to the JSON Library rather than created a JSON object in memory and calling ObjectMapper#writeValueAsString on it.

To get this to work we’d build a JAR containing this class, put that into the plugins folder and then add the following property to conf/neo4j-server.properties (or the Neo4j desktop equivalent) before restarting the server:

org.neo4j.server.thirdparty_jaxrs_classes=org.neo4j.unmanaged=/unmanaged

We can then test it out like this:

$ curl http://localhost:7474/unmanaged/example/people
[{"name":"Mark"},{"name":"Nicole"}]

I’ve put in a couple of test people nodes – full instructions are available on the github README page.

Next we want to make it possible to send that response in the gzip format. To do that we need to add a GzipFilter to the Neo4j lifecycle. This class has moved to a different namespace in Jetty 9 which Neo4j 2.2.3 depends on, but the following class does the job:

import org.eclipse.jetty.servlets.GzipFilter;
 
public class GZipInitialiser implements SPIPluginLifecycle {
    private WebServer webServer;
 
    @Override
    public Collection<Injectable<?>> start(NeoServer neoServer) {
        webServer = getWebServer(neoServer);
        GzipFilter filter = new GzipFilter();
 
        webServer.addFilter(filter, "/*");
        return Collections.emptyList();
    }
 
    private WebServer getWebServer(final NeoServer neoServer) {
        if (neoServer instanceof AbstractNeoServer) {
            return ((AbstractNeoServer) neoServer).getWebServer();
        }
        throw new IllegalArgumentException("expected AbstractNeoServer");
    }
 
    @Override
    public Collection<Injectable<?>> start(GraphDatabaseService graphDatabaseService, Configuration configuration) {
        throw new IllegalAccessError();
    }
 
    @Override
    public void stop() {
 
    }
}

I needed to include the jersey-servlets JAR in my unmanaged extension JAR in order for this to work correctly. Once we redeploy the JAR and restart Neo4j we can try making the same request as above but with a gzip header:

$ curl -v -H "Accept-Encoding:gzip,deflate" http://localhost:7474/unmanaged/example/people
��V�K�MU�R�M,�V�Ձ��2��sR�jcf(#

We can unpack that on the fly by piping it through gunzip to check we get a sensible result:

$ curl -v -H "Accept-Encoding:gzip,deflate" http://localhost:7474/unmanaged/example/people | gunzip
[{"name":"Mark"},{"name":"Nicole"}]

And there we have it – a gzipped streamed response. All the code is on github so give it a try and give me a shout if it doesn’t work. The fastest way to get me is probably on our new shiny neo4j-users Slack group.

Written by Mark Needham

August 10th, 2015 at 11:57 pm

Posted in neo4j

Tagged with

Neo4j: Cypher – Removing consecutive duplicates

without comments

When writing Cypher queries I sometimes find myself wanting to remove consecutive duplicates in collections that I’ve joined together.

e.g we might start with the following query where 1 and 7 appear consecutively:

RETURN [1,1,2,3,4,5,6,7,7,8] AS values
 
==> +-----------------------+
==> | values                |
==> +-----------------------+
==> | [1,1,2,3,4,5,6,7,7,8] |
==> +-----------------------+
==> 1 row

We want to end up with [1,2,3,4,5,6,7,8]. We can start by exploding our array and putting consecutive elements next to each other:

WITH [1,1,2,3,4,5,6,7,7,8] AS values
UNWIND RANGE(0, LENGTH(values) - 2) AS idx
RETURN idx, idx+1, values[idx], values[idx+1]
 
==> +-------------------------------------------+
==> | idx | idx+1 | values[idx] | values[idx+1] |
==> +-------------------------------------------+
==> | 0   | 1     | 1           | 1             |
==> | 1   | 2     | 1           | 2             |
==> | 2   | 3     | 2           | 3             |
==> | 3   | 4     | 3           | 4             |
==> | 4   | 5     | 4           | 5             |
==> | 5   | 6     | 5           | 6             |
==> | 6   | 7     | 6           | 7             |
==> | 7   | 8     | 7           | 7             |
==> | 8   | 9     | 7           | 8             |
==> +-------------------------------------------+
==> 9 rows

Next we can filter out rows which have the same values since that means they have consecutive duplicates:

WITH [1,1,2,3,4,5,6,7,7,8] AS values
UNWIND RANGE(0, LENGTH(values) - 2) AS idx
WITH values[idx] AS a, values[idx+1] AS b
WHERE a <> b
RETURN a,b
 
==> +-------+
==> | a | b |
==> +-------+
==> | 1 | 2 |
==> | 2 | 3 |
==> | 3 | 4 |
==> | 4 | 5 |
==> | 5 | 6 |
==> | 6 | 7 |
==> | 7 | 8 |
==> +-------+
==> 7 rows

Now we need to join the collection back together again. Most of the values we want are in field ‘b’ but we also need to grab the first value from field ‘a’:

WITH [1,1,2,3,4,5,6,7,7,8] AS values
UNWIND RANGE(0, LENGTH(values) - 2) AS idx
WITH values[idx] AS a, values[idx+1] AS b
WHERE a <> b
RETURN COLLECT(a)[0] + COLLECT(b) AS noDuplicates
 
==> +-------------------+
==> | noDuplicates      |
==> +-------------------+
==> | [1,2,3,4,5,6,7,8] |
==> +-------------------+
==> 1 row

What about if we have more than 2 duplicates in a row?

WITH [1,1,1,2,3,4,5,5,6,7,7,8] AS values
UNWIND RANGE(0, LENGTH(values) - 2) AS idx
WITH values[idx] AS a, values[idx+1] AS b
WHERE a <> b
RETURN COLLECT(a)[0] + COLLECT(b) AS noDuplicates
 
==> +-------------------+
==> | noDuplicates      |
==> +-------------------+
==> | [1,2,3,4,5,6,7,8] |
==> +-------------------+
==> 1 row

Still happy, good times! Of course if we have a non consecutive duplicate that wouldn’t be removed:

WITH [1,1,1,2,3,4,5,5,6,7,7,8,1] AS values
UNWIND RANGE(0, LENGTH(values) - 2) AS idx
WITH values[idx] AS a, values[idx+1] AS b
WHERE a <> b
RETURN COLLECT(a)[0] + COLLECT(b) AS noDuplicates
 
==> +---------------------+
==> | noDuplicates        |
==> +---------------------+
==> | [1,2,3,4,5,6,7,8,1] |
==> +---------------------+
==> 1 row

Written by Mark Needham

July 30th, 2015 at 6:23 am

Posted in neo4j

Tagged with ,

Neo4j: MERGE’ing on super nodes

with one comment

In my continued playing with the Chicago crime data set I wanted to connect the crimes committed to their position in the FBI crime type hierarchy.

These are the sub graphs that I want to connect:

2015 07 26 22 19 04

We have a ‘fbiCode’ on each ‘Crime’ node which indicates which ‘Crime Sub Category’ the crime belongs to.

I started with the following query to connect the nodes together:

MATCH (crime:Crime)
WITH crime SKIP {skip} LIMIT 10000
 
MATCH (subCat:SubCategory {code: crime.fbiCode})
MERGE (crime)-[:CATEGORY]->(subCat)
RETURN COUNT(*) AS crimesProcessed

I had this running inside a Python script which incremented ‘skip’ by 10,000 on each iteration as long as ‘crimesProcessed’ came back with a value > 0.

To start with the ‘CATEGORY’ relationships were being created very quickly but it slowed down quite noticeably about 1 million nodes in.

I profiled the queries but the query plans didn’t show anything obviously wrong. My suspicion was that I had a super node problem where the cypher run time was iterating through all of the sub category’s relationships to check whether one of them pointed to the crime on the other side of the ‘MERGE’ statement.

I cancelled the import job and wrote a query to check how many relationships each sub category had. It varied from 1,000 to 93,000 somewhat confirming my suspicion.

Michael suggested tweaking the query to use the shortestpath function to check for the existence of the relationship and then use the ‘CREATE’ clause to create it if it didn’t exist.

The neat thing about the shortestpath function is that it will start from the side with the lowest cardinality and as soon as it finds a relationship it will stop searching. Let’s have a look at that version of the query:

MATCH (crime:Crime)
WITH crime SKIP {skip} LIMIT 10000
MATCH (subCat:SubCategory {code: crime.fbiCode})
WITH crime, subCat, shortestPath((crime)-[:CATEGORY]->(subCat)) AS path
FOREACH(ignoreMe IN CASE WHEN path is NULL THEN [1] ELSE [] END |
  CREATE (crime)-[:CATEGORY]->(subCat))
RETURN COUNT(*)

This worked much better – 10,000 nodes processed in ~ 2.5 seconds – and the time remained constant as more relationships were added. This allowed me to create all the category nodes but we can actually do even better if we use CREATE UNIQUE instead of MERGE

MATCH (crime:Crime)
WITH crime SKIP {skip} LIMIT 10000
 
MATCH (subCat:SubCategory {code: crime.fbiCode})
CREATE UNIQUE (crime)-[:CATEGORY]->(subCat)
RETURN COUNT(*) AS crimesProcessed

Using this query 10,000 nodes took ~ 250ms -900ms second to process which means we can process all the nodes in 5-6 minutes – good times!

I’m not super familiar with the ‘CREATE UNIQUE’ code so I’m not sure that it’s always a good substitute for ‘MERGE’ but on this occasion it does the job.

The lesson for me here is that if a query is taking longer than you think it should try and use other constructs / a combination of other constructs and see whether things improve – they just might!

Written by Mark Needham

July 28th, 2015 at 9:04 pm

Posted in neo4j

Tagged with

Neo4j: From JSON to CSV to LOAD CSV via jq

without comments


In my last blog post I showed how to import a Chicago crime categories & sub categories JSON document using Neo4j’s cypher query language via the py2neo driver. While this is a good approach for people with a developer background, many of the users I encounter aren’t developers and favour using Cypher via the Neo4j browser.

If we’re going to do this we’ll need to transform our JSON document into a CSV file so that we can use the LOAD CSV command on it. Michael pointed me to the jq tool which comes in very handy.

To recap, this is a part of the JSON file:

{
    "categories": [
        {
            "name": "Index Crime",
            "sub_categories": [
                {
                    "code": "01A",
                    "description": "Homicide 1st & 2nd Degree"
                },
            ]
        },
        {
            "name": "Non-Index Crime",
            "sub_categories": [
                {
                    "code": "01B",
                    "description": "Involuntary Manslaughter"
                },
            ]
        },
        {
            "name": "Violent Crime",
            "sub_categories": [
                {
                    "code": "01A",
                    "description": "Homicide 1st & 2nd Degree"
                },
            ]
        }
    ]
}

We want to get one row for each sub category which contains three columns – category name, sub category code, sub category description.

First we need to pull out the categories:

$ jq ".categories[]" categories.json
 
{
  "name": "Index Crime",
  "sub_categories": [
    {
      "code": "01A",
      "description": "Homicide 1st & 2nd Degree"
    },
  ]
}
{
  "name": "Non-Index Crime",
  "sub_categories": [
    {
      "code": "01B",
      "description": "Involuntary Manslaughter"
    },
  ]
}
{
  "name": "Violent Crime",
  "sub_categories": [
    {
      "code": "01A",
      "description": "Homicide 1st & 2nd Degree"
    },
  ]
}

Next we want to create a row for each sub category with the category alongside it. We can use the pipe function to combine the two selectors:

$ jq ".categories[] | {name: .name, sub_category: .sub_categories[]}" categories.json
 
{
  "name": "Index Crime",
  "sub_category": {
    "code": "01A",
    "description": "Homicide 1st & 2nd Degree"
  }
}
...
{
  "name": "Non-Index Crime",
  "sub_category": {
    "code": "01B",
    "description": "Involuntary Manslaughter"
  }
}
...
{
  "name": "Violent Crime",
  "sub_category": {
    "code": "01A",
    "description": "Homicide 1st & 2nd Degree"
  }
}

Now we want to un-nest the sub category:

$ jq ".categories[] | {name: .name, sub_category: .sub_categories[]} | [.name, .sub_category.code, .sub_category.description]" categories.json
 
[
  "Index Crime",
  "01A",
  "Homicide 1st & 2nd Degree"
]
 
[
  "Non-Index Crime",
  "01B",
  "Involuntary Manslaughter"
]
 
[
  "Violent Crime",
  "01A",
  "Homicide 1st & 2nd Degree"
]

And finally let’s use the @csv filter to generate CSV lines:

$ jq ".categories[] | {name: .name, sub_category: .sub_categories[]} | [.name, .sub_category.code, .sub_category.description] | @csv" categories.json
"\"Index Crime\",\"01A\",\"Homicide 1st & 2nd Degree\""
"\"Index Crime\",\"02\",\"Criminal Sexual Assault\""
"\"Index Crime\",\"03\",\"Robbery\""
"\"Index Crime\",\"04A\",\"Aggravated Assault\""
"\"Index Crime\",\"04B\",\"Aggravated Battery\""
"\"Index Crime\",\"05\",\"Burglary\""
"\"Index Crime\",\"06\",\"Larceny\""
"\"Index Crime\",\"07\",\"Motor Vehicle Theft\""
"\"Index Crime\",\"09\",\"Arson\""
"\"Non-Index Crime\",\"01B\",\"Involuntary Manslaughter\""
"\"Non-Index Crime\",\"08A\",\"Simple Assault\""
"\"Non-Index Crime\",\"08B\",\"Simple Battery\""
"\"Non-Index Crime\",\"10\",\"Forgery & Counterfeiting\""
"\"Non-Index Crime\",\"11\",\"Fraud\""
"\"Non-Index Crime\",\"12\",\"Embezzlement\""
"\"Non-Index Crime\",\"13\",\"Stolen Property\""
"\"Non-Index Crime\",\"14\",\"Vandalism\""
"\"Non-Index Crime\",\"15\",\"Weapons Violation\""
"\"Non-Index Crime\",\"16\",\"Prostitution\""
"\"Non-Index Crime\",\"17\",\"Criminal Sexual Abuse\""
"\"Non-Index Crime\",\"18\",\"Drug Abuse\""
"\"Non-Index Crime\",\"19\",\"Gambling\""
"\"Non-Index Crime\",\"20\",\"Offenses Against Family\""
"\"Non-Index Crime\",\"22\",\"Liquor License\""
"\"Non-Index Crime\",\"24\",\"Disorderly Conduct\""
"\"Non-Index Crime\",\"26\",\"Misc Non-Index Offense\""
"\"Violent Crime\",\"01A\",\"Homicide 1st & 2nd Degree\""
"\"Violent Crime\",\"02\",\"Criminal Sexual Assault\""
"\"Violent Crime\",\"03\",\"Robbery\""
"\"Violent Crime\",\"04A\",\"Aggravated Assault\""
"\"Violent Crime\",\"04B\",\"Aggravated Battery\""

The only annoying thing about this output is that all the double quotes are escaped. We can sort that out by passing the ‘-r’ flag when we call jq:

$ jq -r ".categories[] | {name: .name, sub_category: .sub_categories[]} | [.name, .sub_category.code, .sub_category.description] | @csv" categories.json
"Index Crime","01A","Homicide 1st & 2nd Degree"
"Index Crime","02","Criminal Sexual Assault"
"Index Crime","03","Robbery"
"Index Crime","04A","Aggravated Assault"
"Index Crime","04B","Aggravated Battery"
"Index Crime","05","Burglary"
"Index Crime","06","Larceny"
"Index Crime","07","Motor Vehicle Theft"
"Index Crime","09","Arson"
"Non-Index Crime","01B","Involuntary Manslaughter"
"Non-Index Crime","08A","Simple Assault"
"Non-Index Crime","08B","Simple Battery"
"Non-Index Crime","10","Forgery & Counterfeiting"
"Non-Index Crime","11","Fraud"
"Non-Index Crime","12","Embezzlement"
"Non-Index Crime","13","Stolen Property"
"Non-Index Crime","14","Vandalism"
"Non-Index Crime","15","Weapons Violation"
"Non-Index Crime","16","Prostitution"
"Non-Index Crime","17","Criminal Sexual Abuse"
"Non-Index Crime","18","Drug Abuse"
"Non-Index Crime","19","Gambling"
"Non-Index Crime","20","Offenses Against Family"
"Non-Index Crime","22","Liquor License"
"Non-Index Crime","24","Disorderly Conduct"
"Non-Index Crime","26","Misc Non-Index Offense"
"Violent Crime","01A","Homicide 1st & 2nd Degree"
"Violent Crime","02","Criminal Sexual Assault"
"Violent Crime","03","Robbery"
"Violent Crime","04A","Aggravated Assault"
"Violent Crime","04B","Aggravated Battery"

Excellent. The only thing left is to write a header and then direct the output into a CSV file and get it into Neo4j:

$ echo "category,sub_category_code,sub_category_description" > categories.csv
$ jq -r ".categories[] |
         {name: .name, sub_category: .sub_categories[]} |
         [.name, .sub_category.code, .sub_category.description] |
         @csv " categories.json >> categories.csv
$ head -n10 categories.csv
category,sub_category_code,sub_category_description
"Index Crime","01A","Homicide 1st & 2nd Degree"
"Index Crime","02","Criminal Sexual Assault"
"Index Crime","03","Robbery"
"Index Crime","04A","Aggravated Assault"
"Index Crime","04B","Aggravated Battery"
"Index Crime","05","Burglary"
"Index Crime","06","Larceny"
"Index Crime","07","Motor Vehicle Theft"
"Index Crime","09","Arson"
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-spark-chicago/categories.csv" AS row
MERGE (c:CrimeCategory {name: row.category})
MERGE (sc:SubCategory {code: row.sub_category_code})
ON CREATE SET sc.description = row.sub_category_description
MERGE (c)-[:CHILD]->(sc)

And that’s it!

Graph  25

Written by Mark Needham

July 25th, 2015 at 11:05 pm

Posted in neo4j

Tagged with