Mark Needham

Thoughts on Software Development

neo4j: Loading data – REST API vs Batch Import

without comments

A couple of weeks ago when I first started playing around with my football data set I was loading all the data into neo4j using the REST API via neography which was taking around 4 minutes to load.

The data set consisted of just over 250 matches which translated into 8,000 nodes & 30,000 relationships so it’s very small by all means.

Ashok and I were discussing how that could be quicker and the first thing we tried was to store inserted nodes in an in memory hash map and look them up from there rather than doing an index lookup each time.

These were the timings for different numbers of matches when I did that:

--------------------------------------------------------------------
| Matches | Cache-Hits | Cache-Misses | Lucene        | In memory  |
--------------------------------------------------------------------
| 25      | 501        | 325          | 26.692s       | 22.877s    |
| 50      | 1275       | 373          | 50.491s       | 38.304s    |
| 263     | 8016       | 480          | 4m 11.031s    | 2m 49.951s |
--------------------------------------------------------------------

For the full data set it was about 30% faster which was a nice improvement but still left me waiting around for a bit longer than I wanted to!

I’ve previously used the batch inserter and I was planning to use that again to get a significant improvement in loading time until Ashok pointed out Michael Hunger’s batch-import which seemed worth a try.

I had to add an extra step to the pipeline to put all the nodes and relationships into CSV files and then pass those files to the batch-import JAR.

There was a massive improvement in the load time using it. These were the timings:

-------------------------------------------------------------------
| Matches | Lucene        | In memory  | Batch Import             |
-------------------------------------------------------------------
| 25      | 26.692s       | 22.877s    | 0.378s + 0.921s = 1.299s |
| 50      | 50.491s       | 38.304s    | 0.392s + 1.025s = 1.417s |
| 263     | 4m 11.031s    | 2m 49.951s | 0.524s + 1.239s = 1.763s |
-------------------------------------------------------------------

(the two numbers represent the time taken to generate the CSV files and then the time to import them)

From my brief skimming of the code it seems to take in the files and then route them through the batch importer API so I imagine similar results would be had by calling that directly.

I know this is not a very fair comparison given that you probably shouldn’t be using the REST API to insert data but since I’ve done it a couple of times I thought it’d be interesting to measure anyway!

Written by Mark Needham

February 28th, 2013 at 11:36 pm

Posted in neo4j

Tagged with