Neo4j: From Graph Model to Neo4j Import
In case you haven’t come across this dataset before, Tomaz Bratanic has a great blog post explaining it.
The tl;dr is that we have articles, authors, and venues. Authors can write articles, articles can reference other articles, and articles are presented at a venue. Below is the graph model for this dataset:
Tomaz then goes on to show how to load the data into Neo4j using Cypher and the APOC library. Unfortunately this process is quite slow as there’s a lot of data so, since I’m not as patient as Tomaz, I wanted to try and find a faster way to get the data into the graph.
At a high level, the diagram below shows what happens when we import data using Cypher:
When we use the Neo4j import tool we’re able to skip that middle bit:
We skip the whole transactional machinery, so it’s much faster. We do pay a price for that extra speed:
We can only use this tool to create a brand new database
If there’s an error while it runs we need to start again from the beginning
We need to get our data into the format that the tool expects
If we’re happy with that trade off then Neo4j Import is the way to go.
The raw data is in JSON files, so our process will be as follows:
We don’t have to use a Python script to transform the raw data, but that’s the scripting language I’m most familiar with so I tend to use that.
Before we look at the script, let’s explore the JSON files that we need to transform. Below is a sample of one of these files:
Each row contains one JSON document and, as we can see, these documents contain the following keys:
abstract- the abstract of the article
authors- the authors who wrote the article
n_citation- we won’t use this key
references- the other articles that an article cites
title- the title of the article
venue- the venue where the article was published
year- the year the article was published
id- the id of the article
For each of these JSON documents we’re going to create the following graph structure:
(:Article)-[:VENUE]->(:Venue) (:Article)-[:AUTHOR]->(:Author) (:Article)-[:CITED]->(:Article)
()indicates a node i.e.
(:Article)means that we have a node with the label
indicates a relationship i.e.
[:VENUE]means that we have a relationship with the type
The properties from our JSON document will be assigned as follows:
Authornode per value in the array, with the value assigned to the
Articlenode per value in the array, with the value assigned to the
Venuenode per value in the array, with the value assigned to the
So we know what data we’re going to extract from our JSON file, but what should the CSV files that we create look like?
The Neo4j Import Tool expects separate CSV files containing the nodes and relationships that we want to create. For each of those files we can either include a header line at the top of the file, or we can store those header lines in a separate line. We’re going to take the latter approach in this blog post.
The fields in those headers have a very specific format, it’s almost a mini DSL.
Let’s take the example of the
Venue nodes and the corresponding relationship.
We’ll have three CSV files.
The node headers will look like this:
For node files one of the fields needs to act as an identifier.
We define this by including
:ID in that field.
In our example we’re also specifying an optional node group in parentheses e.g.
The node group is used to indicate that the identifier only needs to be unique within that group, rather than across all nodes.
And the relationships header will look like this:
Again our header fields need to contain some custom values:
:START_IDindicates that this field contains the identifier for the source node
:END_IDindicates that this field contains the identifier for the target node
As with the nodes, we can specify a node group in parentheses e.g.
So now we know what files we need to create, let’s have a look at the script that generates these files:
There’s nothing too clever going here - we’re iterating over the JSON files and writing into CSV files. For the relationships we write those directly to the CSV files as soon as we see them. For the nodes we’re collecting those in in-memory data structures to remove duplicates, and then we write those to the CSV files at the end.
Let’s have a look at a subset of the data in those CSV files:
This script takes a few minutes to run on my machine. There are certainly some ways we could improve the script, but this is good enough as a first pass.
Now it’s time to import the data into Neo4j. If we’re using the Neo4j Desktop we can access the Neo4j Import Tool via the 'Terminal' tab within a database.
We can then paste in the script below:
Let’s go through the arguments we’re passing to the Neo4j Import Tool. The line below imports our author nodes:
The syntax of this part of the command is
It treats the files that we provide as if they’re one big file, and the first line of the first file needs to contain the header line.
So in this line we’re saying that we want to create one node for each entry in the
authors.csv file, and each of those nodes should have the label
The line below creates relationships between our
The syntax of this part of the command is
Again, it treats the files we provide as if they’re one big file, so the first line of the first file must contain the header line.
In this line we’re saying that we want to create a relationship with the type
VENUE for each entry in the
Now if we run the script, we’ll see the following output:
Neo4j version: 3.5.3 Importing the contents of these files into /home/markhneedham/.config/Neo4j Desktop/Application/neo4jDatabases/database-32ed444f-69cd-422c-81aa-0c634210ad50/installation-3.5.3/data/databases/graph.db: Nodes: :Author /home/markhneedham/projects/dblp/data/authors_header.csv /home/markhneedham/projects/dblp/data/authors.csv :Article /home/markhneedham/projects/dblp/data/articles_header.csv /home/markhneedham/projects/dblp/data/articles.csv :Venue /home/markhneedham/projects/dblp/data/venues_header.csv /home/markhneedham/projects/dblp/data/venues.csv Relationships: :REFERENCES /home/markhneedham/projects/dblp/data/article_REFERENCES_article_header.csv /home/markhneedham/projects/dblp/data/article_REFERENCES_article.csv :AUTHOR /home/markhneedham/projects/dblp/data/article_AUTHOR_author_header.csv /home/markhneedham/projects/dblp/data/article_AUTHOR_author.csv :VENUE /home/markhneedham/projects/dblp/data/article_VENUE_venue_header.csv /home/markhneedham/projects/dblp/data/article_VENUE_venue.csv Available resources: Total machine memory: 31.27 GB Free machine memory: 1.93 GB Max heap memory : 910.50 MB Processors: 8 Configured max memory: 27.34 GB High-IO: false WARNING: heap size 910.50 MB may be too small to complete this import. Suggested heap size is 1.00 GBImport starting 2019-03-27 09:43:54.934+0000 Estimated number of nodes: 7.84 M Estimated number of node properties: 24.08 M Estimated number of relationships: 36.99 M Estimated number of relationship properties: 0.00 Estimated disk space usage: 4.78 GB Estimated required memory usage: 1.09 GB IMPORT DONE in 2m 52s 306ms. Imported: 4850632 nodes 37215467 relationships 15328803 properties Peak memory usage: 1.08 GB
And now let’s start the database and have a look at the graph we’ve imported: