29 Jan 2020

Neo4j: Finding the longest path

One on my favourite things about storing data in a graph database is executing path based queries against that data. I’ve been trying to find a way to write such queries against the Australian Open QuickGraph, and in this blog post we’re going to write what I think of as longest path queries against this graph.

Figure 1. Finding longest paths in Neo4j

Setting up Neo4j

We’re going to use the following Docker Compose configuration in this blog post:

Dockerfile

version: '3.7'

services:
  neo4j:
    image: neo4j:4.0.0-enterprise
    container_name: "quickgraph-aus-open"
    volumes:
      - ./plugins:/plugins
      - ./data:/data
      - ./import:/var/lib/neo4j/import
    ports:
      - "7474:7474"
      - "7687:7687"
    environment:
      - "NEO4J_ACCEPT_LICENSE_AGREEMENT=yes"
      - "NEO4J_AUTH=neo4j/neo"
      - NEO4J_apoc_import_file_use__neo4j__config=true
      - NEO4J_apoc_import_file_enabled=true
      - NEO4J_apoc_export_file_enabled=true
      - NEO4JLABS_PLUGINS=["apoc"]

Once we’ve created that file we need to open a terminal session where that file lives and then run docker-compose up to launch Neo4j.

Importing our dataset

We’re going to learn how to write longest path queries using a dataset that contains the finalists of the Australian Open tennis tournament.

:use system;
CREATE DATABASE blog;
:use blog

The following query creates tournaments and relationships between them

CALL apoc.load.json("https://github.com/mneedham/australian-open-neo4j/raw/master/blog_data/tournaments.json")
YIELD value
CALL apoc.merge.node(value.t1.labels, value.t1.properties) YIELD node AS t1
CALL apoc.merge.node(value.t2.labels, value.t2.properties) YIELD node AS t2
CALL apoc.merge.relationship(t1, value.rel.label, {}, {}, t2, {}) YIELD rel
RETURN count(*);

The following query creates matches and players and associated relationships

CALL apoc.load.json("https://github.com/mneedham/australian-open-neo4j/raw/master/blog_data/finalists.json")
YIELD value
CALL apoc.merge.node(value.winner.labels, value.winner.properties) YIELD node AS winner
CALL apoc.merge.node(value.loser.labels, value.loser.properties) YIELD node AS loser
CALL apoc.merge.node(value.match.labels, value.match.properties) YIELD node AS match
CALL apoc.merge.node(value.t.labels, value.t.properties) YIELD node AS tournament
CALL apoc.merge.relationship(winner, value.winnerRel.label, {}, {}, match, {}) YIELD rel AS winnerRel
CALL apoc.merge.relationship(loser, value.loserRel.label, {}, {}, match, {}) YIELD rel AS loserRel
CALL apoc.merge.relationship(match, value.tournRel.label, {}, {}, tournament, {}) YIELD rel AS tournRel
return count(*);

We can see the imported graph in the Neo4j Browser visualisation below:

Figure 2. Australian Open Womens Finalists

Finding longest paths

Let’s start with a variable length path query that starts with the Tournament in the year 2000 and follows the NEXT_TOURNAMENT relationship as many times as possible by using the * syntax after the relationship type:

MATCH path = (:Tournament {year: 2000})-[:NEXT_TOURNAMENT*]->(next)
RETURN [t in nodes(path) | t.year] AS allTournaments

If we run this query, we’ll see the following output:

Table 1. Results
allTournaments
[2000, 2001]
[2000, 2001, 2002]
[2000, 2001, 2002, 2003]
[2000, 2001, 2002, 2003, 2004]
[2000, 2001, 2002, 2003, 2004, 2005]
[2000, 2001, 2002, 2003, 2004, 2005, 2006]
[2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007]
[2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008]
[2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009]
[2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010]
[2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011]
[2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012]
[2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013]
[2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014]
[2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015]
[2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016]
[2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017]
[2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018]
[2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]

As we can see from the results, we’ve returned all the intermediate paths between the tournament in the year 2000 and the one in the year 2019.

But what if we only want to return the longest path? i.e the last one in the above results table.

We can tweak our query to do this by adding a WHERE clause that filters out any paths where the last node in that path has a NEXT_TOURNAMENT relationship. The following query does this:

MATCH path = (:Tournament {year: 2000})-[:NEXT_TOURNAMENT*]->(next)
WHERE not((next)-[:NEXT_TOURNAMENT]->())
RETURN [t in nodes(path) | t.year] AS allTournaments

And if we run that query, we only get one row back, as expected:

Table 2. Results
allTournaments
[2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]

Now let’s see how we can use this technique for a more complex query.

We want to find which players have lost multiple finals in a row. So we need to find all the Match nodes that have round "F" where the Player has a LOSER relationship to that match. And we then want to see if that same player had a LOSER relationship to the final match in tournaments in the following years. The following query is our first attempt to do this:

// Find losing finalists in a tournament and a path of all the tournaments
// after that tournament
MATCH path = (t:Tournament)-[:NEXT_TOURNAMENT*]->(t2:Tournament),
             (t)<-[:IN_TOURNAMENT]-(:Match {round: "F"})<-[:LOSER]-(player)

// Check that the player lost the final in every subsequent tournament
WITH nodes(path) AS tournaments, player
WHERE all(t in tournaments[1..]
          WHERE (t)<-[:IN_TOURNAMENT]-(:Match {round: "F"})<-[:LOSER]-(player)
)

RETURN player.name, [t IN tournaments | t.year] AS years

If we run this query, we’ll see the following results:

Table 3. Results
player.name	years
"Martina Hingis"	[2000, 2001]
"Martina Hingis"	[2000, 2001, 2002]
"Martina Hingis"	[2001, 2002]

Poor Martina Hingis! But despite telling us all the finals that Martina lost, we’re still returning some rows that we want to exclude. Ideally we only want to return the longest path, which contains [2000,2001,2002].

We need to update our query to filter out any paths where the losing finalist didn’t lose the final the year before

// Find losing finalists in a tournament and a path of all the tournaments
// after that tournament
MATCH path = (t:Tournament)-[:NEXT_TOURNAMENT*]->(t2:Tournament),
             (t)<-[:IN_TOURNAMENT]-(:Match {round: "F"})<-[:LOSER]-(player)

// Get the first and last tournaments in the list
WITH nodes(path) AS tournaments, player
WITH tournaments, tournaments[0] AS first, tournaments[-1] AS last, player

// Get the tournament that happened immediately before the first one in the list and
// the tournament that happened immediately after the last one in the list
WITH tournaments, player,
     [(last)-[:NEXT_TOURNAMENT]->(next) | next][0] AS next,
     [(previous)-[:NEXT_TOURNAMENT]->(first) | previous][0] AS previous

// Check that the player lost the final in every subsequent tournament and
// that the player lost the final in the tournament immediately before and
// that the player lost the final in the tournament immediately after
WHERE all(t in tournaments[1..]
          WHERE (t)<-[:IN_TOURNAMENT]-(:Match {round: "F"})<-[:LOSER]-(player)
          AND not((next)<-[:IN_TOURNAMENT]-(:Match {round: "F"})<-[:LOSER]-(player))
          AND not((previous)<-[:IN_TOURNAMENT]-(:Match {round: "F"})<-[:LOSER]-(player))
)

RETURN player.name, [t IN tournaments | t.year] AS years;

If we run this query, we’ll see the following results:

Table 4. Results
player.name	years
"Martina Hingis"	[2000, 2001, 2002]

Perfect, that’s exactly what we want the results to look like! We can now extend this query to find the players who reached consecutive finals:

// Find finalists in a tournament and a path of all the tournaments
// after that tournament
MATCH path = (t:Tournament)-[:NEXT_TOURNAMENT*]->(t2:Tournament),
             (t)<-[:IN_TOURNAMENT]-(:Match {round: "F"})<--(player)

// Get the first and last tournaments in the list
WITH nodes(path) AS tournaments, player
WITH tournaments, tournaments[0] AS first, tournaments[-1] AS last, player

// Get the tournament that happened immediately before the first one in the list and
// the tournament that happened immediately after the last one in the list
WITH tournaments, player,
     [(last)-[:NEXT_TOURNAMENT]->(next) | next][0] AS next,
     [(previous)-[:NEXT_TOURNAMENT]->(first) | previous][0] AS previous

// Check that the player reached the final in every subsequent tournament and
// that the player reached the final in the tournament immediately before and
// that the player reached the final in the tournament immediately after
WHERE all(t in tournaments[1..]
          WHERE (t)<-[:IN_TOURNAMENT]-(:Match {round: "F"})<--(player)
          AND not((next)<-[:IN_TOURNAMENT]-(:Match {round: "F"})<--(player))
          AND not((previous)<-[:IN_TOURNAMENT]-(:Match {round: "F"})<--(player))
)

RETURN player.name,
       // Create a list containing the year of the final and the relationship type from the
       // player to the final match in that tournament
       apoc.coll.flatten(
         [t IN tournaments | [(t)<-[:IN_TOURNAMENT]-(:Match {round: "F"})<-[type]-(player) |
           [t.year, type(type)]][0]]
       ) AS years
ORDER BY years[0]

If we run that query, we’ll see the following results:

Table 5. Results
player.name	years
"Martina Hingis"	[2000, "LOSER", 2001, "LOSER", 2002, "LOSER"]
"Jennifer Capriati"	[2001, "WINNER", 2002, "WINNER"]
"Maria Sharapova"	[2007, "LOSER", 2008, "WINNER"]
"Serena Williams"	[2009, "WINNER", 2010, "WINNER"]
"Victoria Azarenka"	[2012, "WINNER", 2013, "WINNER"]
"Na Li"	[2013, "LOSER", 2014, "WINNER"]
"Serena Williams"	[2015, "WINNER", 2016, "LOSER", 2017, "WINNER"]

There are a lot more players who reached multiple finals, but noone won more than 2 finals in a row. I was expecting to see more dominance by a single player!

It would be interesting to run a similar query that looked at the finalists of all Grand Slam tournaments and not sure just consecutive Australian Opens. Perhaps that can be the topic of another blog post.

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.