Mark Needham

Thoughts on Software Development

Neo4j: Using aliases to handle messy data

with 4 comments

One of the common problems when building data heavy applications is that names of things in the domain are often named differently depending on which system you get the data from.

This means that we’ll typically end up running the data from different sources through a normalisation process to ensure that we have consistent naming in the database:

Data nromalisation

I’ve recently started linking the football stadium a match was played in to the match in my football graph but unfortunately different match compilers use different spellings or even names for the same stadium.

My first instinct was to write a normalisation layer but instead I decided to store the stadium names as they were but separately create an ‘alias’ relationship back in to the correct spelling.

The model looks like this:

2013 11 26 00 16 42

I add a ‘Stadium’ label to all the stadiums which means that if we do a query to find the number of games at each stadium we’ll get back games at the misspelled versions too:

MATCH (stadium:Stadium)<-[:in_stadium]-(game)
RETURN stadium.name, count(game) AS games
ORDER BY games DESC
==> +-------------------------------+
==> | stadium.name          | games |
==> +-------------------------------+
==> | "Craven Cottage"      | 57    |
==> | "Villa Park"          | 57    |
==> | "Anfield"             | 57    |
==> | "Stamford Bridge"     | 57    |
==> | "Britannia Stadium"   | 57    |
==> | "Emirates Stadium"    | 57    |
==> | "Etihad Stadium"      | 57    |
==> | "Stadium of Light"    | 57    |
==> | "Old Trafford"        | 57    |
==> | "The Hawthorns"       | 57    |
==> | "White Hart Lane"     | 57    |
==> | "Goodison Park"       | 57    |
==> | "DW Stadium"          | 39    |
==> | "Molineux Stadium"    | 38    |
==> | "Liberty Stadium"     | 38    |
==> | "Ewood Park"          | 38    |
==> | "Carrow Road"         | 38    |
==> | "Reebok Stadium"      | 38    |
==> | "Loftus Road Stadium" | 37    |
==> | "St James' Park"      | 34    |
==> | "Upton Park"          | 33    |
==> | "Bloomfield Road"     | 19    |
==> | "Madejski Stadium"    | 19    |
==> | "St. James' Park"     | 19    |
==> | "St Andrews Stadium"  | 19    |
==> | "St. Mary's Stadium"  | 19    |
==> | "The DW Stadium"      | 18    |
==> | "Boleyn Ground"       | 5     |
==> | "St. James Park"      | 4     |
==> | "Loftus Road"         | 1     |
==> +-------------------------------+
==> 30 rows

We only want to show the proper spellings of stadiums which we can do with the following query:

MATCH (stadium:Stadium)<-[:alias_of*0..1]-()<-[:in_stadium]-(game)
WHERE NOT(stadium-[:alias_of]->())
return stadium.name, count(game) AS games
ORDER BY games DESC

Here we get all the stadiums, then get any incoming aliases and find the games played in those stadiums. We then filter out any stadium which has an outgoing ‘alias_of’ relationship because that indicates that the node isn’t the original node for the stadium.

If we run the query we’ll get the expected result:

==> +-------------------------------+
==> | stadium.name          | games |
==> +-------------------------------+
==> | "St James' Park"      | 57    |
==> | "Craven Cottage"      | 57    |
==> | "Villa Park"          | 57    |
==> | "Anfield"             | 57    |
==> | "Stamford Bridge"     | 57    |
==> | "Britannia Stadium"   | 57    |
==> | "Emirates Stadium"    | 57    |
==> | "Etihad Stadium"      | 57    |
==> | "Stadium of Light"    | 57    |
==> | "Old Trafford"        | 57    |
==> | "The Hawthorns"       | 57    |
==> | "White Hart Lane"     | 57    |
==> | "The DW Stadium"      | 57    |
==> | "Goodison Park"       | 57    |
==> | "Loftus Road Stadium" | 38    |
==> | "Molineux Stadium"    | 38    |
==> | "Upton Park"          | 38    |
==> | "Liberty Stadium"     | 38    |
==> | "Ewood Park"          | 38    |
==> | "Carrow Road"         | 38    |
==> | "Reebok Stadium"      | 38    |
==> | "Bloomfield Road"     | 19    |
==> | "Madejski Stadium"    | 19    |
==> | "St Andrews Stadium"  | 19    |
==> | "St. Mary's Stadium"  | 19    |
==> +-------------------------------+
==> 25 rows

The nice thing about this approach is that I only need to collect aliases in one place and I can just import the data as is from the source.

On the other hand it does add some complexity to queries as you need to take aliases into account each time.

I’d love to hear your thoughts on what you think of this approach.

Written by Mark Needham

November 26th, 2013 at 12:12 am

Posted in neo4j

Tagged with

  • Luanne Misquitta

    To me, if you need to maintain the aliases then it seems like a good solution. Will you put stadium details/properties on the original node then? i.e no properties on the alias nodes?

  • http://www.markhneedham.com/blog Mark Needham

    @luannemisquitta:disqus what do you mean by maintain the aliases?

    My thinking was that in some environments different systems would actually legitimately have different names for a product (the example is a bit contrived!). Some of the time you care about doing queries with the aliases but at some stage you’d want to do a query where you don’t care about the aliases anymore so you want to aggregate everything up.

    I was thinking that details/properties would only go on the original node yeh.

  • Luanne Misquitta

    Cool. Was trying to think of cases when I’d want the aliases maintained in the graph and not reconciled outside it but yes, I see your point

  • http://www.markhneedham.com/blog Mark Needham

    @luannemisquitta:disqus an example could be a supply chain where different parts of the chain use different names for the same product.

    e.g. a product may be called ’50g chips’ in the Lays part of the supply chain but then when that same product is in the supermarket part of the chain it might be known as ‘Lays 50g chips’ yet both those names refer to the same thing.

    So I think for that example it would make sense to maintain the two different aliases…