Mark Needham

Thoughts on Software Development

Why you shouldn’t use name as a key a.k.a. I am an idiot

with 5 comments

I think one of the first things that I learnt about dealing with users in a data store is that you should never use name as a primary key because their might be two people with the same name.

Despite knowing that I foolishly chose to ignore this knowledge when building my neo4j graph and used name as the key for the Lucene index.

I thought I’d got away with it but NO!

Earlier today I was trying to work out who the most connected person at ThoughtWorks is and the graph was suggesting that ‘Rahul Singh’ was the most connected, having worked with 540 people.

I mentioned this to Jen who felt something was probably wrong since he’d only started working at ThoughtWorks a couple of years ago.

Amusingly Jen found an email from 18 months ago sent by Rahul #1 explaining that there were in fact two people with exactly the same name and he was getting emails intended for the other one and vice versa.

I now have first hand knowledge of what can happen if you ignore one of the most basic rules of software development!

My gamble that there probably wouldn’t be two people with the same name in such a small dataset has totally failed and from now on I’ll be sure to use a unique key!

Written by Mark Needham

June 24th, 2012 at 10:55 pm

  • http://twitter.com/ikwattro Christophe Willemsen

    Hi,

    Thx for you post.Which turnaround did you implement after this ? 

    I have an application with dogName – kennelName – dateOfBirth . In fact we can in a kennel life have the same dog name with the same kennelName. I was thinking about generating a salt with these three properties as a fourth property and basing my search on this salt.

    Do you have better solutions?

    Thank you

    Chris

  • http://www.markhneedham.com/blog Mark Needham

    @twitter-278486993:disqus I have a unique ID which is assigned to each person so I’m going to use that – it’s not really human recognisable but it will be unique. Another way could be to use a UUID and just create a random one of those each time. 
    You don’t need the key to necessarily contain any of the data that it represents IMO

  • http://twitter.com/ikwattro Christophe Willemsen

    In that case you have to know this unique id before the search ?

  • http://www.markhneedham.com/blog Mark Needham

    @twitter-278486993:disqus yeh that’s true – so you’re talking about a situation where you don’t know the unique id? In that case I suppose you can do a search on a lucene index or equivalent and find which records match before picking your specific row?

  • http://twitter.com/ikwattro Christophe Willemsen

    Yes I think I didn’t match your use case from the start. Thx