1 Dec 2017

Python: Learning about defaultdict's handling of missing keys

While reading the scikit-learn code I came across a bit of code that I didn’t understand for a while but in retrospect is quite neat.

This is the code snippet that intrigued me:

vocabulary = defaultdict()
vocabulary.default_factory = vocabulary.__len__

Let’s quickly see how it works by adapting an example from scikit-learn:

>>> from collections import defaultdict
>>> vocabulary = defaultdict()
>>> vocabulary.default_factory = vocabulary.__len__

>>> vocabulary["foo"]
0
>>> vocabulary.items()
dict_items([('foo', 0)])

>>> vocabulary["bar"]
1
>>> vocabulary.items()
dict_items([('foo', 0), ('bar', 1)])

What seems to happen is that when we try to find a key that doesn’t exist in the dictionary an entry gets created with a value equal to the number of items in the dictionary.

Let’s check if that assumption is correct by explicitly adding a key and then trying to find one that doesn’t exist:

>>> vocabulary["baz"] = "Mark
>>> vocabulary["baz"]
'Mark'
>>> vocabulary["python"]
3

Now let’s see what the dictionary contains:

>>> vocabulary.items()
dict_items([('foo', 0), ('bar', 1), ('baz', 'Mark'), ('python', 3)])

All makes sense so far. If we look at the source code we can see that this is exactly what’s going on:

"""
__missing__(key) # Called by __getitem__ for missing key; pseudo-code:
  if self.default_factory is None: raise KeyError((key,))
  self[key] = value = self.default_factory()
  return value
"""
pass

scikit-learn uses this code to store a mapping of features to their column position in a matrix, which is a perfect use case.

All in all, very neat!

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.