Mark Needham

Thoughts on Software Development

Archive for the ‘collections’ tag

Python: Learning about defaultdict’s handling of missing keys

without comments

While reading the scikit-learn code I came across a bit of code that I didn’t understand for a while but in retrospect is quite neat.

This is the code snippet that intrigued me:

vocabulary = defaultdict()
vocabulary.default_factory = vocabulary.__len__

Let’s quickly see how it works by adapting an example from scikit-learn:

>>> from collections import defaultdict
>>> vocabulary = defaultdict()
>>> vocabulary.default_factory = vocabulary.__len__
 
>>> vocabulary["foo"]
0
>>> vocabulary.items()
dict_items([('foo', 0)])
 
>>> vocabulary["bar"]
1
>>> vocabulary.items()
dict_items([('foo', 0), ('bar', 1)])

What seems to happen is that when we try to find a key that doesn’t exist in the dictionary an entry gets created with a value equal to the number of items in the dictionary.

Let’s check if that assumption is correct by explicitly adding a key and then trying to find one that doesn’t exist:

>>> vocabulary["baz"] = "Mark
>>> vocabulary["baz"]
'Mark'
>>> vocabulary["python"]
3

Now let’s see what the dictionary contains:

>>> vocabulary.items()
dict_items([('foo', 0), ('bar', 1), ('baz', 'Mark'), ('python', 3)])

All makes sense so far. If we look at the source code we can see that this is exactly what’s going on:

"""
__missing__(key) # Called by __getitem__ for missing key; pseudo-code:
  if self.default_factory is None: raise KeyError((key,))
  self[key] = value = self.default_factory()
  return value
"""
pass

scikit-learn uses this code to store a mapping of features to their column position in a matrix, which is a perfect use case.

All in all, very neat!

Written by Mark Needham

December 1st, 2017 at 3:26 pm

Posted in Python

Tagged with , ,

Wrapping collections: Inheritance vs Composition

with 3 comments

I wrote previously about the differences between wrapping collections and just creating extension methods to make our use of collections in the code base more descriptive but I’ve noticed in code I’ve been reading recently that there appear to be two ways of wrapping the collection – using composition as I described previously but also extending the collection by using inheritance.

I was discussing this with Lu Ning recently and he pointed out that if what we have is actually a collection then it might make more sense to extend the collection with a custom class whereas if the collection is just an implementation detail of some other domain concept then it would be better to use composition.

In the latter case we probably don’t want to expose all of the methods available on the collection since we don’t want to make it possible for clients of the object to perform all of these operations.

We have both of these approaches in our code base – in the case where we have used inheritance we are extending a ‘SelectList’ from the .NET API to be a ‘PleaseSelectList’ so that it will add the value ‘please select’ to the top of drop down lists that we create on our UI instructing the user to select an option.

In this case I think we really do have a ‘SelectList’ so I’m not too bothered that we’re using inheritance instead of composition – in general though my preference is to use composition because I’ve found that most of the time we don’t actually want to expose all the methods available on a collection API when we pass this data around our code.

For example a fairly common scenario might be that we load up a collection of values from a persistence mechanism and then maybe do some querying on them before displaying something to the user.

I’ve noticed that it’s quite rare that we would actually want to allow any clients of this code to have the ability to remove an item from this collection but this is one of the methods that would typically be exposed if we decided to pass around a ‘List’ which is often the case.

Even if it’s not the case we can still convert an ‘IEnumerable’ value into a ‘List’ and then do whatever we want to it unless the collection had been defined as being ‘read only‘ in which case you now have an API which is potentially misleading.

I’m finding myself moving towards the opinion that it only makes sense to create a new type if we actually get some added value in terms of expressing the intent of our code more easily by doing so and a lot of the time it seems that the added value that comes by extending a collection so that we have our own named type doesn’t seem to provide a lot of value. In addition, it can actually be quite painful later on if we decide we want to change the way we represent that data.

I think using the inheritance option is often the short term quickest choice since we can just make use of all the methods that already exist on the collection being extended instead of having to write code to delegate to those methods which is the case if we use composition.

It does seem to be a fine line between using inheritance for good and just using it because then you don’t have to spend a lot of time thinking about the best solution to the problem you’re trying to solve.

Perhaps the choice between the way that we choose to do this comes down to analysing the trade offs between using composition and inheritance as Phil Haack points out in a post he wrote a couple of years ago.

Written by Mark Needham

July 24th, 2009 at 1:07 am

Posted in Coding

Tagged with ,