17 Jun 2015

Coding: Explore and retreat

When refactoring code or looking for the best way to integrate a new piece of functionality I generally favour a small steps/incremental approach but recent experiences have led me to believe that this isn’t always the quickest approach.

Sometimes it seems to make more sense to go on little discovery missions in the code, make some bigger steps and then if necessary retreat and revert our changes and apply the lessons learnt on our next discovery mission. This technique which isn’t anything novel but I think is quite effective.

Michael and I were recently looking at the Smart Local Moving algorithm which is used for community detection in large networks and decided to refactor the code to make sure we understood how it worked. When we started the outline of the main class was like this:

public class Network implements Cloneable, Serializable
{
    private static final long serialVersionUID = 1;

    private int numberOfNodes;
    private int[] firstNeighborIndex;
    private int[] neighbor;
    private double[] edgeWeight;
    private double totalEdgeWeightSelfLinks;
    private double[] nodeWeight;
    private int nClusters;
    private int[] cluster;

    private double[] clusterWeight;
    private int[] numberNodesPerCluster;
    private int[][] nodePerCluster;
    private boolean clusteringStatsAvailable;
...
}

My initial approach was to put methods around things to make it a bit easier to understand and then step by step replace each of those fields with nodes and relationships. I spent the first couple of hours doing this and while it was making the code more readable it wasn’t progressing very quickly and I wasn’t much wiser about how the code worked.

Michael and I paired on it for a few hours and he adopted a slightly different but more successful approach where we looked at slightly bigger chunks of code e.g. all the loops that used the firstNeighborIndex field and then created a hypothesis of what that code was doing.

In this case firstNeighborIndex acts as an offset into neighbor and is used to iterate through a node’s relationships. We thought we could probably replace that with something more similar to the Neo4j model where you have classes for nodes and relationships and a node has a method which returns a collection of relationships.

We tried tearing out everywhere that used those two fields and replacing them with our new nodes/relationships code but that didn’t work because we hadn’t realised that edgeWeight and nodeWeight are also tied to the contents of the original fields.

We therefore needed to retreat and try again. This time I put the new approach alongside the existing approach and then slowly replaced existing bits of code.

Along the way I came up with other ideas about how to restructure the code, tried some more bigger leaps to validate my ideas and then moved back into incremental mode again.

In summary I’ve found the combination of incrementally changing code and going on bigger exploratory missions works quite well.

Now I’m trying to work out when each approach is appropriate and I’ll write that up when I learn more! You can see my progress via the github commits.

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.