Archive for the ‘Software Development’ tag
The 5 whys: Another attempt
Towards the end of the week before last and the beginning of last week we’d been having quite a few problems with our QA environment to the point where we were unable to deploy anything to it for 3 days.
A few weeks ago I wrote about a 5 whys exercise that we did in a retrospective and in our weekly code review we decided to give it a go and see what we could learn.
We started with the question ‘Why was there a mess?‘ and then branched out the first level whys since it was fairly clear that there wasn’t only one thing which had contributed to our problems.
We ended up with 4 answers to the first why:
- There was a DNS change
- Volume was deleted from our QA server
- System tests failing
- Change in one project hanging QA deployment
- Main build broken for a while
We then worked across the whiteboard taking each of these in turn.
I think our approach allowed us to avoid part of ‘the cult of the root cause‘ which Don Reinertsen wrote about.
It still wasn’t quite spot on due to some mistakes I made while facilitating but these were my observations:
- Once we got to answering the whys for the 4th and 5th first level whys the whiteboard was way too cluttered and it had become quite difficult to see exactly where we’d got up to.
As a result we lost the discipline around answering the question why and drifted off into general discussion around the original question but stopped drilling down further looking for a potential root cause.
The next time I think it would probably work better to look for the first why and collect any potential other whys on the same level in a ‘parking lot’ type area which we could then go to later on.
- Having said that, a neat thing about having the whys alongside each other was that we were able to see that the first two whys were linked to each other.
Both changes had been done by someone in the operations team based on conversations they had with people on our team.
We realised that our communication with the operations team hadn’t been entirely clear and had left room for doubt which had led to unexpected changes being made to the servers.
This was an example of us stopping before we’d drilled down to 5 levels having realised that we could influence the situation positively even if we hadn’t found the root cause of the problem.
- Drilling down into the ‘System tests failing’ led to the most interesting insights:
-
System tests failing
- Noone cares about them
- We can push to QA even if they’re broken
- Used to them failing
- Perception amongst devs that they’re flaky
- There had previously been a time when data changed frequently and broke them.
- Perception amongst devs that they’re flaky
- Seen as being owned by the QAs
- The tests were defined by QAs
- The time from checkin to system tests failing is quite long
Looking back at this now we probably should have drilled a bit further down on some of the whys.
We actually ended up discussing the perception amongst the developers that the tests were flaky and it was pointed out that most of the failures were actually real.
We don’t currently have a ‘stop the line’ mentality if the systems tests fail but have agreed to adopt that approach for the next iteration and check at the end of this week to see if we’ve improved.
-
System tests failing
- Even though I didn’t facilitate the exercise perfectly I think there was still a far greater level of analysis done by the team in this exercise than in others that I’ve seen.
I’ve noticed that a lot of retrospective type exercises tend to only encourage surface level analysis so we never really go deeper into a subject and see if we can actually make some useful changes to the way that we work.
Working with external identifiers
As part of the ingestion process for our application we import XML documents and corresponding PDFs into a database and onto the file system respectively.
Since the user needs to be able to search for documents by the userFacingId we reference it by that identifier in the database and the web application.
Each document also has an external identifier and we use this to identify the PDFs on the file system.
We can’t use the raw userFacingId to do this because there are some documents which have the same ID when we import them.
Most of the time we only need to care about the userFacingId in the web application but when the user wants to download a PDF we need to map from the userFacingId to the externalId so we can locate the file on the file system.
The first implementation of this code involved some mapping code in the web application from which we constructed an externalId from a given userFacingId.
Unfortunately this logic drifted into a few different places and it started to become really difficult to tell whether we were dealing with a userFacingId or an externalId.
We wanted to try and isolate the translation logic into one place on the edge of the system but Pat pointed out that it would actually be simpler if we never had to care about the externalId in our code.
We changed the ingestion process to add the externalId to each document so that we’d be able to get hold of it when we needed to.
We had to change the design of the code so that whenever the user wants to download a PDF (for example) we make a call to the database by the userFacingId to look up the externalId.
The disadvantage of the approach is that we’re making an extra (internal) network call to look up the ID but it’s the type of code that should be easily cacheable if it becomes a performance problem so it should be fine.
I think this approach is much better than having potentially flawed translation logic in the application.
Canonical Identifiers
Duncan and I had an interesting problem recently where we had to make it possible to search within an ‘item’ to find possible sub items that exist inside it.
The URI for the item was something like this:
/items/234
Let’s say Item 234 contains the following sub items:
- Mark
- duncan
We have a search box on the page which allows us to type in the name of a sub item and go the sub item’s page if it exists or see an error message if it doesn’t.
If the user types in the sub item name exactly right then there’s no problem:
items/234?subItem=Mark
redirects to:
items/234/subItem/Mark
It becomes more interesting if the user gets the case of the sub item wrong e.g. they type ‘mark’ instead of ‘Mark’.
It’s not very friendly in terms of user experience to give the user an error message if they do that so I suggested that we just make the look up of the sub item case insensitive
items/234/subItem/mark
would therefore find us the ‘Mark’ sub item.
Duncan pointed out that we’d now have more than 1 URI for the same document which isn’t particularly great since theoretically there should be a one to one mapping between a URI and a given document.
He pointed out that we could do a look up to find the ‘canonical identifier’ before we did the redirect such that if you typed in ‘mark’:
items/234?subItem=mark
would redirect to:
items/234/subItem/Mark
The logic for checking the existence of a sub item would be the bit that’s case insensitive and makes it more user friendly.
The ‘window fixing’ wall
On my current project we have a wall where we keep track of ‘window fixing’ tasks – things that people want to fix in the code base but chose to defer until a later date.
Every now and then we take what’s on the wall and prioritise it according to Fabio Pereira’s effort/pain matrix so that we know which clean up tasks will provide the greatest value to the team.
While I think it’s a nice way of getting a team understanding of technical debt I think it can lead to a couple of problems which come with most attempts at group responsibility for something.
By writing the task up on the wall we’ve effectively pushed the responsibility for keeping the code clean away from us and onto the ‘team’.
It also seems to make it more acceptable to make a mess in the code because we’ve acknowledged that we’ve done that and either us or a team mate will fix it later.
In a way I suppose it’s good that people are at least conscious that they’re taking short cuts at times and we have a reasonable log of where those short cuts have been taken.
On the other hand, from my experience, when people are really motivated to fix a piece of code then they’ll find the time/way to do that whether or not it’s written up on the wall.
I think this is also a good thing even though the refactoring won’t have been prioritised by the rest of the team.
Sometimes it’s easier to go and fix something when you know what needs doing rather than deferring it and having to explain the problem to someone else later.
In summary I think the centralised wall is a good idea but not a complete replacement for people being diligent and taking care of the code base themselves.
gawk: Getting story numbers from git commit messages
As I mentioned in my previous post I’ve been writing a little application to create graphs based on our git repository history and in one of them we wanted to try and create a graph showing which people had been working on which stories.
I needed a way to extract a story number from the git commit message and then store them all in a text file.
A typical commit with a story number in might look like this:
Mark/Uday #689 some awesome scala refactoring
I couldn’t think of an easy way to do this with my current knowledge of sed or the Mac version of awk but the match function of gawk (GNU awk) makes this really easy.
match(string, regexp [, array])
Search string for the longest, leftmost substring matched by the regular expression, regexp and return the character position, or index, at which that substring begins (one, if it starts at the beginning of string). If no match is found, return zero.
…
If array is present, it is cleared, and then the zeroth element of array is set to the entire portion of string matched by regexp.
The array argument is what I needed and it’s only available as a gawk extension according to the documentation.
I ended up with the following command to strip the story numbers:
git log --no-merges --pretty="format:%s" |
gawk '{ match($0, /#([0-9]+)/, arr); if(arr[1] != "") print arr[1] }'I had to install gawk using ports on my Mac but on Fedora the default installation of awk is gawk.
Learning Regular Expressions: Non capturing match
I’ve been working my way slowly through the O’Reilly ‘Mastering Regular Expressions‘ book and recently read about the non capturing match operator which came in useful for some Git log parsing I’ve been doing.
On the project I’m working on we all commit as the same user and then put our names at the beginning of the commit message.
We wanted to try and find out the statistics of who’d been pairing with each other and therefore needed to extract the pairs from commits.
Unfortunately everyone writes their names in a slightly different way so the regular expression which I used to parse each commit needed to try and handle that.
For example these are some of the ways that commit messages start:
Uday/Charles #67 did some stuff mark,suzuki more stuff pat, tom: very important stuff Uday:Marc #87 stuff
The separator between the names is different in each case but in the majority of cases can be satisfied by the following regular expression:
([\/,][ ]?|:)
It’s either:
- A forward slash or comma followed by an optional space
- A colon
Since I want to express the fact that the separator can be one thing or the other I need to group those two things together in parentheses.
Unfortunately that means that the separator will be included in the array of captures that we have when parsing the commit.
I only wanted to have the names of the two people included in that array.
The non capturing match operator ‘(?:’ allows us to match against the expected separator without actually capturing it:
(?:[\/,][ ]?|:)
That regular expression is part of a much larger/probably over complicated one which also helps to capture the names of the people pairing:
var pairRegex = /^\[?([\w-]+)[ ]?[^\/, ]*(?:[\/,][ ]?|:)([\w-]+)\]?[^\/]*[\s:]/
Using the regex with the non capturing match gives us:
"charles/mark: adios to play, hello scalatra".match(pairRegex) ["charles/mark: adios to play, hello ", "charles", "mark"]
Whereas if we used a normal capture we’d also capture the ‘/’:
"charles/mark: adios to play, hello scalatra".match(pairRegex) ["charles/mark: adios to play, hello ", "charles", "/", "mark"]
Parsing XML from the unix terminal/shell
I spent a bit of time today trying to put together a quick script which would allow me to grab story numbers from the commits in our Git repository and then work out which functional areas those stories were in by querying mingle.
Therefore I wanted to make a curl request to the mingle and then pipe that result somewhere and run an xpath expression to get my element.
I didn’t want to have to write code in another script file and then reference that file from the shell and in my search to achieve that I came across XMLStarlet on stackoverflow.
It’s installable via mac ports:
sudo port install xmlstarlet
And I was then able to pipe the results of my mingle request and locate the following bit of XML:
<property type_description="Managed text list" hidden="false"> <name>Functional Area</name> <value>Our Functional Area</value> </property>
curl -s http://user:password@mingleurl:8888/api/v2/projects/project_name/cards/1.xml | xmlstarlet sel -t -v "//property/name[. = 'Functional Area']/../value"
There’s much more you can do with the command which is listed on the documentation page.
The read-only database
The last couple of applications I’ve worked on have had almost completely read only databases where we had to populate the database in an offline process and then provide various ways for users to access the data.
This creates an interesting situation with respect to how we should setup our development environment.
Our normal setup would probably have an individual version of that database on every development machine and we would populate and then truncate the database during various test scenarios.
This actually means that our tests are interacting with the database in a different way than we would see during the running of the application.
It also means that we have more infrastructure to take care of and more software updates to do although using tools like Chef or Puppet can reduce the pain this causes once the initial setup of those scripts has been done.
On the project I worked on last year we started off with the individual database approach but eventually moved to having a shared database used by all the developers.
We only made the move once we had the real production data and the script which would populate that data into our database ready.
The disadvantage of having this shared database is that our tests become more indirect.
We wrote our tests against data which we knew would be in our production data set which meant if anything failed you had a bit more investigation to do since the data setup was done elsewhere.
On the other hand noone had to worry about getting it setup on their machines which had proved to be tricky to totally automate.
We have a similar situation on the application I’m currently working on and have noticed that we run into problems that don’t usually exist as a result of adding data to the database on each test.
For example in one test the database takes a bit of time to sort out its indexes which means that some tests intermittently fail.
We found a bit of a hacky way around this by forcing the database to reindex in the test and waiting until it has done so but we’ve now solved a problem which doesn’t actually exist in production.
This approach wouldn’t work as well if we had a read/write database since we’d end up with tests failing since another developer machine had mutated the data it relied on.
With a read only database it seems to be ok though.
Pain Driven Development
My colleague Pat Fornasier has been using an interesting spin on the idea of making decisions at the last responsible moment by encouraging our team to ‘feel the pain’ before introducing any constraint in our application.
These are some of the decisions which we’ve been delaying/are still delaying:
Dependency Injection
Everyone in our team comes from a Java/C# background and one of the first technical decisions that gets made on applications in those languages is which dependency injection container to use.
We decided to just create a trait where we wired up the dependencies ourself and then inject that trait into the entry point of our application. Effectively it acts as the ApplicationContext that a framework like Spring would provide.
I was fairly sure that we’d need to introduce a container fairly quickly but it’s been 10 weeks and we still haven’t felt the need to do that and our application is simpler as a result.
Data Ingestion
As I mentioned in an earlier post we have to import around 5 million documents into our database by the time the application goes live.
Our initial attempt at writing this code was single threaded and it was clear that there were many places where performance optimisations could be made.
Since we were only ingesting a few thousand documents at that stage it still ran pretty quickly so Pat encouraged us to wait until we felt the pain before making any changes.
That duly happened once the number of documents increased and it started taking 3/4 hours to run the job in our QA environment.
We then spent a couple of days working out how to make it possible to process the documents more quickly.
Complex markup in documents
As I mentioned a couple of months ago the application we’re working on is mainly about taking data from a database and applying some transformations on it before showing it to the user.
We decided to incrementally add different types of documents into the database.
This meant that initially all our transformations involved just getting a text representation of XML nodes even though we knew that eventually we’d need to do more processing on the data depending on which tags appeared.
These data transformations actually turned out to be more complicated than we’d imagined so we might have delayed the pain here a little bit too long.
On the other hand we were able to show early progress to our business stakeholders which probably wouldn’t have been the case if we’d tried to take on the complex markup all at once.
N.B.
One thing to note with this approach is that we need to make sure there is a feedback mechanism to recognise when we are feeling pain otherwise we’ll end up going beyond the last responsible moment more frequently.
There will probably also be more complaints about things not being done ‘properly’ since we’re waiting for longer until we actually do that.
We have a code review that the whole team attends for an hour each week which acts as the feedback mechanism and we recently starting using Fabio’s effort/pain wall to work out which things were causing us most pain.
Performance tuning our data import: Gather precise data
One of the interesting problems that we have to solve on my current project is working out how to import a few million XML documents into our database in a reasonable amount of time.
The stages of the import process are as follows:
- Extract a bunch of ZIP files to the disc
- Processing only the XML documents…
- Load the XML document and determine whether the document is valid to import
- Add some meta data to the document for database indexing
- Import the document into the database
We’ve been working on this quite a bit recently and one of the main things we’ve learnt is the value of gathering detailing information about what’s actually happening in the code.
When we started we only gathered the end to end time for the whole job to run against a certain number of documents.
The problem with doing this is that we couldn’t see where the constraint in the process is and therefore went and parallelised the process using Akka which gave some improvement but not as much as expected.
Having realised that we didn’t really know where the bottle neck was we added in much more logging to our code to try and identify where the most time was being taken.
For each document there are effectively 3 main things that we’re doing:
- Loading the XML file
- Applying the XPath expressions against the file
- Importing the document into the database
We ran our import process a few times and recorded how long was being taken on each stage.
It was then much easier to see where we needed to focus our attention if we wanted to see big improvements.
We gathered this data for our local environment and QA environment and noticed that there was a big difference on the loading of the XML files – it was 6 or 7 times quicker on the QA environment.
By chance I ended up running the import on a laptop on the train and noticed that it aborted because it couldn’t access an external DTD which was referenced in each XML file.
The QA machine is sitting inside a data centre with a high speed connection which means that the downloading of the DTD files is significantly faster than we can achieve locally.
We realised that we could solve this problem by forcing the parser to load the DTDs locally and immediately saw a huge decrease in the overall time.
Without collecting the data and seeing so clearly where the constraint was it would have taken us much longer to realise where we needed to make improvements.
We still have many more improvements to make but measuring the performance instead of speculating seems to be the way to go.