24 Sep 2009

Book Club: Versioning your database (K. Scott Allen)

In our latest technical book club we discussed a series of posts written by K. Scott Allen about getting your database under version control.

These are some of my thoughts and our discussion:

We had an interesting discussion around when it’s ok to go and change checked in change scripts - on previous projects I’ve worked on we’ve actually had the rule that once you’ve checked in a change script to source control then you can no longer change it but instead need to add another change script that does what you want. Several colleagues pointed out that this approach can lead to us having quite a lot of fine grained change scripts - each maybe adding or altering one column - and that an approach they had used with some success was to allow changes to be made up until a change script had been applied in an environment outside of the team’s control, typically UAT.
On a previous project I worked on where we didn’t script the database, we made use of Hibernate annotations to specify the mappings between domain objects and the database so the SQL code for database creation was being auto generated by a Hibernate tool. This doesn’t seem to fit in with the idea of allowing us to incrementally change our database but Tom pointed out that it might be quite a useful way to get an initial baseline for a database. After that we probably want to make use of change scripts.
We discussed the coupling between the model of a system that we have in a database and the one that exists in our code when using ORM tools and how we often end up with these two different models being quite tightly coupled meaning that changing our domain model can become quite difficult. Tom pointed out that the database is just another data source but that it often seems to be treated differently to other data sources in this respect. He also suggested the idea of creating an anti corruption layer in our code between the database and our domain model if they start to become too different rather than trying to keep them as similar as possible by some imaginative use of ORM mappings.
Another interesting are of discussion was around how to deal with test data and existing data in the system with respect to our change scripts. On projects I’ve worked on if we had reference data that was required across all environments if we wanted to make changes to this data then we would just make use of another change script to do that. Ahrum suggested that if we had environment specific test data then we’d probably have other environment specific scripts that we could run after the change scripts had been executed. One of the more interesting problems when making changes to tables which already have data in is working out what we want to do with the existing data if we change the type of a column. We can typically either create a temporary column and copy all the values there first before creating an update script that converts each of the values to the data type of the new column or we can just forget about the data which is a less acceptable options from what I’ve seen. The effort involved in doing this often seems to mean that we are more reluctant to make changes to column types after their initial creation which means that we need to choose the type quite accurately early on.
A couple of colleagues pointed out that one benefit they had found with taking this approach to working with databases actually helped to expose any problems in the process - there can’t be any under the table changes made to databases if they are being manipulated by change scripts otherwise we end up with different setups in different environments. Since we will be recreating and updating databases quite frequently these problems will become obvious quite quickly.
Dean pointed out that the change script approach is actually really useful for performance testing - a benefit of this approach that I hadn’t considered previously considered. When doing this type of testing we know which version of the database we were using at the time and if there suddenly becomes a performance problem then it should be much easier to track down which change resulted in the problem.

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.