Archive for February, 2012
Haskell: Creating a sliding window over a collection
A couple of years ago when I was playing around with F# I came across the Seq.windowed function which allows you to create a sliding window of a specific size over a collection.
Taking an example from the F# documentation page:
let seqNumbers = [ 1.0; 1.5; 2.0; 1.5; 1.0; 1.5 ] :> seq<float> let seqWindows = Seq.windowed 3 seqNumbers
We end up with this:
Initial sequence: 1.0 1.5 2.0 1.5 1.0 1.5 Windows of length 3: [|1.0; 1.5; 2.0|] [|1.5; 2.0; 1.5|] [|2.0; 1.5; 1.0|] [|1.5; 1.0; 1.5|]
Problem 8 of Project Euler is defined like so:
Find the greatest product of five consecutive digits in the 1000-digit number.
73167176531330624919225119674426574742355349194934
96983520312774506326239578318016984801869478851843
85861560789112949495459501737958331952853208805511
12540698747158523863050715693290963295227443043557
66896648950445244523161731856403098711121722383113
62229893423380308135336276614282806444486645238749
30358907296290491560440772390713810515859307960866
70172427121883998797908792274921901699720888093776
65727333001053367881220235421809751254540594752243
52584907711670556013604839586446706324415722155397
53697817977846174064955149290862569321978468622482
83972241375657056057490261407972968652414535100474
82166370484403199890008895243450658541227588666881
16427171479924442928230863465674813919123162824586
17866458359124566529476545682848912883142607690042
24219022671055626321111109370544217506941658960408
07198403850962455444362981230987879927244284909188
84580156166097919133875499200524063689912560717606
05886116467109405077541002256983155200055935729725
71636269561882670428252483600823257530420752963450
In order to do that I need to convert that number into arrays of size 5 which first involves converting the number into an array of the individual digits:
charToArray :: [Char] -> [Int] charToArray [] = [] charToArray (x:xs) = (read [x]::Int) : (charToArray xs) number = "7316717653133062491922511967442657474235534919493496983520312774506326239578318016984801869478851843858615607891129494954595017379583319528532088055111254069874715852386305071569329096329522744304355766896648950445244523161731856403098711121722383113622298934233803081353362766142828064444866452387493035890729629049156044077239071381051585930796086670172427121883998797908792274921901699720888093776657273330010533678812202354218097512545405947522435258490771167055601360483958644670632441572215539753697817977846174064955149290862569321978468622482839722413756570560574902614079729686524145351004748216637048440319989000889524345065854122758866688116427171479924442928230863465674813919123162824586178664583591245665294765456828489128831426076900422421902267105562632111110937054421750694165896040807198403850962455444362981230987879927244284909188845801561660979191338754992005240636899125607176060588611646710940507754100225698315520005593572972571636269561882670428252483600823257530420752963450"
As far as I know there isn’t a windowed function in Haskell so I wrote the following one to get the arrays of 5 consecutive elements:
windowed :: Int -> [a] -> [[a]] windowed size ls = (case ls of [] -> [] x:xs -> if length ls >= size then (take size ls) : windowed size xs else windowed size xs)
We can then use it like this:
*Main> maximum (map (\ls -> foldl (*) 1 ls) (windowed 5 (charToArray number))) 40824
Haskell: Getting the nth element in a list
I started trying to solve some of the Project Euler problems as a way to learn a bit of Haskell and problem 7 is defined like so:
By listing the first six prime numbers: 2, 3, 5, 7, 11, and 13, we can see that the 6th prime is 13.
What is the 10 001st prime number?
I read that the Sieve of Eratosthenes is a useful algorithm for working out all the prime numbers and there’s a page on the Literate Programs wiki explaining how to derive them using it.
The most naive implementation of all the primes ends up reading like this:
primes = 2 : sieve [3,5..] where sieve [] = [] sieve (p:xs) = p : sieve (xs `minus` [p,p+2*p..])
That gives an infinite sequence of all the prime numbers but I wanted to be able to specifically pick the 10,001st prime number which I assumed would be named ‘nth’ or something like that.
As it turns out we actually need to use the ‘!!’ operator which I found out from Mark Chu-Carroll’s blog post:
*Main> :t (!!) (!!) :: [a] -> Int -> a
problem_7 = primes !! 10000 -- 0 indexed so we need to get the position one before 10,001
It takes a while to run but we end up with the answer:
*Main> problem_7 104743
Java: Faking a closure with a factory to create a domain object
Recently we wanted to create a domain object which needed to have an external dependency in order to do a calculation and we wanted to be able to stub out that dependency in our tests.
Originally we were just new’ing up the dependency inside the domain class but that makes it impossible to control it’s value in a test.
Equally it didn’t seem like we should be passing that dependency into the constructor of the domain object since it’s not a piece of state which defines the object, just something that it uses.
We ended up with something similar to the following code where we have our domain object as an inner class:
public class FooFactory { private final RandomService randomService; public FooFactory(RandomService randomService) { this.randomService = randomService; } public Foo createFoo(String bar, int baz) { return new Foo(bar, baz); } class Foo { private String bar; private int baz; public Foo(String bar, int baz) { this.bar = bar; this.baz = baz; } public int awesomeStuff() { int random = randomService.random(bar, baz); return random * 3; } } }
A test on that code could then read like this:
public class FooFactoryTest { @Test public void createsAFoo() { RandomService randomService = mock(RandomService.class); when(randomService.random("bar", 12)).thenReturn(13); FooFactory.Foo foo = new FooFactory(randomService).createFoo("bar", 12); assertThat(foo.awesomeStuff(), equalTo(39)); } }
It’s a bit of a verbose way of getting around the problem but it seems to work reasonably well.
Haskell: Viewing the steps of a reduce
I’ve been playing around with Haskell a bit over the last week and in the bit of code I was working on I wanted to fold over a collection but see the state of the fold after each step.
I remembered Don Syme showing me how to do something similar during the F# Exchange last year while we were writing some code to score a tennis game by using Seq.scan.
I didn’t realise there was also a scan function in Haskell which could be defined in terms of foldl if we wanted to:
foldscanl :: (a -> b -> a) -> a -> [b] -> [a] foldscanl fn acc ls = foldl (\ x y -> x ++ [fn (last x) y]) [acc] ls
*Main> foldscanl (+) 0 [1..10] [0,1,3,6,10,15,21,28,36,45,55]
The actual function is defined like so:
scanl :: (a -> b -> a) -> a -> [b] -> [a] scanl f q ls = q : (case ls of [] -> [] x:xs -> scanl f (f q x) xs)
*Main> scanl (*) 1 [1..10] [1,1,2,6,24,120,720,5040,40320,362880,3628800]
I’ve been finding scanl particularly useful for working out whether the function I’m using to fold over the collection is actually working correctly.
There’s also a scanr if we want to fold over the collection from the right hand side.
Thou shalt storm
On the majority of the teams that I’ve worked on there’s been a time where everyone seems to be disagreeing with each other about almost everything and the whole situation becomes pretty tense for all involved.
The first time I came across this it seemed quite dysfunctional but I was introduced to Bruce Tuckman’s model of group development which helps to explain what’s going on.
Tuckman outlines four stages which teams tend to go through – forming, storming, norming and performing.
The forming stage tends to be fairly nice because people have mostly just met each other and don’t want to cause any conflict but eventually we move into the storming stage which is where it gets interesting:
The team addresses issues such as what problems they are really supposed to solve, how they will function independently and together and what leadership model they will accept. Team members open up to each other and confront each other’s ideas and perspectives.
The storming stage is necessary to the growth of the team. It can be contentious, unpleasant and even painful to members of the team who are averse to conflict. Tolerance of each team member and their differences should be emphasized. Without tolerance and patience the team will fail.
In agile teams this conflict often tends to rear its head in a retrospective and people will tend to go away from it feeling pretty bad about the whole thing and wishing they could work on another team!
In reality it’s just a stage that the team is going through and as long as team members address the differences of opinion they have rather than hiding from that conversation then the team can move on to being more productive.
One cool thing I’ve noticed is that if people have worked with each other before then they’ve probably already previously got over this hurdle and can therefore either skip/spend much less time on the storming stage.
A colleague of mine pointed out that we shouldn’t be alarmed at the way people interact with each other in the storming stage but we should be aware when we’re in it which I think is a pretty good way of looking at it.
Gina Abudi has an article from a couple of years ago which goes through the storming stage and all the other ones in more detail.
Optimising for typing
My colleague Ola Bini recently wrote a post describing his thoughts on the syntax of programming languages and while the post in general is interesting the bit that most resonates with me at the moment is the following:
Typing fewer characters doesn’t actually optimize for writing either – the intuition behind that statement is quite easy: imagine you had to write a book. However, instead of writing it in English, you just wrote the gzipped version of the book directly. You would definitely have to type much less – but would that in any way help you write the book? No, probably it would make it harder. So typing I definitely don’t want to optimize.
On the application I’m currently working on we have a full time DBA on the team who favours a much more concise style of naming in tables than most developers would be used to.
We have an acronym for each table and if there’s a foreign key reference to a field in that table elsewhere then we’ll use the acronym as part of the column name in the other table.
For example if we had a reference to a table called Person then our foreign key column name might be per_id.
One of the problems we were trying to overcome is that from what I understand Oracle has a field name limit of 30 characters so we would eventually exceed that if our naming was too verbose.
By using acronyms we can keep a more consistent style of naming rather than having some names being spelt out fully and others being abbreviated.
The other argument used for the more concise naming convention is that it’s much quicker to type columns named in this way and we don’t have auto complete functionality available in sqlplus as we do in most IDEs.
The disadvantage of the approach is that it’s now very difficult to understand what data is being stored in a table just by looking at it – you need to first learn all the acronyms and then do the translations in your head.
If you’re working with the tables all day long then it becomes second nature but if your use of them is more sporadic it takes a bit longer.
We’re trying to keep interactions with the database reasonably minimal so hopefully we won’t need to do any really complex queries with joins in which would be truly difficult to understand when we come back to them!
Coding: Packaging by vertical slice
On most of the applications I’ve worked on we’ve tended to organise/package classes by the function that they have or the layer that they fit in.
A typical package structure might therefore end up looking like this:
- com.awesome.project
- common
- StringUtils
- controllers
- LocationController
- PricingController
- domain
- Address
- Cost
- CostFactory
- Location
- Price
- repositories
- LocationRepository
- PriceRepository
- services
- LocationService
- common
This works reasonably well and allows you to find code which is similar in function but I find that more often than not a lot of the code that lives immediately around where you currently are isn’t actually relevant at the time.
On the last couple of applications that I’ve worked on we’ve been trying to group code around a domain concept or vertical slice of functionality.
Therefore instead of the above code we’d end up with something more like this:
- com.awesome.project
- location
- Address
- Location
- LocationController
- LocationRepository
- LocationService
- platform
- StringUtils
- price
- Cost
- CostFactory
- Distance
- Price
- PriceController
- PriceRepository
- location
We were having a discussion about grouping code like this last week and I was struggling to describe what I prefer about the latter approach.
In the code base that I’m currently working on, which provides an API for other systems to do stuff with, it seems to lead to a design where we have created lots of potential micro services which could be deployed separately if we wanted.
That possibility wasn’t as clear to me until we started grouping code this way.
Another cool thing is that it’s made us think about the domain of the code more and whether the grouping of classes actually makes sense. We can also see which classes fall inside an aggregate root.
In the above example under ‘pricing’ we can tell that Price is an aggregate root because it has a repository which allows us to get one and we can also tell that Cost is probably contained by Price since we don’t have a way of directly getting a Cost.
We stop thinking about the domain classes as a whole, instead we think about them in their groups and how their aggregate roots might interact with each other if at all.
One disadvantage of grouping code like this is that if we’re writing a new repository, for example, we’ve got further to navigate to find another one to base ours on.
On the other hand you could argue that if we’re doing that then perhaps there’s an abstraction we can pull out to remove the problem.
It’s an interesting approach to grouping code and one thing we’ve started noticing is that we end up with some packages which have a lot of classes in them and others which have very few.
We’re not sure whether this is a symptom of us not breaking down those particular packages enough or if there are just some areas of the domain that are bigger than others.
These are just some of my early observations so it’d be interesting to hear other’s thoughts on whether this is a good/bad idea.
Tech Leads & The Progress Principle
I’ve been reading The Progress Principle on and off for the last couple of months and one of my favourite quotes from the book is the following:
Truly effective video game designers know how to create a sense of progress for players within all stages of a game. Truly effective managers know how to do the same for their subordinates.
While a tech lead might not like to be referred to as a manager I think part of the role does involve helping developers to make progress and the best ones I’ve worked with seem to do that instinctively.
They’re able to see when someone has got very stuck with what they’re doing and can then work out whether they just need to provide some advice on how they can move forward or if that’s not working they can come and work together on the problem.
Work progress and setbacks matter so much because work matters so much. It’s simply part of being human
For the last couple of weeks a colleague and I have been trying to work out how to migrate reference data from an existing application for an application which we’re rewriting.
We’ve run into problems along the way such as looking at a test database which we didn’t realise was a couple of years out of date and therefore didn’t have the data in the right format to trying to link up data from tables which weren’t meant to be linked.
On each setback the tech lead (Alex) helped us progress first by recognising that the test and production databases might not be synced up and giving us other ideas about table design based on his greater knowledge of the whole system.
On this occasion although we were making some progress we couldn’t quite figure out how to get the data into the shape we wanted which was becoming increasingly frustrating.
The effect of setbacks on emotions is stronger than the effect of progress…the power of setbacks to diminish happiness is more than twice as strong as the power of progress to boost happiness.
Eventually Alex came and paired on the problem and was able to see something that we’d missed which allowed us to solve the problem.
I think the balance of initially allowing us to try and solve it ourselves and giving advice before taking a more hands on approach matches up fairly closely with what the authors of The Progress Principle suggest.
It’s certainly been an interesting book to read and one which I think is very applicable to people working in software teams.
Reading Code: boilerpipe
I’m a big fan of the iPad application Flipboard, especially it’s ability to filter out the non important content on web pages and just show me the main content so I’ve been looking around at open source libraries which provide that facility.
I came across a quora page where someone had asked how this was done and the suggested libraries were readability, Goose and boilerpipe.
boilerpipe was written by Christian Kohlschütter and has a corresponding paper and video as well.
At a very high level this is my understanding of what the code is doing:
It is based around a pipes/filters architectural style whereby a TextDocument is passed through filters which perform transformations on it. After they’ve all been applied we can retrieve the main content of the article via a method call.
I’ve used the pipes/filters approach when playing around with clojure/F# but the problems I was working on were much smaller than this.
In the code there around about 7 or 8 fields being manipulated so I did sometimes find it difficult to know how fields could end up with certain values which often involved looking at other filters and seeing what they did to the document.
I always thought it should be possible to view each filter completely independently but when there’s state manipulation involved that doesn’t seem to be the case.
Luckily Christian has comments in his code which explain how you might compose the different filters again and why certain filters don’t make sense on their own, only if they’re combined with others.
For example the BlockProximityFusion class, which is used to merge together adjacent text blocks, contains the following comment:
Fuses adjacent blocks if their distance (in blocks) does not exceed a certain limit. This probably makes sense only in cases where an upstream filter already has removed some blocks.
I suppose the same thing could also have been achieved with some automated tests showing scenarios where different filters are composed.
Christian makes use of the logical OR (“|”) operator throughout the code base to ensure that all the filters get executed even if a previous one has successfully made changes to the document.
For example the main entry point into the code is ArticleExtractor which reads like this:
public final class ArticleExtractor extends ExtractorBase { public static final ArticleExtractor INSTANCE = new ArticleExtractor(); public static ArticleExtractor getInstance() { return INSTANCE; } public boolean process(TextDocument doc) throws BoilerpipeProcessingException { return TerminatingBlocksFinder.INSTANCE.process(doc) | new DocumentTitleMatchClassifier(doc.getTitle()).process(doc) | NumWordsRulesClassifier.INSTANCE.process(doc) // cut for brevity | ExpandTitleToContentFilter.INSTANCE.process(doc); } }
I noticed a similar thing in the underscore.js code but in that case the ‘&&’ operator was used to execute code on the right hand side only if the expression on the left had been successful.
If we’re not using any libraries that simulate first class collections in Java (totallylazy/Guava for example) then something like this could also work:
public final class ArticleExtractor extends ExtractorBase { ... public boolean process(TextDocument doc) throws BoilerpipeProcessingException { List<BoilerpipeFilter> filters = asList(TerminatingBlocksFinder.INSTANCE, new DocumentTitleMatchClassifier(doc.getTitle()), ExpandTitleToContentFilter.INSTANCE); boolean result = true; for (BoilerpipeFilter filter : filters) { result = result | filter.process(doc); } return result; } }
I originally started just browsing the code and thought I roughly understood it before realising I couldn’t explain what it actually did. I therefore changed my approach and started writing some unit tests around it to see what the current behaviour was.
From what I can tell the main algorithm in the code is contained inside NumWordsRulesClassifier where each text block in the document is classified as being either content or non content.
I wrote tests covering all the scenarios in this class and then refactored the code to see if I could make it a bit more expressive. I ended up with this:
private boolean currentBlockHasContent(final TextBlock prev, final TextBlock curr, final TextBlock next) { if (fewLinksInCurrentBlock(curr)) { if (fewLinksInPreviousBlock(prev)) { return curr.getNumWords() > 16 || next.getNumWords() > 15 || prev.getNumWords() > 4; } else { return curr.getNumWords() > 40 || next.getNumWords() > 17; } } return false; } private boolean fewLinksInCurrentBlock(TextBlock curr) { return curr.getLinkDensity() <= 0.333333; } private boolean fewLinksInPreviousBlock(TextBlock prev) { return prev.getLinkDensity() <= 0.555556; }
The logic is all based around examining the text blocks immediately before and after the current one to work out whether or not it’s likely to be boiler plate content.
The logic around the next/previous text blocks is written quite imperatively and feels like it could be made more concise by using something like F#’s ‘Seq.windowed’ over the collection but I can’t quite see how at the moment!
You can read more about the algorithm on pages 4-7 of the paper.
From running the code against a few articles I’ve got saved to ReadItLater it does seem to work reasonably well.
Overall…
I haven’t read every single bit of the code base but from what I have read I think boilerpipe is a pretty cool library and the approach to filtering content is neat.
I found it especially useful to be able to read parts of the paper and then go and look at the corresponding code. Often that type of thing remains up to the imagination of the reader from my experience!
Oracle Spatial: java.sql.SQLRecoverableException: No more data to read from socket
We’re using Oracle Spatial on my current project so that we can locate points within geographical regions and decided earlier in the week to rename the table where we store the SDO_GEOMETRY objects for each region.
We did that by using a normal table alter statement but then started seeing the following error when we tried to insert test data in that column which takes an SDO_GEOMETRY object:
org.hibernate.exception.JDBCConnectionException: could not execute native bulk manipulation query
at org.hibernate.exception.SQLStateConverter.convert(SQLStateConverter.java:99)
at org.hibernate.exception.JDBCExceptionHelper.convert(JDBCExceptionHelper.java:66)
at org.hibernate.engine.query.NativeSQLQueryPlan.performExecuteUpdate(NativeSQLQueryPlan.java:219)
at org.hibernate.impl.SessionImpl.executeNativeUpdate(SessionImpl.java:1310)
at org.hibernate.impl.SQLQueryImpl.executeUpdate(SQLQueryImpl.java:396)
at $Proxy53.insertTariffZone(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
at org.junit.runner.JUnitCore.run(JUnitCore.java:157)
at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:71)
at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:202)
at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:63)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)
Caused by: java.sql.SQLRecoverableException: No more data to read from socket
at oracle.jdbc.driver.T4CMAREngine.unmarshalUB1(T4CMAREngine.java:1157)
at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:290)
at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:192)
at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:531)
at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:207)
at oracle.jdbc.driver.T4CPreparedStatement.executeForRows(T4CPreparedStatement.java:1044)
at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1329)
at oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:3584)
at oracle.jdbc.driver.OraclePreparedStatement.executeUpdate(OraclePreparedStatement.java:3665)
at oracle.jdbc.driver.OraclePreparedStatementWrapper.executeUpdate(OraclePreparedStatementWrapper.java:1352)
at com.mchange.v2.c3p0.impl.NewProxyPreparedStatement.executeUpdate(NewProxyPreparedStatement.java:105)
at org.hibernate.engine.query.NativeSQLQueryPlan.performExecuteUpdate(NativeSQLQueryPlan.java:210)
... 39 moreWe couldn’t see anything particularly wrong in what we’d done and none of the error messages we got were being particularly helpful.
Eventually we asked the DBA on our team to help out and he showed us how to look up the Oracle system logs which in our case were located at:
/u01/app/oracle/diag/rdbms/orcl/orcl/trace
We ran the query again and noticed new files were being written to that location, one of which had the following error message:
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0x40] [PC:0x2FDAE7D, mdidxid()+2563] [flags: 0x0, count: 1] DDE: Problem Key 'ORA 7445 [mdidxid()+2563]' was flood controlled (0x2) (incident: 3851) ORA-07445: exception encountered: core dump [mdidxid()+2563] [SIGSEGV] [ADDR:0x40] [PC:0x2FDAE7D] [Address not mapped to object] [] ORA-13203: failed to read USER_SDO_GEOM_METADATA view ssexhd: crashing the process... Shadow_Core_Dump = PARTIAL
We’d forgotten to rename the table in the USER_SDO_GEOM_METADATA view, exactly as the message says!
Running the following statement sorted us out:
UPDATE user_sdo_geom_metadata SET table_name = 'NEW_NAME' where table_name = 'OLD_NAME';