Mark Needham

Thoughts on Software Development

Haskell: An impressively non performant union find

with 5 comments

I’ve spent the best part of the last day debugging a clustering algorithm I wrote as part of the Algorithms 2 course, eventually coming to the conclusion that the union find data structure I was using wasn’t working as expected.

In our algorithm we’re trying to group together points which are ‘close’ to each other and the data structure is particular useful for doing that.

To paraphrase from my previous post about how we use the union find data structure:

We start out with n connected components i.e. every point is in its own connected component.

We then merge these components together as calculate the neighbours of each point until we’ve iterated through all the points and have grouped all the points into the appropriate components.

I came across 3 libraries which implement this data structure – union-find, equivalence and persistent-equivalence.

union-find seemed like it had the easiest API to understand so I plugged it into my program only to eventually realise that it wasn’t putting the points into components as I expected.

I eventually narrowed the problem down to the following example:

> let uf = emptyEquivalence (0,9)
[(0,0),(1,1),(2,2),(3,3),(4,4),(5,5),(6,6),(7,7),(8,8),(9,9)]
 
> components $ equate 0 1 uf
[(0,0),(1,0),(2,2),(3,3),(4,4),(5,5),(6,6),(7,7),(8,8),(9,9)]
 
> components $ equate 8 9 $ equate 0 1 $ uf
[(0,0),(1,0),(2,2),(3,3),(4,4),(5,5),(6,6),(7,7),(8,8),(9,8)]
 
> components $ equate 0 8 $ equate 8 9 $ equate 0 1 $ uf
[(0,0),(1,0),(2,2),(3,3),(4,4),(5,5),(6,6),(7,7),(8,0),(9,8)]

We start out with a union-find where every point is in its own component. The next line puts points ’0′ and ’1′ into the same component which it does by making indexes ’0′ and ’1′ of the array both have the same value, in this case 0, which is known as the component’s leader.

We continue doing that for points ’8′ and ’9′ which works fine and our union find now consists of 8 components – the ones with leaders 8 & 0 which have two elements and then ones with leaders 2,3,4,5,6 & 7 which only contain themselves.

Things go wrong on our next step where we try to join nodes ’0′ and ’8′. As I understand it what should happen here is that all the points connected to either ’0′ or ’8′ should end up in the same component so we should have a component containing points ’0′, ’1′, ’8′ and ’9′ but ’9′ has been missed off in this case.

The implementation is deliberately written to work like this so I thought I’d try writing my own version based on the following Ruby version:

class UnionFind
  def initialize(n)
    @leaders = 1.upto(n).inject([]) { |leaders, i| leaders[i] = i; leaders }
  end
 
  def connected?(id1,id2)
    @leaders[id1] == @leaders[id2]
  end
 
  def union(id1,id2)
    leader_1, leader_2 = @leaders[id1], @leaders[id2]
    @leaders.map! {|i| (i == leader_1) ? leader_2 : i }
  end
end

This is my Haskell equivalent which I adapted from the union-find one that I mentioned above:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
module Leaders (UnionSet, create, components, numberOfComponents, indexes, inSameComponent, union) where
 
import Control.Concurrent.MVar
import Control.Monad
import Data.Array.Diff as ArrayDiff
import Data.IORef
import qualified Data.List
import Data.Maybe
import System.IO.Unsafe
import qualified Data.Set as Set
 
arrayFrom :: (IArray a e, Ix i) => (i,i) -> (i -> e) -> a i e
arrayFrom rng f = array rng [ (x, f x) | x <- range rng ]
 
ref :: a -> IORef a
ref x = unsafePerformIO (newIORef x)
 
data UnionSet i = UnionSet { leaders :: IORef (DiffArray i i) }
 
create :: Ix i => (i, i) -> UnionSet i
create is = UnionSet (ref (arrayFrom is id))
 
extractComponents :: Ix i => DiffArray i i -> [(i, i)]    
extractComponents  = Set.toList . Set.fromList . ArrayDiff.assocs
 
components :: Ix i => UnionSet i -> [(i,i)]
components (UnionSet leaders) = unsafePerformIO $ do
    l <- readIORef leaders
    return (extractComponents l)
 
numberOfComponents :: Ix i => UnionSet i -> Int
numberOfComponents (UnionSet leaders) = unsafePerformIO $ do
    l <- readIORef leaders
    return (length $ extractComponents l) 
 
indexes :: Ix i => UnionSet i -> [(i,i)]
indexes (UnionSet leaders) = unsafePerformIO $ do
    l <- readIORef leaders
    return (ArrayDiff.assocs l)       
 
inSameComponent :: Ix i => UnionSet i -> i -> i -> Bool
inSameComponent (UnionSet leaders) x y = unsafePerformIO $ do
    l <- readIORef leaders
    return (l ! x == l ! y)
 
union x y (UnionSet leaders)  = unsafePerformIO $ do
    ls <- readIORef leaders
    let leader1 = ls ! x 
        leader2 = ls ! y
        newLeaders = map (\(index, value) -> (index, leader2)) . filter (\(index, value) -> value == leader1) $ assocs ls        
    modifyIORef leaders (\l -> l // newLeaders)
    return $ UnionSet leaders

We can recreate the above example like so:

> indexes $ Leaders.union 0 8 $ Leaders.union 8 9 $ Leaders.union 0 1 $ create (0,9)
[(0,9),(1,9),(2,2),(3,3),(4,4),(5,5),(6,6),(7,7),(8,9),(9,9)]

Unfortunately it takes 44 seconds to execute which is mainly down to the call to assocs on line 46. assocs gives us a list of all the indexes and their corresponding values which we use to work out which indexes need to be updated with a new leader.

The rest of the code is mostly boiler plate around getting the array out of the IORef. The IORef allows us to have a mutable array in this instance. There is a page on the c2 wiki which explains how to use IORef in more detail.

Although using the DiffArray allows us to provide a pure external interface around its use, it is known to be 10-100x slower than an MArray.

I’ve been playing around with a version of the union-find data structure which makes use of an MArray instead and has decreased the execution time to 34 seconds.

Unless anyone has any ideas for how I can get this to perform more quickly I’m thinking that perhaps an array isn’t a good choice of underlying data structure at least when using Haskell.

Written by Mark Needham

December 31st, 2012 at 8:44 pm

Posted in Haskell

Tagged with

  • Pingback: Haskell: Strictness and the monadic bind at Mark Needham

  • Pingback: A new year’s idea: Share what you learn at Mark Needham

  • Dave C Turner

    Hi Mark,

    I recently had cause to write a version of Kruskal’s algorithm in Haskell and did it using my own union-find implementation in the ST monad, which is a better way of having mutable state and destructive updates than using IORefs and unsafe??? functions.

    If it’s any help it’s lines 73-114 of http://hpaste.org/80494 (Kruskal is lines 138 onwards). There’s not a lot to it. I haven’t done any serious performance testing on it, or even properly stressed it for correctness, but it should give a flavour if nothing else.

    Cheers,

    David

  • http://www.cse.chalmers.se/~abela Andreas Abel

    You write that union-find is buggy, but the example you give is using the API of persistent-equivalence. Please consider checking your claims:

    “I came across 3 libraries which implement this data structure – union-find, equivalence and persistent-equivalence.

    union-find seemed like it had the easiest API to understand so I
    plugged it into my program only to eventually realise that it wasn’t
    putting the points into components as I expected.”

  • http://www.markhneedham.com/blog Mark Needham

    Hi Andreas,

    I wrote this a long time ago so I can’t remember exactly what I was trying to do…I don’t think union find was buggy, just didn’t work how I expected it to.

    I can’t remember which library was which so I’ll take your word for it!

    Mark