31 Dec 2012

Haskell: An impressively non performant union find

I’ve spent the best part of the last day debugging a clustering algorithm I wrote as part of the Algorithms 2 course, eventually coming to the conclusion that the union find data structure I was using wasn’t working as expected.

In our algorithm we’re trying to group together points which are 'close' to each other and the data structure is particular useful for doing that.

To paraphrase from my previous post about how we use the union find data structure:</p>

We start out with n connected components i.e. every point is in its own connected component.

We then merge these components together as calculate the neighbours of each point until we’ve iterated through all the points and have grouped all the points into the appropriate components.

I came across 3 libraries which implement this data structure - union-find, equivalence and persistent-equivalence.

union-find seemed like it had the easiest API to understand so I plugged it into my program only to eventually realise that it wasn’t putting the points into components as I expected.

I eventually narrowed the problem down to the following example:

> let uf = emptyEquivalence (0,9)
[(0,0),(1,1),(2,2),(3,3),(4,4),(5,5),(6,6),(7,7),(8,8),(9,9)]

> components $ equate 0 1 uf
[(0,0),(1,0),(2,2),(3,3),(4,4),(5,5),(6,6),(7,7),(8,8),(9,9)]

> components $ equate 8 9 $ equate 0 1 $ uf
[(0,0),(1,0),(2,2),(3,3),(4,4),(5,5),(6,6),(7,7),(8,8),(9,8)]

> components $ equate 0 8 $ equate 8 9 $ equate 0 1 $ uf
[(0,0),(1,0),(2,2),(3,3),(4,4),(5,5),(6,6),(7,7),(8,0),(9,8)]

We start out with a union-find where every point is in its own component. The next line puts points '0' and '1' into the same component which it does by making indexes '0' and '1' of the array both have the same value, in this case 0, which is known as the component’s leader.

We continue doing that for points '8' and '9' which works fine and our union find now consists of 8 components - the ones with leaders 8 & 0 which have two elements and then ones with leaders 2,3,4,5,6 & 7 which only contain themselves.

Things go wrong on our next step where we try to join nodes '0' and '8'. As I understand it what should happen here is that all the points connected to either '0' or '8' should end up in the same component so we should have a component containing points '0', '1', '8' and '9' but '9' has been missed off in this case.

The implementation is deliberately written to work like this so I thought I’d try writing my own version based on the following Ruby version:

class UnionFind
  def initialize(n)
    @leaders = 1.upto(n).inject([]) { |leaders, i| leaders[i] = i; leaders }
  end

  def connected?(id1,id2)
    @leaders[id1] == @leaders[id2]
  end

  def union(id1,id2)
    leader_1, leader_2 = @leaders[id1], @leaders[id2]
    @leaders.map! {|i| (i == leader_1) ? leader_2 : i }
  end
end

This is my Haskell equivalent which I adapted from the union-find one that I mentioned above:

module Leaders (UnionSet, create, components, numberOfComponents, indexes, inSameComponent, union) where

import Control.Concurrent.MVar
import Control.Monad
import Data.Array.Diff as ArrayDiff
import Data.IORef
import qualified Data.List
import Data.Maybe
import System.IO.Unsafe
import qualified Data.Set as Set

arrayFrom :: (IArray a e, Ix i) => (i,i) -> (i -> e) -> a i e
arrayFrom rng f = array rng [ (x, f x) | x <- range rng ]

ref :: a -> IORef a
ref x = unsafePerformIO (newIORef x)

data UnionSet i = UnionSet { leaders :: IORef (DiffArray i i) }

create :: Ix i => (i, i) -> UnionSet i
create is = UnionSet (ref (arrayFrom is id))

extractComponents :: Ix i => DiffArray i i -> [(i, i)]
extractComponents  = Set.toList . Set.fromList . ArrayDiff.assocs

components :: Ix i => UnionSet i -> [(i,i)]
components (UnionSet leaders) = unsafePerformIO $ do
    l <- readIORef leaders
    return (extractComponents l)

numberOfComponents :: Ix i => UnionSet i -> Int
numberOfComponents (UnionSet leaders) = unsafePerformIO $ do
    l <- readIORef leaders
    return (length $ extractComponents l)

indexes :: Ix i => UnionSet i -> [(i,i)]
indexes (UnionSet leaders) = unsafePerformIO $ do
    l <- readIORef leaders
    return (ArrayDiff.assocs l)

inSameComponent :: Ix i => UnionSet i -> i -> i -> Bool
inSameComponent (UnionSet leaders) x y = unsafePerformIO $ do
    l <- readIORef leaders
    return (l ! x == l ! y)

union x y (UnionSet leaders)  = unsafePerformIO $ do
    ls <- readIORef leaders
    let leader1 = ls ! x
        leader2 = ls ! y
        newLeaders = map (\(index, value) -> (index, leader2)) . filter (\(index, value) -> value == leader1) $ assocs ls
    modifyIORef leaders (\l -> l // newLeaders)
    return $ UnionSet leaders

We can recreate the above example like so:

> indexes $ Leaders.union 0 8 $ Leaders.union 8 9 $ Leaders.union 0 1 $ create (0,9)
[(0,9),(1,9),(2,2),(3,3),(4,4),(5,5),(6,6),(7,7),(8,9),(9,9)]

Unfortunately it takes 44 seconds to execute which is mainly down to the call to http://zvon.org/other/haskell/Outputarray/index.html on line 46. assocs gives us a list of all the indexes and their corresponding values which we use to work out which indexes need to be updated with a new leader.

The rest of the code is mostly boiler plate around getting the array out of the IORef. The IORef allows us to have a mutable array in this instance. There is a page on the c2 wiki which explains how to use IORef in more detail.

Although using the DiffArray allows us to provide a pure external interface around its use, it is known to be 10-100x slower than an MArray.

I’ve been playing around with a version of the union-find data structure which makes use of an MArray instead and has decreased the execution time to 34 seconds.

Unless anyone has any ideas for how I can get this to perform more quickly I’m thinking that perhaps an array isn’t a good choice of underlying data structure at least when using Haskell.

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.