Mark Needham

Thoughts on Software Development

Archive for the ‘Haskell’ tag

Knapsack Problem in Haskell

with one comment

I recently described two versions of the Knapsack problem written in Ruby and Python and one common thing is that I used a global cache to store the results of previous calculations.

From my experience of coding in Haskell it’s not considered very idiomatic to write code like that and although I haven’t actually tried it, potentially more tricky to achieve.

I thought it’d be interesting to try and write the algorithm in Haskell with that constraint in mind and my first version looked like this:

ref :: a -> IORef a
ref x = unsafePerformIO (newIORef x)   
 
knapsackCached1 :: [[Int]] -> Int -> Int -> IORef (Map.Map (Int, Int) Int) -> Int
knapsackCached1 rows knapsackWeight index cacheContainer = unsafePerformIO $ do
  cache <- readIORef cacheContainer
  if index == 0 || knapsackWeight == 0 then do
    return 0
  else
    let (value:weight:_) = rows !! index
         best = knapsackCached1 rows knapsackWeight prevIndex cacheContainer  in
    if weight > knapsackWeight && lookupPreviousIn cache == Nothing then do
      let updatedCache =  Map.insert (prevIndex, knapsackWeight) best cache
      writeIORef cacheContainer updatedCache
      return $ fromJust $ lookupPreviousIn updatedCache
    else
      if lookupPreviousIn cache == Nothing then do
        let newBest = maximum [best, value + knapsackCached1 rows (knapsackWeight-weight) prevIndex cacheContainer]
            updatedCache = Map.insert (prevIndex, knapsackWeight) newBest cache
        writeIORef cacheContainer updatedCache
        return $ fromJust $ lookupPreviousIn updatedCache
      else
        return $ fromJust $ lookupPreviousIn cache
  where lookupPreviousIn cache = Map.lookup (prevIndex,knapsackWeight) cache
        prevIndex = index-1

We then call it like this:

let (knapsackWeight, numberOfItems, rows) = process contents
     cache = ref (Map.empty :: Map.Map (Int, Int) Int)
knapsackCached1 rows knapsackWeight (numberOfItems-1) cache

As you can see, we’re passing around the cache as a parameter where the cache is a Map wrapped inside an IORef – a data type which allows us to pass around a mutable variable in the IO monad.

We write our new value into the cache on lines 11 and 17 so that our updates to the map will be picked up in the other recursive steps.

Apart from that the shape of the code is the same as the Ruby and Python versions except I’m now only using a map with a pair as the key instead of an array + map as in the other versions.

The annoying thing about this solution is that we have to pass the cache around as a parameter when it’s just a means of optimisation and not part of the actual problem.

An alternative solution could be the following where we abstract the writing/reading of the map into a memoize function which we wrap our function in:

memoize :: ((Int, Int) -> Int) -> (Int, Int) -> Int                  
memoize fn mapKey = unsafePerformIO $ do 
  let cache = ref (Map.empty :: Map.Map (Int, Int) Int)
  items <- readIORef cache
  if Map.lookup mapKey items == Nothing then do
    let result = fn mapKey
    writeIORef cache $  Map.insert mapKey result items
    return result
  else
    return (fromJust $ Map.lookup mapKey items)        
 
knapsackCached :: [[Int]] -> Int -> Int -> Int
knapsackCached rows weight numberOfItems = 
  inner (numberOfItems-1, weight)
  where inner = memoize (\(i,w) -> if i < 0 || w == 0 then 0
                                   else
                                     let best = inner (i-1,w) 
                                         (vi:wi:_) = rows !! i in 
                                     if wi > w then best
                                     else maximum [best, vi + inner (i-1, w-wi)])

We can call that function like this:

let (knapsackWeight, numberOfItems, rows) = process contents
     cache = ref (Map.empty :: Map.Map (Int, Int) Int)
knapsackCached rows knapsackWeight numberOfItems

Here we define an inner function inside knapsackCached which is a partial application of the memoize function. We then pass our cache key to that function on the previous line.

One thing which I noticed while writing this code is that there is some strangeness around the use of ‘in’ after let statements. It seems like if you’re inside an if/else block you need to use ‘in’ unless you’re in the context of a Monad (do statement) in which case you don’t need to.

I was staring a screen of compilation errors for about an hour until I realised this!

These are the timings for the two versions of the algorithm:

# First one
$ time ./k knapsack2.txt 
real	0m14.993s user	0m14.646s sys	0m0.320s
 
# Second one
$ time ./k knapsack2.txt 
real	0m12.594s user	0m12.259s sys	0m0.284s

I’m still trying to understand exactly how to profile and then optimise the program so any tips are always welcome.

Written by Mark Needham

January 9th, 2013 at 12:12 am

Posted in Algorithms,Haskell

Tagged with ,

Haskell: Reading files

without comments

In writing the clustering algorithm which I’ve mentioned way too many times already I needed to process a text file which contained all the points and my initial approach looked like this:

import System.IO
 
main = do    
    withFile "clustering2.txt" ReadMode (\handle -> do  
        contents <- hGetContents handle   
        putStrLn contents)

It felt a bit clunky but I didn’t realise there was an easier way until I came across this thread. We can simplify reading a file to the following by using the readFile function:

main = do    
    contents <- readFile "clustering2.txt" 
    putStrLn contents

We need to read the file in the IO monad which explains why we have the ‘do’ notation on the first line.

Another thing I didn’t realise until recently was that you don’t actually need to worry about the ‘do’ notation if you try to read from the IO monad inside GHCI.

In this context we’re reading from the IO monad when we bind ‘readFile’ to the variable ‘contents’ since ‘readFile’ returns type ‘IO String’:

> :t readFile
readFile :: FilePath -> IO String

We can therefore play around with the code pretty easily:

> contents <- readFile "clustering2.txt"
> let (bits, nodes) = process contents 
> bits
24
> length nodes
19981
> take 10 nodes
[379,1669,5749,6927,7420,9030,9188,9667,11878,12169]

I think we’re able to do this because by being in GHCI we’re already in the context of the IO monad but I’m happy to be corrected if I haven’t explained that correctly.

Written by Mark Needham

January 2nd, 2013 at 12:16 am

Posted in Haskell

Tagged with

Haskell: Downloading the core library source code

without comments

I’ve started playing around with Haskell again and since I’m doing so on a new machine I don’t have a copy of the language source code.

I wanted to rectify that situation but my Google fu was weak and it took me way too long to figure out how to get it so I thought I’d better document it for future me.

The easiest way is to clone the copy of the GHC repository on github:

git clone https://github.com/ghc/ghc.git

Initially that doesn’t have any of the code for the core libraries but running the following command (which takes ages!) sorts it out:

cd ghc
./sync-all get

Noticing that the core libraries weren’t there initially I thought I must have done something wrong so I went to the documentation for the function that I wanted to see the source for – getAssocs.

From that page there is a link to the source for that function and a bit tweaking of the URL lets us know that this function is defined in the array package.

Most of the base packages are available from the Haskell darcs repository so I ended up cloning the ones I wanted in the time that it took for the sync-all script to run.

darcs get http://darcs.haskell.org/packages/base/ # gets most of the packages we'd be interested in
darcs get http://darcs.haskell.org/packages/array/ # gets the array package

Of course I could have just been patient and waited for the script to finish…

Written by Mark Needham

December 31st, 2012 at 10:39 pm

Posted in Haskell

Tagged with

Haskell: Strictness and the monadic bind

with one comment

As I mentioned towards the end of my post about implementing the union find data structure in Haskell I wrote another version using a mutable array and having not seen much of a performance improvement started commenting out code to try and find the problem.

I eventually narrowed it down to the union function which was defined like so:

union :: IO (IOArray Int Int) -> Int -> Int -> IO (IOArray Int Int)
union arrayContainer x y = do
    actualArray <- arrayContainer
    ls <- getAssocs actualArray
    leader1 <- readArray actualArray x
    leader2 <- readArray actualArray y
    let newValues = (map (\(index, value) -> (index, leader1)) . filter (\(index, value) -> value == leader2)) ls
    sequence $ map (\(idx, val) -> writeArray actualArray idx val) newValues
    return actualArray

I was using Unix’s time function to get the execution time since this meant I didn’t need to make any changes to the program and this level of granularity was ok.

The first time I ran the program it executed in 36.379 seconds and my first hunch was that a lot of time was being taken up writing to the array so I commented out that line:

union :: IO (IOArray Int Int) -> Int -> Int -> IO (IOArray Int Int)
union arrayContainer x y = do
    actualArray <- arrayContainer
    ls <- getAssocs actualArray
    leader1 <- readArray actualArray x
    leader2 <- readArray actualArray y
    let newValues = (map (\(index, value) -> (index, leader1)) . filter (\(index, value) -> value == leader2)) ls
    -- sequence $ map (\(idx, val) -> writeArray actualArray idx val) newValues
    return actualArray

The execution time decreased to 33.381 seconds so the writing of the array was actually only a small part of the total execution time.

I thought it was quite strange that it was taking so long to execute since things are generally lazily evaluated in Haskell and my assumption was that newValues wasn’t being evaluated since I hadn’t used it anywhere. I decided to comment that out to see what difference it would make:

union :: IO (IOArray Int Int) -> Int -> Int -> IO (IOArray Int Int)
union arrayContainer x y = do
    actualArray <- arrayContainer
    ls <- getAssocs actualArray
    leader1 <- readArray actualArray x
    leader2 <- readArray actualArray y
    -- let newValues = (map (\(index, value) -> (index, leader1)) . filter (\(index, value) -> value == leader2)) ls
    -- sequence $ map (\(idx, val) -> writeArray actualArray idx val) newValues
    return actualArray

The execution time was now 33.579 seconds so commenting out that line hadn’t actually made any difference. I assumed ls wasn’t being evaluated since it isn’t being used but I thought I’d better check:

union :: IO (IOArray Int Int) -> Int -> Int -> IO (IOArray Int Int)
union arrayContainer x y = do
    actualArray <- arrayContainer
    -- ls <- getAssocs actualArray
    leader1 <- readArray actualArray x
    leader2 <- readArray actualArray y
    -- let newValues = (map (\(index, value) -> (index, leader1)) . filter (\(index, value) -> value == leader2)) ls
    -- sequence $ map (\(idx, val) -> writeArray actualArray idx val) newValues
    return actualArray

The execution time now reduced to 3.882 seconds thereby suggesting that getAssocs was being strictly evaluated.

We are doing what’s called a monadic bind which (at least) within GHCI is strictly evaluated but isn’t necessarily evaluated like this everywhere else:

Monad operations (bind and return) have to be non-strict in fact, always! However other operations can be specific to each monad. For instance some are strict (like IO), and some are non-strict (like []).

From my observations it would seem that the IOArray is one of those monads which evaluates bind strictly.

I tried looking at the Haskell source code to see if I could find any code to prove what I’d observed but I’m not entirely sure what I should be looking for!

Written by Mark Needham

December 31st, 2012 at 10:27 pm

Posted in Haskell

Tagged with

Haskell: An impressively non performant union find

with 5 comments

I’ve spent the best part of the last day debugging a clustering algorithm I wrote as part of the Algorithms 2 course, eventually coming to the conclusion that the union find data structure I was using wasn’t working as expected.

In our algorithm we’re trying to group together points which are ‘close’ to each other and the data structure is particular useful for doing that.

To paraphrase from my previous post about how we use the union find data structure:

We start out with n connected components i.e. every point is in its own connected component.

We then merge these components together as calculate the neighbours of each point until we’ve iterated through all the points and have grouped all the points into the appropriate components.

I came across 3 libraries which implement this data structure – union-find, equivalence and persistent-equivalence.

union-find seemed like it had the easiest API to understand so I plugged it into my program only to eventually realise that it wasn’t putting the points into components as I expected.

I eventually narrowed the problem down to the following example:

> let uf = emptyEquivalence (0,9)
[(0,0),(1,1),(2,2),(3,3),(4,4),(5,5),(6,6),(7,7),(8,8),(9,9)]
 
> components $ equate 0 1 uf
[(0,0),(1,0),(2,2),(3,3),(4,4),(5,5),(6,6),(7,7),(8,8),(9,9)]
 
> components $ equate 8 9 $ equate 0 1 $ uf
[(0,0),(1,0),(2,2),(3,3),(4,4),(5,5),(6,6),(7,7),(8,8),(9,8)]
 
> components $ equate 0 8 $ equate 8 9 $ equate 0 1 $ uf
[(0,0),(1,0),(2,2),(3,3),(4,4),(5,5),(6,6),(7,7),(8,0),(9,8)]

We start out with a union-find where every point is in its own component. The next line puts points ‘0’ and ‘1’ into the same component which it does by making indexes ‘0’ and ‘1’ of the array both have the same value, in this case 0, which is known as the component’s leader.

We continue doing that for points ‘8’ and ‘9’ which works fine and our union find now consists of 8 components – the ones with leaders 8 & 0 which have two elements and then ones with leaders 2,3,4,5,6 & 7 which only contain themselves.

Things go wrong on our next step where we try to join nodes ‘0’ and ‘8’. As I understand it what should happen here is that all the points connected to either ‘0’ or ‘8’ should end up in the same component so we should have a component containing points ‘0’, ‘1’, ‘8’ and ‘9’ but ‘9’ has been missed off in this case.

The implementation is deliberately written to work like this so I thought I’d try writing my own version based on the following Ruby version:

class UnionFind
  def initialize(n)
    @leaders = 1.upto(n).inject([]) { |leaders, i| leaders[i] = i; leaders }
  end
 
  def connected?(id1,id2)
    @leaders[id1] == @leaders[id2]
  end
 
  def union(id1,id2)
    leader_1, leader_2 = @leaders[id1], @leaders[id2]
    @leaders.map! {|i| (i == leader_1) ? leader_2 : i }
  end
end

This is my Haskell equivalent which I adapted from the union-find one that I mentioned above:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
module Leaders (UnionSet, create, components, numberOfComponents, indexes, inSameComponent, union) where
 
import Control.Concurrent.MVar
import Control.Monad
import Data.Array.Diff as ArrayDiff
import Data.IORef
import qualified Data.List
import Data.Maybe
import System.IO.Unsafe
import qualified Data.Set as Set
 
arrayFrom :: (IArray a e, Ix i) => (i,i) -> (i -> e) -> a i e
arrayFrom rng f = array rng [ (x, f x) | x <- range rng ]
 
ref :: a -> IORef a
ref x = unsafePerformIO (newIORef x)
 
data UnionSet i = UnionSet { leaders :: IORef (DiffArray i i) }
 
create :: Ix i => (i, i) -> UnionSet i
create is = UnionSet (ref (arrayFrom is id))
 
extractComponents :: Ix i => DiffArray i i -> [(i, i)]    
extractComponents  = Set.toList . Set.fromList . ArrayDiff.assocs
 
components :: Ix i => UnionSet i -> [(i,i)]
components (UnionSet leaders) = unsafePerformIO $ do
    l <- readIORef leaders
    return (extractComponents l)
 
numberOfComponents :: Ix i => UnionSet i -> Int
numberOfComponents (UnionSet leaders) = unsafePerformIO $ do
    l <- readIORef leaders
    return (length $ extractComponents l) 
 
indexes :: Ix i => UnionSet i -> [(i,i)]
indexes (UnionSet leaders) = unsafePerformIO $ do
    l <- readIORef leaders
    return (ArrayDiff.assocs l)       
 
inSameComponent :: Ix i => UnionSet i -> i -> i -> Bool
inSameComponent (UnionSet leaders) x y = unsafePerformIO $ do
    l <- readIORef leaders
    return (l ! x == l ! y)
 
union x y (UnionSet leaders)  = unsafePerformIO $ do
    ls <- readIORef leaders
    let leader1 = ls ! x 
        leader2 = ls ! y
        newLeaders = map (\(index, value) -> (index, leader2)) . filter (\(index, value) -> value == leader1) $ assocs ls        
    modifyIORef leaders (\l -> l // newLeaders)
    return $ UnionSet leaders

We can recreate the above example like so:

> indexes $ Leaders.union 0 8 $ Leaders.union 8 9 $ Leaders.union 0 1 $ create (0,9)
[(0,9),(1,9),(2,2),(3,3),(4,4),(5,5),(6,6),(7,7),(8,9),(9,9)]

Unfortunately it takes 44 seconds to execute which is mainly down to the call to assocs on line 46. assocs gives us a list of all the indexes and their corresponding values which we use to work out which indexes need to be updated with a new leader.

The rest of the code is mostly boiler plate around getting the array out of the IORef. The IORef allows us to have a mutable array in this instance. There is a page on the c2 wiki which explains how to use IORef in more detail.

Although using the DiffArray allows us to provide a pure external interface around its use, it is known to be 10-100x slower than an MArray.

I’ve been playing around with a version of the union-find data structure which makes use of an MArray instead and has decreased the execution time to 34 seconds.

Unless anyone has any ideas for how I can get this to perform more quickly I’m thinking that perhaps an array isn’t a good choice of underlying data structure at least when using Haskell.

Written by Mark Needham

December 31st, 2012 at 8:44 pm

Posted in Haskell

Tagged with

Bitwise operations in Ruby and Haskell

with 4 comments

Part of one of the most recent problems in the Algorithms 2 course required us to find the ‘neighbours’ of binary values.

In this case a neighbour is described as being any other binary value which has an equivalent value or differs in 1 or 2 bits.

e.g. the neighbours of ‘10000’ would be ‘00000’, ‘00001’, ‘00010’, ‘00100’, ”01000′, ‘10001’, ‘10010’, ‘10011’, ‘10100’, ‘10101’, ‘10110’, ‘11000’, ‘11001’, ‘11010’ and ‘11100’

I initially treated ‘10000’ as an array of 1s and 0s and wrote a function to recursively come up with the above combinations before it was pointed out to me that it’d be much easier to use bit wise logic instead.

I don’t remember every having to write any code using bit wise operations since university so I thought I’d document what I learnt.

The first thing to do is convert those binary values into a decimal equivalent which is pretty easy in Ruby:

> "10000".to_i(2)
=> 16

In Haskell I stole a function from PLEAC to do the job:

> import Data.Char
> (foldr (\c s -> s * 2 + c) 0 . reverse . map digitToInt) "10000"
16

I initially worked out the neighbours by hand but I need to work out how to convert that into code so I ran through an example.

We want to get from ‘10000’ (16) to ‘00000’ (0) which we can do by using an XOR operation:

Binary XOR Operator copies the bit if it is set in one operand but not both.

'10000' XOR '10000'

In Ruby it would read like this:

> 16 ^ 16
=> 0

If we do the same XOR operation but changing the other bits instead we end up with all the neighbours of distance 1:

> [0,1,2,4,16].map { |x| 16 ^ x }
=> [16, 17, 18, 20, 0]

We can generalise the function that creates that array of values like so:

> bits = 5
> offsets = (0..(bits - 1)).map { |x| 2 ** x }
=> [1, 2, 4, 8, 16]

or if we want to use bit shifting:

> offsets = (0..(bits - 1)).map { |x| 1 << x }
=> [1, 2, 4, 8, 16]

With that approach we’re moving the “left operands value is moved left by the number of bits specified by the right operand.” i.e. we move ‘1’ left by 0 bits (from ‘1’to ‘1’), then by 1 bit (from ‘1’ to ’10’), then by 2 bits (from ‘1’ to ‘100’) and so on.

We still need to get all the neighbours which differ by 2 bits which we can do by getting all the ways that we can combine those offsets together. The combination function and Bitwise or do the trick here:

> offsets.combination(2).to_a
=> [[1, 2], [1, 4], [1, 8], [1, 16], [2, 4], [2, 8], [2, 16], [4, 8], [4, 16], [8, 16]]
> offsets.combination(2).to_a.map { |a,b| a|b }
=> [3, 5, 9, 17, 6, 10, 18, 12, 20, 24]

Now if we put it all together:

> initial_value = "10000"
=> "10000"
> bits = initial_value.length
=> 5
> value = "10000".to_i(2)
=> 16
> single_bit_offsets = (0..(bits - 1)).map { |x| 1 << x }
=> [1, 2, 4, 8, 16]
> all_the_offsets = offsets + offsets.combination(2).to_a.map { |a,b| a|b }
=> [1, 2, 4, 8, 16, 3, 5, 9, 17, 6, 10, 18, 12, 20, 24]
> all_the_offsets.map { |offset| value ^ offset }.sort 
=> [0, 1, 2, 4, 8, 17, 18, 19, 20, 21, 22, 24, 25, 26, 28]

The final Haskell version looks like this:

> import Data.Char
> import Data.Bits
> import Data.List
 
> let initialValue = "10000"
> let bits = length initialValue
> let value = (foldr (\c s -> s * 2 + c) 0 . reverse . map digitToInt) initialValue
> let singleBitOffsets = map (shiftL 1) [0..(bits - 1)] :: [Int]
> let allTheOffsets = singleBitOffsets ++ map (\(x:y:_) -> (x .|. y)) (combinationsOf 2 singleBitOffsets) :: [Int]
> Data.List.sort $ map (xor value) allTheOffsets 
[0,1,2,4,8,17,18,19,20,21,22,24,25,26,28]

I took the combinationsOf function from David Amos’ Combinatorics Generation library and it reads like so:

-- subsets of size k
combinationsOf 0 _ = [[]]
combinationsOf _ [] = []
combinationsOf k (x:xs) = map (x:) (combinationsOf (k-1) xs) ++ combinationsOf k xs

A real life version of this problem occurs in the world of genome analysis where scientists need to work out whether phylogenetic profiles are functionally linked.

Written by Mark Needham

December 31st, 2012 at 1:14 pm

Posted in Haskell,Ruby

Tagged with , ,

Haskell: Using qualified imports to avoid polluting the namespace

with one comment

In most of the Haskell code I’ve read any functions from other modules have been imported directly into the namespace and I reached the stage where I had this list of imports in a file:

import System.IO
import Data.List.Split
import Data.Char
import Data.Bits
import Control.Monad 
import Data.Map 
import Data.Set 
import Data.List 
import Data.Maybe

This becomes a problem when you want to use a function which is defined in multiple modules such as filter:

clustering.hs:53:43:
    Ambiguous occurrence `filter'
    It could refer to either `Data.List.filter',
                             imported from `Data.List' at clustering.hs:11:1-16
                             (and originally defined in `GHC.List')
                          or `Data.Set.filter',
                             imported from `Data.Set' at clustering.hs:10:1-16
                          or `Data.Map.filter',
                             imported from `Data.Map' at clustering.hs:9:1-16

One way to solve this is to change occurrences of filter to Data.List.filter but it’s a bit long winded and in this case there is a function in the Prelude package which is available without us importing anything.

Unfortunately we’d have to use the prefix Prelude to refer to it since all the other versions of the function have made it ambiguous.

We still want to use some of the functions in those other modules though so we can do a qualified import which will make the functions available to us but only if we refer to them by their full name. It won’t import them into our namespace.

For example to initialise a map we’d do this:

> import qualified Data.Map
> Data.Map.assocs $ Data.Map.fromList [(1,2), (3,7)]
[(1,2),(3,7)]

That’s a bit long winded though so we can rename imports with a shorter name to make our life a bit easier:

import System.IO
import Data.List.Split
import Data.Char
import Data.Bits
import qualified Control.Monad as Monad
import qualified Data.Map as Map
import qualified Data.Set as Set
import qualified Data.List as List
import qualified Data.Maybe as Maybe

We can then use functions in those packages like so:

> Maybe.maybeToList (Just 3)
[3]

For this particular function I haven’t come across any with the same name so we might want to import that one into our namespace but require the use of the Maybe prefix for any other functions:

import qualified Data.Maybe as Maybe
import Data.Maybe (maybeToList)
> maybeToList (Just 3)
[3]

There’s a wiki entry I came across which explains Haskell modules this in a bit more detail and is worth a read.

Written by Mark Needham

December 30th, 2012 at 11:16 pm

Posted in Haskell

Tagged with

Haskell: Pattern matching a list

with one comment

As I mentioned in a post yesterday I’ve been converting a clustering algorithm into Haskell and I wanted to get the value from doing a bit wise or on two values in a list.

I forgot it was possible to pattern match on lists until I came across a post I wrote about 8 months ago where I’d done this so my initial code looked like this:

> import Data.Bits
> map (\pair -> (pair !! 0) .|. (pair !! 1)) [[1,2], [3,4]]
[3,7]

We can pattern match against the list like so:

> map (\(x:y:_) -> x .|. y) [[1,2], [3,4]]
[3,7]

Here x takes the first value, y takes the second value and the rest of the list is assigned to _ which we don’t use in this case.

There are loads of examples of pattern matching against different data structures in Learn You A Haskell and hopefully next time I’ll remember and won’t write hideous code like the first example!

Written by Mark Needham

December 30th, 2012 at 10:39 pm

Posted in Haskell

Tagged with

Haskell: A cleaner way of initialising a map

with 7 comments

I recently wrote a blog post showing a way of initialising a Haskell map and towards the end of the post I realised how convoluted my approach was and wondered if there was an easier way and indeed there is!

To recap, this is the code I ended up with to populate a map with binary based values as the keys and node ids as the values:

import Data.Map 
 
toMap :: [Int] -> Map Int [Int]
toMap nodes = fromList $ map asMapEntry $ (groupIgnoringIndex . sortIgnoringIndex) nodesWithIndexes
              where nodesWithIndexes = (zip [0..] nodes)
 
groupIgnoringIndex = groupBy (\(_,x) (_,y) -> x == y) 
sortIgnoringIndex = sortBy (\(_,x) (_,y) -> x `compare` y)
 
asMapEntry :: [(Int, Int)] -> (Int, [Int]) 
asMapEntry nodesWithIndexes = 
   ((snd . head) nodesWithIndexes, Prelude.foldl (\acc (x,_) -> acc ++ [x]) [] nodesWithIndexes)
> assocs $ toMap [1,2,5,7,2,4]
[(1,[0]),(2,[4,1]),(4,[5]),(5,[2]),(7,[3])]

To sum up what we’re trying to do: when a key doesn’t have an entry we want to create one with a list containing the appropriate value and if the key already has a value then we want to append that value to the existing list.

As it turns out the insertWith function does exactly what we want:

> let emptyMap = empty :: Map Int [Int]
> assocs $ foldl (\acc (id,val) -> insertWith (++) val [id] acc) emptyMap nodesWithIndexes
[(1,[0]),(2,[4,1]),(4,[5]),(5,[2]),(7,[3])]

Based on this experience it would appear that the same type of thing applies when coding in Haskell as when coding in Clojure.

To paraphrase Jay Fields:

If you’ve written a fair amount of Clojure code […] then chances are you’ve probably reinvented a few functions that are already available in the standard library.

Written by Mark Needham

December 29th, 2012 at 8:14 pm

Posted in Haskell

Tagged with

Haskell: Initialising a map

with 2 comments

I’ve been converting a variation of Kruskal’s algorithm from Ruby into Haskell and one thing I needed to do was create a map of binary based values to node ids.

In Ruby I wrote the following code to do this:

nodes = [1,2,5,7,2,4]
@magical_hash = {}
nodes.each_with_index do |node, index|
  @magical_hash[node] ||= []
  @magical_hash[node] << index
end
 
=> {1=>[0], 2=>[1, 4], 5=>[2], 7=>[3], 4=>[5]}

From looking at the documentation it seemed like the easiest way to do this in Haskell would be to convert the nodes into an appropriate list and then call the fromList function to build the map.

I needed to get the data to look like this:

> let nodesMap = Data.Map.fromList [(1, [0]), (2, [1,4]), (5, [2]), (7, [3]), (4, [5])]
> Data.Map.assocs nodesMap
[(1,[0]),(2,[1,4]),(4,[5]),(5,[2]),(7,[3])]

The first step was to create a list of tuples with the nodes ids and values:

> zip [0..] [1,2,5,7,2,4]
[(0,1),(1,2),(2,5),(3,7),(4,2),(5,4)]

I wanted group the collection by value so that in this instance I could have the 2 nodes with a value of ‘2’ mapping from the same key in the map.

The following code helped do that:

groupIgnoringIndex = groupBy (\(_,x) (_,y) -> x == y)   
sortIgnoringIndex = sortBy (\(_,x) (_,y) -> x `compare` y)
> (groupIgnoringIndex . sortIgnoringIndex) (zip [0..] [1,2,5,7,2,4])
[[(0,1)],[(1,2),(4,2)],[(5,4)],[(2,5)],[(3,7)]]

I wrote the following function to convert that collection into one that could be passed to fromList:

asMapEntry :: [(Int, Int)] -> (Int, [Int])
asMapEntry nodesWithIndexes = ((snd . head) nodesWithIndexes, 
                              foldl (\acc (x,_) -> acc ++ [x]) [] nodesWithIndexes)
> asMapEntry [(1,2),(4,2)]
(2,[1,4])

We can then put all those functions together into the following top level function:

toMap :: [Int] -> Map Int [Int]
toMap nodes = Data.Map.fromList $ map asMapEntry $ (groupIgnoringIndex . sortIgnoringIndex) nodesWithIndexes
              where nodesWithIndexes = (zip [0..] nodes)
> Data.Map.assocs $ toMap [1,2,5,7,2,4]
[(1,[0]),(2,[4,1]),(4,[5]),(5,[2]),(7,[3])]

I haven’t properly investigated all the functions available in the Data.Map package but my gut feeling is there must be a better way to do this – the sort/group combination is ugly in the extreme!

Written by Mark Needham

December 29th, 2012 at 7:27 pm

Posted in Haskell

Tagged with