Kruskal's Algorithm using union find in Ruby
I recently wrote a blog post describing my implementation of Kruskal’s algorithm  a greedy algorithm using to find a minimum spanning tree (MST) of a graph  and while it does the job it’s not particularly quick.
It takes 20 seconds to calculate the MST for a 500 node, ~2000 edge graph.
One way that we can improve the performance of the algorithm is by storing the MST in a union find/http://en.wikipedia.org/wiki/Disjointset_data_structure[disjoint set] data structure.
To refresh, Kruskal’s algorithm does the following:

sort edges E in order of increasing cost

let T = minimum spanning tree, m = number of edges

for i=1 to m:

let e = selected edge (u,v)

if there’s no way to go between u and v using edges in T:

add e to T



return T
In the first version of the algorithm we stored the MST in an array and then we did a depth first search to check that adding an edge to the array wasn’t going to create a cycle which would mean we no longer had an MST.
As I understand it the way that we use the union find data structure is that we start out with n connected components i.e. every node is in its own connected component.
We then merge these components together as we come across new edges which connect nodes in different components until we only have one component left at which point we should have our MST.

sort edges E in order of increasing cost

initialise unionfind uf with n connected components

let T = minimum spanning tree, m = number of edges

for i=1 to m:

let e = selected edge (u,v)

if u and v not in same connected component in uf:

merge u and v into same connected component in uf

add e to T



return T
I came across this bit of code written by Michael Luckeneder which implements the union find data structure in Ruby and adapted that slightly to fit the nomenclature used in the Algorithms 2 videos.
The union find data structure looks like this:
class UnionFind
def initialize(n)
@leaders = 1.upto(n).inject([]) { leaders, i leaders[i] = i; leaders }
end
def connected?(id1,id2)
@leaders[id1] == @leaders[id2]
end
def union(id1,id2)
leader_1, leader_2 = @leaders[id1], @leaders[id2]
@leaders.map! {i (i == leader_1) ? leader_2 : i }
end
end
We have two methods:

connected? which we use to check whether or not two nodes are in the same connected component.

union which we use to put two nodes into the same connected component.
The way it’s implemented is that we have a collection of 'leaders' and initially each node is its own leader. As we find edges which belong in the MST we call union passing the two nodes as arguments. After this method executes every node which has the same leader as the first node will now instead have the same leader as the second node.
For example if we have a simple graph with edges 1 > 2 > 3 > 4 > 5 our initial union find data structure would look like this:
> uf = UnionFind.new 5
=> #<UnionFind:0x45e5a9b3 @leaders=[nil, 1, 2, 3, 4, 5]>
At this stage each node is in its own connected component but if we process the edge 1 > 2 we’d first check if those two nodes were already in the same connected component:
> uf.connected?(1,2)
=> false
Given that they aren’t we can call union:
> uf.union(1,2)
=> [nil, 2, 2, 3, 4, 5]
As we can see from the output nodes 1 & 2 now both have the same leader since they are in the same connected component while the other nodes still remain on their own.
We could then process the 2 > 3 edge which would put nodes 1, 2 & 3 together:
> uf.union(2,3)
=> [nil, 3, 3, 3, 4, 5]
The outline for Kruskal’s algorithm which makes use of this data structure is like so:
set = UnionFind.new number_of_nodes
@minimum_spanning_tree = []
edges = file.drop(1).
map { x x.gsub(/\n/, "").split(" ").map(&:to_i) }.
map { one, two, weight { :from => one, :to => two, :weight => weight}}.
sort_by { x x[:weight]}
edges.each do edge
if !set.connected?(edge[:from], edge[:to])
@minimum_spanning_tree << edge
set.union(edge[:from], edge[:to])
end
end
puts "MST: #{@minimum_spanning_tree}"
puts "Cost: #{@minimum_spanning_tree.inject(0) { acc, x acc + x[:weight]}}"
This version of the algorithm runs in 1.9 seconds, a significant improvement on the initial version. The full code is on github as usual!
About the author
I'm currently working on realtime userfacing analytics with Apache Pinot at StarTree. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also coauthored the O'Reilly Graph Algorithms Book with Amy Hodler.