22 Apr 2015

R: Replacing for loops with data frames

In my last blog post I showed how to derive posterior probabilities for the Think Bayes dice problem:

Suppose I have a box of dice that contains a 4-sided die, a 6-sided die, an 8-sided die, a 12-sided die, and a 20-sided die. If you have ever played Dungeons & Dragons, you know what I am talking about. Suppose I select a die from the box at random, roll it, and get a 6. What is the probability that I rolled each die?

To recap, this was my final solution:

likelihoods = function(names, observations) {
  scores = rep(1.0 / length(names), length(names))
  names(scores) = names

  for(name in names) {
      for(observation in observations) {
        if(name < observation) {
          scores[paste(name)]  = 0
        } else {
          scores[paste(name)] = scores[paste(name)] *  (1.0 / name)
        }
      }
    }
  return(scores)
}

dice = c(4,6,8,12,20)
l1 = likelihoods(dice, c(6))

> l1 / sum(l1)
        4         6         8        12        20
0.0000000 0.3921569 0.2941176 0.1960784 0.1176471

Although it works we have nested for loops which aren’t very idiomatic R so let’s try and get rid of them.

The first thing we want to do is return a data frame rather than a vector so we tweak the first two lines to read like this:

scores = rep(1.0 / length(names), length(names))
df = data.frame(score = scores, name = names)

Next we can get rid of the inner for loop and replace it with a call to ifelse wrapped inside a dplyr mutate call:

library(dplyr)
likelihoods2 = function(names, observations) {
  scores = rep(1.0 / length(names), length(names))
  df = data.frame(score = scores, name = names)

  for(observation in observations) {
    df = df %>% mutate(score = ifelse(name < observation, 0, score * (1.0 / name)) )
  }

  return(df)
}

dice = c(4,6,8,12,20)
l1 = likelihoods2(dice, c(6))

> l1
       score name
1 0.00000000    4
2 0.03333333    6
3 0.02500000    8
4 0.01666667   12
5 0.01000000   20

Finally we’ll tidy up the scores so they’re relatively weighted against each other:

likelihoods2 = function(names, observations) {
  scores = rep(1.0 / length(names), length(names))
  df = data.frame(score = scores, name = names)

  for(observation in observations) {
    df = df %>% mutate(score = ifelse(name < observation, 0, score * (1.0 / name)) )
  }

  return(df %>% mutate(weighted = score / sum(score)) %>% select(name, weighted))
}

dice = c(4,6,8,12,20)
l1 = likelihoods2(dice, c(6))

> l1
  name  weighted
1    4 0.0000000
2    6 0.3921569
3    8 0.2941176
4   12 0.1960784
5   20 0.1176471

Now we’re down to just the one for loop. Getting rid of that one is a bit trickier. First we’ll create a data frame which contains a row for every (observation, dice) pair, simulating the nested for loops:

likelihoods3 = function(names, observations) {
  l = list(observation = observations, roll = names)
  obsDf = do.call(expand.grid,l) %>%
    mutate(likelihood = 1.0 / roll,
           score = ifelse(roll < observation, 0, likelihood))

  return(obsDf)
}

dice = c(4,6,8,12,20)
l1 = likelihoods3(dice, c(6))

> l1
  observation roll likelihood      score
1           6    4 0.25000000 0.00000000
2           6    6 0.16666667 0.16666667
3           6    8 0.12500000 0.12500000
4           6   12 0.08333333 0.08333333
5           6   20 0.05000000 0.05000000

l2 = likelihoods3(dice, c(6, 4, 8, 7, 7, 2))
> l2
   observation roll likelihood      score
1            6    4 0.25000000 0.00000000
2            4    4 0.25000000 0.25000000
3            8    4 0.25000000 0.00000000
4            7    4 0.25000000 0.00000000
5            7    4 0.25000000 0.00000000
6            2    4 0.25000000 0.25000000
7            6    6 0.16666667 0.16666667
8            4    6 0.16666667 0.16666667
9            8    6 0.16666667 0.00000000
10           7    6 0.16666667 0.00000000
11           7    6 0.16666667 0.00000000
12           2    6 0.16666667 0.16666667
13           6    8 0.12500000 0.12500000
14           4    8 0.12500000 0.12500000
15           8    8 0.12500000 0.12500000
16           7    8 0.12500000 0.12500000
17           7    8 0.12500000 0.12500000
18           2    8 0.12500000 0.12500000
19           6   12 0.08333333 0.08333333
20           4   12 0.08333333 0.08333333
21           8   12 0.08333333 0.08333333
22           7   12 0.08333333 0.08333333
23           7   12 0.08333333 0.08333333
24           2   12 0.08333333 0.08333333
25           6   20 0.05000000 0.05000000
26           4   20 0.05000000 0.05000000
27           8   20 0.05000000 0.05000000
28           7   20 0.05000000 0.05000000
29           7   20 0.05000000 0.05000000
30           2   20 0.05000000 0.05000000

Now we need to iterate over the data frame, grouping by 'roll' so that we end up with one row for each one.

We’ll add a new column which stores the posterior probability for each dice. This will be calculated by multiplying the prior probability by the product of the 'score' entries.

This is what our new likelihood function looks like:

likelihoods3 = function(names, observations) {
  l = list(observation = observations, roll = names)
  obsDf = do.call(expand.grid,l) %>%
    mutate(likelihood = 1.0 / roll,
           score = ifelse(roll < observation, 0, likelihood))

  return (obsDf %>%
    group_by(roll) %>%
    summarise(s = (1.0/length(names)) * prod(score) ) %>%
    ungroup() %>%
    mutate(weighted = s / sum(s)) %>%
    select(roll, weighted))
}

l1 = likelihoods3(dice, c(6))
> l1
Source: local data frame [5 x 2]

  roll  weighted
1    4 0.0000000
2    6 0.3921569
3    8 0.2941176
4   12 0.1960784
5   20 0.1176471

l2 = likelihoods3(dice, c(6, 4, 8, 7, 7, 2))
> l2
Source: local data frame [5 x 2]

  roll    weighted
1    4 0.000000000
2    6 0.000000000
3    8 0.915845272
4   12 0.080403426
5   20 0.003751302

We’ve now got the same result as we did with our nested for loops so I think the refactoring has been a success.

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.