R: dplyr – Error in (list: invalid subscript type ‘double’
In my continued playing around with R I wanted to find the minimum value for a specified percentile given a data frame representing a cumulative distribution function (CDF).
e.g. imagine we have the following CDF represented in a data frame:
library(dplyr) df = data.frame(score = c(5,7,8,10,12,20), percentile = c(0.05,0.1,0.15,0.20,0.25,0.5)) 
and we want to find the minimum value for the 0.05 percentile. We can use the filter function to do so:
> (df %>% filter(percentile > 0.05) %>% slice(1))$score [1] 7 
Things become more tricky if we want to return multiple percentiles in one go.
My first thought was to create a data frame with one row for each target percentile and then pull in the appropriate row from our original data frame:
targetPercentiles = c(0.05, 0.2) percentilesDf = data.frame(targetPercentile = targetPercentiles) > percentilesDf %>% group_by(targetPercentile) %>% mutate(x = (df %>% filter(percentile > targetPercentile) %>% slice(1))$score) Error in (list(score = c(5, 7, 8, 10, 12, 20), percentile = c(0.05, 0.1, : invalid subscript type 'double' 
Unfortunately this didn’t quite work as I expected – Antonios pointed out that this is probably because we’re mixing up two pipelines and dplyr can’t figure out what we want to do.
Instead he suggested the following variant which uses the do function
df = data.frame(score = c(5,7,8,10,12,20), percentile = c(0.05,0.1,0.15,0.20,0.25,0.5)) targetPercentiles = c(0.05, 0.2) > data.frame(targetPercentile = targetPercentiles) %>% group_by(targetPercentile) %>% do(df) %>% filter(percentile > targetPercentile) %>% slice(1) %>% select(targetPercentile, score) Source: local data frame [2 x 2] Groups: targetPercentile targetPercentile score 1 0.05 7 2 0.20 12 
We can then wrap this up in a function:
percentiles = function(df, targetPercentiles) { # make sure the percentiles are in order df = df %>% arrange(percentile) data.frame(targetPercentile = targetPercentiles) %>% group_by(targetPercentile) %>% do(df) %>% filter(percentile > targetPercentile) %>% slice(1) %>% select(targetPercentile, score) } 
which we call like this:
df = data.frame(score = c(5,7,8,10,12,20), percentile = c(0.05,0.1,0.15,0.20,0.25,0.5)) > percentiles(df, c(0.08, 0.10, 0.50, 0.80)) Source: local data frame [2 x 2] Groups: targetPercentile targetPercentile score 1 0.08 7 2 0.10 8 
Note that we don’t actually get any rows back for 0.50 or 0.80 since we don’t have any entries greater than those percentiles. With a proper CDF we would though so the function does its job.
Deliberate Practice: Watching yourself fail
I’ve recently been reading the literature written by K. Anders Eriksson and co on Deliberate Practice and one of the suggestions for increasing our competence at a skill is to put ourselves in a situation where we can fail.
I’ve been reading Think Bayes – an introductory text on Bayesian statistics, something I know nothing about – and each chapter concludes with a set of exercises to practice, a potentially perfect exercise in failure!
I’ve been going through the exercises and capturing my screen while I do so, an idea I picked up from one of the papers:
our most important breakthrough was developing a relatively inexpensive and efficient way for students to record their exercises on video and to review and analyze their own performances against welldefined criteria
Ideally I’d get a coach to review the video but that seems too much of an ask of someone. Antonios has taken a look at some of my answers, however, and made suggestions for how he’d solve them which has been really helpful.
After each exercise I watch the video and look for areas where I get stuck or don’t make progress so that I can go and practice more in that area. I also try to find inefficiencies in how I solve a problem as well as the types of approaches I’m taking.
These are some of the observations from watching myself back over the last week or so:

I was most successful when I had some idea of what I was going to try first. Most of the time the first code I wrote didn’t end up being correct but it moved me closer to the answer or ruled out an approach.
It’s much easier to see the error in approach if there is an approach! On one occasion where I hadn’t planned out an approach I ended up staring at the question for 10 minutes and didn’t make any progress at all.

I could either solve the problems within 20 minutes or I wasn’t going to solve them and needed to chunk down to a simpler problem and then try the original exercise again.
e.g. one exercise was to calculate the 5th percentile of a posterior distribution which I flailed around with for 15 minutes before giving up. Watching back on the video it was obvious that I hadn’t completely understood what a probability mass function was. I read the Wikipedia entry and retried the exercise and this time got the answer.
 Knowing that you’re going to watch the video back stops you from getting distracted by email, twitter, Facebook etc.
 It’s a painful experience watching yourself struggle – you can see exactly which functions you don’t know or things you need to look up on Google.
 I deliberately don’t copy/paste any code while doing these exercises. I want to see how well I can do the exercises from scratch so that would defeat the point.
One of the suggestions that Eriksson makes for practice sessions is to focus on ‘technique’ during practice sessions rather than only on outcome but I haven’t yet been able to translate what exactly that would involved in a programming context.
If you have any ideas or thoughts on this approach do let me know in the comments.
R: Think Bayes Locomotive Problem – Posterior probabilities for different priors
In my continued reading of Think Bayes the next problem to tackle is the Locomotive problem which is defined thus:
A railroad numbers its locomotives in order 1..N.
One day you see a locomotive with the number 60. Estimate how many loco motives the railroad has.
The interesting thing about this question is that it initially seems that we don’t have enough information to come up with any sort of answer. However, we can get an estimate if we come up with a prior to work with.
The simplest prior is to assume that there’s one railroad operator with between say 1 and 1000 railroads with an equal probability of each size.
We can then write similar code as with the dice problem to update the prior based on the trains we’ve seen.
First we’ll create a data frame which captures the product of ‘number of locomotives’ and the observations of locomotives that we’ve seen (in this case we’ve only seen one locomotive with number ’60′:)
library(dplyr) possibleValues = 1:1000 observations = c(60) l = list(value = possibleValues, observation = observations) df = expand.grid(l) > df %>% head() value observation 1 1 60 2 2 60 3 3 60 4 4 60 5 5 60 6 6 60 
Next we want to add a column which represents the probability that the observed locomotive could have come from a particular fleet. If the number of railroads is less than 60 then we have a 0 probability, otherwise we have 1 / numberOfRailroadsInFleet:
prior = 1 / length(possibleValues) df = df %>% mutate(score = ifelse(value < observation, 0, 1/value)) > df %>% sample_n(10) value observation score 179 179 60 0.005586592 1001 1001 60 0.000999001 400 400 60 0.002500000 438 438 60 0.002283105 667 667 60 0.001499250 661 661 60 0.001512859 284 284 60 0.003521127 233 233 60 0.004291845 917 917 60 0.001090513 173 173 60 0.005780347 
To find the probability of each fleet size we write the following code:
weightedDf = df %>% group_by(value) %>% summarise(aggScore = prior * prod(score)) %>% ungroup() %>% mutate(weighted = aggScore / sum(aggScore)) > weightedDf %>% sample_n(10) Source: local data frame [10 x 3] value aggScore weighted 1 906 1.102650e06 0.0003909489 2 262 3.812981e06 0.0013519072 3 994 1.005031e06 0.0003563377 4 669 1.493275e06 0.0005294465 5 806 1.239455e06 0.0004394537 6 673 1.484400e06 0.0005262997 7 416 2.401445e06 0.0008514416 8 624 1.600963e06 0.0005676277 9 40 0.000000e+00 0.0000000000 10 248 4.028230e06 0.0014282246 
Let’s plot the data frame to see how the probability varies for each fleet size:
library(ggplot2) ggplot(aes(x = value, y = weighted), data = weightedDf) + geom_line(color="dark blue") 
The most likely choice is a fleet size of 60 based on this diagram but an alternative would be to find the mean of the posterior which we can do like so:
> weightedDf %>% mutate(mean = value * weighted) %>% select(mean) %>% sum() [1] 333.6561 
Now let’s create a function with all that code in so we can play around with some different priors and observations:
meanOfPosterior = function(values, observations) { l = list(value = values, observation = observations) df = expand.grid(l) %>% mutate(score = ifelse(value < observation, 0, 1/value)) prior = 1 / length(possibleValues) weightedDf = df %>% group_by(value) %>% summarise(aggScore = prior * prod(score)) %>% ungroup() %>% mutate(weighted = aggScore / sum(aggScore)) return (weightedDf %>% mutate(mean = value * weighted) %>% select(mean) %>% sum()) } 
If we update our observed railroads to have numbers 60, 30 and 90 we’d get the following means of posteriors assuming different priors:
> meanOfPosterior(1:500, c(60, 30, 90)) [1] 151.8496 > meanOfPosterior(1:1000, c(60, 30, 90)) [1] 164.3056 > meanOfPosterior(1:2000, c(60, 30, 90)) [1] 171.3382 
At the moment the function assumes that we always want to have a uniform prior i.e. every option has an equal opportunity of being chosen, but we might want to vary the prior to see how different assumptions influence the posterior.
We can refactor the function to take in values & priors instead of calculating the priors in the function:
meanOfPosterior = function(values, priors, observations) { priorDf = data.frame(value = values, prior = priors) l = list(value = priorDf$value, observation = observations) df = merge(expand.grid(l), priorDf, by.x = "value", by.y = "value") %>% mutate(score = ifelse(value < observation, 0, 1 / value)) df %>% group_by(value) %>% summarise(aggScore = max(prior) * prod(score)) %>% ungroup() %>% mutate(weighted = aggScore / sum(aggScore)) %>% mutate(mean = value * weighted) %>% select(mean) %>% sum() } 
Now let’s check we get the same posterior means for the uniform priors:
> meanOfPosterior(1:500, 1/length(1:500), c(60, 30, 90)) [1] 151.8496 > meanOfPosterior(1:1000, 1/length(1:1000), c(60, 30, 90)) [1] 164.3056 > meanOfPosterior(1:2000, 1/length(1:2000), c(60, 30, 90)) [1] 171.3382 
Now if instead of a uniform prior let’s use a power law one where the assumption is that smaller fleets are more likely:
> meanOfPosterior(1:500, sapply(1:500, function(x) x ** 1), c(60, 30, 90)) [1] 130.7085 > meanOfPosterior(1:1000, sapply(1:1000, function(x) x ** 1), c(60, 30, 90)) [1] 133.2752 > meanOfPosterior(1:2000, sapply(1:2000, function(x) x ** 1), c(60, 30, 90)) [1] 133.9975 > meanOfPosterior(1:5000, sapply(1:5000, function(x) x ** 1), c(60, 30, 90)) [1] 134.212 > meanOfPosterior(1:10000, sapply(1:10000, function(x) x ** 1), c(60, 30, 90)) [1] 134.2435 
Now we get very similar posterior means which converge on 134 and so that’s our best prediction.
R: Replacing for loops with data frames
In my last blog post I showed how to derive posterior probabilities for the Think Bayes dice problem:
Suppose I have a box of dice that contains a 4sided die, a 6sided die, an 8sided die, a 12sided die, and a 20sided die. If you have ever played Dungeons & Dragons, you know what I am talking about.
Suppose I select a die from the box at random, roll it, and get a 6. What is the probability that I rolled each die?
To recap, this was my final solution:
likelihoods = function(names, observations) { scores = rep(1.0 / length(names), length(names)) names(scores) = names for(name in names) { for(observation in observations) { if(name < observation) { scores[paste(name)] = 0 } else { scores[paste(name)] = scores[paste(name)] * (1.0 / name) } } } return(scores) } dice = c(4,6,8,12,20) l1 = likelihoods(dice, c(6)) > l1 / sum(l1) 4 6 8 12 20 0.0000000 0.3921569 0.2941176 0.1960784 0.1176471 
Although it works we have nested for loops which aren’t very idiomatic R so let’s try and get rid of them.
The first thing we want to do is return a data frame rather than a vector so we tweak the first two lines to read like this:
scores = rep(1.0 / length(names), length(names)) df = data.frame(score = scores, name = names) 
Next we can get rid of the inner for loop and replace it with a call to ifelse wrapped inside a dplyr mutate call:
library(dplyr) likelihoods2 = function(names, observations) { scores = rep(1.0 / length(names), length(names)) df = data.frame(score = scores, name = names) for(observation in observations) { df = df %>% mutate(score = ifelse(name < observation, 0, score * (1.0 / name)) ) } return(df) } dice = c(4,6,8,12,20) l1 = likelihoods2(dice, c(6)) > l1 score name 1 0.00000000 4 2 0.03333333 6 3 0.02500000 8 4 0.01666667 12 5 0.01000000 20 
Finally we’ll tidy up the scores so they’re relatively weighted against each other:
likelihoods2 = function(names, observations) { scores = rep(1.0 / length(names), length(names)) df = data.frame(score = scores, name = names) for(observation in observations) { df = df %>% mutate(score = ifelse(name < observation, 0, score * (1.0 / name)) ) } return(df %>% mutate(weighted = score / sum(score)) %>% select(name, weighted)) } dice = c(4,6,8,12,20) l1 = likelihoods2(dice, c(6)) > l1 name weighted 1 4 0.0000000 2 6 0.3921569 3 8 0.2941176 4 12 0.1960784 5 20 0.1176471 
Now we’re down to just the one for loop. Getting rid of that one is a bit trickier. First we’ll create a data frame which contains a row for every (observation, dice) pair, simulating the nested for loops:
likelihoods3 = function(names, observations) { l = list(observation = observations, roll = names) obsDf = do.call(expand.grid,l) %>% mutate(likelihood = 1.0 / roll, score = ifelse(roll < observation, 0, likelihood)) return(obsDf) } dice = c(4,6,8,12,20) l1 = likelihoods3(dice, c(6)) > l1 observation roll likelihood score 1 6 4 0.25000000 0.00000000 2 6 6 0.16666667 0.16666667 3 6 8 0.12500000 0.12500000 4 6 12 0.08333333 0.08333333 5 6 20 0.05000000 0.05000000 l2 = likelihoods3(dice, c(6, 4, 8, 7, 7, 2)) > l2 observation roll likelihood score 1 6 4 0.25000000 0.00000000 2 4 4 0.25000000 0.25000000 3 8 4 0.25000000 0.00000000 4 7 4 0.25000000 0.00000000 5 7 4 0.25000000 0.00000000 6 2 4 0.25000000 0.25000000 7 6 6 0.16666667 0.16666667 8 4 6 0.16666667 0.16666667 9 8 6 0.16666667 0.00000000 10 7 6 0.16666667 0.00000000 11 7 6 0.16666667 0.00000000 12 2 6 0.16666667 0.16666667 13 6 8 0.12500000 0.12500000 14 4 8 0.12500000 0.12500000 15 8 8 0.12500000 0.12500000 16 7 8 0.12500000 0.12500000 17 7 8 0.12500000 0.12500000 18 2 8 0.12500000 0.12500000 19 6 12 0.08333333 0.08333333 20 4 12 0.08333333 0.08333333 21 8 12 0.08333333 0.08333333 22 7 12 0.08333333 0.08333333 23 7 12 0.08333333 0.08333333 24 2 12 0.08333333 0.08333333 25 6 20 0.05000000 0.05000000 26 4 20 0.05000000 0.05000000 27 8 20 0.05000000 0.05000000 28 7 20 0.05000000 0.05000000 29 7 20 0.05000000 0.05000000 30 2 20 0.05000000 0.05000000 
Now we need to iterate over the data frame, grouping by ‘roll’ so that we end up with one row for each one.
We’ll add a new column which stores the posterior probability for each dice. This will be calculated by multiplying the prior probability by the product of the ‘score’ entries.
This is what our new likelihood function looks like:
likelihoods3 = function(names, observations) { l = list(observation = observations, roll = names) obsDf = do.call(expand.grid,l) %>% mutate(likelihood = 1.0 / roll, score = ifelse(roll < observation, 0, likelihood)) return (obsDf %>% group_by(roll) %>% summarise(s = (1.0/length(names)) * prod(score) ) %>% ungroup() %>% mutate(weighted = s / sum(s)) %>% select(roll, weighted)) } l1 = likelihoods3(dice, c(6)) > l1 Source: local data frame [5 x 2] roll weighted 1 4 0.0000000 2 6 0.3921569 3 8 0.2941176 4 12 0.1960784 5 20 0.1176471 l2 = likelihoods3(dice, c(6, 4, 8, 7, 7, 2)) > l2 Source: local data frame [5 x 2] roll weighted 1 4 0.000000000 2 6 0.000000000 3 8 0.915845272 4 12 0.080403426 5 20 0.003751302 
We’ve now got the same result as we did with our nested for loops so I think the refactoring has been a success.
R: Numeric keys in the nested list/dictionary
Last week I described how I’ve been creating fake dictionaries in R using lists and I found myself using the same structure while solving the dice problem in Think Bayes.
The dice problem is described as follows:
Suppose I have a box of dice that contains a 4sided die, a 6sided die, an 8sided die, a 12sided die, and a 20sided die. If you have ever played Dungeons & Dragons, you know what I am talking about.
Suppose I select a die from the box at random, roll it, and get a 6.
What is the probability that I rolled each die?
Here’s a simple example of the nested list that I started with:
dice = c(4,6,8,12,20) priors = rep(1.0 / length(dice), length(dice)) names(priors) = dice > priors 4 6 8 12 20 0.2 0.2 0.2 0.2 0.2 
I wanted to retrieve the prior for the 8 dice which I tried to do like this:
> priors[8] <NA> NA 
That comes back with ‘NA’ because we’re actually looking for the numeric index 8 rather than the item in that column.
As far as I understand if we want to look up values by name we have to use a string so I tweaked the code to store names as strings:
dice = c(4,6,8,12,20) priors = rep(1.0 / length(dice), length(dice)) names(priors) = sapply(dice, paste) > priors["8"] 8 0.2 
That works much better but with some experimentation I realised I didn’t even need to run ‘dice’ through the sapply function, it already works the way it was:
dice = c(4,6,8,12,20) priors = rep(1.0 / length(dice), length(dice)) names(priors) = dice > priors["8"] 8 0.2 
Now that we’ve got that working we can write a likelihood function which takes in observed dice rolls and tells us how likely it was that we rolled each type of dice. We start simple by copying the above code into a function:
likelihoods = function(names, observations) { scores = rep(1.0 / length(names), length(names)) names(scores) = names return(scores) } dice = c(4,6,8,12,20) l1 = likelihoods(dice, c(6)) > l1 4 6 8 12 20 0.2 0.2 0.2 0.2 0.2 
Next we’ll update the score for a particular dice to 0 if one of the observed rolls is greater than the dice’s maximum score:
likelihoods = function(names, observations) { scores = rep(1.0 / length(names), length(names)) names(scores) = names for(name in names) { for(observation in observations) { if(name < observation) { scores[paste(name)] = 0 } } } return(scores) } dice = c(4,6,8,12,20) l1 = likelihoods(dice, c(6)) > l1 4 6 8 12 20 0.0 0.2 0.2 0.2 0.2 
The 4 dice has been ruled out since we’ve rolled a 6! Now let’s put in the else condition which updates our score by the probability that we got the observed roll with each of valid dice. i.e. we have a 1/20 chance of rolling any number with the 20 side dice, a 1/8 chance with the 8 sided dice etc.
likelihoods = function(names, observations) { scores = rep(1.0 / length(names), length(names)) names(scores) = names for(name in names) { for(observation in observations) { if(name < observation) { scores[paste(name)] = 0 } else { scores[paste(name)] = scores[paste(name)] * (1.0 / name) } } } return(scores) } dice = c(4,6,8,12,20) l1 = likelihoods(dice, c(6)) > l1 4 6 8 12 20 0.00000000 0.03333333 0.02500000 0.01666667 0.01000000 
And finally let’s normalise those scores so they’re a bit more readable:
> l1 / sum(l1) 4 6 8 12 20 0.0000000 0.3921569 0.2941176 0.1960784 0.1176471 
R: nonnumeric argument to binary operator
When debugging R code, given my Java background, I often find myself trying to print out the state of variables along with an appropriate piece of text like this:
names = c(1,2,3,4,5,6) > print("names: " + names) Error in "names: " + names : nonnumeric argument to binary operator 
We might try this next:
> print("names: ", names) [1] "names: " 
which doesn’t actually print the names variable – only the first argument to the print function is printed.
We’ll find more success with the paste function:
> print(paste("names: ", names)) [1] "names: 1" "names: 2" "names: 3" "names: 4" "names: 5" "names: 6" 
This is an improvement but it repeats the ‘names:’ prefix multiple times which isn’t what we want. Introducing the toString function gets us over the line:
> print(paste("names: ", toString(names))) [1] "names: 1, 2, 3, 4, 5, 6" 
R: Removing for loops
In my last blog post I showed the translation of a likelihood function from Think Bayes into R and in my first attempt at this function I used a couple of nested for loops.
likelihoods = function(names, mixes, observations) { scores = rep(1, length(names)) names(scores) = names for(name in names) { for(observation in observations) { scores[name] = scores[name] * mixes[[name]][observation] } } return(scores) } 
Names = c("Bowl 1", "Bowl 2") bowl1Mix = c(0.75, 0.25) names(bowl1Mix) = c("vanilla", "chocolate") bowl2Mix = c(0.5, 0.5) names(bowl2Mix) = c("vanilla", "chocolate") Mixes = list("Bowl 1" = bowl1Mix, "Bowl 2" = bowl2Mix) Mixes Observations = c("vanilla", "vanilla", "vanilla", "chocolate") l = likelihoods(Names, Mixes, Observations) > l / sum(l) Bowl 1 Bowl 2 0.627907 0.372093 
We pass in a vector of bowls, a nested dictionary describing the mixes of cookies in each bowl and the observations that we’ve made. The function tells us that there’s an almost 2/3 probability of the cookies coming from Bowl 1 and just over 1/3 of being Bowl 2.
In this case there probably won’t be much of a performance improvement by getting rid of the loops but we should be able to write something that’s more concise and hopefully idiomatic.
Let’s start by getting rid of the inner for loop. That can be replace by a call to the Reduce function like so:
likelihoods2 = function(names, mixes, observations) { scores = rep(0, length(names)) names(scores) = names for(name in names) { scores[name] = Reduce(function(acc, observation) acc * mixes[[name]][observation], Observations, 1) } return(scores) } 
l2 = likelihoods2(Names, Mixes, Observations) > l2 / sum(l2) Bowl 1 Bowl 2 0.627907 0.372093 
So that’s good, we’ve still got the same probabilities as before. Now to get rid of the outer for loop. The Map function helps us out here:
likelihoods3 = function(names, mixes, observations) { scores = rep(0, length(names)) names(scores) = names scores = Map(function(name) Reduce(function(acc, observation) acc * mixes[[name]][observation], Observations, 1), names) return(scores) } l3 = likelihoods3(Names, Mixes, Observations) > l3 $`Bowl 1` vanilla 0.1054688 $`Bowl 2` vanilla 0.0625 
We end up with a list instead of a vector which we need to fix by using the unlist function:
likelihoods3 = function(names, mixes, observations) { scores = rep(0, length(names)) names(scores) = names scores = Map(function(name) Reduce(function(acc, observation) acc * mixes[[name]][observation], Observations, 1), names) return(unlist(scores)) } l3 = likelihoods3(Names, Mixes, Observations) > l3 / sum(l3) Bowl 1.vanilla Bowl 2.vanilla 0.627907 0.372093 
Now we just have this annoying ‘vanilla’ in the name. That’s fixed easily enough:
likelihoods3 = function(names, mixes, observations) { scores = rep(0, length(names)) names(scores) = names scores = Map(function(name) Reduce(function(acc, observation) acc * mixes[[name]][observation], Observations, 1), names) result = unlist(scores) names(result) = names return(result) } l3 = likelihoods3(Names, Mixes, Observations) > l3 / sum(l3) Bowl 1 Bowl 2 0.627907 0.372093 
A slightly cleaner alternative makes use of the sapply function:
likelihoods3 = function(names, mixes, observations) { scores = rep(0, length(names)) names(scores) = names scores = sapply(names, function(name) Reduce(function(acc, observation) acc * mixes[[name]][observation], Observations, 1)) names(scores) = names return(scores) } l3 = likelihoods3(Names, Mixes, Observations) > l3 / sum(l3) Bowl 1 Bowl 2 0.627907 0.372093 
That’s the best I’ve got for now but I wonder if we could write a version of this using matrix operations some how – but that’s for next time!
R: Think Bayes – More posterior probability calculations
As I mentioned in a post last week I’ve been reading through Think Bayes and translating some of the examples form Python to R.
After my first post Antonios suggested a more idiomatic way of writing the function in R so I thought I’d give it a try to calculate the probability that combinations of cookies had come from each bowl.
In the simplest case we have this function which takes in the names of the bowls and the likelihood scores:
f = function(names,likelihoods) { # Assume each option has an equal prior priors = rep(1, length(names)) / length(names) # create a data frame with all info you have dt = data.frame(names,priors,likelihoods) # calculate posterior probabilities dt$post = dt$priors*dt$likelihoods / sum(dt$priors*dt$likelihoods) # specify what you want the function to return list(names=dt$names, priors=dt$priors, likelihoods=dt$likelihoods, posteriors=dt$post) } 
We assume a prior probability of 0.5 for each bowl.
Given the following probabilities of of different cookies being in each bowl…
mixes = { 'Bowl 1':dict(vanilla=0.75, chocolate=0.25), 'Bowl 2':dict(vanilla=0.5, chocolate=0.5), } 
…we can simulate taking one vanilla cookie with the following parameters:
Likelihoods = c(0.75,0.5) Names = c("Bowl 1", "Bowl 2") res=f(Names,Likelihoods) > res$posteriors[res$name == "Bowl 1"] [1] 0.6 > res$posteriors[res$name == "Bowl 2"] [1] 0.4 
If we want to simulate taking 3 vanilla cookies and 1 chocolate one we’d have the following:
Likelihoods = c((0.75 ** 3) * (0.25 ** 1), (0.5 ** 3) * (0.5 ** 1)) Names = c("Bowl 1", "Bowl 2") res=f(Names,Likelihoods) > res$posteriors[res$name == "Bowl 1"] [1] 0.627907 > res$posteriors[res$name == "Bowl 2"] [1] 0.372093 
That’s a bit clunky and the intent of ‘3 vanilla cookies and 1 chocolate’ has been lost. I decided to refactor the code to take in a vector of cookies and calculate the likelihoods internally.
First we need to create a data structure to store the mixes of cookies in each bowl that we defined above. It turns out we can do this using a nested list:
bowl1Mix = c(0.75, 0.25) names(bowl1Mix) = c("vanilla", "chocolate") bowl2Mix = c(0.5, 0.5) names(bowl2Mix) = c("vanilla", "chocolate") Mixes = list("Bowl 1" = bowl1Mix, "Bowl 2" = bowl2Mix) > Mixes $`Bowl 1` vanilla chocolate 0.75 0.25 $`Bowl 2` vanilla chocolate 0.5 0.5 
Now let’s tweak our function to take in observations rather than likelihoods and then calculate those likelihoods internally:
likelihoods = function(names, observations) { scores = c(1,1) names(scores) = names for(name in names) { for(observation in observations) { scores[name] = scores[name] * mixes[[name]][observation] } } return(scores) } f = function(names,mixes,observations) { # Assume each option has an equal prior priors = rep(1, length(names)) / length(names) # create a data frame with all info you have dt = data.frame(names,priors) dt$likelihoods = likelihoods(Names, Observations) # calculate posterior probabilities dt$post = dt$priors*dt$likelihoods / sum(dt$priors*dt$likelihoods) # specify what you want the function to return list(names=dt$names, priors=dt$priors, likelihoods=dt$likelihoods, posteriors=dt$post) } 
And if we call that function:
Names = c("Bowl 1", "Bowl 2") bowl1Mix = c(0.75, 0.25) names(bowl1Mix) = c("vanilla", "chocolate") bowl2Mix = c(0.5, 0.5) names(bowl2Mix) = c("vanilla", "chocolate") Mixes = list("Bowl 1" = bowl1Mix, "Bowl 2" = bowl2Mix) Mixes Observations = c("vanilla", "vanilla", "vanilla", "chocolate") res=f(Names,Mixes,Observations) > res$posteriors[res$names == "Bowl 1"] [1] 0.627907 > res$posteriors[res$names == "Bowl 2"] [1] 0.372093 
Exactly the same result as before! #win
Spark: Generating CSV files to import into Neo4j
About a year ago Ian pointed me at a Chicago Crime data set which seemed like a good fit for Neo4j and after much procrastination I’ve finally got around to importing it.
The data set covers crimes committed from 2001 until now. It contains around 4 million crimes and meta data around those crimes such as the location, type of crime and year to name a few.
The contents of the file follow this structure:
$ head n 10 ~/Downloads/Crimes__2001_to_present.csv ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location 9464711,HX114160,01/14/2014 05:00:00 AM,028XX E 80TH ST,0560,ASSAULT,SIMPLE,APARTMENT,false,true,0422,004,7,46,08A,1196652,1852516,2014,01/20/2014 12:40:05 AM,41.75017626412204,87.55494559131228,"(41.75017626412204, 87.55494559131228)" 9460704,HX113741,01/14/2014 04:55:00 AM,091XX S JEFFERY AVE,031A,ROBBERY,ARMED: HANDGUN,SIDEWALK,false,false,0413,004,8,48,03,1191060,1844959,2014,01/18/2014 12:39:56 AM,41.729576153145636,87.57568059471686,"(41.729576153145636, 87.57568059471686)" 9460339,HX113740,01/14/2014 04:44:00 AM,040XX W MAYPOLE AVE,1310,CRIMINAL DAMAGE,TO PROPERTY,RESIDENCE,false,true,1114,011,28,26,14,1149075,1901099,2014,01/16/2014 12:40:00 AM,41.884543798701515,87.72803579358926,"(41.884543798701515, 87.72803579358926)" 9461467,HX114463,01/14/2014 04:43:00 AM,059XX S CICERO AVE,0820,THEFT,$500 AND UNDER,PARKING LOT/GARAGE(NON.RESID.),false,false,0813,008,13,64,06,1145661,1865031,2014,01/16/2014 12:40:00 AM,41.785633535413176,87.74148516669783,"(41.785633535413176, 87.74148516669783)" 9460355,HX113738,01/14/2014 04:21:00 AM,070XX S PEORIA ST,0820,THEFT,$500 AND UNDER,STREET,true,false,0733,007,17,68,06,1171480,1858195,2014,01/16/2014 12:40:00 AM,41.766348042591375,87.64702037047671,"(41.766348042591375, 87.64702037047671)" 9461140,HX113909,01/14/2014 03:17:00 AM,016XX W HUBBARD ST,0610,BURGLARY,FORCIBLE ENTRY,COMMERCIAL / BUSINESS OFFICE,false,false,1215,012,27,24,05,1165029,1903111,2014,01/16/2014 12:40:00 AM,41.889741146006095,87.66939334853973,"(41.889741146006095, 87.66939334853973)" 9460361,HX113731,01/14/2014 03:12:00 AM,022XX S WENTWORTH AVE,0820,THEFT,$500 AND UNDER,CTA TRAIN,false,false,0914,009,25,34,06,1175363,1889525,2014,01/20/2014 12:40:05 AM,41.85223460427207,87.63185047834335,"(41.85223460427207, 87.63185047834335)" 9461691,HX114506,01/14/2014 03:00:00 AM,087XX S COLFAX AVE,0650,BURGLARY,HOME INVASION,RESIDENCE,false,false,0423,004,7,46,05,1195052,1847362,2014,01/17/2014 12:40:17 AM,41.73607283858007,87.56097809501115,"(41.73607283858007, 87.56097809501115)" 9461792,HX114824,01/14/2014 03:00:00 AM,012XX S CALIFORNIA BLVD,0810,THEFT,OVER $500,STREET,false,false,1023,010,28,29,06,1157929,1894034,2014,01/17/2014 12:40:17 AM,41.86498077118534,87.69571529596696,"(41.86498077118534, 87.69571529596696)" 
Since I wanted to import this into Neo4j I needed to do some massaging of the data since the neo4jimport tool expects to receive CSV files containing the nodes and relationships we want to create.
I’d been looking at Spark towards the end of last year and the preprocessing of the big initial file into smaller CSV files containing nodes and relationships seemed like a good fit.
I therefore needed to create a Spark job to do this. We’ll then pass this job to a Spark executor running locally and it will spit out CSV files.
We start by creating a Scala object with a main method that will contain our processing code. Inside that main method we’ll instantiate a Spark context:
import org.apache.spark.{SparkConf, SparkContext} object GenerateCSVFiles { def main(args: Array[String]) { val conf = new SparkConf().setAppName("Chicago Crime Dataset") val sc = new SparkContext(conf) } } 
Easy enough. Next we’ll read in the CSV file. I found the easiest way to reference this was with an environment variable but perhaps there’s a more idiomatic way:
import java.io.File import org.apache.spark.{SparkConf, SparkContext} object GenerateCSVFiles { def main(args: Array[String]) { var crimeFile = System.getenv("CSV_FILE") if(crimeFile == null  !new File(crimeFile).exists()) { throw new RuntimeException("Cannot find CSV file [" + crimeFile + "]") } println("Using %s".format(crimeFile)) val conf = new SparkConf().setAppName("Chicago Crime Dataset") val sc = new SparkContext(conf) val crimeData = sc.textFile(crimeFile).cache() } 
The type of crimeData is RDD[String] – Spark’s way of representing the (lazily evaluated) lines of the CSV file. This also includes the header of the file so let’s write a function to get rid of that since we’ll be generating our own headers for the different files:
import org.apache.spark.rdd.RDD // http://mailarchives.apache.org/mod_mbox/sparkuser/201404.mbox/%3CCAEYYnxYuEaie518ODdnfR7VvD39d71=CgB_Dxw_4COVXgmYYQ@mail.gmail.com%3E def dropHeader(data: RDD[String]): RDD[String] = { data.mapPartitionsWithIndex((idx, lines) => { if (idx == 0) { lines.drop(1) } lines }) } 
Now we’re ready to start generating our new CSV files so we’ll write a function which parses each line and extracts the appropriate columns. I’m using Open CSV for this:
import au.com.bytecode.opencsv.CSVParser def generateFile(file: String, withoutHeader: RDD[String], fn: Array[String] => Array[String], header: String , distinct:Boolean = true, separator: String = ",") = { FileUtil.fullyDelete(new File(file)) val tmpFile = "/tmp/" + System.currentTimeMillis() + "" + file val rows: RDD[String] = withoutHeader.mapPartitions(lines => { val parser = new CSVParser(',') lines.map(line => { val columns = parser.parseLine(line) fn(columns).mkString(separator) }) }) if (distinct) rows.distinct() saveAsTextFile tmpFile else rows.saveAsTextFile(tmpFile) } 
We then call this function like this:
generateFile("/tmp/crimes.csv", withoutHeader, columns => Array(columns(0),"Crime", columns(2), columns(6)), "id:ID(Crime),:LABEL,date,description", false) 
The output into ‘tmpFile’ is actually 32 ‘part files’ but I wanted to be able to merge those together into individual CSV files that were easier to work with.
I won’t paste the the full job here but if you want to take a look it’s on github.
Now we need to submit the job to Spark. I’ve wrapped this in a script if you want to follow along but these are the contents:
./spark1.1.0binhadoop1/bin/sparksubmit \ drivermemory 5g \ class GenerateCSVFiles \ master local[8] \ target/scala2.10/playground_2.101.0.jar \ $@ 
If we execute that we’ll see the following output…”
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Crimes__2001_to_present.csv
Using Spark's default log4j profile: org/apache/spark/log4jdefaults.properties
15/04/15 00:31:44 INFO SparkContext: Running Spark version 1.3.0
...
15/04/15 00:47:26 INFO TaskSchedulerImpl: Removed TaskSet 8.0, whose tasks have all completed, from pool
15/04/15 00:47:26 INFO DAGScheduler: Stage 8 (saveAsTextFile at GenerateCSVFiles.scala:51) finished in 2.702 s
15/04/15 00:47:26 INFO DAGScheduler: Job 4 finished: saveAsTextFile at GenerateCSVFiles.scala:51, took 8.715588 s
real 0m44.935s
user 4m2.259s
sys 0m14.159s 
and these CSV files will be generated:
$ ls alh /tmp/*.csv rwxrwxrwx 1 markneedham wheel 3.0K 14 Apr 07:37 /tmp/beats.csv rwxrwxrwx 1 markneedham wheel 217M 14 Apr 07:37 /tmp/crimes.csv rwxrwxrwx 1 markneedham wheel 84M 14 Apr 07:37 /tmp/crimesBeats.csv rwxrwxrwx 1 markneedham wheel 120M 14 Apr 07:37 /tmp/crimesPrimaryTypes.csv rwxrwxrwx 1 markneedham wheel 912B 14 Apr 07:37 /tmp/primaryTypes.csv 
Let’s have a quick check what they contain:
$ head n 10 /tmp/beats.csv id:ID(Beat),:LABEL 1135,Beat 1421,Beat 2312,Beat 1113,Beat 1014,Beat 2411,Beat 1333,Beat 2521,Beat 1652,Beat 
$ head n 10 /tmp/crimes.csv id:ID(Crime),:LABEL,date,description 9464711,Crime,01/14/2014 05:00:00 AM,SIMPLE 9460704,Crime,01/14/2014 04:55:00 AM,ARMED: HANDGUN 9460339,Crime,01/14/2014 04:44:00 AM,TO PROPERTY 9461467,Crime,01/14/2014 04:43:00 AM,$500 AND UNDER 9460355,Crime,01/14/2014 04:21:00 AM,$500 AND UNDER 9461140,Crime,01/14/2014 03:17:00 AM,FORCIBLE ENTRY 9460361,Crime,01/14/2014 03:12:00 AM,$500 AND UNDER 9461691,Crime,01/14/2014 03:00:00 AM,HOME INVASION 9461792,Crime,01/14/2014 03:00:00 AM,OVER $500 
$ head n 10 /tmp/crimesBeats.csv :START_ID(Crime),:END_ID(Beat),:TYPE 5896915,0733,ON_BEAT 9208776,2232,ON_BEAT 8237555,0111,ON_BEAT 6464775,0322,ON_BEAT 6468868,0411,ON_BEAT 4189649,0524,ON_BEAT 7620897,0421,ON_BEAT 7720402,0321,ON_BEAT 5053025,1115,ON_BEAT 
Looking good. Let’s get them imported into Neo4j:
$ ./neo4jcommunity2.2.0/bin/neo4jimport into /tmp/myneo nodes /tmp/crimes.csv nodes /tmp/beats.csv nodes /tmp/primaryTypes.csv relationships /tmp/crimesBeats.csv relationships /tmp/crimesPrimaryTypes.csv Nodes [*>:45.76 MB/sPROPERTIES(2)=============NODE:3v:118.05 MB/] 4M Done in 5s 605ms Prepare node index [*RESOLVE:64.85 MB] 4M Done in 4s 930ms Calculate dense nodes [>:42.33 MB/s*PREPARE(7)===================================CALCULATOR] 8M Done in 5s 417ms Relationships [>:42.33 MB/s*PREPARE(7)==========================RELATIONSHIPv:44.] 8M Done in 6s 62ms Node > Relationship [*>:??] 4M Done in 324ms Relationship > Relationship [*LINK] 8M Done in 1s 984ms Node counts [*>:??] 4M Done in 360ms Relationship counts [*>:??] 8M Done in 653ms IMPORT DONE in 26s 517ms 
Next I updated conf/neo4jserver.properties to point to my new database:
#*************************************************************** # Server configuration #*************************************************************** # location of the database directory #org.neo4j.server.database.location=data/graph.db org.neo4j.server.database.location=/tmp/myneo 
Now I can start up Neo and start exploring the data:
$ ./neo4jcommunity2.2.0/bin/neo4j start 
MATCH (:Crime)[r:CRIME_TYPE]>() RETURN r LIMIT 10 
There’s lots more relationships and entities that we could pull out of this data set – what I’ve done is just a start. So if you’re up for some more Chicago crime exploration the code and instructions explaining how to run it are on github.
R: Creating an object with functions to calculate conditional probability
I’ve been working through Alan Downey’s Thinking Bayes and I thought it’d be an interesting exercise to translate some of the code from Python to R.
The first example is a simple one about conditional probablity and the author creates a class ‘PMF’ (Probability Mass Function) to solve the following problem:
Suppose there are two bowls of cookies. Bowl 1 contains 30 vanilla cookies and 10 chocolate cookies. Bowl 2 contains 20 of each.
Now suppose you choose one of the bowls at random and, without looking, select a cookie at random. The cookie is vanilla.
What is the probability that it came from Bowl 1?
In Python the code looks like this:
pmf = Pmf() pmf.Set('Bowl 1', 0.5) pmf.Set('Bowl 2', 0.5) pmf.Mult('Bowl 1', 0.75) pmf.Mult('Bowl 2', 0.5) pmf.Normalize() print pmf.Prob('Bowl 1') 
The ‘PMF’ class is defined here.
 ‘Set’ defines the prior probability of picking a cookie from either bowl i.e. in our case it’s random.
 ‘Mult’ defines the likelihood of picking a vanilla biscuit from either bowl
 ‘Normalize’ applies a normalisation so that our posterior probabilities add up to 1.
We want to create something similar in R and the actual calculation is stragiht forward:
pBowl1 = 0.5 pBowl2 = 0.5 pVanillaGivenBowl1 = 0.75 pVanillaGivenBowl2 = 0.5 > (pBowl1 * pVanillaGivenBowl1) / ((pBowl1 * pVanillaGivenBowl1) + (PBowl2 * pVanillaGivenBowl2)) 0.6 > (pBowl2 * pVanillaGivenBowl2) / ((pBowl1 * pVanillaGivenBowl1) + (pBowl2 * pVanillaGivenBowl2)) 0.4 
The problem is we have quite a bit of duplication and it doesn’t read as cleanly as the Python version.
I’m not sure of the idiomatic way of handling this type of problem in R with mutable state in R but it seems like we can achieve this using functions.
I ended up writing the following function which returns a list of other functions to call.
create.pmf = function() { priors << c() likelihoods << c() list( prior = function(option, probability) { l = c(probability) names(l) = c(option) priors << c(priors, l) }, likelihood = function(option, probability) { l = c(probability) names(l) = c(option) likelihoods << c(likelihoods, l) }, posterior = function(option) { names = names(priors) normalised = 0.0 for(name in names) { normalised = normalised + (priors[name] * likelihoods[name]) } (priors[option] * likelihoods[option]) / normalised } ) } 
I couldn’t work out how to get ‘priors’ and ‘likelihoods’ to be lexically scoped so I’ve currently got those defined as global variables. I’m using a list as a kind of dictionary following a suggestion on Stack Overflow.
The code doesn’t handle the unhappy path very well but it seems to work for the example from the book:
pmf = create.pmf() pmf$prior("Bowl 1", 0.5) pmf$prior("Bowl 2", 0.5) pmf$likelihood("Bowl 1", 0.75) pmf$likelihood("Bowl 2", 0.5) > pmf$posterior("Bowl 1") Bowl 1 0.6 > pmf$posterior("Bowl 2") Bowl 2 0.4 
How would you solve this type of problem? Is there a cleaner/better way?