## Archive for the ‘R’ Category

## R: ggplot – Displaying multiple charts with a for loop

Continuing with my analysis of the Neo4j London user group I wanted to drill into some individual meetups and see the makeup of the people attending those meetups with respect to the cohort they belong to.

I started by writing a function which would take in an event ID and output a bar chart showing the number of people who attended that event from each cohort.

We can work out the cohort that a member belongs to by querying for the first event they attended.

Our query for the most recent Intro to graphs session looks like this:

library(RNeo4j) graph = startGraph("http://127.0.0.1:7474/db/data/") eventId = "220750415" query = "match (g:Group {name: 'Neo4j - London User Group'})-[:HOSTED_EVENT]-> (e {id: {id}})<-[:TO]-(rsvp {response: 'yes'})<-[:RSVPD]-(person) WITH rsvp, person MATCH (person)-[:RSVPD]->(otherRSVP) WITH person, rsvp, otherRSVP ORDER BY person.id, otherRSVP.time WITH person, rsvp, COLLECT(otherRSVP)[0] AS earliestRSVP return rsvp.time, earliestRSVP.time, person.id" df = cypher(graph, query, id= eventId) > df %>% sample_n(10) rsvp.time earliestRSVP.time person.id 18 1.430819e+12 1.392726e+12 130976662 95 1.430069e+12 1.430069e+12 10286388 79 1.429035e+12 1.429035e+12 38344282 64 1.428108e+12 1.412935e+12 153473172 73 1.429513e+12 1.398236e+12 143322942 19 1.430389e+12 1.430389e+12 129261842 37 1.429643e+12 1.327603e+12 9750821 49 1.429618e+12 1.429618e+12 184325696 69 1.430781e+12 1.404554e+12 67485912 1 1.430929e+12 1.430146e+12 185405773 |

We’re not doing anything too clever here, just using a couple of WITH clauses to order RSVPs so we can get the earlier one for each person.

Once we’ve done that we’ll tidy up the data frame so that it contains columns containing the cohort in which the member attended their first event:

timestampToDate <- function(x) as.POSIXct(x / 1000, origin="1970-01-01", tz = "GMT") df$time = timestampToDate(df$rsvp.time) df$date = format(as.Date(df$time), "%Y-%m") df$earliestTime = timestampToDate(df$earliestRSVP.time) df$earliestDate = format(as.Date(df$earliestTime), "%Y-%m") > df %>% sample_n(10) rsvp.time earliestRSVP.time person.id time date earliestTime earliestDate 47 1.430697e+12 1.430697e+12 186893861 2015-05-03 23:47:11 2015-05 2015-05-03 23:47:11 2015-05 44 1.430924e+12 1.430924e+12 186998186 2015-05-06 14:49:44 2015-05 2015-05-06 14:49:44 2015-05 85 1.429611e+12 1.422378e+12 53761842 2015-04-21 10:13:46 2015-04 2015-01-27 16:56:02 2015-01 14 1.430125e+12 1.412690e+12 7994846 2015-04-27 09:01:58 2015-04 2014-10-07 13:57:09 2014-10 29 1.430035e+12 1.430035e+12 37719672 2015-04-26 07:59:03 2015-04 2015-04-26 07:59:03 2015-04 12 1.430855e+12 1.430855e+12 186968869 2015-05-05 19:38:10 2015-05 2015-05-05 19:38:10 2015-05 41 1.428917e+12 1.422459e+12 133623562 2015-04-13 09:20:07 2015-04 2015-01-28 15:37:40 2015-01 87 1.430927e+12 1.430927e+12 185155627 2015-05-06 15:46:59 2015-05 2015-05-06 15:46:59 2015-05 62 1.430849e+12 1.430849e+12 186965212 2015-05-05 17:56:23 2015-05 2015-05-05 17:56:23 2015-05 8 1.430237e+12 1.425567e+12 184979500 2015-04-28 15:58:23 2015-04 2015-03-05 14:45:40 2015-03 |

Now that we’ve got that we can group by the earliestDate cohort and then create a bar chart:

byCohort = df %>% count(earliestDate) ggplot(aes(x= earliestDate, y = n), data = byCohort) + geom_bar(stat="identity", fill = "dark blue") + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=1)) |

This is good and gives us the insight that most of the members attending this version of intro to graphs just joined the group. The event was on 7th April and most people joined in March which makes sense.

Let’s see if that trend continues over the previous two years. To do this we need to create a for loop which goes over all the Intro to Graphs events and then outputs a chart for each one.

First I pulled out the code above into a function:

plotEvent = function(eventId) { query = "match (g:Group {name: 'Neo4j - London User Group'})-[:HOSTED_EVENT]-> (e {id: {id}})<-[:TO]-(rsvp {response: 'yes'})<-[:RSVPD]-(person) WITH rsvp, person MATCH (person)-[:RSVPD]->(otherRSVP) WITH person, rsvp, otherRSVP ORDER BY person.id, otherRSVP.time WITH person, rsvp, COLLECT(otherRSVP)[0] AS earliestRSVP return rsvp.time, earliestRSVP.time, person.id" df = cypher(graph, query, id= eventId) df$time = timestampToDate(df$rsvp.time) df$date = format(as.Date(df$time), "%Y-%m") df$earliestTime = timestampToDate(df$earliestRSVP.time) df$earliestDate = format(as.Date(df$earliestTime), "%Y-%m") byCohort = df %>% count(earliestDate) ggplot(aes(x= earliestDate, y = n), data = byCohort) + geom_bar(stat="identity", fill = "dark blue") + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=1)) } |

We’d call it like this for the Intro to graphs meetup:

> plotEvent("220750415") |

Next I tweaked the code to look up all Into to graphs events and then loop through and output a chart for each event:

events = cypher(graph, "match (e:Event {name: 'Intro to Graphs'}) RETURN e.id ORDER BY e.time") for(event in events$n) { plotEvent(as.character(event)) } |

Unfortunately that doesn’t print anything at all which we can fix by storing our plots in a list and then displaying it afterwards:

library(gridExtra) p = list() for(i in 1:count(events)$n) { event = events[i, 1] p[[i]] = plotEvent(as.character(event)) } do.call(grid.arrange,p) |

This visualisation is probably better without any of the axis so let’s update the function to scrap those. We’ll also add the date of the event at the top of each chart which will require a slight tweak of the query:

plotEvent = function(eventId) { query = "match (g:Group {name: 'Neo4j - London User Group'})-[:HOSTED_EVENT]-> (e {id: {id}})<-[:TO]-(rsvp {response: 'yes'})<-[:RSVPD]-(person) WITH e,rsvp, person MATCH (person)-[:RSVPD]->(otherRSVP) WITH e,person, rsvp, otherRSVP ORDER BY person.id, otherRSVP.time WITH e, person, rsvp, COLLECT(otherRSVP)[0] AS earliestRSVP return rsvp.time, earliestRSVP.time, person.id, e.time" df = cypher(graph, query, id= eventId) df$time = timestampToDate(df$rsvp.time) df$eventTime = timestampToDate(df$e.time) df$date = format(as.Date(df$time), "%Y-%m") df$earliestTime = timestampToDate(df$earliestRSVP.time) df$earliestDate = format(as.Date(df$earliestTime), "%Y-%m") byCohort = df %>% count(earliestDate) ggplot(aes(x= earliestDate, y = n), data = byCohort) + geom_bar(stat="identity", fill = "dark blue") + theme(axis.ticks = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank()) + labs(title = df$eventTime[1]) } |

I think this makes it a bit easier to read although I’ve made the mistake of not having all the charts representing the same scale – one to fix for next time.

We started doing the intro to graphs sessions less frequently towards the end of last year so my hypothesis was that we’d see a range of people from different cohorts RSVPing for them but that doesn’t seem to be the case. Instead it’s very dominated by people signing up close to the event.

## R: Cohort heatmap of Neo4j London meetup

A few months ago I had a go at doing some cohort analysis of the Neo4j London meetup group which was an interesting experiment but unfortunately resulted in a chart that was completely illegible.

I wasn’t sure how to progress from there but a few days ago I came across the cohort heatmap which seemed like a better way of visualising things over time.

The underlying idea is still the same – we’ve comparing different cohorts of users against each other to see whether a change or intervention we did at a certain time had any impact.

However, the way we display the cohorts changes and I think for the better.

To recap, we start with the following data frame:

df = read.csv("/tmp/df.csv") > df %>% sample_n(5) rsvp.time person.id time date 255 1.354277e+12 12228948 2012-11-30 12:05:08 2012-11 2475 1.407342e+12 19057581 2014-08-06 16:26:04 2014-08 3988 1.421769e+12 66122172 2015-01-20 15:58:02 2015-01 4411 1.419377e+12 165750262 2014-12-23 23:27:44 2014-12 1010 1.383057e+12 74602292 2013-10-29 14:24:32 2013-10 |

And we need to transform this into a data frame which is grouped by cohort (members who attended their first meetup in a particular month). The following code gets us there:

firstMeetup = df %>% group_by(person.id) %>% summarise(firstEvent = min(time), count = n()) %>% arrange(desc(count)) firstMeetup$date = format(as.Date(firstMeetup$firstEvent), "%Y-%m") countsForCohort = function(df, firstMeetup, cohort) { members = (firstMeetup %>% filter(date == cohort))$person.id attendance = df %>% filter(person.id %in% members) %>% count(person.id, date) %>% ungroup() %>% count(date) allCohorts = df %>% select(date) %>% unique cohortAttendance = merge(allCohorts, attendance, by = "date", all.x = TRUE) cohortAttendance[is.na(cohortAttendance) & cohortAttendance$date > cohort] = 0 cohortAttendance %>% mutate(cohort = cohort, retention = n / length(members), members = n) } cohorts = collect(df %>% select(date) %>% unique())[,1] cohortAttendance = data.frame() for(cohort in cohorts) { cohortAttendance = rbind(cohortAttendance,countsForCohort(df, firstMeetup, cohort)) } monthNumber = function(cohort, date) { cohortAsDate = as.yearmon(cohort) dateAsDate = as.yearmon(date) if(cohortAsDate > dateAsDate) { "NA" } else { paste(round((dateAsDate - cohortAsDate) * 12), sep="") } } cohortAttendanceWithMonthNumber = cohortAttendance %>% group_by(row_number()) %>% mutate(monthNumber = monthNumber(cohort, date)) %>% filter(monthNumber != "NA") %>% filter(!is.na(members)) %>% mutate(monthNumber = as.numeric(monthNumber)) %>% arrange(monthNumber) > cohortAttendanceWithMonthNumber %>% head(10) Source: local data frame [10 x 7] Groups: row_number() date n cohort retention members row_number() monthNumber 1 2011-06 4 2011-06 1.00 4 1 0 2 2011-07 1 2011-06 0.25 1 2 1 3 2011-08 1 2011-06 0.25 1 3 2 4 2011-09 2 2011-06 0.50 2 4 3 5 2011-10 1 2011-06 0.25 1 5 4 6 2011-11 1 2011-06 0.25 1 6 5 7 2012-01 1 2011-06 0.25 1 7 7 8 2012-04 2 2011-06 0.50 2 8 10 9 2012-05 1 2011-06 0.25 1 9 11 10 2012-06 1 2011-06 0.25 1 10 12 |

Now let’s create our first heatmap.

t <- max(cohortAttendanceWithMonthNumber$members) cols <- c("#e7f0fa", "#c9e2f6", "#95cbee", "#0099dc", "#4ab04a", "#ffd73e", "#eec73a", "#e29421", "#e29421", "#f05336", "#ce472e") ggplot(cohortAttendanceWithMonthNumber, aes(y=cohort, x=date, fill=members)) + theme_minimal() + geom_tile(colour="white", linewidth=2, width=.9, height=.9) + scale_fill_gradientn(colours=cols, limits=c(0, t), breaks=seq(0, t, by=t/4), labels=c("0", round(t/4*1, 1), round(t/4*2, 1), round(t/4*3, 1), round(t/4*4, 1)), guide=guide_colourbar(ticks=T, nbin=50, barheight=.5, label=T, barwidth=10)) + theme(legend.position='bottom', legend.direction="horizontal", plot.title = element_text(size=20, face="bold", vjust=2), axis.text.x=element_text(size=8, angle=90, hjust=.5, vjust=.5, face="plain")) + ggtitle("Cohort Activity Heatmap (number of members who attended event)") |

‘t’ is the maximum number of members within a cohort who attended a meetup in a given month. This makes it easy to see which cohorts started with the most members but makes it difficult to compare their retention over time.

We can fix that by showing the percentage of members in the cohort who attend each month rather than using absolute values. To do that we must first add an extra column containing the percentage values:

cohortAttendanceWithMonthNumber$retentionPercentage = ifelse(!is.na(cohortAttendanceWithMonthNumber$retention), cohortAttendanceWithMonthNumber$retention * 100, 0) t <- max(cohortAttendanceWithMonthNumber$retentionPercentage) cols <- c("#e7f0fa", "#c9e2f6", "#95cbee", "#0099dc", "#4ab04a", "#ffd73e", "#eec73a", "#e29421", "#e29421", "#f05336", "#ce472e") ggplot(cohortAttendanceWithMonthNumber, aes(y=cohort, x=date, fill=retentionPercentage)) + theme_minimal() + geom_tile(colour="white", linewidth=2, width=.9, height=.9) + scale_fill_gradientn(colours=cols, limits=c(0, t), breaks=seq(0, t, by=t/4), labels=c("0", round(t/4*1, 1), round(t/4*2, 1), round(t/4*3, 1), round(t/4*4, 1)), guide=guide_colourbar(ticks=T, nbin=50, barheight=.5, label=T, barwidth=10)) + theme(legend.position='bottom', legend.direction="horizontal", plot.title = element_text(size=20, face="bold", vjust=2), axis.text.x=element_text(size=8, angle=90, hjust=.5, vjust=.5, face="plain")) + ggtitle("Cohort Activity Heatmap (number of members who attended event)") |

This version allows us to compare cohorts against each other but now we don’t have the exact numbers which means earlier cohorts will look better since there are less people in them. We can get the best of both worlds by keeping this visualisation but showing the actual values inside each box:

t <- max(cohortAttendanceWithMonthNumber$retentionPercentage) cols <- c("#e7f0fa", "#c9e2f6", "#95cbee", "#0099dc", "#4ab04a", "#ffd73e", "#eec73a", "#e29421", "#e29421", "#f05336", "#ce472e") ggplot(cohortAttendanceWithMonthNumber, aes(y=cohort, x=date, fill=retentionPercentage)) + theme_minimal() + geom_tile(colour="white", linewidth=2, width=.9, height=.9) + scale_fill_gradientn(colours=cols, limits=c(0, t), breaks=seq(0, t, by=t/4), labels=c("0", round(t/4*1, 1), round(t/4*2, 1), round(t/4*3, 1), round(t/4*4, 1)), guide=guide_colourbar(ticks=T, nbin=50, barheight=.5, label=T, barwidth=10)) + theme(legend.position='bottom', legend.direction="horizontal", plot.title = element_text(size=20, face="bold", vjust=2), axis.text.x=element_text(size=8, angle=90, hjust=.5, vjust=.5, face="plain")) + ggtitle("Cohort Activity Heatmap (number of members who attended event)") + geom_text(aes(label=members),size=3) |

What we can learn overall is that the majority of people seem to have a passing interest and then we have a smaller percentage who will continue to come to events.

It seems like we did a better job at retaining attendees in the middle of last year – one hypothesis is that the events we ran around then were more compelling but I need to do more analysis.

Next I’m going to drill further into some of the recent events and see what cohorts the attendees came from.

## R: Neo4j London meetup group – How many events do people come to?

Earlier this week the number of members in the Neo4j London meetup group creeped over the 2,000 mark and I thought it’d be fun to re-explore the data that I previously imported into Neo4j.

How often do people come to meetups?

library(RNeo4j) library(dplyr) graph = startGraph("http://localhost:7474/db/data/") query = "MATCH (g:Group {name: 'Neo4j - London User Group'})-[:HOSTED_EVENT]->(event)<-[:TO]-({response: 'yes'})<-[:RSVPD]-(profile)-[:HAS_MEMBERSHIP]->(membership)-[:OF_GROUP]->(g) WHERE (event.time + event.utc_offset) < timestamp() RETURN event.id, event.time + event.utc_offset AS eventTime, profile.id, membership.joined" df = cypher(graph, query) > df %>% head() event.id eventTime profile.id membership.joined 1 20616111 1.309372e+12 6436797 1.307285e+12 2 20616111 1.309372e+12 12964956 1.307275e+12 3 20616111 1.309372e+12 14533478 1.307290e+12 4 20616111 1.309372e+12 10793775 1.307705e+12 5 24528711 1.311793e+12 10793775 1.307705e+12 6 29953071 1.314815e+12 10595297 1.308154e+12 |

byEventsAttended = df %>% count(profile.id) > byEventsAttended %>% sample_n(10) Source: local data frame [10 x 2] profile.id n 1 128137932 2 2 126947632 1 3 98733862 2 4 20468901 11 5 48293132 5 6 144764532 1 7 95259802 1 8 14524524 3 9 80611852 2 10 134907492 2 |

Now let’s visualise the number of people that have attended certain number of events:

ggplot(aes(x = n), data = byEventsAttended) + geom_bar(binwidth = 1, fill = "Dark Blue") + scale_y_continuous(breaks = seq(0,750,by = 50)) |

Most people come to one meetup and then there’s a long tail after that with fewer and fewer people coming to lots of meetups.

The chart has lots of blank space due to the sparseness of people on the right hand side. If we exclude any people who’ve attended more than 20 events we might get a more interesting visualisation:

ggplot(aes(x = n), data = byEventsAttended %>% filter(n <= 20)) + geom_bar(binwidth = 1, fill = "Dark Blue") + scale_y_continuous(breaks = seq(0,750,by = 50)) |

Nicole suggested a more interesting visualisation would be a box plot so I decided to try that next:

ggplot(aes(x = "Attendees", y = n), data = byEventsAttended) + geom_boxplot(fill = "grey80", colour = "Dark Blue") + coord_flip() |

This visualisation really emphasises that the majority are between 1 and 3 and it’s much less obvious how many values there are at the higher end. A quick check of the data with the summary function reveals as much:

> summary(byEventsAttended$n) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 1.000 2.000 2.837 3.000 69.000 |

Now to figure out how to move that box plot a bit to the right

## R: dplyr – Error in (list: invalid subscript type ‘double’

In my continued playing around with R I wanted to find the minimum value for a specified percentile given a data frame representing a cumulative distribution function (CDF).

e.g. imagine we have the following CDF represented in a data frame:

library(dplyr) df = data.frame(score = c(5,7,8,10,12,20), percentile = c(0.05,0.1,0.15,0.20,0.25,0.5)) |

and we want to find the minimum value for the 0.05 percentile. We can use the filter function to do so:

> (df %>% filter(percentile > 0.05) %>% slice(1))$score [1] 7 |

Things become more tricky if we want to return multiple percentiles in one go.

My first thought was to create a data frame with one row for each target percentile and then pull in the appropriate row from our original data frame:

targetPercentiles = c(0.05, 0.2) percentilesDf = data.frame(targetPercentile = targetPercentiles) > percentilesDf %>% group_by(targetPercentile) %>% mutate(x = (df %>% filter(percentile > targetPercentile) %>% slice(1))$score) Error in (list(score = c(5, 7, 8, 10, 12, 20), percentile = c(0.05, 0.1, : invalid subscript type 'double' |

Unfortunately this didn’t quite work as I expected – Antonios pointed out that this is probably because we’re mixing up two pipelines and dplyr can’t figure out what we want to do.

Instead he suggested the following variant which uses the do function

df = data.frame(score = c(5,7,8,10,12,20), percentile = c(0.05,0.1,0.15,0.20,0.25,0.5)) targetPercentiles = c(0.05, 0.2) > data.frame(targetPercentile = targetPercentiles) %>% group_by(targetPercentile) %>% do(df) %>% filter(percentile > targetPercentile) %>% slice(1) %>% select(targetPercentile, score) Source: local data frame [2 x 2] Groups: targetPercentile targetPercentile score 1 0.05 7 2 0.20 12 |

We can then wrap this up in a function:

percentiles = function(df, targetPercentiles) { # make sure the percentiles are in order df = df %>% arrange(percentile) data.frame(targetPercentile = targetPercentiles) %>% group_by(targetPercentile) %>% do(df) %>% filter(percentile > targetPercentile) %>% slice(1) %>% select(targetPercentile, score) } |

which we call like this:

df = data.frame(score = c(5,7,8,10,12,20), percentile = c(0.05,0.1,0.15,0.20,0.25,0.5)) > percentiles(df, c(0.08, 0.10, 0.50, 0.80)) Source: local data frame [2 x 2] Groups: targetPercentile targetPercentile score 1 0.08 7 2 0.10 8 |

Note that we don’t actually get any rows back for 0.50 or 0.80 since we don’t have any entries greater than those percentiles. With a proper CDF we would though so the function does its job.

## R: Think Bayes Locomotive Problem – Posterior probabilities for different priors

In my continued reading of Think Bayes the next problem to tackle is the Locomotive problem which is defined thus:

A railroad numbers its locomotives in order 1..N.

One day you see a locomotive with the number 60. Estimate how many loco- motives the railroad has.

The interesting thing about this question is that it initially seems that we don’t have enough information to come up with any sort of answer. However, we can get an estimate if we come up with a prior to work with.

The simplest prior is to assume that there’s one railroad operator with between say 1 and 1000 railroads with an equal probability of each size.

We can then write similar code as with the dice problem to update the prior based on the trains we’ve seen.

First we’ll create a data frame which captures the product of ‘number of locomotives’ and the observations of locomotives that we’ve seen (in this case we’ve only seen one locomotive with number ’60′:)

library(dplyr) possibleValues = 1:1000 observations = c(60) l = list(value = possibleValues, observation = observations) df = expand.grid(l) > df %>% head() value observation 1 1 60 2 2 60 3 3 60 4 4 60 5 5 60 6 6 60 |

Next we want to add a column which represents the probability that the observed locomotive could have come from a particular fleet. If the number of railroads is less than 60 then we have a 0 probability, otherwise we have 1 / numberOfRailroadsInFleet:

prior = 1 / length(possibleValues) df = df %>% mutate(score = ifelse(value < observation, 0, 1/value)) > df %>% sample_n(10) value observation score 179 179 60 0.005586592 1001 1001 60 0.000999001 400 400 60 0.002500000 438 438 60 0.002283105 667 667 60 0.001499250 661 661 60 0.001512859 284 284 60 0.003521127 233 233 60 0.004291845 917 917 60 0.001090513 173 173 60 0.005780347 |

To find the probability of each fleet size we write the following code:

weightedDf = df %>% group_by(value) %>% summarise(aggScore = prior * prod(score)) %>% ungroup() %>% mutate(weighted = aggScore / sum(aggScore)) > weightedDf %>% sample_n(10) Source: local data frame [10 x 3] value aggScore weighted 1 906 1.102650e-06 0.0003909489 2 262 3.812981e-06 0.0013519072 3 994 1.005031e-06 0.0003563377 4 669 1.493275e-06 0.0005294465 5 806 1.239455e-06 0.0004394537 6 673 1.484400e-06 0.0005262997 7 416 2.401445e-06 0.0008514416 8 624 1.600963e-06 0.0005676277 9 40 0.000000e+00 0.0000000000 10 248 4.028230e-06 0.0014282246 |

Let’s plot the data frame to see how the probability varies for each fleet size:

library(ggplot2) ggplot(aes(x = value, y = weighted), data = weightedDf) + geom_line(color="dark blue") |

The most likely choice is a fleet size of 60 based on this diagram but an alternative would be to find the mean of the posterior which we can do like so:

> weightedDf %>% mutate(mean = value * weighted) %>% select(mean) %>% sum() [1] 333.6561 |

Now let’s create a function with all that code in so we can play around with some different priors and observations:

meanOfPosterior = function(values, observations) { l = list(value = values, observation = observations) df = expand.grid(l) %>% mutate(score = ifelse(value < observation, 0, 1/value)) prior = 1 / length(possibleValues) weightedDf = df %>% group_by(value) %>% summarise(aggScore = prior * prod(score)) %>% ungroup() %>% mutate(weighted = aggScore / sum(aggScore)) return (weightedDf %>% mutate(mean = value * weighted) %>% select(mean) %>% sum()) } |

If we update our observed railroads to have numbers 60, 30 and 90 we’d get the following means of posteriors assuming different priors:

> meanOfPosterior(1:500, c(60, 30, 90)) [1] 151.8496 > meanOfPosterior(1:1000, c(60, 30, 90)) [1] 164.3056 > meanOfPosterior(1:2000, c(60, 30, 90)) [1] 171.3382 |

At the moment the function assumes that we always want to have a uniform prior i.e. every option has an equal opportunity of being chosen, but we might want to vary the prior to see how different assumptions influence the posterior.

We can refactor the function to take in values & priors instead of calculating the priors in the function:

meanOfPosterior = function(values, priors, observations) { priorDf = data.frame(value = values, prior = priors) l = list(value = priorDf$value, observation = observations) df = merge(expand.grid(l), priorDf, by.x = "value", by.y = "value") %>% mutate(score = ifelse(value < observation, 0, 1 / value)) df %>% group_by(value) %>% summarise(aggScore = max(prior) * prod(score)) %>% ungroup() %>% mutate(weighted = aggScore / sum(aggScore)) %>% mutate(mean = value * weighted) %>% select(mean) %>% sum() } |

Now let’s check we get the same posterior means for the uniform priors:

> meanOfPosterior(1:500, 1/length(1:500), c(60, 30, 90)) [1] 151.8496 > meanOfPosterior(1:1000, 1/length(1:1000), c(60, 30, 90)) [1] 164.3056 > meanOfPosterior(1:2000, 1/length(1:2000), c(60, 30, 90)) [1] 171.3382 |

Now if instead of a uniform prior let’s use a power law one where the assumption is that smaller fleets are more likely:

> meanOfPosterior(1:500, sapply(1:500, function(x) x ** -1), c(60, 30, 90)) [1] 130.7085 > meanOfPosterior(1:1000, sapply(1:1000, function(x) x ** -1), c(60, 30, 90)) [1] 133.2752 > meanOfPosterior(1:2000, sapply(1:2000, function(x) x ** -1), c(60, 30, 90)) [1] 133.9975 > meanOfPosterior(1:5000, sapply(1:5000, function(x) x ** -1), c(60, 30, 90)) [1] 134.212 > meanOfPosterior(1:10000, sapply(1:10000, function(x) x ** -1), c(60, 30, 90)) [1] 134.2435 |

Now we get very similar posterior means which converge on 134 and so that’s our best prediction.

## R: Replacing for loops with data frames

In my last blog post I showed how to derive posterior probabilities for the Think Bayes dice problem:

Suppose I have a box of dice that contains a 4-sided die, a 6-sided die, an 8-sided die, a 12-sided die, and a 20-sided die. If you have ever played Dungeons & Dragons, you know what I am talking about.

Suppose I select a die from the box at random, roll it, and get a 6. What is the probability that I rolled each die?

To recap, this was my final solution:

likelihoods = function(names, observations) { scores = rep(1.0 / length(names), length(names)) names(scores) = names for(name in names) { for(observation in observations) { if(name < observation) { scores[paste(name)] = 0 } else { scores[paste(name)] = scores[paste(name)] * (1.0 / name) } } } return(scores) } dice = c(4,6,8,12,20) l1 = likelihoods(dice, c(6)) > l1 / sum(l1) 4 6 8 12 20 0.0000000 0.3921569 0.2941176 0.1960784 0.1176471 |

Although it works we have nested for loops which aren’t very idiomatic R so let’s try and get rid of them.

The first thing we want to do is return a data frame rather than a vector so we tweak the first two lines to read like this:

scores = rep(1.0 / length(names), length(names)) df = data.frame(score = scores, name = names) |

Next we can get rid of the inner for loop and replace it with a call to ifelse wrapped inside a dplyr mutate call:

library(dplyr) likelihoods2 = function(names, observations) { scores = rep(1.0 / length(names), length(names)) df = data.frame(score = scores, name = names) for(observation in observations) { df = df %>% mutate(score = ifelse(name < observation, 0, score * (1.0 / name)) ) } return(df) } dice = c(4,6,8,12,20) l1 = likelihoods2(dice, c(6)) > l1 score name 1 0.00000000 4 2 0.03333333 6 3 0.02500000 8 4 0.01666667 12 5 0.01000000 20 |

Finally we’ll tidy up the scores so they’re relatively weighted against each other:

likelihoods2 = function(names, observations) { scores = rep(1.0 / length(names), length(names)) df = data.frame(score = scores, name = names) for(observation in observations) { df = df %>% mutate(score = ifelse(name < observation, 0, score * (1.0 / name)) ) } return(df %>% mutate(weighted = score / sum(score)) %>% select(name, weighted)) } dice = c(4,6,8,12,20) l1 = likelihoods2(dice, c(6)) > l1 name weighted 1 4 0.0000000 2 6 0.3921569 3 8 0.2941176 4 12 0.1960784 5 20 0.1176471 |

Now we’re down to just the one for loop. Getting rid of that one is a bit trickier. First we’ll create a data frame which contains a row for every (observation, dice) pair, simulating the nested for loops:

likelihoods3 = function(names, observations) { l = list(observation = observations, roll = names) obsDf = do.call(expand.grid,l) %>% mutate(likelihood = 1.0 / roll, score = ifelse(roll < observation, 0, likelihood)) return(obsDf) } dice = c(4,6,8,12,20) l1 = likelihoods3(dice, c(6)) > l1 observation roll likelihood score 1 6 4 0.25000000 0.00000000 2 6 6 0.16666667 0.16666667 3 6 8 0.12500000 0.12500000 4 6 12 0.08333333 0.08333333 5 6 20 0.05000000 0.05000000 l2 = likelihoods3(dice, c(6, 4, 8, 7, 7, 2)) > l2 observation roll likelihood score 1 6 4 0.25000000 0.00000000 2 4 4 0.25000000 0.25000000 3 8 4 0.25000000 0.00000000 4 7 4 0.25000000 0.00000000 5 7 4 0.25000000 0.00000000 6 2 4 0.25000000 0.25000000 7 6 6 0.16666667 0.16666667 8 4 6 0.16666667 0.16666667 9 8 6 0.16666667 0.00000000 10 7 6 0.16666667 0.00000000 11 7 6 0.16666667 0.00000000 12 2 6 0.16666667 0.16666667 13 6 8 0.12500000 0.12500000 14 4 8 0.12500000 0.12500000 15 8 8 0.12500000 0.12500000 16 7 8 0.12500000 0.12500000 17 7 8 0.12500000 0.12500000 18 2 8 0.12500000 0.12500000 19 6 12 0.08333333 0.08333333 20 4 12 0.08333333 0.08333333 21 8 12 0.08333333 0.08333333 22 7 12 0.08333333 0.08333333 23 7 12 0.08333333 0.08333333 24 2 12 0.08333333 0.08333333 25 6 20 0.05000000 0.05000000 26 4 20 0.05000000 0.05000000 27 8 20 0.05000000 0.05000000 28 7 20 0.05000000 0.05000000 29 7 20 0.05000000 0.05000000 30 2 20 0.05000000 0.05000000 |

Now we need to iterate over the data frame, grouping by ‘roll’ so that we end up with one row for each one.

We’ll add a new column which stores the posterior probability for each dice. This will be calculated by multiplying the prior probability by the product of the ‘score’ entries.

This is what our new likelihood function looks like:

likelihoods3 = function(names, observations) { l = list(observation = observations, roll = names) obsDf = do.call(expand.grid,l) %>% mutate(likelihood = 1.0 / roll, score = ifelse(roll < observation, 0, likelihood)) return (obsDf %>% group_by(roll) %>% summarise(s = (1.0/length(names)) * prod(score) ) %>% ungroup() %>% mutate(weighted = s / sum(s)) %>% select(roll, weighted)) } l1 = likelihoods3(dice, c(6)) > l1 Source: local data frame [5 x 2] roll weighted 1 4 0.0000000 2 6 0.3921569 3 8 0.2941176 4 12 0.1960784 5 20 0.1176471 l2 = likelihoods3(dice, c(6, 4, 8, 7, 7, 2)) > l2 Source: local data frame [5 x 2] roll weighted 1 4 0.000000000 2 6 0.000000000 3 8 0.915845272 4 12 0.080403426 5 20 0.003751302 |

We’ve now got the same result as we did with our nested for loops so I think the refactoring has been a success.

## R: Numeric keys in the nested list/dictionary

Last week I described how I’ve been creating fake dictionaries in R using lists and I found myself using the same structure while solving the dice problem in Think Bayes.

The dice problem is described as follows:

Suppose I have a box of dice that contains a 4-sided die, a 6-sided die, an 8-sided die, a 12-sided die, and a 20-sided die. If you have ever played Dungeons & Dragons, you know what I am talking about.

Suppose I select a die from the box at random, roll it, and get a 6.

What is the probability that I rolled each die?

Here’s a simple example of the nested list that I started with:

dice = c(4,6,8,12,20) priors = rep(1.0 / length(dice), length(dice)) names(priors) = dice > priors 4 6 8 12 20 0.2 0.2 0.2 0.2 0.2 |

I wanted to retrieve the prior for the 8 dice which I tried to do like this:

> priors[8] <NA> NA |

That comes back with ‘NA’ because we’re actually looking for the numeric index 8 rather than the item in that column.

As far as I understand if we want to look up values by name we have to use a string so I tweaked the code to store names as strings:

dice = c(4,6,8,12,20) priors = rep(1.0 / length(dice), length(dice)) names(priors) = sapply(dice, paste) > priors["8"] 8 0.2 |

That works much better but with some experimentation I realised I didn’t even need to run ‘dice’ through the sapply function, it already works the way it was:

dice = c(4,6,8,12,20) priors = rep(1.0 / length(dice), length(dice)) names(priors) = dice > priors["8"] 8 0.2 |

Now that we’ve got that working we can write a likelihood function which takes in observed dice rolls and tells us how likely it was that we rolled each type of dice. We start simple by copying the above code into a function:

likelihoods = function(names, observations) { scores = rep(1.0 / length(names), length(names)) names(scores) = names return(scores) } dice = c(4,6,8,12,20) l1 = likelihoods(dice, c(6)) > l1 4 6 8 12 20 0.2 0.2 0.2 0.2 0.2 |

Next we’ll update the score for a particular dice to 0 if one of the observed rolls is greater than the dice’s maximum score:

likelihoods = function(names, observations) { scores = rep(1.0 / length(names), length(names)) names(scores) = names for(name in names) { for(observation in observations) { if(name < observation) { scores[paste(name)] = 0 } } } return(scores) } dice = c(4,6,8,12,20) l1 = likelihoods(dice, c(6)) > l1 4 6 8 12 20 0.0 0.2 0.2 0.2 0.2 |

The 4 dice has been ruled out since we’ve rolled a 6! Now let’s put in the else condition which updates our score by the probability that we got the observed roll with each of valid dice. i.e. we have a 1/20 chance of rolling any number with the 20 side dice, a 1/8 chance with the 8 sided dice etc.

likelihoods = function(names, observations) { scores = rep(1.0 / length(names), length(names)) names(scores) = names for(name in names) { for(observation in observations) { if(name < observation) { scores[paste(name)] = 0 } else { scores[paste(name)] = scores[paste(name)] * (1.0 / name) } } } return(scores) } dice = c(4,6,8,12,20) l1 = likelihoods(dice, c(6)) > l1 4 6 8 12 20 0.00000000 0.03333333 0.02500000 0.01666667 0.01000000 |

And finally let’s normalise those scores so they’re a bit more readable:

> l1 / sum(l1) 4 6 8 12 20 0.0000000 0.3921569 0.2941176 0.1960784 0.1176471 |

## R: non-numeric argument to binary operator

When debugging R code, given my Java background, I often find myself trying to print out the state of variables along with an appropriate piece of text like this:

names = c(1,2,3,4,5,6) > print("names: " + names) Error in "names: " + names : non-numeric argument to binary operator |

We might try this next:

> print("names: ", names) [1] "names: " |

which doesn’t actually print the names variable – only the first argument to the print function is printed.

We’ll find more success with the paste function:

> print(paste("names: ", names)) [1] "names: 1" "names: 2" "names: 3" "names: 4" "names: 5" "names: 6" |

This is an improvement but it repeats the ‘names:’ prefix multiple times which isn’t what we want. Introducing the toString function gets us over the line:

> print(paste("names: ", toString(names))) [1] "names: 1, 2, 3, 4, 5, 6" |

## R: Removing for loops

In my last blog post I showed the translation of a likelihood function from Think Bayes into R and in my first attempt at this function I used a couple of nested for loops.

likelihoods = function(names, mixes, observations) { scores = rep(1, length(names)) names(scores) = names for(name in names) { for(observation in observations) { scores[name] = scores[name] * mixes[[name]][observation] } } return(scores) } |

Names = c("Bowl 1", "Bowl 2") bowl1Mix = c(0.75, 0.25) names(bowl1Mix) = c("vanilla", "chocolate") bowl2Mix = c(0.5, 0.5) names(bowl2Mix) = c("vanilla", "chocolate") Mixes = list("Bowl 1" = bowl1Mix, "Bowl 2" = bowl2Mix) Mixes Observations = c("vanilla", "vanilla", "vanilla", "chocolate") l = likelihoods(Names, Mixes, Observations) > l / sum(l) Bowl 1 Bowl 2 0.627907 0.372093 |

We pass in a vector of bowls, a nested dictionary describing the mixes of cookies in each bowl and the observations that we’ve made. The function tells us that there’s an almost 2/3 probability of the cookies coming from Bowl 1 and just over 1/3 of being Bowl 2.

In this case there probably won’t be much of a performance improvement by getting rid of the loops but we should be able to write something that’s more concise and hopefully idiomatic.

Let’s start by getting rid of the inner for loop. That can be replace by a call to the Reduce function like so:

likelihoods2 = function(names, mixes, observations) { scores = rep(0, length(names)) names(scores) = names for(name in names) { scores[name] = Reduce(function(acc, observation) acc * mixes[[name]][observation], Observations, 1) } return(scores) } |

l2 = likelihoods2(Names, Mixes, Observations) > l2 / sum(l2) Bowl 1 Bowl 2 0.627907 0.372093 |

So that’s good, we’ve still got the same probabilities as before. Now to get rid of the outer for loop. The Map function helps us out here:

likelihoods3 = function(names, mixes, observations) { scores = rep(0, length(names)) names(scores) = names scores = Map(function(name) Reduce(function(acc, observation) acc * mixes[[name]][observation], Observations, 1), names) return(scores) } l3 = likelihoods3(Names, Mixes, Observations) > l3 $`Bowl 1` vanilla 0.1054688 $`Bowl 2` vanilla 0.0625 |

We end up with a list instead of a vector which we need to fix by using the unlist function:

likelihoods3 = function(names, mixes, observations) { scores = rep(0, length(names)) names(scores) = names scores = Map(function(name) Reduce(function(acc, observation) acc * mixes[[name]][observation], Observations, 1), names) return(unlist(scores)) } l3 = likelihoods3(Names, Mixes, Observations) > l3 / sum(l3) Bowl 1.vanilla Bowl 2.vanilla 0.627907 0.372093 |

Now we just have this annoying ‘vanilla’ in the name. That’s fixed easily enough:

likelihoods3 = function(names, mixes, observations) { scores = rep(0, length(names)) names(scores) = names scores = Map(function(name) Reduce(function(acc, observation) acc * mixes[[name]][observation], Observations, 1), names) result = unlist(scores) names(result) = names return(result) } l3 = likelihoods3(Names, Mixes, Observations) > l3 / sum(l3) Bowl 1 Bowl 2 0.627907 0.372093 |

A slightly cleaner alternative makes use of the sapply function:

likelihoods3 = function(names, mixes, observations) { scores = rep(0, length(names)) names(scores) = names scores = sapply(names, function(name) Reduce(function(acc, observation) acc * mixes[[name]][observation], Observations, 1)) names(scores) = names return(scores) } l3 = likelihoods3(Names, Mixes, Observations) > l3 / sum(l3) Bowl 1 Bowl 2 0.627907 0.372093 |

That’s the best I’ve got for now but I wonder if we could write a version of this using matrix operations some how – but that’s for next time!

## R: Think Bayes – More posterior probability calculations

As I mentioned in a post last week I’ve been reading through Think Bayes and translating some of the examples form Python to R.

After my first post Antonios suggested a more idiomatic way of writing the function in R so I thought I’d give it a try to calculate the probability that combinations of cookies had come from each bowl.

In the simplest case we have this function which takes in the names of the bowls and the likelihood scores:

f = function(names,likelihoods) { # Assume each option has an equal prior priors = rep(1, length(names)) / length(names) # create a data frame with all info you have dt = data.frame(names,priors,likelihoods) # calculate posterior probabilities dt$post = dt$priors*dt$likelihoods / sum(dt$priors*dt$likelihoods) # specify what you want the function to return list(names=dt$names, priors=dt$priors, likelihoods=dt$likelihoods, posteriors=dt$post) } |

We assume a prior probability of 0.5 for each bowl.

Given the following probabilities of of different cookies being in each bowl…

mixes = { 'Bowl 1':dict(vanilla=0.75, chocolate=0.25), 'Bowl 2':dict(vanilla=0.5, chocolate=0.5), } |

…we can simulate taking one vanilla cookie with the following parameters:

Likelihoods = c(0.75,0.5) Names = c("Bowl 1", "Bowl 2") res=f(Names,Likelihoods) > res$posteriors[res$name == "Bowl 1"] [1] 0.6 > res$posteriors[res$name == "Bowl 2"] [1] 0.4 |

If we want to simulate taking 3 vanilla cookies and 1 chocolate one we’d have the following:

Likelihoods = c((0.75 ** 3) * (0.25 ** 1), (0.5 ** 3) * (0.5 ** 1)) Names = c("Bowl 1", "Bowl 2") res=f(Names,Likelihoods) > res$posteriors[res$name == "Bowl 1"] [1] 0.627907 > res$posteriors[res$name == "Bowl 2"] [1] 0.372093 |

That’s a bit clunky and the intent of ‘3 vanilla cookies and 1 chocolate’ has been lost. I decided to refactor the code to take in a vector of cookies and calculate the likelihoods internally.

First we need to create a data structure to store the mixes of cookies in each bowl that we defined above. It turns out we can do this using a nested list:

bowl1Mix = c(0.75, 0.25) names(bowl1Mix) = c("vanilla", "chocolate") bowl2Mix = c(0.5, 0.5) names(bowl2Mix) = c("vanilla", "chocolate") Mixes = list("Bowl 1" = bowl1Mix, "Bowl 2" = bowl2Mix) > Mixes $`Bowl 1` vanilla chocolate 0.75 0.25 $`Bowl 2` vanilla chocolate 0.5 0.5 |

Now let’s tweak our function to take in observations rather than likelihoods and then calculate those likelihoods internally:

likelihoods = function(names, observations) { scores = c(1,1) names(scores) = names for(name in names) { for(observation in observations) { scores[name] = scores[name] * mixes[[name]][observation] } } return(scores) } f = function(names,mixes,observations) { # Assume each option has an equal prior priors = rep(1, length(names)) / length(names) # create a data frame with all info you have dt = data.frame(names,priors) dt$likelihoods = likelihoods(Names, Observations) # calculate posterior probabilities dt$post = dt$priors*dt$likelihoods / sum(dt$priors*dt$likelihoods) # specify what you want the function to return list(names=dt$names, priors=dt$priors, likelihoods=dt$likelihoods, posteriors=dt$post) } |

And if we call that function:

Names = c("Bowl 1", "Bowl 2") bowl1Mix = c(0.75, 0.25) names(bowl1Mix) = c("vanilla", "chocolate") bowl2Mix = c(0.5, 0.5) names(bowl2Mix) = c("vanilla", "chocolate") Mixes = list("Bowl 1" = bowl1Mix, "Bowl 2" = bowl2Mix) Mixes Observations = c("vanilla", "vanilla", "vanilla", "chocolate") res=f(Names,Mixes,Observations) > res$posteriors[res$names == "Bowl 1"] [1] 0.627907 > res$posteriors[res$names == "Bowl 2"] [1] 0.372093 |

Exactly the same result as before! #win