Mark Needham

Thoughts on Software Development

Archive for the ‘R’ Category

R: ggplot – Displaying multiple charts with a for loop

without comments

Continuing with my analysis of the Neo4j London user group I wanted to drill into some individual meetups and see the makeup of the people attending those meetups with respect to the cohort they belong to.

I started by writing a function which would take in an event ID and output a bar chart showing the number of people who attended that event from each cohort.

We can work out the cohort that a member belongs to by querying for the first event they attended.

Our query for the most recent Intro to graphs session looks like this:

library(RNeo4j)
graph = startGraph("http://127.0.0.1:7474/db/data/")
 
eventId = "220750415"
query =  "match (g:Group {name: 'Neo4j - London User Group'})-[:HOSTED_EVENT]->
                (e {id: {id}})<-[:TO]-(rsvp {response: 'yes'})<-[:RSVPD]-(person) 
          WITH rsvp, person
          MATCH (person)-[:RSVPD]->(otherRSVP)
 
          WITH person, rsvp, otherRSVP
          ORDER BY person.id, otherRSVP.time
 
          WITH person, rsvp, COLLECT(otherRSVP)[0] AS earliestRSVP
          return rsvp.time, earliestRSVP.time,  person.id"
 
df = cypher(graph, query, id= eventId)
 
> df %>% sample_n(10)
      rsvp.time earliestRSVP.time person.id
18 1.430819e+12      1.392726e+12 130976662
95 1.430069e+12      1.430069e+12  10286388
79 1.429035e+12      1.429035e+12  38344282
64 1.428108e+12      1.412935e+12 153473172
73 1.429513e+12      1.398236e+12 143322942
19 1.430389e+12      1.430389e+12 129261842
37 1.429643e+12      1.327603e+12   9750821
49 1.429618e+12      1.429618e+12 184325696
69 1.430781e+12      1.404554e+12  67485912
1  1.430929e+12      1.430146e+12 185405773

We’re not doing anything too clever here, just using a couple of WITH clauses to order RSVPs so we can get the earlier one for each person.

Once we’ve done that we’ll tidy up the data frame so that it contains columns containing the cohort in which the member attended their first event:

timestampToDate <- function(x) as.POSIXct(x / 1000, origin="1970-01-01", tz = "GMT")
 
df$time = timestampToDate(df$rsvp.time)
df$date = format(as.Date(df$time), "%Y-%m")
df$earliestTime = timestampToDate(df$earliestRSVP.time)
df$earliestDate = format(as.Date(df$earliestTime), "%Y-%m")
 
> df %>% sample_n(10)
      rsvp.time earliestRSVP.time person.id                time    date        earliestTime earliestDate
47 1.430697e+12      1.430697e+12 186893861 2015-05-03 23:47:11 2015-05 2015-05-03 23:47:11      2015-05
44 1.430924e+12      1.430924e+12 186998186 2015-05-06 14:49:44 2015-05 2015-05-06 14:49:44      2015-05
85 1.429611e+12      1.422378e+12  53761842 2015-04-21 10:13:46 2015-04 2015-01-27 16:56:02      2015-01
14 1.430125e+12      1.412690e+12   7994846 2015-04-27 09:01:58 2015-04 2014-10-07 13:57:09      2014-10
29 1.430035e+12      1.430035e+12  37719672 2015-04-26 07:59:03 2015-04 2015-04-26 07:59:03      2015-04
12 1.430855e+12      1.430855e+12 186968869 2015-05-05 19:38:10 2015-05 2015-05-05 19:38:10      2015-05
41 1.428917e+12      1.422459e+12 133623562 2015-04-13 09:20:07 2015-04 2015-01-28 15:37:40      2015-01
87 1.430927e+12      1.430927e+12 185155627 2015-05-06 15:46:59 2015-05 2015-05-06 15:46:59      2015-05
62 1.430849e+12      1.430849e+12 186965212 2015-05-05 17:56:23 2015-05 2015-05-05 17:56:23      2015-05
8  1.430237e+12      1.425567e+12 184979500 2015-04-28 15:58:23 2015-04 2015-03-05 14:45:40      2015-03

Now that we’ve got that we can group by the earliestDate cohort and then create a bar chart:

byCohort = df %>% count(earliestDate) 
 
ggplot(aes(x= earliestDate, y = n), data = byCohort) + 
    geom_bar(stat="identity", fill = "dark blue") +
    theme(axis.text.x=element_text(angle=90,hjust=1,vjust=1))

2015 05 13 00 30 59

This is good and gives us the insight that most of the members attending this version of intro to graphs just joined the group. The event was on 7th April and most people joined in March which makes sense.

Let’s see if that trend continues over the previous two years. To do this we need to create a for loop which goes over all the Intro to Graphs events and then outputs a chart for each one.

First I pulled out the code above into a function:

plotEvent = function(eventId) {
  query =  "match (g:Group {name: 'Neo4j - London User Group'})-[:HOSTED_EVENT]->
                (e {id: {id}})<-[:TO]-(rsvp {response: 'yes'})<-[:RSVPD]-(person) 
          WITH rsvp, person
          MATCH (person)-[:RSVPD]->(otherRSVP)
 
          WITH person, rsvp, otherRSVP
          ORDER BY person.id, otherRSVP.time
 
          WITH person, rsvp, COLLECT(otherRSVP)[0] AS earliestRSVP
          return rsvp.time, earliestRSVP.time,  person.id"
 
  df = cypher(graph, query, id= eventId)
  df$time = timestampToDate(df$rsvp.time)
  df$date = format(as.Date(df$time), "%Y-%m")
  df$earliestTime = timestampToDate(df$earliestRSVP.time)
  df$earliestDate = format(as.Date(df$earliestTime), "%Y-%m")
 
  byCohort = df %>% count(earliestDate) 
 
  ggplot(aes(x= earliestDate, y = n), data = byCohort) + 
    geom_bar(stat="identity", fill = "dark blue") +
    theme(axis.text.x=element_text(angle=90,hjust=1,vjust=1))
}

We’d call it like this for the Intro to graphs meetup:

> plotEvent("220750415")

Next I tweaked the code to look up all Into to graphs events and then loop through and output a chart for each event:

events = cypher(graph, "match (e:Event {name: 'Intro to Graphs'}) RETURN e.id ORDER BY e.time")
 
for(event in events$n) {
  plotEvent(as.character(event))
}

Unfortunately that doesn’t print anything at all which we can fix by storing our plots in a list and then displaying it afterwards:

library(gridExtra)
p = list()
for(i in 1:count(events)$n) {
  event = events[i, 1]
  p[[i]] = plotEvent(as.character(event))
}
 
do.call(grid.arrange,p)

2015 05 14 00 57 10

This visualisation is probably better without any of the axis so let’s update the function to scrap those. We’ll also add the date of the event at the top of each chart which will require a slight tweak of the query:

plotEvent = function(eventId) {
  query =  "match (g:Group {name: 'Neo4j - London User Group'})-[:HOSTED_EVENT]->
                (e {id: {id}})<-[:TO]-(rsvp {response: 'yes'})<-[:RSVPD]-(person) 
            WITH e,rsvp, person
            MATCH (person)-[:RSVPD]->(otherRSVP)
 
            WITH e,person, rsvp, otherRSVP
            ORDER BY person.id, otherRSVP.time
 
            WITH e, person, rsvp, COLLECT(otherRSVP)[0] AS earliestRSVP
            return rsvp.time, earliestRSVP.time,  person.id, e.time"
 
  df = cypher(graph, query, id= eventId)
  df$time = timestampToDate(df$rsvp.time)
  df$eventTime = timestampToDate(df$e.time)
  df$date = format(as.Date(df$time), "%Y-%m")
  df$earliestTime = timestampToDate(df$earliestRSVP.time)
  df$earliestDate = format(as.Date(df$earliestTime), "%Y-%m")
 
  byCohort = df %>% count(earliestDate) 
 
  ggplot(aes(x= earliestDate, y = n), data = byCohort) + 
    geom_bar(stat="identity", fill = "dark blue") +
    theme(axis.ticks = element_blank(), 
          axis.text.x = element_blank(), 
          axis.text.y = element_blank(),
          axis.title.x = element_blank(),
          axis.title.y = element_blank()) + 
    labs(title = df$eventTime[1])
}
2015 05 14 01 08 54

I think this makes it a bit easier to read although I’ve made the mistake of not having all the charts representing the same scale – one to fix for next time.

We started doing the intro to graphs sessions less frequently towards the end of last year so my hypothesis was that we’d see a range of people from different cohorts RSVPing for them but that doesn’t seem to be the case. Instead it’s very dominated by people signing up close to the event.

Written by Mark Needham

May 14th, 2015 at 12:17 am

Posted in neo4j,R

Tagged with , ,

R: Cohort heatmap of Neo4j London meetup

without comments

A few months ago I had a go at doing some cohort analysis of the Neo4j London meetup group which was an interesting experiment but unfortunately resulted in a chart that was completely illegible.

I wasn’t sure how to progress from there but a few days ago I came across the cohort heatmap which seemed like a better way of visualising things over time.

The underlying idea is still the same – we’ve comparing different cohorts of users against each other to see whether a change or intervention we did at a certain time had any impact.

However, the way we display the cohorts changes and I think for the better.

To recap, we start with the following data frame:

df = read.csv("/tmp/df.csv")
> df %>% sample_n(5)
        rsvp.time person.id                time    date
255  1.354277e+12  12228948 2012-11-30 12:05:08 2012-11
2475 1.407342e+12  19057581 2014-08-06 16:26:04 2014-08
3988 1.421769e+12  66122172 2015-01-20 15:58:02 2015-01
4411 1.419377e+12 165750262 2014-12-23 23:27:44 2014-12
1010 1.383057e+12  74602292 2013-10-29 14:24:32 2013-10

And we need to transform this into a data frame which is grouped by cohort (members who attended their first meetup in a particular month). The following code gets us there:

firstMeetup = df %>% 
  group_by(person.id) %>% 
  summarise(firstEvent = min(time), count = n()) %>% 
  arrange(desc(count))
firstMeetup$date = format(as.Date(firstMeetup$firstEvent), "%Y-%m")
 
countsForCohort = function(df, firstMeetup, cohort) {
  members = (firstMeetup %>% filter(date == cohort))$person.id
 
  attendance = df %>% 
    filter(person.id %in% members) %>% 
    count(person.id, date) %>% 
    ungroup() %>%
    count(date)
 
  allCohorts = df %>% select(date) %>% unique
  cohortAttendance = merge(allCohorts, attendance, by = "date", all.x = TRUE)  
  cohortAttendance[is.na(cohortAttendance) & cohortAttendance$date > cohort] = 0
  cohortAttendance %>% mutate(cohort = cohort, retention = n / length(members), members = n)  
}
 
cohorts = collect(df %>% select(date) %>% unique())[,1]
 
cohortAttendance = data.frame()
for(cohort in cohorts) {
  cohortAttendance = rbind(cohortAttendance,countsForCohort(df, firstMeetup, cohort))      
}
 
monthNumber = function(cohort, date) {
  cohortAsDate = as.yearmon(cohort)
  dateAsDate = as.yearmon(date)
 
  if(cohortAsDate > dateAsDate) {
    "NA"
  } else {
    paste(round((dateAsDate - cohortAsDate) * 12), sep="")
  }
}
 
cohortAttendanceWithMonthNumber = cohortAttendance %>% 
  group_by(row_number()) %>% 
  mutate(monthNumber = monthNumber(cohort, date)) %>%
  filter(monthNumber != "NA") %>%
  filter(!is.na(members)) %>%
  mutate(monthNumber = as.numeric(monthNumber)) %>% 
  arrange(monthNumber)
 
> cohortAttendanceWithMonthNumber %>% head(10)
Source: local data frame [10 x 7]
Groups: row_number()
 
      date n  cohort retention members row_number() monthNumber
1  2011-06 4 2011-06      1.00       4            1           0
2  2011-07 1 2011-06      0.25       1            2           1
3  2011-08 1 2011-06      0.25       1            3           2
4  2011-09 2 2011-06      0.50       2            4           3
5  2011-10 1 2011-06      0.25       1            5           4
6  2011-11 1 2011-06      0.25       1            6           5
7  2012-01 1 2011-06      0.25       1            7           7
8  2012-04 2 2011-06      0.50       2            8          10
9  2012-05 1 2011-06      0.25       1            9          11
10 2012-06 1 2011-06      0.25       1           10          12

Now let’s create our first heatmap.

t <- max(cohortAttendanceWithMonthNumber$members)
 
cols <- c("#e7f0fa", "#c9e2f6", "#95cbee", "#0099dc", "#4ab04a", "#ffd73e", "#eec73a", "#e29421", "#e29421", "#f05336", "#ce472e")
ggplot(cohortAttendanceWithMonthNumber, aes(y=cohort, x=date, fill=members)) +
  theme_minimal() +
  geom_tile(colour="white", linewidth=2, width=.9, height=.9) +
  scale_fill_gradientn(colours=cols, limits=c(0, t),
                       breaks=seq(0, t, by=t/4),
                       labels=c("0", round(t/4*1, 1), round(t/4*2, 1), round(t/4*3, 1), round(t/4*4, 1)),
                       guide=guide_colourbar(ticks=T, nbin=50, barheight=.5, label=T, barwidth=10)) +
  theme(legend.position='bottom', 
        legend.direction="horizontal",
        plot.title = element_text(size=20, face="bold", vjust=2),
        axis.text.x=element_text(size=8, angle=90, hjust=.5, vjust=.5, face="plain")) +
  ggtitle("Cohort Activity Heatmap (number of members who attended event)")

2015 05 11 23 55 56

‘t’ is the maximum number of members within a cohort who attended a meetup in a given month. This makes it easy to see which cohorts started with the most members but makes it difficult to compare their retention over time.

We can fix that by showing the percentage of members in the cohort who attend each month rather than using absolute values. To do that we must first add an extra column containing the percentage values:

cohortAttendanceWithMonthNumber$retentionPercentage = ifelse(!is.na(cohortAttendanceWithMonthNumber$retention),  cohortAttendanceWithMonthNumber$retention * 100, 0)
t <- max(cohortAttendanceWithMonthNumber$retentionPercentage)
 
cols <- c("#e7f0fa", "#c9e2f6", "#95cbee", "#0099dc", "#4ab04a", "#ffd73e", "#eec73a", "#e29421", "#e29421", "#f05336", "#ce472e")
ggplot(cohortAttendanceWithMonthNumber, aes(y=cohort, x=date, fill=retentionPercentage)) +
  theme_minimal() +
  geom_tile(colour="white", linewidth=2, width=.9, height=.9) +
  scale_fill_gradientn(colours=cols, limits=c(0, t),
                       breaks=seq(0, t, by=t/4),
                       labels=c("0", round(t/4*1, 1), round(t/4*2, 1), round(t/4*3, 1), round(t/4*4, 1)),
                       guide=guide_colourbar(ticks=T, nbin=50, barheight=.5, label=T, barwidth=10)) +
  theme(legend.position='bottom', 
        legend.direction="horizontal",
        plot.title = element_text(size=20, face="bold", vjust=2),
        axis.text.x=element_text(size=8, angle=90, hjust=.5, vjust=.5, face="plain")) +
  ggtitle("Cohort Activity Heatmap (number of members who attended event)")

2015 05 12 00 01 55

This version allows us to compare cohorts against each other but now we don’t have the exact numbers which means earlier cohorts will look better since there are less people in them. We can get the best of both worlds by keeping this visualisation but showing the actual values inside each box:

t <- max(cohortAttendanceWithMonthNumber$retentionPercentage)
 
cols <- c("#e7f0fa", "#c9e2f6", "#95cbee", "#0099dc", "#4ab04a", "#ffd73e", "#eec73a", "#e29421", "#e29421", "#f05336", "#ce472e")
ggplot(cohortAttendanceWithMonthNumber, aes(y=cohort, x=date, fill=retentionPercentage)) +
  theme_minimal() +
  geom_tile(colour="white", linewidth=2, width=.9, height=.9) +
  scale_fill_gradientn(colours=cols, limits=c(0, t),
                       breaks=seq(0, t, by=t/4),
                       labels=c("0", round(t/4*1, 1), round(t/4*2, 1), round(t/4*3, 1), round(t/4*4, 1)),
                       guide=guide_colourbar(ticks=T, nbin=50, barheight=.5, label=T, barwidth=10)) +
  theme(legend.position='bottom', 
        legend.direction="horizontal",
        plot.title = element_text(size=20, face="bold", vjust=2),
        axis.text.x=element_text(size=8, angle=90, hjust=.5, vjust=.5, face="plain")) +
  ggtitle("Cohort Activity Heatmap (number of members who attended event)") + 
  geom_text(aes(label=members),size=3)

2015 05 12 00 04 31

What we can learn overall is that the majority of people seem to have a passing interest and then we have a smaller percentage who will continue to come to events.

It seems like we did a better job at retaining attendees in the middle of last year – one hypothesis is that the events we ran around then were more compelling but I need to do more analysis.

Next I’m going to drill further into some of the recent events and see what cohorts the attendees came from.

Written by Mark Needham

May 11th, 2015 at 11:16 pm

Posted in R

Tagged with ,

R: Neo4j London meetup group – How many events do people come to?

with one comment

Earlier this week the number of members in the Neo4j London meetup group creeped over the 2,000 mark and I thought it’d be fun to re-explore the data that I previously imported into Neo4j.

How often do people come to meetups?

library(RNeo4j)
library(dplyr)
 
graph = startGraph("http://localhost:7474/db/data/")
 
query = "MATCH (g:Group {name: 'Neo4j - London User Group'})-[:HOSTED_EVENT]->(event)<-[:TO]-({response: 'yes'})<-[:RSVPD]-(profile)-[:HAS_MEMBERSHIP]->(membership)-[:OF_GROUP]->(g)
         WHERE (event.time + event.utc_offset) < timestamp()
         RETURN event.id, event.time + event.utc_offset AS eventTime, profile.id, membership.joined"
 
df = cypher(graph, query)
 
> df %>% head()
  event.id    eventTime profile.id membership.joined
1 20616111 1.309372e+12    6436797      1.307285e+12
2 20616111 1.309372e+12   12964956      1.307275e+12
3 20616111 1.309372e+12   14533478      1.307290e+12
4 20616111 1.309372e+12   10793775      1.307705e+12
5 24528711 1.311793e+12   10793775      1.307705e+12
6 29953071 1.314815e+12   10595297      1.308154e+12
byEventsAttended = df %>% count(profile.id)
 
> byEventsAttended %>% sample_n(10)
Source: local data frame [10 x 2]
 
   profile.id  n
1   128137932  2
2   126947632  1
3    98733862  2
4    20468901 11
5    48293132  5
6   144764532  1
7    95259802  1
8    14524524  3
9    80611852  2
10  134907492  2

Now let’s visualise the number of people that have attended certain number of events:

ggplot(aes(x = n), data = byEventsAttended) + 
  geom_bar(binwidth = 1, fill = "Dark Blue") +
  scale_y_continuous(breaks = seq(0,750,by = 50))

2015 05 09 01 15 02

Most people come to one meetup and then there’s a long tail after that with fewer and fewer people coming to lots of meetups.

The chart has lots of blank space due to the sparseness of people on the right hand side. If we exclude any people who’ve attended more than 20 events we might get a more interesting visualisation:

ggplot(aes(x = n), data = byEventsAttended %>% filter(n <= 20)) + 
  geom_bar(binwidth = 1, fill = "Dark Blue") +
  scale_y_continuous(breaks = seq(0,750,by = 50))
2015 05 09 01 15 36

Nicole suggested a more interesting visualisation would be a box plot so I decided to try that next:

ggplot(aes(x = "Attendees", y = n), data = byEventsAttended) +
  geom_boxplot(fill = "grey80", colour = "Dark Blue") +
  coord_flip()

2015 05 09 22 31 20

This visualisation really emphasises that the majority are between 1 and 3 and it’s much less obvious how many values there are at the higher end. A quick check of the data with the summary function reveals as much:

> summary(byEventsAttended$n)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.000   2.000   2.837   3.000  69.000


Now to figure out how to move that box plot a bit to the right :)

Written by Mark Needham

May 9th, 2015 at 10:33 pm

Posted in R

Tagged with ,

R: dplyr – Error in (list: invalid subscript type ‘double’

without comments

In my continued playing around with R I wanted to find the minimum value for a specified percentile given a data frame representing a cumulative distribution function (CDF).

e.g. imagine we have the following CDF represented in a data frame:

library(dplyr)
df = data.frame(score = c(5,7,8,10,12,20), percentile = c(0.05,0.1,0.15,0.20,0.25,0.5))

and we want to find the minimum value for the 0.05 percentile. We can use the filter function to do so:

> (df %>% filter(percentile > 0.05) %>% slice(1))$score
[1] 7

Things become more tricky if we want to return multiple percentiles in one go.

My first thought was to create a data frame with one row for each target percentile and then pull in the appropriate row from our original data frame:

targetPercentiles = c(0.05, 0.2)
percentilesDf = data.frame(targetPercentile = targetPercentiles)
> percentilesDf %>% 
    group_by(targetPercentile) %>%
    mutate(x = (df %>% filter(percentile > targetPercentile) %>% slice(1))$score)
 
Error in (list(score = c(5, 7, 8, 10, 12, 20), percentile = c(0.05, 0.1,  : 
  invalid subscript type 'double'

Unfortunately this didn’t quite work as I expected – Antonios pointed out that this is probably because we’re mixing up two pipelines and dplyr can’t figure out what we want to do.

Instead he suggested the following variant which uses the do function

df = data.frame(score = c(5,7,8,10,12,20), percentile = c(0.05,0.1,0.15,0.20,0.25,0.5))
targetPercentiles = c(0.05, 0.2)
 
> data.frame(targetPercentile = targetPercentiles) %>%
    group_by(targetPercentile) %>%
    do(df) %>% 
    filter(percentile > targetPercentile) %>% 
    slice(1) %>%
    select(targetPercentile, score)
Source: local data frame [2 x 2]
Groups: targetPercentile
 
  targetPercentile score
1             0.05     7
2             0.20    12

We can then wrap this up in a function:

percentiles = function(df, targetPercentiles) {
  # make sure the percentiles are in order
  df = df %>% arrange(percentile)
 
  data.frame(targetPercentile = targetPercentiles) %>%
    group_by(targetPercentile) %>%
    do(df) %>% 
    filter(percentile > targetPercentile) %>% 
    slice(1) %>%
    select(targetPercentile, score)
}

which we call like this:

df = data.frame(score = c(5,7,8,10,12,20), percentile = c(0.05,0.1,0.15,0.20,0.25,0.5))
> percentiles(df, c(0.08, 0.10, 0.50, 0.80))
Source: local data frame [2 x 2]
Groups: targetPercentile
 
  targetPercentile score
1             0.08     7
2             0.10     8

Note that we don’t actually get any rows back for 0.50 or 0.80 since we don’t have any entries greater than those percentiles. With a proper CDF we would though so the function does its job.

Written by Mark Needham

April 27th, 2015 at 10:34 pm

Posted in R

Tagged with ,

R: Think Bayes Locomotive Problem – Posterior probabilities for different priors

without comments

In my continued reading of Think Bayes the next problem to tackle is the Locomotive problem which is defined thus:

A railroad numbers its locomotives in order 1..N.

One day you see a locomotive with the number 60. Estimate how many loco- motives the railroad has.

The interesting thing about this question is that it initially seems that we don’t have enough information to come up with any sort of answer. However, we can get an estimate if we come up with a prior to work with.

The simplest prior is to assume that there’s one railroad operator with between say 1 and 1000 railroads with an equal probability of each size.

We can then write similar code as with the dice problem to update the prior based on the trains we’ve seen.

First we’ll create a data frame which captures the product of ‘number of locomotives’ and the observations of locomotives that we’ve seen (in this case we’ve only seen one locomotive with number ’60′:)

library(dplyr)
 
possibleValues = 1:1000
observations = c(60)
 
l = list(value = possibleValues, observation = observations)
df = expand.grid(l) 
 
> df %>% head()
  value observation
1     1          60
2     2          60
3     3          60
4     4          60
5     5          60
6     6          60

Next we want to add a column which represents the probability that the observed locomotive could have come from a particular fleet. If the number of railroads is less than 60 then we have a 0 probability, otherwise we have 1 / numberOfRailroadsInFleet:

prior = 1  / length(possibleValues)
df = df %>% mutate(score = ifelse(value < observation, 0, 1/value))
 
> df %>% sample_n(10)
     value observation       score
179    179          60 0.005586592
1001  1001          60 0.000999001
400    400          60 0.002500000
438    438          60 0.002283105
667    667          60 0.001499250
661    661          60 0.001512859
284    284          60 0.003521127
233    233          60 0.004291845
917    917          60 0.001090513
173    173          60 0.005780347

To find the probability of each fleet size we write the following code:

weightedDf = df %>% 
  group_by(value) %>% 
  summarise(aggScore = prior * prod(score)) %>%
  ungroup() %>%
  mutate(weighted = aggScore / sum(aggScore))
 
> weightedDf %>% sample_n(10)
Source: local data frame [10 x 3]
 
   value     aggScore     weighted
1    906 1.102650e-06 0.0003909489
2    262 3.812981e-06 0.0013519072
3    994 1.005031e-06 0.0003563377
4    669 1.493275e-06 0.0005294465
5    806 1.239455e-06 0.0004394537
6    673 1.484400e-06 0.0005262997
7    416 2.401445e-06 0.0008514416
8    624 1.600963e-06 0.0005676277
9     40 0.000000e+00 0.0000000000
10   248 4.028230e-06 0.0014282246

Let’s plot the data frame to see how the probability varies for each fleet size:

library(ggplot2)
ggplot(aes(x = value, y = weighted), data = weightedDf) + 
  geom_line(color="dark blue")

2015 04 25 00 25 47

The most likely choice is a fleet size of 60 based on this diagram but an alternative would be to find the mean of the posterior which we can do like so:

> weightedDf %>% mutate(mean = value * weighted) %>% select(mean) %>% sum()
[1] 333.6561

Now let’s create a function with all that code in so we can play around with some different priors and observations:

meanOfPosterior = function(values, observations) {
  l = list(value = values, observation = observations)   
  df = expand.grid(l) %>% mutate(score = ifelse(value < observation, 0, 1/value))
 
  prior = 1  / length(possibleValues)
  weightedDf = df %>% 
    group_by(value) %>% 
    summarise(aggScore = prior * prod(score)) %>%
    ungroup() %>%
    mutate(weighted = aggScore / sum(aggScore))
 
  return (weightedDf %>% mutate(mean = value * weighted) %>% select(mean) %>% sum()) 
}

If we update our observed railroads to have numbers 60, 30 and 90 we’d get the following means of posteriors assuming different priors:

> meanOfPosterior(1:500, c(60, 30, 90))
[1] 151.8496
> meanOfPosterior(1:1000, c(60, 30, 90))
[1] 164.3056
> meanOfPosterior(1:2000, c(60, 30, 90))
[1] 171.3382

At the moment the function assumes that we always want to have a uniform prior i.e. every option has an equal opportunity of being chosen, but we might want to vary the prior to see how different assumptions influence the posterior.

We can refactor the function to take in values & priors instead of calculating the priors in the function:

meanOfPosterior = function(values, priors, observations) {
  priorDf = data.frame(value = values, prior = priors)
  l = list(value = priorDf$value, observation = observations)
 
  df = merge(expand.grid(l), priorDf, by.x = "value", by.y = "value") %>% 
    mutate(score = ifelse(value < observation, 0, 1 / value))
 
  df %>% 
    group_by(value) %>% 
    summarise(aggScore = max(prior) * prod(score)) %>%
    ungroup() %>%
    mutate(weighted = aggScore / sum(aggScore)) %>%
    mutate(mean = value * weighted) %>%
    select(mean) %>%
    sum()
}

Now let’s check we get the same posterior means for the uniform priors:

> meanOfPosterior(1:500,  1/length(1:500), c(60, 30, 90))
[1] 151.8496
> meanOfPosterior(1:1000, 1/length(1:1000), c(60, 30, 90))
[1] 164.3056
> meanOfPosterior(1:2000, 1/length(1:2000), c(60, 30, 90))
[1] 171.3382

Now if instead of a uniform prior let’s use a power law one where the assumption is that smaller fleets are more likely:

> meanOfPosterior(1:500,  sapply(1:500,  function(x) x ** -1), c(60, 30, 90))
[1] 130.7085
> meanOfPosterior(1:1000, sapply(1:1000, function(x) x ** -1), c(60, 30, 90))
[1] 133.2752
> meanOfPosterior(1:2000, sapply(1:2000, function(x) x ** -1), c(60, 30, 90))
[1] 133.9975
> meanOfPosterior(1:5000, sapply(1:5000, function(x) x ** -1), c(60, 30, 90))
[1] 134.212
> meanOfPosterior(1:10000, sapply(1:10000, function(x) x ** -1), c(60, 30, 90))
[1] 134.2435

Now we get very similar posterior means which converge on 134 and so that’s our best prediction.

Written by Mark Needham

April 24th, 2015 at 11:53 pm

Posted in R

Tagged with

R: Replacing for loops with data frames

without comments

In my last blog post I showed how to derive posterior probabilities for the Think Bayes dice problem:

Suppose I have a box of dice that contains a 4-sided die, a 6-sided die, an 8-sided die, a 12-sided die, and a 20-sided die. If you have ever played Dungeons & Dragons, you know what I am talking about.

Suppose I select a die from the box at random, roll it, and get a 6.
What is the probability that I rolled each die?

To recap, this was my final solution:

likelihoods = function(names, observations) {
  scores = rep(1.0 / length(names), length(names))  
  names(scores) = names
 
  for(name in names) {
      for(observation in observations) {
        if(name < observation) {
          scores[paste(name)]  = 0
        } else {
          scores[paste(name)] = scores[paste(name)] *  (1.0 / name)
        }        
      }
    }  
  return(scores)
}
 
dice = c(4,6,8,12,20)
l1 = likelihoods(dice, c(6))
 
> l1 / sum(l1)
        4         6         8        12        20 
0.0000000 0.3921569 0.2941176 0.1960784 0.1176471

Although it works we have nested for loops which aren’t very idiomatic R so let’s try and get rid of them.

The first thing we want to do is return a data frame rather than a vector so we tweak the first two lines to read like this:

scores = rep(1.0 / length(names), length(names))  
df = data.frame(score = scores, name = names)

Next we can get rid of the inner for loop and replace it with a call to ifelse wrapped inside a dplyr mutate call:

library(dplyr)
likelihoods2 = function(names, observations) {
  scores = rep(1.0 / length(names), length(names))  
  df = data.frame(score = scores, name = names)
 
  for(observation in observations) {
    df = df %>% mutate(score = ifelse(name < observation, 0, score * (1.0 / name)) )
  }
 
  return(df)
}
 
dice = c(4,6,8,12,20)
l1 = likelihoods2(dice, c(6))
 
> l1
       score name
1 0.00000000    4
2 0.03333333    6
3 0.02500000    8
4 0.01666667   12
5 0.01000000   20

Finally we’ll tidy up the scores so they’re relatively weighted against each other:

likelihoods2 = function(names, observations) {
  scores = rep(1.0 / length(names), length(names))  
  df = data.frame(score = scores, name = names)
 
  for(observation in observations) {
    df = df %>% mutate(score = ifelse(name < observation, 0, score * (1.0 / name)) )
  }
 
  return(df %>% mutate(weighted = score / sum(score)) %>% select(name, weighted))
}
 
dice = c(4,6,8,12,20)
l1 = likelihoods2(dice, c(6))
 
> l1
  name  weighted
1    4 0.0000000
2    6 0.3921569
3    8 0.2941176
4   12 0.1960784
5   20 0.1176471

Now we’re down to just the one for loop. Getting rid of that one is a bit trickier. First we’ll create a data frame which contains a row for every (observation, dice) pair, simulating the nested for loops:

likelihoods3 = function(names, observations) {
  l = list(observation = observations, roll = names)
  obsDf = do.call(expand.grid,l) %>% 
    mutate(likelihood = 1.0 / roll, 
           score = ifelse(roll < observation, 0, likelihood))   
 
  return(obsDf)
}
 
dice = c(4,6,8,12,20)
l1 = likelihoods3(dice, c(6))
 
> l1
  observation roll likelihood      score
1           6    4 0.25000000 0.00000000
2           6    6 0.16666667 0.16666667
3           6    8 0.12500000 0.12500000
4           6   12 0.08333333 0.08333333
5           6   20 0.05000000 0.05000000
 
l2 = likelihoods3(dice, c(6, 4, 8, 7, 7, 2))
> l2
   observation roll likelihood      score
1            6    4 0.25000000 0.00000000
2            4    4 0.25000000 0.25000000
3            8    4 0.25000000 0.00000000
4            7    4 0.25000000 0.00000000
5            7    4 0.25000000 0.00000000
6            2    4 0.25000000 0.25000000
7            6    6 0.16666667 0.16666667
8            4    6 0.16666667 0.16666667
9            8    6 0.16666667 0.00000000
10           7    6 0.16666667 0.00000000
11           7    6 0.16666667 0.00000000
12           2    6 0.16666667 0.16666667
13           6    8 0.12500000 0.12500000
14           4    8 0.12500000 0.12500000
15           8    8 0.12500000 0.12500000
16           7    8 0.12500000 0.12500000
17           7    8 0.12500000 0.12500000
18           2    8 0.12500000 0.12500000
19           6   12 0.08333333 0.08333333
20           4   12 0.08333333 0.08333333
21           8   12 0.08333333 0.08333333
22           7   12 0.08333333 0.08333333
23           7   12 0.08333333 0.08333333
24           2   12 0.08333333 0.08333333
25           6   20 0.05000000 0.05000000
26           4   20 0.05000000 0.05000000
27           8   20 0.05000000 0.05000000
28           7   20 0.05000000 0.05000000
29           7   20 0.05000000 0.05000000
30           2   20 0.05000000 0.05000000

Now we need to iterate over the data frame, grouping by ‘roll’ so that we end up with one row for each one.

We’ll add a new column which stores the posterior probability for each dice. This will be calculated by multiplying the prior probability by the product of the ‘score’ entries.

This is what our new likelihood function looks like:

likelihoods3 = function(names, observations) {
  l = list(observation = observations, roll = names)
  obsDf = do.call(expand.grid,l) %>% 
    mutate(likelihood = 1.0 / roll, 
           score = ifelse(roll < observation, 0, likelihood))   
 
  return (obsDf %>% 
    group_by(roll) %>% 
    summarise(s = (1.0/length(names)) * prod(score) ) %>%
    ungroup() %>% 
    mutate(weighted = s / sum(s)) %>%
    select(roll, weighted))
}
 
l1 = likelihoods3(dice, c(6))
> l1
Source: local data frame [5 x 2]
 
  roll  weighted
1    4 0.0000000
2    6 0.3921569
3    8 0.2941176
4   12 0.1960784
5   20 0.1176471
 
l2 = likelihoods3(dice, c(6, 4, 8, 7, 7, 2))
> l2
Source: local data frame [5 x 2]
 
  roll    weighted
1    4 0.000000000
2    6 0.000000000
3    8 0.915845272
4   12 0.080403426
5   20 0.003751302

We’ve now got the same result as we did with our nested for loops so I think the refactoring has been a success.

Written by Mark Needham

April 22nd, 2015 at 10:18 pm

Posted in R

Tagged with

R: Numeric keys in the nested list/dictionary

without comments

Last week I described how I’ve been creating fake dictionaries in R using lists and I found myself using the same structure while solving the dice problem in Think Bayes.

The dice problem is described as follows:

Suppose I have a box of dice that contains a 4-sided die, a 6-sided die, an 8-sided die, a 12-sided die, and a 20-sided die. If you have ever played Dungeons & Dragons, you know what I am talking about.

Suppose I select a die from the box at random, roll it, and get a 6.
What is the probability that I rolled each die?

Here’s a simple example of the nested list that I started with:

dice = c(4,6,8,12,20)
priors = rep(1.0 / length(dice), length(dice))
names(priors) = dice
 
> priors
  4   6   8  12  20 
0.2 0.2 0.2 0.2 0.2

I wanted to retrieve the prior for the 8 dice which I tried to do like this:

> priors[8]
<NA> 
  NA

That comes back with ‘NA’ because we’re actually looking for the numeric index 8 rather than the item in that column.

As far as I understand if we want to look up values by name we have to use a string so I tweaked the code to store names as strings:

dice = c(4,6,8,12,20)
priors = rep(1.0 / length(dice), length(dice))
names(priors) = sapply(dice, paste)
 
> priors["8"]
  8 
0.2

That works much better but with some experimentation I realised I didn’t even need to run ‘dice’ through the sapply function, it already works the way it was:

dice = c(4,6,8,12,20)
priors = rep(1.0 / length(dice), length(dice))
names(priors) = dice
 
> priors["8"]
  8 
0.2

Now that we’ve got that working we can write a likelihood function which takes in observed dice rolls and tells us how likely it was that we rolled each type of dice. We start simple by copying the above code into a function:

likelihoods = function(names, observations) {
  scores = rep(1.0 / length(names), length(names))  
  names(scores) = names
 
  return(scores)
}
 
dice = c(4,6,8,12,20)
l1 = likelihoods(dice, c(6))
 
> l1
  4   6   8  12  20 
0.2 0.2 0.2 0.2 0.2

Next we’ll update the score for a particular dice to 0 if one of the observed rolls is greater than the dice’s maximum score:

likelihoods = function(names, observations) {
  scores = rep(1.0 / length(names), length(names))  
  names(scores) = names
 
  for(name in names) {
      for(observation in observations) {
        if(name < observation) {
          scores[paste(name)]  = 0       
        }
      }
    }  
  return(scores)
}
 
dice = c(4,6,8,12,20)
l1 = likelihoods(dice, c(6))
 
> l1
  4   6   8  12  20 
0.0 0.2 0.2 0.2 0.2

The 4 dice has been ruled out since we’ve rolled a 6! Now let’s put in the else condition which updates our score by the probability that we got the observed roll with each of valid dice. i.e. we have a 1/20 chance of rolling any number with the 20 side dice, a 1/8 chance with the 8 sided dice etc.

likelihoods = function(names, observations) {
  scores = rep(1.0 / length(names), length(names))  
  names(scores) = names
 
  for(name in names) {
      for(observation in observations) {
        if(name < observation) {
          scores[paste(name)]  = 0
        } else {
          scores[paste(name)] = scores[paste(name)] *  (1.0 / name)
        }        
      }
    }  
  return(scores)
}
 
dice = c(4,6,8,12,20)
l1 = likelihoods(dice, c(6))
 
> l1
         4          6          8         12         20 
0.00000000 0.03333333 0.02500000 0.01666667 0.01000000

And finally let’s normalise those scores so they’re a bit more readable:

> l1 / sum(l1)
        4         6         8        12        20 
0.0000000 0.3921569 0.2941176 0.1960784 0.1176471

Written by Mark Needham

April 21st, 2015 at 5:59 am

Posted in R

Tagged with

R: non-numeric argument to binary operator

without comments

When debugging R code, given my Java background, I often find myself trying to print out the state of variables along with an appropriate piece of text like this:

names = c(1,2,3,4,5,6)
> print("names: " + names)
Error in "names: " + names : non-numeric argument to binary operator

We might try this next:

> print("names: ", names)
[1] "names: "

which doesn’t actually print the names variable – only the first argument to the print function is printed.

We’ll find more success with the paste function:

> print(paste("names: ", names))
[1] "names:  1" "names:  2" "names:  3" "names:  4" "names:  5" "names:  6"

This is an improvement but it repeats the ‘names:’ prefix multiple times which isn’t what we want. Introducing the toString function gets us over the line:

> print(paste("names: ", toString(names)))
[1] "names:  1, 2, 3, 4, 5, 6"

Written by Mark Needham

April 19th, 2015 at 11:08 pm

Posted in R

Tagged with

R: Removing for loops

without comments

In my last blog post I showed the translation of a likelihood function from Think Bayes into R and in my first attempt at this function I used a couple of nested for loops.

likelihoods = function(names, mixes, observations) {
  scores = rep(1, length(names))
  names(scores) = names
 
  for(name in names) {
      for(observation in observations) {
        scores[name] = scores[name] *  mixes[[name]][observation]      
      }
    }  
  return(scores)
}
Names = c("Bowl 1", "Bowl 2")
 
bowl1Mix = c(0.75, 0.25)
names(bowl1Mix) = c("vanilla", "chocolate")
bowl2Mix = c(0.5, 0.5)
names(bowl2Mix) = c("vanilla", "chocolate")
Mixes = list("Bowl 1" = bowl1Mix, "Bowl 2" = bowl2Mix)
Mixes
 
Observations = c("vanilla", "vanilla", "vanilla", "chocolate")
l = likelihoods(Names, Mixes, Observations)
 
> l / sum(l)
  Bowl 1   Bowl 2 
0.627907 0.372093

We pass in a vector of bowls, a nested dictionary describing the mixes of cookies in each bowl and the observations that we’ve made. The function tells us that there’s an almost 2/3 probability of the cookies coming from Bowl 1 and just over 1/3 of being Bowl 2.

In this case there probably won’t be much of a performance improvement by getting rid of the loops but we should be able to write something that’s more concise and hopefully idiomatic.

Let’s start by getting rid of the inner for loop. That can be replace by a call to the Reduce function like so:

likelihoods2 = function(names, mixes, observations) {
  scores = rep(0, length(names))
  names(scores) = names
 
  for(name in names) {
    scores[name] = Reduce(function(acc, observation) acc *  mixes[[name]][observation], Observations, 1)
  }  
  return(scores)
}
l2 = likelihoods2(Names, Mixes, Observations)
 
> l2 / sum(l2)
  Bowl 1   Bowl 2 
0.627907 0.372093

So that’s good, we’ve still got the same probabilities as before. Now to get rid of the outer for loop. The Map function helps us out here:

likelihoods3 = function(names, mixes, observations) {
  scores = rep(0, length(names))
  names(scores) = names
 
  scores = Map(function(name) 
    Reduce(function(acc, observation) acc *  mixes[[name]][observation], Observations, 1), 
    names)
 
  return(scores)
}
 
l3 = likelihoods3(Names, Mixes, Observations)
> l3
$`Bowl 1`
  vanilla 
0.1054688 
 
$`Bowl 2`
vanilla 
 0.0625

We end up with a list instead of a vector which we need to fix by using the unlist function:

likelihoods3 = function(names, mixes, observations) {
  scores = rep(0, length(names))
  names(scores) = names
 
  scores = Map(function(name) 
    Reduce(function(acc, observation) acc *  mixes[[name]][observation], Observations, 1), 
    names)
 
  return(unlist(scores))
}
 
l3 = likelihoods3(Names, Mixes, Observations)
 
> l3 / sum(l3)
Bowl 1.vanilla Bowl 2.vanilla 
      0.627907       0.372093

Now we just have this annoying ‘vanilla’ in the name. That’s fixed easily enough:

likelihoods3 = function(names, mixes, observations) {
  scores = rep(0, length(names))
  names(scores) = names
 
  scores = Map(function(name) 
    Reduce(function(acc, observation) acc *  mixes[[name]][observation], Observations, 1), 
    names)
 
  result = unlist(scores)
  names(result) = names
 
  return(result)
}
 
l3 = likelihoods3(Names, Mixes, Observations)
 
> l3 / sum(l3)
  Bowl 1   Bowl 2 
0.627907 0.372093

A slightly cleaner alternative makes use of the sapply function:

likelihoods3 = function(names, mixes, observations) {
  scores = rep(0, length(names))
  names(scores) = names
 
  scores = sapply(names, function(name) 
    Reduce(function(acc, observation) acc *  mixes[[name]][observation], Observations, 1))
  names(scores) = names
 
  return(scores)
}
 
l3 = likelihoods3(Names, Mixes, Observations)
 
> l3 / sum(l3)
  Bowl 1   Bowl 2 
0.627907 0.372093

That’s the best I’ve got for now but I wonder if we could write a version of this using matrix operations some how – but that’s for next time!

Written by Mark Needham

April 18th, 2015 at 11:53 pm

Posted in R

Tagged with

R: Think Bayes – More posterior probability calculations

without comments

As I mentioned in a post last week I’ve been reading through Think Bayes and translating some of the examples form Python to R.

After my first post Antonios suggested a more idiomatic way of writing the function in R so I thought I’d give it a try to calculate the probability that combinations of cookies had come from each bowl.

In the simplest case we have this function which takes in the names of the bowls and the likelihood scores:

f = function(names,likelihoods) {
  # Assume each option has an equal prior
  priors = rep(1, length(names)) / length(names)
 
  # create a data frame with all info you have
  dt = data.frame(names,priors,likelihoods)
 
  # calculate posterior probabilities
  dt$post = dt$priors*dt$likelihoods / sum(dt$priors*dt$likelihoods)
 
  # specify what you want the function to return
  list(names=dt$names, priors=dt$priors, likelihoods=dt$likelihoods, posteriors=dt$post)  
}

We assume a prior probability of 0.5 for each bowl.

Given the following probabilities of of different cookies being in each bowl…

mixes = {
  'Bowl 1':dict(vanilla=0.75, chocolate=0.25),
  'Bowl 2':dict(vanilla=0.5, chocolate=0.5),
}

…we can simulate taking one vanilla cookie with the following parameters:

Likelihoods = c(0.75,0.5)
Names = c("Bowl 1", "Bowl 2")
res=f(Names,Likelihoods)
 
> res$posteriors[res$name == "Bowl 1"]
[1] 0.6
> res$posteriors[res$name == "Bowl 2"]
[1] 0.4

If we want to simulate taking 3 vanilla cookies and 1 chocolate one we’d have the following:

Likelihoods = c((0.75 ** 3) * (0.25 ** 1), (0.5 ** 3) * (0.5 ** 1))
Names = c("Bowl 1", "Bowl 2")
res=f(Names,Likelihoods)
 
> res$posteriors[res$name == "Bowl 1"]
[1] 0.627907
> res$posteriors[res$name == "Bowl 2"]
[1] 0.372093

That’s a bit clunky and the intent of ‘3 vanilla cookies and 1 chocolate’ has been lost. I decided to refactor the code to take in a vector of cookies and calculate the likelihoods internally.

First we need to create a data structure to store the mixes of cookies in each bowl that we defined above. It turns out we can do this using a nested list:

bowl1Mix = c(0.75, 0.25)
names(bowl1Mix) = c("vanilla", "chocolate")
bowl2Mix = c(0.5, 0.5)
names(bowl2Mix) = c("vanilla", "chocolate")
Mixes = list("Bowl 1" = bowl1Mix, "Bowl 2" = bowl2Mix)
 
> Mixes
$`Bowl 1`
  vanilla chocolate 
     0.75      0.25 
 
$`Bowl 2`
  vanilla chocolate 
      0.5       0.5

Now let’s tweak our function to take in observations rather than likelihoods and then calculate those likelihoods internally:

likelihoods = function(names, observations) {
  scores = c(1,1)
  names(scores) = names
 
  for(name in names) {
      for(observation in observations) {
        scores[name] = scores[name] *  mixes[[name]][observation]      
      }
    }  
  return(scores)
}
 
f = function(names,mixes,observations) {
  # Assume each option has an equal prior
  priors = rep(1, length(names)) / length(names)
 
  # create a data frame with all info you have
  dt = data.frame(names,priors)
 
  dt$likelihoods = likelihoods(Names, Observations)
 
  # calculate posterior probabilities
  dt$post = dt$priors*dt$likelihoods / sum(dt$priors*dt$likelihoods)
 
  # specify what you want the function to return
  list(names=dt$names, priors=dt$priors, likelihoods=dt$likelihoods, posteriors=dt$post)  
}

And if we call that function:

Names = c("Bowl 1", "Bowl 2")
 
bowl1Mix = c(0.75, 0.25)
names(bowl1Mix) = c("vanilla", "chocolate")
bowl2Mix = c(0.5, 0.5)
names(bowl2Mix) = c("vanilla", "chocolate")
Mixes = list("Bowl 1" = bowl1Mix, "Bowl 2" = bowl2Mix)
Mixes
 
Observations = c("vanilla", "vanilla", "vanilla", "chocolate")
 
res=f(Names,Mixes,Observations)
 
> res$posteriors[res$names == "Bowl 1"]
[1] 0.627907
 
> res$posteriors[res$names == "Bowl 2"]
[1] 0.372093

Exactly the same result as before! #win

Written by Mark Needham

April 16th, 2015 at 8:57 pm

Posted in R

Tagged with