Mark Needham

Thoughts on Software Development

Archive for the ‘R’ Category

R: Removing for loops

without comments

In my last blog post I showed the translation of a likelihood function from Think Bayes into R and in my first attempt at this function I used a couple of nested for loops.

likelihoods = function(names, mixes, observations) {
  scores = rep(1, length(names))
  names(scores) = names
 
  for(name in names) {
      for(observation in observations) {
        scores[name] = scores[name] *  mixes[[name]][observation]      
      }
    }  
  return(scores)
}
Names = c("Bowl 1", "Bowl 2")
 
bowl1Mix = c(0.75, 0.25)
names(bowl1Mix) = c("vanilla", "chocolate")
bowl2Mix = c(0.5, 0.5)
names(bowl2Mix) = c("vanilla", "chocolate")
Mixes = list("Bowl 1" = bowl1Mix, "Bowl 2" = bowl2Mix)
Mixes
 
Observations = c("vanilla", "vanilla", "vanilla", "chocolate")
l = likelihoods(Names, Mixes, Observations)
 
> l / sum(l)
  Bowl 1   Bowl 2 
0.627907 0.372093

We pass in a vector of bowls, a nested dictionary describing the mixes of cookies in each bowl and the observations that we’ve made. The function tells us that there’s an almost 2/3 probability of the cookies coming from Bowl 1 and just over 1/3 of being Bowl 2.

In this case there probably won’t be much of a performance improvement by getting rid of the loops but we should be able to write something that’s more concise and hopefully idiomatic.

Let’s start by getting rid of the inner for loop. That can be replace by a call to the Reduce function like so:

likelihoods2 = function(names, mixes, observations) {
  scores = rep(0, length(names))
  names(scores) = names
 
  for(name in names) {
    scores[name] = Reduce(function(acc, observation) acc *  mixes[[name]][observation], Observations, 1)
  }  
  return(scores)
}
l2 = likelihoods2(Names, Mixes, Observations)
 
> l2 / sum(l2)
  Bowl 1   Bowl 2 
0.627907 0.372093

So that’s good, we’ve still got the same probabilities as before. Now to get rid of the outer for loop. The Map function helps us out here:

likelihoods3 = function(names, mixes, observations) {
  scores = rep(0, length(names))
  names(scores) = names
 
  scores = Map(function(name) 
    Reduce(function(acc, observation) acc *  mixes[[name]][observation], Observations, 1), 
    names)
 
  return(scores)
}
 
l3 = likelihoods3(Names, Mixes, Observations)
> l3
$`Bowl 1`
  vanilla 
0.1054688 
 
$`Bowl 2`
vanilla 
 0.0625

We end up with a list instead of a vector which we need to fix by using the unlist function:

likelihoods3 = function(names, mixes, observations) {
  scores = rep(0, length(names))
  names(scores) = names
 
  scores = Map(function(name) 
    Reduce(function(acc, observation) acc *  mixes[[name]][observation], Observations, 1), 
    names)
 
  return(unlist(scores))
}
 
l3 = likelihoods3(Names, Mixes, Observations)
 
> l3 / sum(l3)
Bowl 1.vanilla Bowl 2.vanilla 
      0.627907       0.372093

Now we just have this annoying ‘vanilla’ in the name. That’s fixed easily enough:

likelihoods3 = function(names, mixes, observations) {
  scores = rep(0, length(names))
  names(scores) = names
 
  scores = Map(function(name) 
    Reduce(function(acc, observation) acc *  mixes[[name]][observation], Observations, 1), 
    names)
 
  result = unlist(scores)
  names(result) = names
 
  return(result)
}
 
l3 = likelihoods3(Names, Mixes, Observations)
 
> l3 / sum(l3)
  Bowl 1   Bowl 2 
0.627907 0.372093

A slightly cleaner alternative makes use of the sapply function:

likelihoods3 = function(names, mixes, observations) {
  scores = rep(0, length(names))
  names(scores) = names
 
  scores = sapply(names, function(name) 
    Reduce(function(acc, observation) acc *  mixes[[name]][observation], Observations, 1))
  names(scores) = names
 
  return(scores)
}
 
l3 = likelihoods3(Names, Mixes, Observations)
 
> l3 / sum(l3)
  Bowl 1   Bowl 2 
0.627907 0.372093

That’s the best I’ve got for now but I wonder if we could write a version of this using matrix operations some how – but that’s for next time!

Written by Mark Needham

April 18th, 2015 at 11:53 pm

Posted in R

Tagged with

R: Think Bayes – More posterior probability calculations

without comments

As I mentioned in a post last week I’ve been reading through Think Bayes and translating some of the examples form Python to R.

After my first post Antonios suggested a more idiomatic way of writing the function in R so I thought I’d give it a try to calculate the probability that combinations of cookies had come from each bowl.

In the simplest case we have this function which takes in the names of the bowls and the likelihood scores:

f = function(names,likelihoods) {
  # Assume each option has an equal prior
  priors = rep(1, length(names)) / length(names)
 
  # create a data frame with all info you have
  dt = data.frame(names,priors,likelihoods)
 
  # calculate posterior probabilities
  dt$post = dt$priors*dt$likelihoods / sum(dt$priors*dt$likelihoods)
 
  # specify what you want the function to return
  list(names=dt$names, priors=dt$priors, likelihoods=dt$likelihoods, posteriors=dt$post)  
}

We assume a prior probability of 0.5 for each bowl.

Given the following probabilities of of different cookies being in each bowl…

mixes = {
  'Bowl 1':dict(vanilla=0.75, chocolate=0.25),
  'Bowl 2':dict(vanilla=0.5, chocolate=0.5),
}

…we can simulate taking one vanilla cookie with the following parameters:

Likelihoods = c(0.75,0.5)
Names = c("Bowl 1", "Bowl 2")
res=f(Names,Likelihoods)
 
> res$posteriors[res$name == "Bowl 1"]
[1] 0.6
> res$posteriors[res$name == "Bowl 2"]
[1] 0.4

If we want to simulate taking 3 vanilla cookies and 1 chocolate one we’d have the following:

Likelihoods = c((0.75 ** 3) * (0.25 ** 1), (0.5 ** 3) * (0.5 ** 1))
Names = c("Bowl 1", "Bowl 2")
res=f(Names,Likelihoods)
 
> res$posteriors[res$name == "Bowl 1"]
[1] 0.627907
> res$posteriors[res$name == "Bowl 2"]
[1] 0.372093

That’s a bit clunky and the intent of ‘3 vanilla cookies and 1 chocolate’ has been lost. I decided to refactor the code to take in a vector of cookies and calculate the likelihoods internally.

First we need to create a data structure to store the mixes of cookies in each bowl that we defined above. It turns out we can do this using a nested list:

bowl1Mix = c(0.75, 0.25)
names(bowl1Mix) = c("vanilla", "chocolate")
bowl2Mix = c(0.5, 0.5)
names(bowl2Mix) = c("vanilla", "chocolate")
Mixes = list("Bowl 1" = bowl1Mix, "Bowl 2" = bowl2Mix)
 
> Mixes
$`Bowl 1`
  vanilla chocolate 
     0.75      0.25 
 
$`Bowl 2`
  vanilla chocolate 
      0.5       0.5

Now let’s tweak our function to take in observations rather than likelihoods and then calculate those likelihoods internally:

likelihoods = function(names, observations) {
  scores = c(1,1)
  names(scores) = names
 
  for(name in names) {
      for(observation in observations) {
        scores[name] = scores[name] *  mixes[[name]][observation]      
      }
    }  
  return(scores)
}
 
f = function(names,mixes,observations) {
  # Assume each option has an equal prior
  priors = rep(1, length(names)) / length(names)
 
  # create a data frame with all info you have
  dt = data.frame(names,priors)
 
  dt$likelihoods = likelihoods(Names, Observations)
 
  # calculate posterior probabilities
  dt$post = dt$priors*dt$likelihoods / sum(dt$priors*dt$likelihoods)
 
  # specify what you want the function to return
  list(names=dt$names, priors=dt$priors, likelihoods=dt$likelihoods, posteriors=dt$post)  
}

And if we call that function:

Names = c("Bowl 1", "Bowl 2")
 
bowl1Mix = c(0.75, 0.25)
names(bowl1Mix) = c("vanilla", "chocolate")
bowl2Mix = c(0.5, 0.5)
names(bowl2Mix) = c("vanilla", "chocolate")
Mixes = list("Bowl 1" = bowl1Mix, "Bowl 2" = bowl2Mix)
Mixes
 
Observations = c("vanilla", "vanilla", "vanilla", "chocolate")
 
res=f(Names,Mixes,Observations)
 
> res$posteriors[res$names == "Bowl 1"]
[1] 0.627907
 
> res$posteriors[res$names == "Bowl 2"]
[1] 0.372093

Exactly the same result as before! #win

Written by Mark Needham

April 16th, 2015 at 8:57 pm

Posted in R

Tagged with

R: Creating an object with functions to calculate conditional probability

without comments

I’ve been working through Alan Downey’s Thinking Bayes and I thought it’d be an interesting exercise to translate some of the code from Python to R.

The first example is a simple one about conditional probablity and the author creates a class ‘PMF’ (Probability Mass Function) to solve the following problem:

Suppose there are two bowls of cookies. Bowl 1 contains 30 vanilla cookies and 10 chocolate cookies. Bowl 2 contains 20 of each.

Now suppose you choose one of the bowls at random and, without looking, select a cookie at random. The cookie is vanilla.

What is the probability that it came from Bowl 1?

In Python the code looks like this:

pmf = Pmf()
pmf.Set('Bowl 1', 0.5)
pmf.Set('Bowl 2', 0.5)
 
pmf.Mult('Bowl 1', 0.75)
pmf.Mult('Bowl 2', 0.5)
 
pmf.Normalize()
 
print pmf.Prob('Bowl 1')

The ‘PMF’ class is defined here.

  • ‘Set’ defines the prior probability of picking a cookie from either bowl i.e. in our case it’s random.
  • ‘Mult’ defines the likelihood of picking a vanilla biscuit from either bowl
  • ‘Normalize’ applies a normalisation so that our posterior probabilities add up to 1.

We want to create something similar in R and the actual calculation is stragiht forward:

pBowl1 = 0.5
pBowl2 = 0.5
 
pVanillaGivenBowl1 = 0.75
pVanillaGivenBowl2 = 0.5
 
> (pBowl1 * pVanillaGivenBowl1) / ((pBowl1 * pVanillaGivenBowl1) + (PBowl2 * pVanillaGivenBowl2))
0.6
 
> (pBowl2 * pVanillaGivenBowl2) / ((pBowl1 * pVanillaGivenBowl1) + (pBowl2 * pVanillaGivenBowl2))
0.4

The problem is we have quite a bit of duplication and it doesn’t read as cleanly as the Python version.

I’m not sure of the idiomatic way of handling this type of problem in R with mutable state in R but it seems like we can achieve this using functions.

I ended up writing the following function which returns a list of other functions to call.

create.pmf = function() {
  priors <<- c()
  likelihoods <<- c()
  list(
    prior = function(option, probability) {
      l = c(probability)  
      names(l) = c(option)
      priors <<- c(priors, l)
    },
    likelihood = function(option, probability) {
      l = c(probability)  
      names(l) = c(option)
      likelihoods <<- c(likelihoods, l)
    },
    posterior = function(option) {
      names = names(priors)
      normalised = 0.0
      for(name in names) {
        normalised = normalised + (priors[name] * likelihoods[name])
      }
 
      (priors[option] * likelihoods[option]) / normalised
    }    
  )
}

I couldn’t work out how to get ‘priors’ and ‘likelihoods’ to be lexically scoped so I’ve currently got those defined as global variables. I’m using a list as a kind of dictionary following a suggestion on Stack Overflow.

The code doesn’t handle the unhappy path very well but it seems to work for the example from the book:

pmf = create.pmf()
 
pmf$prior("Bowl 1", 0.5)
pmf$prior("Bowl 2", 0.5)
 
pmf$likelihood("Bowl 1", 0.75)
pmf$likelihood("Bowl 2", 0.5)
 
> pmf$posterior("Bowl 1")
Bowl 1 
   0.6 
> pmf$posterior("Bowl 2")
Bowl 2 
   0.4

How would you solve this type of problem? Is there a cleaner/better way?

Written by Mark Needham

April 12th, 2015 at 7:55 am

Posted in R

Tagged with ,

R: Snakes and ladders markov chain

with one comment

A few days ago I read a really cool blog post explaining how Markov chains can be used to model the possible state transitions in a game of snakes and ladders, a use of Markov chains I hadn’t even thought of!

While the example is very helpful for understanding the concept, my understanding of the code is that it works off the assumption that any roll of the dice that puts you on a score > 100 is a winning roll.

In the version of the game that I know you have to land exactly on 100 to win. e.g if you’re on square 98 and roll a 6 you would go forward 2 spaces to 100 and then bounce back 4 spaces to 96.

I thought it’d be a good exercise to tweak the code to cater for this:

n=100
 
# We have 6 extra columns because we want to represent throwing of the dice which results in a final square > 100
M=matrix(0,n+1,n+1+6)
rownames(M)=0:n
colnames(M)=0:(n+6)
 
# set probabilities of landing on each square assuming that there aren't any snakes or ladders
for(i in 1:6){
  diag(M[,(i+1):(i+1+n)])=1/6
}
 
# account for 'bounce back' if a dice roll leads to a final score > 100
for(i in 96:100) {
  for(c in 102:107) {
    idx = 101 - (c - 101)  
    M[i, idx] = M[i, idx] + M[i, c]
  }  
}

We can inspect the last few rows to check that if the transition matrix is accurate:

> M[95:100,95:101]
 
   94        95        96        97        98        99       100
94  0 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667
95  0 0.0000000 0.1666667 0.1666667 0.1666667 0.3333333 0.1666667
96  0 0.0000000 0.0000000 0.1666667 0.3333333 0.3333333 0.1666667
97  0 0.0000000 0.0000000 0.1666667 0.3333333 0.3333333 0.1666667
98  0 0.0000000 0.1666667 0.1666667 0.1666667 0.3333333 0.1666667
99  0 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667

If we’re on the 99th square (the last row) we could roll a 1 and end up on 100, a 2 and end up on 99 (1 forward, 1 back), a 3 and end up on 98 (1 forward, 2 back), a 4 and end up on 97 (1 forward, 3 back), a 5 and end up on 96 (1 forward, 4 back) or a 6 and end up on 95 (1 forward, 5 back). i.e. we can land on 95, 96, 97, 98, 99 or 100 with 1/6 probability.

If we’re on the 96th square (the 3rd row) we could roll a 1 and end up on 97, a 2 and end up on 98, a 3 and end up on 99, a 4 and end up on 100, a 5 and end up on 99 (4 forward, 1 back) or a 6 and end up on 98 (4 forward, 2 back). i.e. we can land on 97 with 1/6 probability, 98 with 2/6 probability, 99 with 2/6 probability or 100 with 1/6 probability.

We could do a similar analysis for the other squares but it seems like the probabilities are being calculated correctly.

Next we can update the matrix with the snakes and ladders. That code stays the same:

# get rid of the extra columns, we don't need them anymore
M=M[,1:(n+1)]
 
# add in the snakes and ladders
starting = c(4,9,17,20,28,40,51,54,62,64,63,71,93,95,92)
ending   = c(14,31,7,38,84,59,67,34,19,60,81,91,73,75,78)
 
for(i in 1:length(starting)) {
  # Retrieve current probabilities of landing on the starting square
  v=M[,starting[i]+1]  
  ind=which(v>0)
 
  # Set no probability of falling on the starting squares
  M[ind,starting[i]+1]=0
 
  # Move all existing probabilities to the ending squares
  M[ind,ending[i]+1]=M[ind,ending[i]+1]+v[ind]
}

We can also simplify the powermat function which is used to simulate what the board would look like after a certain number of dice rolls:

# original
powermat=function(P,h){
  Ph=P
  if(h>1) {
    for(k in 2:h) {
      Ph=Ph%*%P
    }
  }
  return(Ph)
}
 
#new 
library(expm)
powermat = function(P,h) {
  return (P %^% h)
}

initial=c(1,rep(0,n))
h = 1
> (initial%*%powermat(M,h))[1:15]
     0         1         2         3 4         5         6 7 8 9 10 11 12 13        14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
[1,] 0 0.1666667 0.1666667 0.1666667 0 0.1666667 0.1666667 0 0 0  0  0  0  0 0.1666667  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
     46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
[1,]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0   0

One interesting thing I noticed is that it now seems to take way more turns on average to finish the game than when you didn’t need to score exactly 100 to win:

> sum(1 - game)
[1] 999
distrib=initial%*%M
game=rep(NA,1000)
for(h in 1:length(game)){
game[h]=distrib[n+1]
distrib=distrib%*%M}
plot(1-game[1:200],type="l",lwd=2,col="red",
ylab="Probability to be still playing")
2015 04 09 22 48 24

I expected it to take longer to finish the game but not this long! I think I’ve probably made a mistake but I’m not sure where…

Update

Antonios found the mistake I’d made – when on the 100th square we should have a 1 as the probability of getting to the 100th square. i.e. we need to update M like so:

M[101,101] = 1

Now if we visualise he probability that we’re still playing we get a more accurate curve:

distrib=initial%*%M
game=rep(NA,1000)
for(h in 1:length(game)){
game[h]=distrib[n+1]
distrib=distrib%*%M}
plot(1-game[1:200],type="l",lwd=2,col="red",
ylab="Probability to be still playing")
2015 04 10 23 49 21

Written by Mark Needham

April 9th, 2015 at 10:02 pm

Posted in R

Tagged with

R: Markov Chain Wikipedia Example

without comments

Over the weekend I’ve been reading about Markov Chains and I thought it’d be an interesting exercise for me to translate Wikipedia’s example into R code.

But first a definition:

A Markov chain is a random process that undergoes transitions from one state to another on a state space.

It is required to possess a property that is usually characterized as “memoryless”: the probability distribution of the next state depends only on the current state and not on the sequence of events that preceded it.

that ‘random process’ could be moves in a Monopoly game, the next word in a sentence or, as in Wikipedia’s example, the next state of the Stock Market.

The diagram below shows the probabilities of transitioning between the various states:

800px Finance Markov chain example state space svg

e.g. if we’re in a Bull Market the probability of the state of the market next week being a Bull Market is 0.9, a Bear Market is 0.075 and a Stagnant Market is 0.025.

We can model the various transition probabilities as a matrix:

M = matrix(c(0.9, 0.075, 0.025, 0.15, 0.8, 0.05, 0.25, 0.25, 0.5),
          nrow = 3,
          ncol = 3,
          byrow = TRUE)
 
> M
     [,1]  [,2]  [,3]
[1,] 0.90 0.075 0.025
[2,] 0.15 0.800 0.050
[3,] 0.25 0.250 0.500

Rows/Cols 1-3 are Bull, Bear, Stagnant respectively.

Now let’s say we start with a Bear market and want to find the probability of each state in 3 weeks time.

We can do this is by multiplying our probability/transition matrix by itself 3 times and then multiplying the result by a vector representing the initial Bear market state.

threeIterations = (M %*% M %*% M)
 
> threeIterations
> threeIterations
       [,1]    [,2]    [,3]
[1,] 0.7745 0.17875 0.04675
[2,] 0.3575 0.56825 0.07425
[3,] 0.4675 0.37125 0.16125
 
> c(0,1,0) %*% threeIterations
       [,1]    [,2]    [,3]
[1,] 0.3575 0.56825 0.07425

So we have a 56.825% chance of still being in a Bear Market, 35.75% chance that we’re now in a Bull Market and only a 7.425% chance of being in a stagnant market.

I found it a bit annoying having to type ‘%*% M’ multiple times but luckily the expm library allows us to apply a Matrix power operation:

install.packages("expm")
library(expm)
 
> M %^% 3
       [,1]    [,2]    [,3]
[1,] 0.7745 0.17875 0.04675
[2,] 0.3575 0.56825 0.07425
[3,] 0.4675 0.37125 0.16125

The nice thing about this function is that we can now easily see where the stock market will trend towards over a large number of weeks:

> M %^% 100
      [,1]   [,2]   [,3]
[1,] 0.625 0.3125 0.0625
[2,] 0.625 0.3125 0.0625
[3,] 0.625 0.3125 0.0625

i.e. 62.5% of weeks we will be in a bull market, 31.25% of weeks will be in a bear market and 6.25% of weeks will be stagnant,

Written by Mark Needham

April 5th, 2015 at 10:07 am

Posted in R

Tagged with

R/ggplot: Controlling X axis order

without comments

As part of a talk I gave at the Neo4j London meetup earlier this week I wanted to show how you could build a simple chart showing the number of friends that different actors had using the ggplot library.

I started out with the following code:

df = read.csv("/tmp/friends.csv")
top = df %>% head(20)
 
ggplot(aes(x = p.name, y = colleagues), data = top) + 
  geom_bar(fill = "dark blue", stat = "identity")

The friends CSV file is available as a gist if you want to reproduce the chart. This is how the chart renders:

2015 02 27 00 41 08

It’s in a fairly arbitrary order when it would be quite cool if we could get the most popular people to show from left to right.

I had the people’s names in the correct order in the data frame but annoyingly it was then sorting them into alphabetical order. Luckily I came across the by using the scale_x_discrete function which does exactly what I needed.

If we pass in the list of names to that function we get the chart we desire:

ggplot(aes(x = p.name, y = colleagues), data = top) + 
  geom_bar(fill = "dark blue", stat = "identity") + 
  scale_x_discrete(limits= top$p.name)

2015 02 27 00 45 01

Written by Mark Needham

February 27th, 2015 at 12:49 am

Posted in R

Tagged with ,

R: Conditionally updating rows of a data frame

without comments

In a blog post I wrote a couple of days ago about cohort analysis I had to assign a monthNumber to each row in a data frame and started out with the following code:

library(zoo)
library(dplyr)
 
monthNumber = function(cohort, date) {
  cohortAsDate = as.yearmon(cohort)
  dateAsDate = as.yearmon(date)
 
  if(cohortAsDate > dateAsDate) {
    "NA"
  } else {
    paste(round((dateAsDate - cohortAsDate) * 12), sep="")
  }
}
 
cohortAttendance %>% 
  group_by(row_number()) %>% 
  mutate(monthNumber = monthNumber(cohort, date)) %>%
  filter(monthNumber != "NA") %>%
  filter(monthNumber != "0") %>% 
  mutate(monthNumber = as.numeric(monthNumber)) %>% 
  arrange(monthNumber)

If we time this function using system.time we’ll see that it’s not very snappy:

system.time(cohortAttendance %>% 
  group_by(row_number()) %>% 
  mutate(monthNumber = monthNumber(cohort, date)) %>%
  filter(monthNumber != "NA") %>%
  filter(monthNumber != "0") %>% 
  mutate(monthNumber = as.numeric(monthNumber)) %>% 
  arrange(monthNumber))
 
   user  system elapsed 
  1.968   0.019   2.016

The reason for the poor performance is that we process each row of the data table individually due to the call to group_by on the second line. One way we can refactor the code is to use the ifelse which can process multiple rows at a time:

system.time(
cohortAttendance %>% 
  mutate(monthNumber = ifelse(as.yearmon(cohort) > as.yearmon(date), 
                              paste((round(as.yearmon(date) - as.yearmon(cohort))*12), sep=""), 
                              NA)))
   user  system elapsed 
  0.026   0.000   0.026

Antonios suggested another approach which involves first setting every row to ‘NA’ and then selectively updating the appropriate rows. I ended up with the following code:

cohortAttendance$monthNumber = NA
 
cohortAttendance$monthNumber[as.yearmon(cohortAttendance$cohort) > as.yearmon(cohortAttendance$date)] = paste((round(as.yearmon(cohortAttendance$date) - as.yearmon(cohortAttendance$cohort))*12), sep="")

Let’s measure that:

system.time(paste((round(as.yearmon(cohortAttendance$date) - as.yearmon(cohortAttendance$cohort))*12), sep=""))
   user  system elapsed 
  0.013   0.000   0.013

Both approaches are much quicker than my original version although this one seems to be marginally quicker than the ifelse approach.

Note to future Mark: try to avoid grouping by row number – there’s usually a better and faster solution!

Written by Mark Needham

February 26th, 2015 at 12:45 am

Posted in R

Tagged with

R: Cohort analysis of Neo4j meetup members

without comments

A few weeks ago I came across a blog post explaining how to apply cohort analysis to customer retention using R and I thought it’d be a fun exercise to calculate something similar for meetup attendees.

In the customer retention example we track customer purchases on a month by month basis and each customer is put into a cohort or bucket based on the first month they made a purchase in.

We then calculate how many of them made purchases in subsequent months and compare that with the behaviour of people in other cohorts.

In our case we aren’t selling anything so our equivalent will be a person attending a meetup. We’ll put people into cohorts based on the month of the first meetup they attended.

This can act as a proxy for when people become interested in a technology and could perhaps allow us to see how the behaviour of innovators, early adopters and the early majority differs, if at all.

The first thing we need to do is get the data showing the events that people RSVP’d ‘yes’ to. I’ve already got the data in Neo4j so we’ll write a query to extract it as a data frame:

library(RNeo4j)
graph = startGraph("http://127.0.0.1:7474/db/data/")
 
query = "MATCH (g:Group {name: \"Neo4j - London User Group\"})-[:HOSTED_EVENT]->(e),
               (e)<-[:TO]-(rsvp {response: \"yes\"})<-[:RSVPD]-(person)
         RETURN rsvp.time, person.id"
 
timestampToDate <- function(x) as.POSIXct(x / 1000, origin="1970-01-01", tz = "GMT")
 
df = cypher(graph, query)
df$time = timestampToDate(df$rsvp.time)
df$date = format(as.Date(df$time), "%Y-%m")
> df %>% head()
##         rsvp.time person.id                time    date
## 612  1.404857e+12  23362191 2014-07-08 22:00:29 2014-07
## 1765 1.380049e+12 112623332 2013-09-24 18:58:00 2013-09
## 1248 1.390563e+12   9746061 2014-01-24 11:24:35 2014-01
## 1541 1.390920e+12   7881615 2014-01-28 14:40:35 2014-01
## 3056 1.420670e+12  12810159 2015-01-07 22:31:04 2015-01
## 607  1.406025e+12  14329387 2014-07-22 10:34:51 2014-07
## 1634 1.391445e+12  91330472 2014-02-03 16:33:58 2014-02
## 2137 1.371453e+12  68874702 2013-06-17 07:17:10 2013-06
## 430  1.407835e+12 150265192 2014-08-12 09:15:31 2014-08
## 2957 1.417190e+12 182752269 2014-11-28 15:45:18 2014-11

Next we need to find the first meetup that a person attended – this will determine the cohort that the person is assigned to:

firstMeetup = df %>% 
  group_by(person.id) %>% 
  summarise(firstEvent = min(time), count = n()) %>% 
  arrange(desc(count))
 
> firstMeetup
## Source: local data frame [10 x 3]
## 
##    person.id          firstEvent count
## 1   13526622 2013-01-24 20:25:19     2
## 2  119400912 2014-10-03 13:09:09     2
## 3  122524352 2014-08-14 14:09:44     1
## 4   37201052 2012-05-21 10:26:24     3
## 5  137112712 2014-07-31 09:32:12     1
## 6  152448642 2014-06-20 08:32:50    17
## 7   98563682 2014-11-05 17:27:57     1
## 8  146976492 2014-05-17 00:04:42     4
## 9   12318409 2014-11-03 05:25:26     2
## 10  41280492 2014-10-16 19:02:03     5

Let’s assign each person to a cohort (month/year) and see how many people belong to each one:

firstMeetup$date = format(as.Date(firstMeetup$firstEvent), "%Y-%m")
byMonthYear = firstMeetup %>% count(date) %>% arrange(date)
 
ggplot(aes(x=date, y = n), data = byMonthYear) + 
  geom_bar(stat="identity", fill = "dark blue") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Unnamed chunk 4 1

Next we need to track a cohort over time to see whether people keep coming to events. I wrote the following function to work it out:

countsForCohort = function(df, firstMeetup, cohort) {
  members = (firstMeetup %>% filter(date == cohort))$person.id
 
  attendance = df %>% 
    filter(person.id %in% members) %>% 
    count(person.id, date) %>% 
    ungroup() %>%
    count(date)
 
  allCohorts = df %>% select(date) %>% unique
  cohortAttendance = merge(allCohorts, attendance, by = "date", all = TRUE)
 
  cohortAttendance[is.na(cohortAttendance) & cohortAttendance$date > cohort] = 0
  cohortAttendance %>% mutate(cohort = cohort, retention = n / length(members))  
}

On the first line we get the ids of all the people in the cohort so that we can filter the data frame to only include RSVPs by these people. The first call to ‘count’ makes sure that we only have one entry per person per month and the second call gives us a count of how many people attended an event in a given month.

Next we do the equivalent of a left join using the merge function to ensure we have a row representing each month even if noone from the cohort attended. This will lead to NA entries if there’s no matching row in the ‘attendance’ data frame – we’ll replace those with a 0 if the cohort is in the future. If not we’ll leave it as it is.

Finally we calculate the retention rate for each month for that cohort. e.g. these are some of the rows for the ‘2011-06′ cohort:

> countsForCohort(df, firstMeetup, "2011-06") %>% sample_n(10)
      date n  cohort retention
16 2013-01 1 2011-06      0.25
5  2011-10 1 2011-06      0.25
30 2014-03 0 2011-06      0.00
29 2014-02 0 2011-06      0.00
40 2015-01 0 2011-06      0.00
31 2014-04 0 2011-06      0.00
8  2012-04 2 2011-06      0.50
39 2014-12 0 2011-06      0.00
2  2011-07 1 2011-06      0.25
19 2013-04 1 2011-06      0.25

We could then choose to plot that cohort:

ggplot(aes(x=date, y = retention, colour = cohort), data = countsForCohort(df, firstMeetup, "2011-06")) + 
  geom_line(aes(group = cohort)) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Unnamed chunk 5 1

From this chart we can see that none of the people who first attended a Neo4j meetup in June 2011 have attended any events for the last two years.

Next we want to be able to plot multiple cohorts on the same chart which we can easily do by constructing one big data frame and passing it to ggplot:

cohorts = collect(df %>% select(date) %>% unique())[,1]
 
cohortAttendance = data.frame()
for(cohort in cohorts) {
  cohortAttendance = rbind(cohortAttendance,countsForCohort(df, firstMeetup, cohort))      
}
 
ggplot(aes(x=date, y = retention, colour = cohort), data = cohortAttendance) + 
  geom_line(aes(group = cohort)) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Unnamed chunk 5 2

This all looks a bit of a mess and at the moment we can’t easily compare cohorts as they start at different places on the x axis. We can fix that by adding a ‘monthNumber’ column to the data frame which we calculate with the following function:

monthNumber = function(cohort, date) {
  cohortAsDate = as.yearmon(cohort)
  dateAsDate = as.yearmon(date)
 
  if(cohortAsDate > dateAsDate) {
    "NA"
  } else {
    paste(round((dateAsDate - cohortAsDate) * 12), sep="")
  }
}

Now let’s create a new data frame with the month field added:

cohortAttendanceWithMonthNumber = cohortAttendance %>% 
  group_by(row_number()) %>% 
  mutate(monthNumber = monthNumber(cohort, date)) %>%
  filter(monthNumber != "NA") %>%
  filter(monthNumber != "0") %>% 
  mutate(monthNumber = as.numeric(monthNumber)) %>% 
  arrange(monthNumber)

We’re also filtering out any ‘NA’ columns which would represent row entries for months from before the cohort started. We don’t want to plot those.

finally let’s plot a chart containing all cohorts normalised by month number:

ggplot(aes(x=monthNumber, y = retention, colour = cohort), data = cohortAttendanceWithMonthNumber) + 
  geom_line(aes(group = cohort)) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1), panel.background = element_blank())
Unnamed chunk 5 3

It’s still a bit of a mess but what stands out is that when the number of people in a cohort is small the fluctuation in the retention value can be quite pronounced.

The next step is to make the cohorts a bit more coarse grained to see if it reveals some insights. I think I’ll start out with a cohort covering a 3 month period and see how that works out.

Written by Mark Needham

February 24th, 2015 at 1:19 am

Posted in R

Tagged with ,

R/dplyr: Extracting data frame column value for filtering with %in%

with one comment

I’ve been playing around with dplyr over the weekend and wanted to extract the values from a data frame column to use in a later filtering step.

I had a data frame:

library(dplyr)
df = data.frame(userId = c(1,2,3,4,5), score = c(2,3,4,5,5))

And wanted to extract the userIds of those people who have a score greater than 3. I started with:

highScoringPeople = df %>% filter(score > 3) %>% select(userId)
> highScoringPeople
  userId
1      3
2      4
3      5

And then filtered the data frame expecting to get back those 3 people:

> df %>% filter(userId %in% highScoringPeople)
[1] userId score 
<0 rows> (or 0-length row.names)

No rows! I created vector with the numbers 3-5 to make sure that worked:

> df %>% filter(userId %in% c(3,4,5))
  userId score
1      3     4
2      4     5
3      5     5

That works as expected so highScoringPeople obviously isn’t in the right format to facilitate an ‘in lookup’. Let’s explore:

> str(c(3,4,5))
 num [1:3] 3 4 5
 
> str(highScoringPeople)
'data.frame':	3 obs. of  1 variable:
 $ userId: num  3 4 5

Now it’s even more obvious why it doesn’t work – highScoringPeople is still a data frame when we need it to be a vector/list.

One way to fix this is to extract the userIds using the $ syntax instead of the select function:

highScoringPeople = (df %>% filter(score > 3))$userId
 
> str(highScoringPeople)
 num [1:3] 3 4 5
 
> df %>% filter(userId %in% highScoringPeople)
  userId score
1      3     4
2      4     5
3      5     5

Or if we want to do the column selection using dplyr we can extract the values for the column like this:

highScoringPeople = (df %>% filter(score > 3) %>% select(userId))[[1]]
 
> str(highScoringPeople)
 num [1:3] 3 4 5

Not so difficult after all.

Written by Mark Needham

February 22nd, 2015 at 8:58 am

Posted in R

Tagged with ,

R: Weather vs attendance at NoSQL meetups

without comments

A few weeks ago I came across a tweet by Sean Taylor asking for a weather data set with a few years worth of recording and I was surprised to learn that R already has such a thing – the weatherData package.

weatherData provides a thin veneer around the wunderground API and was exactly what I’d been looking for to compare meetup at London’s NoSQL against weather conditions that day.

The first step was to download the appropriate weather recordings and save them to a CSV file so I wouldn’t have to keep calling the API.

I thought I may as well download all the recordings available to me and wrote the following code to make that happen:

library(weatherData)
 
# London City Airport
getDetailedWeatherForYear = function(year) {
  getWeatherForDate("LCY", 
                    start_date= paste(sep="", year, "-01-01"),
                    end_date = paste(sep="", year, "-12-31"),
                    opt_detailed = FALSE,
                    opt_all_columns = TRUE)
}
 
df = rbind(getDetailedWeatherForYear(2011), 
      getDetailedWeatherForYear(2012),
      getDetailedWeatherForYear(2013),
      getDetailedWeatherForYear(2014),
      getWeatherForDate("LCY", start_date="2015-01-01",
                        end_date = "2015-01-25",
                        opt_detailed = FALSE,
                        opt_all_columns = TRUE))

I then saved that to a CSV file:

write.csv(df, 'weather/temp_data.csv', row.names = FALSE)
"Date","GMT","Max_TemperatureC","Mean_TemperatureC","Min_TemperatureC","Dew_PointC","MeanDew_PointC","Min_DewpointC","Max_Humidity","Mean_Humidity","Min_Humidity","Max_Sea_Level_PressurehPa","Mean_Sea_Level_PressurehPa","Min_Sea_Level_PressurehPa","Max_VisibilityKm","Mean_VisibilityKm","Min_VisibilitykM","Max_Wind_SpeedKm_h","Mean_Wind_SpeedKm_h","Max_Gust_SpeedKm_h","Precipitationmm","CloudCover","Events","WindDirDegrees"
2011-01-01,"2011-1-1",7,6,4,5,3,1,93,85,76,1027,1025,1023,10,9,3,14,10,NA,0,7,"Rain",312
2011-01-02,"2011-1-2",4,3,2,1,0,-1,87,81,75,1029,1028,1027,10,10,10,11,8,NA,0,7,"",321
2011-01-03,"2011-1-3",4,2,1,0,-2,-5,87,74,56,1028,1024,1019,10,10,10,8,5,NA,0,6,"Rain-Snow",249
2011-01-04,"2011-1-4",6,3,1,3,1,-1,93,83,65,1019,1013,1008,10,10,10,21,6,NA,0,5,"Rain",224
2011-01-05,"2011-1-5",8,7,5,6,3,0,93,80,61,1008,1000,994,10,9,4,26,16,45,0,4,"Rain",200
2011-01-06,"2011-1-6",7,4,3,6,3,1,93,90,87,1002,996,993,10,9,5,13,6,NA,0,5,"Rain",281
2011-01-07,"2011-1-7",11,6,2,9,5,2,100,91,82,1003,999,996,10,7,2,24,11,NA,0,5,"Rain-Snow",124
2011-01-08,"2011-1-8",11,7,4,8,4,-1,87,77,65,1004,997,987,10,10,5,39,23,50,0,5,"Rain",230
2011-01-09,"2011-1-9",7,4,3,1,0,-1,87,74,57,1018,1012,1004,10,10,10,24,16,NA,0,NA,"",242

If we want to read that back in future we can do so with the following code:

weather = read.csv("weather/temp_data.csv")
weather$Date = as.POSIXct(weather$Date)
 
> weather %>% sample_n(10) %>% select(Date, Min_TemperatureC, Mean_TemperatureC, Max_TemperatureC)
           Date Min_TemperatureC Mean_TemperatureC Max_TemperatureC
1471 2015-01-10                5                 9               14
802  2013-03-12               -2                 1                4
1274 2014-06-27               14                18               22
848  2013-04-27                5                 8               10
832  2013-04-11                6                 8               10
717  2012-12-17                6                 7                9
1463 2015-01-02                6                 9               13
1090 2013-12-25                4                 6                7
560  2012-07-13               15                18               20
1230 2014-05-14                9                14               19

The next step was to bring the weather data together with the meetup attendance data that I already had.

For simplicity’s sake I’ve got those saved in a CSV file as we can just read those in as well:

timestampToDate <- function(x) as.POSIXct(x / 1000, origin="1970-01-01", tz = "GMT")
 
events = read.csv("events.csv")
events$eventTime = timestampToDate(events$eventTime)
 
> events %>% sample_n(10) %>% select(event.name, rsvps, eventTime)
                                                           event.name rsvps           eventTime
36                                   London Office Hours - Old Street    10 2012-01-18 17:00:00
137                                          Enterprise Search London    34 2011-05-23 18:15:00
256                           MarkLogic User Group London: Jim Fuller    40 2014-04-29 18:30:00
117                                  Neural Networks and Data Science   171 2013-03-28 18:30:00
210                                  London Office Hours - Old Street     3 2011-09-15 17:00:00
443                                                      July social!    12 2014-07-14 19:00:00
322                                                   Intro to Graphs    39 2014-09-03 18:30:00
203                                  Vendor focus: Amazon CloudSearch    24 2013-05-16 17:30:00
17  Neo4J Tales from the Trenches: A Recommendation Engine Case Study    12 2012-04-25 18:30:00
55                                                London Office Hours    10 2013-09-18 17:00:00

Now that we’ve got our two datasets ready we can plot a simple chart of the average attendance and temperature grouped by month:

byMonth = events %>% 
  mutate(month = factor(format(eventTime, "%B"), levels=month.name)) %>%
  group_by(month) %>%
  summarise(events = n(), 
            count = sum(rsvps)) %>%
  mutate(ave = count / events) %>%
  arrange(desc(ave))
 
averageTemperatureByMonth = weather %>% 
  mutate(month = factor(format(Date, "%B"), levels=month.name)) %>%
  group_by(month) %>% 
  summarise(aveTemperature = mean(Mean_TemperatureC))
 
g1 = ggplot(aes(x = month, y = aveTemperature, group=1), data = averageTemperatureByMonth) + 
  geom_line( ) + 
  ggtitle("Temperature by month")
 
g2 = ggplot(aes(x = month, y = count, group=1), data = byMonth) + 
  geom_bar(stat="identity", fill="dark blue") +
  ggtitle("Attendance by month")
 
library(gridExtra)
grid.arrange(g1,g2, ncol = 1)

2015 02 09 20 32 50

We can see a rough inverse correlation between the temperature and attendance, particularly between April and August – as the temperature increases, total attendance decreases.

But what about if we compare at a finer level of granularity such as a specific date? We can do that by adding a ‘day’ column to our events data frame and merging it with the weather one:

byDay = events %>% 
  mutate(day = as.Date(as.POSIXct(eventTime))) %>%
  group_by(day) %>%
  summarise(events = n(), 
            count = sum(rsvps)) %>%
  mutate(ave = count / events) %>%
  arrange(desc(ave))
weather = weather %>% mutate(day = Date)
merged = merge(weather, byDay, by = "day")

Now we can plot the attendance vs the mean temperature for individual days:

ggplot(aes(x =count, y = Mean_TemperatureC,group = day), data = merged) + 
  geom_point()
2015 02 10 07 21 24

Interestingly there now doesn’t seem to be any correlation between the temperature and attendance. We can confirm our suspicions by running a correlation:

> cor(merged$count, merged$Mean_TemperatureC)
[1] 0.008516294

Not even 1% correlation between the values! One way we could confirm that non correlation is to plot the average temperature against the average attendance rather than total attendance:

g1 = ggplot(aes(x = month, y = aveTemperature, group=1), data = averageTemperatureByMonth) + 
  geom_line( ) + 
  ggtitle("Temperature by month")
 
g2 = ggplot(aes(x = month, y = ave, group=1), data = byMonth) + 
  geom_bar(stat="identity", fill="dark blue") +
  ggtitle("Attendance by month")
 
grid.arrange(g1,g2, ncol = 1)

2015 02 11 06 48 05

Now we can see there’s not really that much of a correlation between temperature and month – in fact 9 of the months have a very similar average attendance. It’s only July, December and especially August where there’s a noticeable dip.

This could suggest there’s another variable other than temperature which is influencing attendance in these months. My hypothesis is that we’d see lower attendance in the weeks of school holidays – the main ones happen in July/August, December and March/April (which interestingly don’t show the dip!)

Another interesting thing to look into is whether the reason for the dip in attendance isn’t through lack of will from attendees but rather because there aren’t actually any events to go to. Let’s plot the number of events being hosted each month against the temperature:

g1 = ggplot(aes(x = month, y = aveTemperature, group=1), data = averageTemperatureByMonth) + 
  geom_line( ) + 
  ggtitle("Temperature by month")
 
g2 = ggplot(aes(x = month, y = events, group=1), data = byMonth) + 
  geom_bar(stat="identity", fill="dark blue") +
  ggtitle("Events by month")
 
grid.arrange(g1,g2, ncol = 1)

2015 02 11 06 57 16

Here we notice there’s a big dip in events in December – organisers are hosting less events and we know from our earlier plot that on average less people are attending those events. Lots of events are hosted in the Autumn, slightly fewer in the Spring and fewer in January, March and August in particular.

Again there’s no particular correlation between temperature and the number of events being hosted on a particular day:

ggplot(aes(x = events, y = Mean_TemperatureC,group = day), data = merged) + 
  geom_point()

2015 02 11 07 05 48

There’s not any obvious correlation from looking at this plot although I find it difficult to interpret plots where we have the values all grouped around very few points (often factor variables) on one axis and spread out (continuous variable) on the other. Let’s confirm our suspicion by calculating the correlation between these two variables:

> cor(merged$events, merged$Mean_TemperatureC)
[1] 0.0251698

Back to the drawing board for my attendance prediction model then!

If you have any suggestions for doing this analysis more effectively or I’ve made any mistakes please let me know in the comments, I’m still learning how to investigate what data is actually telling us.

Written by Mark Needham

February 11th, 2015 at 7:09 am

Posted in R

Tagged with ,