## Archive for the ‘R’ Category

## R: Removing for loops

In my last blog post I showed the translation of a likelihood function from Think Bayes into R and in my first attempt at this function I used a couple of nested for loops.

likelihoods = function(names, mixes, observations) { scores = rep(1, length(names)) names(scores) = names for(name in names) { for(observation in observations) { scores[name] = scores[name] * mixes[[name]][observation] } } return(scores) } |

Names = c("Bowl 1", "Bowl 2") bowl1Mix = c(0.75, 0.25) names(bowl1Mix) = c("vanilla", "chocolate") bowl2Mix = c(0.5, 0.5) names(bowl2Mix) = c("vanilla", "chocolate") Mixes = list("Bowl 1" = bowl1Mix, "Bowl 2" = bowl2Mix) Mixes Observations = c("vanilla", "vanilla", "vanilla", "chocolate") l = likelihoods(Names, Mixes, Observations) > l / sum(l) Bowl 1 Bowl 2 0.627907 0.372093 |

We pass in a vector of bowls, a nested dictionary describing the mixes of cookies in each bowl and the observations that we’ve made. The function tells us that there’s an almost 2/3 probability of the cookies coming from Bowl 1 and just over 1/3 of being Bowl 2.

In this case there probably won’t be much of a performance improvement by getting rid of the loops but we should be able to write something that’s more concise and hopefully idiomatic.

Let’s start by getting rid of the inner for loop. That can be replace by a call to the Reduce function like so:

likelihoods2 = function(names, mixes, observations) { scores = rep(0, length(names)) names(scores) = names for(name in names) { scores[name] = Reduce(function(acc, observation) acc * mixes[[name]][observation], Observations, 1) } return(scores) } |

l2 = likelihoods2(Names, Mixes, Observations) > l2 / sum(l2) Bowl 1 Bowl 2 0.627907 0.372093 |

So that’s good, we’ve still got the same probabilities as before. Now to get rid of the outer for loop. The Map function helps us out here:

likelihoods3 = function(names, mixes, observations) { scores = rep(0, length(names)) names(scores) = names scores = Map(function(name) Reduce(function(acc, observation) acc * mixes[[name]][observation], Observations, 1), names) return(scores) } l3 = likelihoods3(Names, Mixes, Observations) > l3 $`Bowl 1` vanilla 0.1054688 $`Bowl 2` vanilla 0.0625 |

We end up with a list instead of a vector which we need to fix by using the unlist function:

likelihoods3 = function(names, mixes, observations) { scores = rep(0, length(names)) names(scores) = names scores = Map(function(name) Reduce(function(acc, observation) acc * mixes[[name]][observation], Observations, 1), names) return(unlist(scores)) } l3 = likelihoods3(Names, Mixes, Observations) > l3 / sum(l3) Bowl 1.vanilla Bowl 2.vanilla 0.627907 0.372093 |

Now we just have this annoying ‘vanilla’ in the name. That’s fixed easily enough:

likelihoods3 = function(names, mixes, observations) { scores = rep(0, length(names)) names(scores) = names scores = Map(function(name) Reduce(function(acc, observation) acc * mixes[[name]][observation], Observations, 1), names) result = unlist(scores) names(result) = names return(result) } l3 = likelihoods3(Names, Mixes, Observations) > l3 / sum(l3) Bowl 1 Bowl 2 0.627907 0.372093 |

A slightly cleaner alternative makes use of the sapply function:

likelihoods3 = function(names, mixes, observations) { scores = rep(0, length(names)) names(scores) = names scores = sapply(names, function(name) Reduce(function(acc, observation) acc * mixes[[name]][observation], Observations, 1)) names(scores) = names return(scores) } l3 = likelihoods3(Names, Mixes, Observations) > l3 / sum(l3) Bowl 1 Bowl 2 0.627907 0.372093 |

That’s the best I’ve got for now but I wonder if we could write a version of this using matrix operations some how – but that’s for next time!

## R: Think Bayes – More posterior probability calculations

As I mentioned in a post last week I’ve been reading through Think Bayes and translating some of the examples form Python to R.

After my first post Antonios suggested a more idiomatic way of writing the function in R so I thought I’d give it a try to calculate the probability that combinations of cookies had come from each bowl.

In the simplest case we have this function which takes in the names of the bowls and the likelihood scores:

f = function(names,likelihoods) { # Assume each option has an equal prior priors = rep(1, length(names)) / length(names) # create a data frame with all info you have dt = data.frame(names,priors,likelihoods) # calculate posterior probabilities dt$post = dt$priors*dt$likelihoods / sum(dt$priors*dt$likelihoods) # specify what you want the function to return list(names=dt$names, priors=dt$priors, likelihoods=dt$likelihoods, posteriors=dt$post) } |

We assume a prior probability of 0.5 for each bowl.

Given the following probabilities of of different cookies being in each bowl…

mixes = { 'Bowl 1':dict(vanilla=0.75, chocolate=0.25), 'Bowl 2':dict(vanilla=0.5, chocolate=0.5), } |

…we can simulate taking one vanilla cookie with the following parameters:

Likelihoods = c(0.75,0.5) Names = c("Bowl 1", "Bowl 2") res=f(Names,Likelihoods) > res$posteriors[res$name == "Bowl 1"] [1] 0.6 > res$posteriors[res$name == "Bowl 2"] [1] 0.4 |

If we want to simulate taking 3 vanilla cookies and 1 chocolate one we’d have the following:

Likelihoods = c((0.75 ** 3) * (0.25 ** 1), (0.5 ** 3) * (0.5 ** 1)) Names = c("Bowl 1", "Bowl 2") res=f(Names,Likelihoods) > res$posteriors[res$name == "Bowl 1"] [1] 0.627907 > res$posteriors[res$name == "Bowl 2"] [1] 0.372093 |

That’s a bit clunky and the intent of ‘3 vanilla cookies and 1 chocolate’ has been lost. I decided to refactor the code to take in a vector of cookies and calculate the likelihoods internally.

First we need to create a data structure to store the mixes of cookies in each bowl that we defined above. It turns out we can do this using a nested list:

bowl1Mix = c(0.75, 0.25) names(bowl1Mix) = c("vanilla", "chocolate") bowl2Mix = c(0.5, 0.5) names(bowl2Mix) = c("vanilla", "chocolate") Mixes = list("Bowl 1" = bowl1Mix, "Bowl 2" = bowl2Mix) > Mixes $`Bowl 1` vanilla chocolate 0.75 0.25 $`Bowl 2` vanilla chocolate 0.5 0.5 |

Now let’s tweak our function to take in observations rather than likelihoods and then calculate those likelihoods internally:

likelihoods = function(names, observations) { scores = c(1,1) names(scores) = names for(name in names) { for(observation in observations) { scores[name] = scores[name] * mixes[[name]][observation] } } return(scores) } f = function(names,mixes,observations) { # Assume each option has an equal prior priors = rep(1, length(names)) / length(names) # create a data frame with all info you have dt = data.frame(names,priors) dt$likelihoods = likelihoods(Names, Observations) # calculate posterior probabilities dt$post = dt$priors*dt$likelihoods / sum(dt$priors*dt$likelihoods) # specify what you want the function to return list(names=dt$names, priors=dt$priors, likelihoods=dt$likelihoods, posteriors=dt$post) } |

And if we call that function:

Names = c("Bowl 1", "Bowl 2") bowl1Mix = c(0.75, 0.25) names(bowl1Mix) = c("vanilla", "chocolate") bowl2Mix = c(0.5, 0.5) names(bowl2Mix) = c("vanilla", "chocolate") Mixes = list("Bowl 1" = bowl1Mix, "Bowl 2" = bowl2Mix) Mixes Observations = c("vanilla", "vanilla", "vanilla", "chocolate") res=f(Names,Mixes,Observations) > res$posteriors[res$names == "Bowl 1"] [1] 0.627907 > res$posteriors[res$names == "Bowl 2"] [1] 0.372093 |

Exactly the same result as before! #win

## R: Creating an object with functions to calculate conditional probability

I’ve been working through Alan Downey’s Thinking Bayes and I thought it’d be an interesting exercise to translate some of the code from Python to R.

The first example is a simple one about conditional probablity and the author creates a class ‘PMF’ (Probability Mass Function) to solve the following problem:

Suppose there are two bowls of cookies. Bowl 1 contains 30 vanilla cookies and 10 chocolate cookies. Bowl 2 contains 20 of each.

Now suppose you choose one of the bowls at random and, without looking, select a cookie at random. The cookie is vanilla.

What is the probability that it came from Bowl 1?

In Python the code looks like this:

pmf = Pmf() pmf.Set('Bowl 1', 0.5) pmf.Set('Bowl 2', 0.5) pmf.Mult('Bowl 1', 0.75) pmf.Mult('Bowl 2', 0.5) pmf.Normalize() print pmf.Prob('Bowl 1') |

The ‘PMF’ class is defined here.

- ‘Set’ defines the prior probability of picking a cookie from either bowl i.e. in our case it’s random.
- ‘Mult’ defines the likelihood of picking a vanilla biscuit from either bowl
- ‘Normalize’ applies a normalisation so that our posterior probabilities add up to 1.

We want to create something similar in R and the actual calculation is stragiht forward:

pBowl1 = 0.5 pBowl2 = 0.5 pVanillaGivenBowl1 = 0.75 pVanillaGivenBowl2 = 0.5 > (pBowl1 * pVanillaGivenBowl1) / ((pBowl1 * pVanillaGivenBowl1) + (PBowl2 * pVanillaGivenBowl2)) 0.6 > (pBowl2 * pVanillaGivenBowl2) / ((pBowl1 * pVanillaGivenBowl1) + (pBowl2 * pVanillaGivenBowl2)) 0.4 |

The problem is we have quite a bit of duplication and it doesn’t read as cleanly as the Python version.

I’m not sure of the idiomatic way of handling this type of problem in R with mutable state in R but it seems like we can achieve this using functions.

I ended up writing the following function which returns a list of other functions to call.

create.pmf = function() { priors <<- c() likelihoods <<- c() list( prior = function(option, probability) { l = c(probability) names(l) = c(option) priors <<- c(priors, l) }, likelihood = function(option, probability) { l = c(probability) names(l) = c(option) likelihoods <<- c(likelihoods, l) }, posterior = function(option) { names = names(priors) normalised = 0.0 for(name in names) { normalised = normalised + (priors[name] * likelihoods[name]) } (priors[option] * likelihoods[option]) / normalised } ) } |

I couldn’t work out how to get ‘priors’ and ‘likelihoods’ to be lexically scoped so I’ve currently got those defined as global variables. I’m using a list as a kind of dictionary following a suggestion on Stack Overflow.

The code doesn’t handle the unhappy path very well but it seems to work for the example from the book:

pmf = create.pmf() pmf$prior("Bowl 1", 0.5) pmf$prior("Bowl 2", 0.5) pmf$likelihood("Bowl 1", 0.75) pmf$likelihood("Bowl 2", 0.5) > pmf$posterior("Bowl 1") Bowl 1 0.6 > pmf$posterior("Bowl 2") Bowl 2 0.4 |

How would you solve this type of problem? Is there a cleaner/better way?

## R: Snakes and ladders markov chain

A few days ago I read a really cool blog post explaining how Markov chains can be used to model the possible state transitions in a game of snakes and ladders, a use of Markov chains I hadn’t even thought of!

While the example is very helpful for understanding the concept, my understanding of the code is that it works off the assumption that any roll of the dice that puts you on a score > 100 is a winning roll.

In the version of the game that I know you have to land exactly on 100 to win. e.g if you’re on square 98 and roll a 6 you would go forward 2 spaces to 100 and then bounce back 4 spaces to 96.

I thought it’d be a good exercise to tweak the code to cater for this:

n=100 # We have 6 extra columns because we want to represent throwing of the dice which results in a final square > 100 M=matrix(0,n+1,n+1+6) rownames(M)=0:n colnames(M)=0:(n+6) # set probabilities of landing on each square assuming that there aren't any snakes or ladders for(i in 1:6){ diag(M[,(i+1):(i+1+n)])=1/6 } # account for 'bounce back' if a dice roll leads to a final score > 100 for(i in 96:100) { for(c in 102:107) { idx = 101 - (c - 101) M[i, idx] = M[i, idx] + M[i, c] } } |

We can inspect the last few rows to check that if the transition matrix is accurate:

> M[95:100,95:101] 94 95 96 97 98 99 100 94 0 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 95 0 0.0000000 0.1666667 0.1666667 0.1666667 0.3333333 0.1666667 96 0 0.0000000 0.0000000 0.1666667 0.3333333 0.3333333 0.1666667 97 0 0.0000000 0.0000000 0.1666667 0.3333333 0.3333333 0.1666667 98 0 0.0000000 0.1666667 0.1666667 0.1666667 0.3333333 0.1666667 99 0 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 |

If we’re on the 99th square (the last row) we could roll a 1 and end up on 100, a 2 and end up on 99 (1 forward, 1 back), a 3 and end up on 98 (1 forward, 2 back), a 4 and end up on 97 (1 forward, 3 back), a 5 and end up on 96 (1 forward, 4 back) or a 6 and end up on 95 (1 forward, 5 back). i.e. we can land on 95, 96, 97, 98, 99 or 100 with 1/6 probability.

If we’re on the 96th square (the 3rd row) we could roll a 1 and end up on 97, a 2 and end up on 98, a 3 and end up on 99, a 4 and end up on 100, a 5 and end up on 99 (4 forward, 1 back) or a 6 and end up on 98 (4 forward, 2 back). i.e. we can land on 97 with 1/6 probability, 98 with 2/6 probability, 99 with 2/6 probability or 100 with 1/6 probability.

We could do a similar analysis for the other squares but it seems like the probabilities are being calculated correctly.

Next we can update the matrix with the snakes and ladders. That code stays the same:

# get rid of the extra columns, we don't need them anymore M=M[,1:(n+1)] # add in the snakes and ladders starting = c(4,9,17,20,28,40,51,54,62,64,63,71,93,95,92) ending = c(14,31,7,38,84,59,67,34,19,60,81,91,73,75,78) for(i in 1:length(starting)) { # Retrieve current probabilities of landing on the starting square v=M[,starting[i]+1] ind=which(v>0) # Set no probability of falling on the starting squares M[ind,starting[i]+1]=0 # Move all existing probabilities to the ending squares M[ind,ending[i]+1]=M[ind,ending[i]+1]+v[ind] } |

We can also simplify the powermat function which is used to simulate what the board would look like after a certain number of dice rolls:

# original powermat=function(P,h){ Ph=P if(h>1) { for(k in 2:h) { Ph=Ph%*%P } } return(Ph) } #new library(expm) powermat = function(P,h) { return (P %^% h) } |

initial=c(1,rep(0,n)) h = 1 > (initial%*%powermat(M,h))[1:15] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 [1,] 0 0.1666667 0.1666667 0.1666667 0 0.1666667 0.1666667 0 0 0 0 0 0 0 0.1666667 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 [1,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 |

One interesting thing I noticed is that it now seems to take way more turns on average to finish the game than when you didn’t need to score exactly 100 to win:

> sum(1 - game) [1] 999 |

distrib=initial%*%M game=rep(NA,1000) for(h in 1:length(game)){ game[h]=distrib[n+1] distrib=distrib%*%M} plot(1-game[1:200],type="l",lwd=2,col="red", ylab="Probability to be still playing") |

I expected it to take longer to finish the game but not this long! I think I’ve probably made a mistake but I’m not sure where…

### Update

Antonios found the mistake I’d made – when on the 100th square we should have a 1 as the probability of getting to the 100th square. i.e. we need to update M like so:

M[101,101] = 1 |

Now if we visualise he probability that we’re still playing we get a more accurate curve:

distrib=initial%*%M game=rep(NA,1000) for(h in 1:length(game)){ game[h]=distrib[n+1] distrib=distrib%*%M} plot(1-game[1:200],type="l",lwd=2,col="red", ylab="Probability to be still playing") |

## R: Markov Chain Wikipedia Example

Over the weekend I’ve been reading about Markov Chains and I thought it’d be an interesting exercise for me to translate Wikipedia’s example into R code.

But first a definition:

A Markov chain is a random process that undergoes transitions from one state to another on a state space.

It is required to possess a property that is usually characterized as “memoryless”: the probability distribution of the next state depends only on the current state and not on the sequence of events that preceded it.

that ‘random process’ could be moves in a Monopoly game, the next word in a sentence or, as in Wikipedia’s example, the next state of the Stock Market.

The diagram below shows the probabilities of transitioning between the various states:

e.g. if we’re in a Bull Market the probability of the state of the market next week being a Bull Market is 0.9, a Bear Market is 0.075 and a Stagnant Market is 0.025.

We can model the various transition probabilities as a matrix:

M = matrix(c(0.9, 0.075, 0.025, 0.15, 0.8, 0.05, 0.25, 0.25, 0.5), nrow = 3, ncol = 3, byrow = TRUE) > M [,1] [,2] [,3] [1,] 0.90 0.075 0.025 [2,] 0.15 0.800 0.050 [3,] 0.25 0.250 0.500 |

Rows/Cols 1-3 are Bull, Bear, Stagnant respectively.

Now let’s say we start with a Bear market and want to find the probability of each state in 3 weeks time.

We can do this is by multiplying our probability/transition matrix by itself 3 times and then multiplying the result by a vector representing the initial Bear market state.

threeIterations = (M %*% M %*% M) > threeIterations > threeIterations [,1] [,2] [,3] [1,] 0.7745 0.17875 0.04675 [2,] 0.3575 0.56825 0.07425 [3,] 0.4675 0.37125 0.16125 > c(0,1,0) %*% threeIterations [,1] [,2] [,3] [1,] 0.3575 0.56825 0.07425 |

So we have a 56.825% chance of still being in a Bear Market, 35.75% chance that we’re now in a Bull Market and only a 7.425% chance of being in a stagnant market.

I found it a bit annoying having to type ‘%*% M’ multiple times but luckily the expm library allows us to apply a Matrix power operation:

install.packages("expm") library(expm) > M %^% 3 [,1] [,2] [,3] [1,] 0.7745 0.17875 0.04675 [2,] 0.3575 0.56825 0.07425 [3,] 0.4675 0.37125 0.16125 |

The nice thing about this function is that we can now easily see where the stock market will trend towards over a large number of weeks:

> M %^% 100 [,1] [,2] [,3] [1,] 0.625 0.3125 0.0625 [2,] 0.625 0.3125 0.0625 [3,] 0.625 0.3125 0.0625 |

i.e. 62.5% of weeks we will be in a bull market, 31.25% of weeks will be in a bear market and 6.25% of weeks will be stagnant,

## R/ggplot: Controlling X axis order

As part of a talk I gave at the Neo4j London meetup earlier this week I wanted to show how you could build a simple chart showing the number of friends that different actors had using the ggplot library.

I started out with the following code:

df = read.csv("/tmp/friends.csv") top = df %>% head(20) ggplot(aes(x = p.name, y = colleagues), data = top) + geom_bar(fill = "dark blue", stat = "identity") |

The friends CSV file is available as a gist if you want to reproduce the chart. This is how the chart renders:

It’s in a fairly arbitrary order when it would be quite cool if we could get the most popular people to show from left to right.

I had the people’s names in the correct order in the data frame but annoyingly it was then sorting them into alphabetical order. Luckily I came across the by using the scale_x_discrete function which does exactly what I needed.

If we pass in the list of names to that function we get the chart we desire:

ggplot(aes(x = p.name, y = colleagues), data = top) + geom_bar(fill = "dark blue", stat = "identity") + scale_x_discrete(limits= top$p.name) |

## R: Conditionally updating rows of a data frame

In a blog post I wrote a couple of days ago about cohort analysis I had to assign a monthNumber to each row in a data frame and started out with the following code:

library(zoo) library(dplyr) monthNumber = function(cohort, date) { cohortAsDate = as.yearmon(cohort) dateAsDate = as.yearmon(date) if(cohortAsDate > dateAsDate) { "NA" } else { paste(round((dateAsDate - cohortAsDate) * 12), sep="") } } cohortAttendance %>% group_by(row_number()) %>% mutate(monthNumber = monthNumber(cohort, date)) %>% filter(monthNumber != "NA") %>% filter(monthNumber != "0") %>% mutate(monthNumber = as.numeric(monthNumber)) %>% arrange(monthNumber) |

If we time this function using system.time we’ll see that it’s not very snappy:

system.time(cohortAttendance %>% group_by(row_number()) %>% mutate(monthNumber = monthNumber(cohort, date)) %>% filter(monthNumber != "NA") %>% filter(monthNumber != "0") %>% mutate(monthNumber = as.numeric(monthNumber)) %>% arrange(monthNumber)) user system elapsed 1.968 0.019 2.016 |

The reason for the poor performance is that we process each row of the data table individually due to the call to group_by on the second line. One way we can refactor the code is to use the ifelse which can process multiple rows at a time:

system.time( cohortAttendance %>% mutate(monthNumber = ifelse(as.yearmon(cohort) > as.yearmon(date), paste((round(as.yearmon(date) - as.yearmon(cohort))*12), sep=""), NA))) user system elapsed 0.026 0.000 0.026 |

Antonios suggested another approach which involves first setting every row to ‘NA’ and then selectively updating the appropriate rows. I ended up with the following code:

cohortAttendance$monthNumber = NA cohortAttendance$monthNumber[as.yearmon(cohortAttendance$cohort) > as.yearmon(cohortAttendance$date)] = paste((round(as.yearmon(cohortAttendance$date) - as.yearmon(cohortAttendance$cohort))*12), sep="") |

Let’s measure that:

system.time(paste((round(as.yearmon(cohortAttendance$date) - as.yearmon(cohortAttendance$cohort))*12), sep="")) user system elapsed 0.013 0.000 0.013 |

Both approaches are much quicker than my original version although this one seems to be marginally quicker than the ifelse approach.

Note to future Mark: try to avoid grouping by row number – there’s usually a better and faster solution!

## R: Cohort analysis of Neo4j meetup members

A few weeks ago I came across a blog post explaining how to apply cohort analysis to customer retention using R and I thought it’d be a fun exercise to calculate something similar for meetup attendees.

In the customer retention example we track customer purchases on a month by month basis and each customer is put into a cohort or bucket based on the first month they made a purchase in.

We then calculate how many of them made purchases in subsequent months and compare that with the behaviour of people in other cohorts.

In our case we aren’t selling anything so our equivalent will be a person attending a meetup. We’ll put people into cohorts based on the month of the first meetup they attended.

This can act as a proxy for when people become interested in a technology and could perhaps allow us to see how the behaviour of innovators, early adopters and the early majority differs, if at all.

The first thing we need to do is get the data showing the events that people RSVP’d ‘yes’ to. I’ve already got the data in Neo4j so we’ll write a query to extract it as a data frame:

library(RNeo4j) graph = startGraph("http://127.0.0.1:7474/db/data/") query = "MATCH (g:Group {name: \"Neo4j - London User Group\"})-[:HOSTED_EVENT]->(e), (e)<-[:TO]-(rsvp {response: \"yes\"})<-[:RSVPD]-(person) RETURN rsvp.time, person.id" timestampToDate <- function(x) as.POSIXct(x / 1000, origin="1970-01-01", tz = "GMT") df = cypher(graph, query) df$time = timestampToDate(df$rsvp.time) df$date = format(as.Date(df$time), "%Y-%m") |

> df %>% head() ## rsvp.time person.id time date ## 612 1.404857e+12 23362191 2014-07-08 22:00:29 2014-07 ## 1765 1.380049e+12 112623332 2013-09-24 18:58:00 2013-09 ## 1248 1.390563e+12 9746061 2014-01-24 11:24:35 2014-01 ## 1541 1.390920e+12 7881615 2014-01-28 14:40:35 2014-01 ## 3056 1.420670e+12 12810159 2015-01-07 22:31:04 2015-01 ## 607 1.406025e+12 14329387 2014-07-22 10:34:51 2014-07 ## 1634 1.391445e+12 91330472 2014-02-03 16:33:58 2014-02 ## 2137 1.371453e+12 68874702 2013-06-17 07:17:10 2013-06 ## 430 1.407835e+12 150265192 2014-08-12 09:15:31 2014-08 ## 2957 1.417190e+12 182752269 2014-11-28 15:45:18 2014-11 |

Next we need to find the first meetup that a person attended – this will determine the cohort that the person is assigned to:

firstMeetup = df %>% group_by(person.id) %>% summarise(firstEvent = min(time), count = n()) %>% arrange(desc(count)) > firstMeetup ## Source: local data frame [10 x 3] ## ## person.id firstEvent count ## 1 13526622 2013-01-24 20:25:19 2 ## 2 119400912 2014-10-03 13:09:09 2 ## 3 122524352 2014-08-14 14:09:44 1 ## 4 37201052 2012-05-21 10:26:24 3 ## 5 137112712 2014-07-31 09:32:12 1 ## 6 152448642 2014-06-20 08:32:50 17 ## 7 98563682 2014-11-05 17:27:57 1 ## 8 146976492 2014-05-17 00:04:42 4 ## 9 12318409 2014-11-03 05:25:26 2 ## 10 41280492 2014-10-16 19:02:03 5 |

Let’s assign each person to a cohort (month/year) and see how many people belong to each one:

firstMeetup$date = format(as.Date(firstMeetup$firstEvent), "%Y-%m") byMonthYear = firstMeetup %>% count(date) %>% arrange(date) ggplot(aes(x=date, y = n), data = byMonthYear) + geom_bar(stat="identity", fill = "dark blue") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) |

Next we need to track a cohort over time to see whether people keep coming to events. I wrote the following function to work it out:

countsForCohort = function(df, firstMeetup, cohort) { members = (firstMeetup %>% filter(date == cohort))$person.id attendance = df %>% filter(person.id %in% members) %>% count(person.id, date) %>% ungroup() %>% count(date) allCohorts = df %>% select(date) %>% unique cohortAttendance = merge(allCohorts, attendance, by = "date", all = TRUE) cohortAttendance[is.na(cohortAttendance) & cohortAttendance$date > cohort] = 0 cohortAttendance %>% mutate(cohort = cohort, retention = n / length(members)) } |

On the first line we get the ids of all the people in the cohort so that we can filter the data frame to only include RSVPs by these people. The first call to ‘count’ makes sure that we only have one entry per person per month and the second call gives us a count of how many people attended an event in a given month.

Next we do the equivalent of a left join using the merge function to ensure we have a row representing each month even if noone from the cohort attended. This will lead to NA entries if there’s no matching row in the ‘attendance’ data frame – we’ll replace those with a 0 if the cohort is in the future. If not we’ll leave it as it is.

Finally we calculate the retention rate for each month for that cohort. e.g. these are some of the rows for the ‘2011-06′ cohort:

> countsForCohort(df, firstMeetup, "2011-06") %>% sample_n(10) date n cohort retention 16 2013-01 1 2011-06 0.25 5 2011-10 1 2011-06 0.25 30 2014-03 0 2011-06 0.00 29 2014-02 0 2011-06 0.00 40 2015-01 0 2011-06 0.00 31 2014-04 0 2011-06 0.00 8 2012-04 2 2011-06 0.50 39 2014-12 0 2011-06 0.00 2 2011-07 1 2011-06 0.25 19 2013-04 1 2011-06 0.25 |

We could then choose to plot that cohort:

ggplot(aes(x=date, y = retention, colour = cohort), data = countsForCohort(df, firstMeetup, "2011-06")) + geom_line(aes(group = cohort)) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) |

From this chart we can see that none of the people who first attended a Neo4j meetup in June 2011 have attended any events for the last two years.

Next we want to be able to plot multiple cohorts on the same chart which we can easily do by constructing one big data frame and passing it to ggplot:

cohorts = collect(df %>% select(date) %>% unique())[,1] cohortAttendance = data.frame() for(cohort in cohorts) { cohortAttendance = rbind(cohortAttendance,countsForCohort(df, firstMeetup, cohort)) } ggplot(aes(x=date, y = retention, colour = cohort), data = cohortAttendance) + geom_line(aes(group = cohort)) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) |

This all looks a bit of a mess and at the moment we can’t easily compare cohorts as they start at different places on the x axis. We can fix that by adding a ‘monthNumber’ column to the data frame which we calculate with the following function:

monthNumber = function(cohort, date) { cohortAsDate = as.yearmon(cohort) dateAsDate = as.yearmon(date) if(cohortAsDate > dateAsDate) { "NA" } else { paste(round((dateAsDate - cohortAsDate) * 12), sep="") } } |

Now let’s create a new data frame with the month field added:

cohortAttendanceWithMonthNumber = cohortAttendance %>% group_by(row_number()) %>% mutate(monthNumber = monthNumber(cohort, date)) %>% filter(monthNumber != "NA") %>% filter(monthNumber != "0") %>% mutate(monthNumber = as.numeric(monthNumber)) %>% arrange(monthNumber) |

We’re also filtering out any ‘NA’ columns which would represent row entries for months from before the cohort started. We don’t want to plot those.

finally let’s plot a chart containing all cohorts normalised by month number:

ggplot(aes(x=monthNumber, y = retention, colour = cohort), data = cohortAttendanceWithMonthNumber) + geom_line(aes(group = cohort)) + theme(axis.text.x = element_text(angle = 45, hjust = 1), panel.background = element_blank()) |

It’s still a bit of a mess but what stands out is that when the number of people in a cohort is small the fluctuation in the retention value can be quite pronounced.

The next step is to make the cohorts a bit more coarse grained to see if it reveals some insights. I think I’ll start out with a cohort covering a 3 month period and see how that works out.

## R/dplyr: Extracting data frame column value for filtering with %in%

I’ve been playing around with dplyr over the weekend and wanted to extract the values from a data frame column to use in a later filtering step.

I had a data frame:

library(dplyr) df = data.frame(userId = c(1,2,3,4,5), score = c(2,3,4,5,5)) |

And wanted to extract the userIds of those people who have a score greater than 3. I started with:

highScoringPeople = df %>% filter(score > 3) %>% select(userId) > highScoringPeople userId 1 3 2 4 3 5 |

And then filtered the data frame expecting to get back those 3 people:

> df %>% filter(userId %in% highScoringPeople) [1] userId score <0 rows> (or 0-length row.names) |

No rows! I created vector with the numbers 3-5 to make sure that worked:

> df %>% filter(userId %in% c(3,4,5)) userId score 1 3 4 2 4 5 3 5 5 |

That works as expected so highScoringPeople obviously isn’t in the right format to facilitate an ‘in lookup’. Let’s explore:

> str(c(3,4,5)) num [1:3] 3 4 5 > str(highScoringPeople) 'data.frame': 3 obs. of 1 variable: $ userId: num 3 4 5 |

Now it’s even more obvious why it doesn’t work – highScoringPeople is still a data frame when we need it to be a vector/list.

One way to fix this is to extract the userIds using the $ syntax instead of the select function:

highScoringPeople = (df %>% filter(score > 3))$userId > str(highScoringPeople) num [1:3] 3 4 5 > df %>% filter(userId %in% highScoringPeople) userId score 1 3 4 2 4 5 3 5 5 |

Or if we want to do the column selection using dplyr we can extract the values for the column like this:

highScoringPeople = (df %>% filter(score > 3) %>% select(userId))[[1]] > str(highScoringPeople) num [1:3] 3 4 5 |

Not so difficult after all.

## R: Weather vs attendance at NoSQL meetups

A few weeks ago I came across a tweet by Sean Taylor asking for a weather data set with a few years worth of recording and I was surprised to learn that R already has such a thing – the weatherData package.

Winner is: @UTVilla!
library(weatherData)
df <- getWeatherForYear("SFO", 2013)
ggplot(df, aes(x=Date, y = Mean_TemperatureF)) + geom_line()

— Sean J. Taylor (@seanjtaylor) January 22, 2015

weatherData provides a thin veneer around the wunderground API and was exactly what I’d been looking for to compare meetup at London’s NoSQL against weather conditions that day.

The first step was to download the appropriate weather recordings and save them to a CSV file so I wouldn’t have to keep calling the API.

I thought I may as well download all the recordings available to me and wrote the following code to make that happen:

library(weatherData) # London City Airport getDetailedWeatherForYear = function(year) { getWeatherForDate("LCY", start_date= paste(sep="", year, "-01-01"), end_date = paste(sep="", year, "-12-31"), opt_detailed = FALSE, opt_all_columns = TRUE) } df = rbind(getDetailedWeatherForYear(2011), getDetailedWeatherForYear(2012), getDetailedWeatherForYear(2013), getDetailedWeatherForYear(2014), getWeatherForDate("LCY", start_date="2015-01-01", end_date = "2015-01-25", opt_detailed = FALSE, opt_all_columns = TRUE)) |

I then saved that to a CSV file:

write.csv(df, 'weather/temp_data.csv', row.names = FALSE) |

"Date","GMT","Max_TemperatureC","Mean_TemperatureC","Min_TemperatureC","Dew_PointC","MeanDew_PointC","Min_DewpointC","Max_Humidity","Mean_Humidity","Min_Humidity","Max_Sea_Level_PressurehPa","Mean_Sea_Level_PressurehPa","Min_Sea_Level_PressurehPa","Max_VisibilityKm","Mean_VisibilityKm","Min_VisibilitykM","Max_Wind_SpeedKm_h","Mean_Wind_SpeedKm_h","Max_Gust_SpeedKm_h","Precipitationmm","CloudCover","Events","WindDirDegrees" 2011-01-01,"2011-1-1",7,6,4,5,3,1,93,85,76,1027,1025,1023,10,9,3,14,10,NA,0,7,"Rain",312 2011-01-02,"2011-1-2",4,3,2,1,0,-1,87,81,75,1029,1028,1027,10,10,10,11,8,NA,0,7,"",321 2011-01-03,"2011-1-3",4,2,1,0,-2,-5,87,74,56,1028,1024,1019,10,10,10,8,5,NA,0,6,"Rain-Snow",249 2011-01-04,"2011-1-4",6,3,1,3,1,-1,93,83,65,1019,1013,1008,10,10,10,21,6,NA,0,5,"Rain",224 2011-01-05,"2011-1-5",8,7,5,6,3,0,93,80,61,1008,1000,994,10,9,4,26,16,45,0,4,"Rain",200 2011-01-06,"2011-1-6",7,4,3,6,3,1,93,90,87,1002,996,993,10,9,5,13,6,NA,0,5,"Rain",281 2011-01-07,"2011-1-7",11,6,2,9,5,2,100,91,82,1003,999,996,10,7,2,24,11,NA,0,5,"Rain-Snow",124 2011-01-08,"2011-1-8",11,7,4,8,4,-1,87,77,65,1004,997,987,10,10,5,39,23,50,0,5,"Rain",230 2011-01-09,"2011-1-9",7,4,3,1,0,-1,87,74,57,1018,1012,1004,10,10,10,24,16,NA,0,NA,"",242 |

If we want to read that back in future we can do so with the following code:

weather = read.csv("weather/temp_data.csv") weather$Date = as.POSIXct(weather$Date) > weather %>% sample_n(10) %>% select(Date, Min_TemperatureC, Mean_TemperatureC, Max_TemperatureC) Date Min_TemperatureC Mean_TemperatureC Max_TemperatureC 1471 2015-01-10 5 9 14 802 2013-03-12 -2 1 4 1274 2014-06-27 14 18 22 848 2013-04-27 5 8 10 832 2013-04-11 6 8 10 717 2012-12-17 6 7 9 1463 2015-01-02 6 9 13 1090 2013-12-25 4 6 7 560 2012-07-13 15 18 20 1230 2014-05-14 9 14 19 |

The next step was to bring the weather data together with the meetup attendance data that I already had.

For simplicity’s sake I’ve got those saved in a CSV file as we can just read those in as well:

timestampToDate <- function(x) as.POSIXct(x / 1000, origin="1970-01-01", tz = "GMT") events = read.csv("events.csv") events$eventTime = timestampToDate(events$eventTime) > events %>% sample_n(10) %>% select(event.name, rsvps, eventTime) event.name rsvps eventTime 36 London Office Hours - Old Street 10 2012-01-18 17:00:00 137 Enterprise Search London 34 2011-05-23 18:15:00 256 MarkLogic User Group London: Jim Fuller 40 2014-04-29 18:30:00 117 Neural Networks and Data Science 171 2013-03-28 18:30:00 210 London Office Hours - Old Street 3 2011-09-15 17:00:00 443 July social! 12 2014-07-14 19:00:00 322 Intro to Graphs 39 2014-09-03 18:30:00 203 Vendor focus: Amazon CloudSearch 24 2013-05-16 17:30:00 17 Neo4J Tales from the Trenches: A Recommendation Engine Case Study 12 2012-04-25 18:30:00 55 London Office Hours 10 2013-09-18 17:00:00 |

Now that we’ve got our two datasets ready we can plot a simple chart of the average attendance and temperature grouped by month:

byMonth = events %>% mutate(month = factor(format(eventTime, "%B"), levels=month.name)) %>% group_by(month) %>% summarise(events = n(), count = sum(rsvps)) %>% mutate(ave = count / events) %>% arrange(desc(ave)) averageTemperatureByMonth = weather %>% mutate(month = factor(format(Date, "%B"), levels=month.name)) %>% group_by(month) %>% summarise(aveTemperature = mean(Mean_TemperatureC)) g1 = ggplot(aes(x = month, y = aveTemperature, group=1), data = averageTemperatureByMonth) + geom_line( ) + ggtitle("Temperature by month") g2 = ggplot(aes(x = month, y = count, group=1), data = byMonth) + geom_bar(stat="identity", fill="dark blue") + ggtitle("Attendance by month") library(gridExtra) grid.arrange(g1,g2, ncol = 1) |

We can see a rough inverse correlation between the temperature and attendance, particularly between April and August – as the temperature increases, total attendance decreases.

But what about if we compare at a finer level of granularity such as a specific date? We can do that by adding a ‘day’ column to our events data frame and merging it with the weather one:

byDay = events %>% mutate(day = as.Date(as.POSIXct(eventTime))) %>% group_by(day) %>% summarise(events = n(), count = sum(rsvps)) %>% mutate(ave = count / events) %>% arrange(desc(ave)) weather = weather %>% mutate(day = Date) merged = merge(weather, byDay, by = "day") |

Now we can plot the attendance vs the mean temperature for individual days:

ggplot(aes(x =count, y = Mean_TemperatureC,group = day), data = merged) + geom_point() |

Interestingly there now doesn’t seem to be any correlation between the temperature and attendance. We can confirm our suspicions by running a correlation:

> cor(merged$count, merged$Mean_TemperatureC) [1] 0.008516294 |

Not even 1% correlation between the values! One way we could confirm that non correlation is to plot the average temperature against the average attendance rather than total attendance:

g1 = ggplot(aes(x = month, y = aveTemperature, group=1), data = averageTemperatureByMonth) + geom_line( ) + ggtitle("Temperature by month") g2 = ggplot(aes(x = month, y = ave, group=1), data = byMonth) + geom_bar(stat="identity", fill="dark blue") + ggtitle("Attendance by month") grid.arrange(g1,g2, ncol = 1) |

Now we can see there’s not really that much of a correlation between temperature and month – in fact 9 of the months have a very similar average attendance. It’s only July, December and especially August where there’s a noticeable dip.

This could suggest there’s another variable other than temperature which is influencing attendance in these months. My hypothesis is that we’d see lower attendance in the weeks of school holidays – the main ones happen in July/August, December and March/April (which interestingly don’t show the dip!)

Another interesting thing to look into is whether the reason for the dip in attendance isn’t through lack of will from attendees but rather because there aren’t actually any events to go to. Let’s plot the number of events being hosted each month against the temperature:

g1 = ggplot(aes(x = month, y = aveTemperature, group=1), data = averageTemperatureByMonth) + geom_line( ) + ggtitle("Temperature by month") g2 = ggplot(aes(x = month, y = events, group=1), data = byMonth) + geom_bar(stat="identity", fill="dark blue") + ggtitle("Events by month") grid.arrange(g1,g2, ncol = 1) |

Here we notice there’s a big dip in events in December – organisers are hosting less events and we know from our earlier plot that on average less people are attending those events. Lots of events are hosted in the Autumn, slightly fewer in the Spring and fewer in January, March and August in particular.

Again there’s no particular correlation between temperature and the number of events being hosted on a particular day:

ggplot(aes(x = events, y = Mean_TemperatureC,group = day), data = merged) + geom_point() |

There’s not any obvious correlation from looking at this plot although I find it difficult to interpret plots where we have the values all grouped around very few points (often factor variables) on one axis and spread out (continuous variable) on the other. Let’s confirm our suspicion by calculating the correlation between these two variables:

> cor(merged$events, merged$Mean_TemperatureC) [1] 0.0251698 |

Back to the drawing board for my attendance prediction model then!

If you have any suggestions for doing this analysis more effectively or I’ve made any mistakes please let me know in the comments, I’m still learning how to investigate what data is actually telling us.