# Mark Needham

Thoughts on Software Development

## R: data.table – Finding the maximum row

In my continued playing around with the R data.table package I wanted to find the maximum row based on one of the columns, grouped by another column, and then return back the whole row.

We’ll use the following data table to illustrate:

```> blogDT = data.table(name = c("Property 1","Property 1","Property 1","Property 2","Property 2","Property 2"), price = c(10000, 12500, 18000, 245000, 512000, 1000000), date = c("Day 1", "Day 7", "Day 10", "Day 3", "Day 5", "Day 12"))   > blogDT[, lag.price := c(NA, price[-.N]), by = name]   > blogDT[, diff := price - lag.price]   > blogDT name price date lag.price diff 1: Property 1 10000 Day 1 NA NA 2: Property 1 12500 Day 7 10000 2500 3: Property 1 18000 Day 10 12500 5500 4: Property 2 245000 Day 3 NA NA 5: Property 2 512000 Day 5 245000 267000 6: Property 2 1000000 Day 12 512000 488000```

I wanted to find the biggest difference in ‘price’ and ‘lag.price’ grouped by the ‘name’ column.

If we just want to get the max ‘diff’ grouped by ‘name’ it’s quite easy:

```> blogDT[!is.na(diff), .(max = max(diff)), keyby = name] name max 1: Property 1 5500 2: Property 2 488000```

However now we’ve lost the information about which ‘date’ that was on and what the ‘price’ and ‘lag.price’ were which is a bit annoying.

If we only want to keep the rows which have the highest ‘diff’ grouped by ‘name’, one way to go about this is to add a ‘max.diff’ column to each row and then filter appropriately. e.g.

```> maxDT = blogDT[!is.na(diff)][, max := max(diff), by = name]   > maxDT name price date lag.price diff max 1: Property 1 12500 Day 7 10000 2500 5500 2: Property 1 18000 Day 10 12500 5500 5500 3: Property 2 512000 Day 5 245000 267000 488000 4: Property 2 1000000 Day 12 512000 488000 488000   > maxDT[diff == max] name price date lag.price diff max 1: Property 1 18000 Day 10 12500 5500 5500 2: Property 2 1000000 Day 12 512000 488000 488000```

I’ve only been playing with data.table for a few days so if there’s a better way do let me know in the comments.

Written by Mark Needham

October 2nd, 2015 at 6:42 pm

Posted in R

Tagged with , ,

## R: data.table – Comparing adjacent rows

As part of my exploration of the Land Registry price paid data set I wanted to compare the difference between consecutive sales of properties.

This means we need to group the sales by a property identifier and then get the previous sale price into a column on each row unless it’s the first sale in which case we’ll have ‘NA’. We can do this by creating a lag variable.

I’ll use a simpler data set which is very similar in structure to the Land Registry’s to demonstrate:

```> blogDT = data.table(name = c("Property 1","Property 1","Property 1","Property 2","Property 2","Property 2"), price = c(10000, 12500, 18000, 245000, 512000, 1000000))   > blogDT name price 1: Property 1 10000 2: Property 1 12500 3: Property 1 18000 4: Property 2 245000 5: Property 2 512000 6: Property 2 1000000```

We want to group by the ‘name’ column and then have the price on row 1 show on row 2, the price on row 2 on row 3, the price on row 4 on row 5 and the price on row 5 on row 6. To do that we’ll introduce a ‘lag.price’ column:

```> blogDT[, lag.price := c(NA, price[-.N]), by = name]   > blogDT name price lag.price 1: Property 1 10000 NA 2: Property 1 12500 10000 3: Property 1 18000 12500 4: Property 2 245000 NA 5: Property 2 512000 245000 6: Property 2 1000000 512000```

Next let’s calculate the difference between the two prices:

```> blogDT[, diff := price - lag.price] > blogDT name price lag.price diff 1: Property 1 10000 NA NA 2: Property 1 12500 10000 2500 3: Property 1 18000 12500 5500 4: Property 2 245000 NA NA 5: Property 2 512000 245000 267000 6: Property 2 1000000 512000 488000```

Finally let’s order the data table by the biggest price gains:

```> blogDT[order(-diff)] name price lag.price diff 1: Property 2 1000000 512000 488000 2: Property 2 512000 245000 267000 3: Property 1 18000 12500 5500 4: Property 1 12500 10000 2500 5: Property 1 10000 NA NA 6: Property 2 245000 NA NA```

Written by Mark Needham

September 27th, 2015 at 10:02 pm

Posted in R

Tagged with ,

## R: Date for given week/year

As I mentioned in my last couple of blog posts I’ve been looking at the data behind this blog and I wanted to plot a chart showing the number of posts per week since the blog started.

I started out with a data frame with posts and publication date:

```> library(dplyr) > df = read.csv("posts.csv") > df\$date = ymd_hms(df\$date)   > df %>% sample_n(10) title date 538 Nygard Big Data Model: The Investigation Stage 2012-10-10 00:00:36 341 The read-only database 2011-08-29 23:32:26 1112 CSS in Internet Explorer - Some lessons learned 2008-10-31 15:24:51 143 Coding: Mutating parameters 2010-08-26 07:47:23 433 Scala: Counting number of inversions (via merge sort) for an unsorted collection 2012-03-20 06:53:18 618 neo4j/cypher: SQL style GROUP BY functionality 2013-02-17 21:05:27 1111 Testing Hibernate mappings: Setting up test data 2008-10-30 13:24:14 462 neo4j: What question do you want to answer? 2012-05-05 13:20:41 1399 Book Club: Design Sense (Michael Feathers) 2009-09-29 14:42:29 494 Bash Shell: Reusing parts of previous commands 2012-07-05 23:42:35```

The first step was to add a couple of columns representing the week and year for the publication date. The ‘lubridate’ library came in handy here:

```byWeek = df %>% mutate(year = year(date), week = week(date)) %>% group_by(week, year) %>% summarise(n = n()) %>% ungroup() %>% arrange(desc(n))   > byWeek Source: local data frame [352 x 3]   week year n 1 33 2008 14 2 35 2008 11 3 53 2012 11 4 9 2013 10 5 12 2013 9 6 21 2009 9 7 22 2009 9 8 38 2013 9 9 40 2008 9 10 48 2012 9 .. ... ... ..```

The next step is to calculate the start date of each of those weeks so that we can plot the counts on a continuous date scale. I spent a while searching how to do this before realising that the ‘week’ function I used before can set the week for a given data as well. Let’s get to work:

```calculate_start_of_week = function(week, year) { date <- ymd(paste(year, 1, 1, sep="-")) week(date) = week return(date) }   > calculate_start_of_week(c(1,2,3), c(2015,2014,2013)) [1] "2015-01-01 UTC" "2014-01-08 UTC" "2013-01-15 UTC"```

And now let’s transform our data frame and plot the counts:

```ggplot(aes(x=start_of_week, y=n, group=1), data = byWeek %>% mutate(start_of_week = calculate_start_of_week(week, year))) + geom_line()```

It’s a bit erratic as you can see. Some of this can be explained by the fact that I do in fact post in an erratic way while some of it is explained by the fact that some weeks only have a few days if they start on the 29th onwards.

Written by Mark Needham

July 10th, 2015 at 10:01 pm

Posted in R

Tagged with ,

## R: dplyr – Error: cannot modify grouping variable

I’ve been doing some exploration of the posts made on this blog and I thought I’d start with answering a simple question – on which dates did I write the most posts?

I started with a data frame containing each post and the date it was published:

```> library(dplyr) > df %>% sample_n(5) title date 1148 Taiichi Ohno's Workplace Management: Book Review 2008-12-08 14:14:48 158 Rails: Faking a delete method with 'form_for' 2010-09-20 18:52:15 331 Retrospectives: The 4 L's Retrospective 2011-07-25 21:00:30 1035 msbuild - Use OutputPath instead of OutDir 2008-08-14 18:54:03 1181 The danger of commenting out code 2009-01-17 06:02:33```

To find the most popular days for blog posts we can write the following aggregation function:

```> df %>% mutate(day = as.Date(date)) %>% count(day) %>% arrange(desc(n))   Source: local data frame [1,140 x 2]   day n 1 2012-12-31 6 2 2014-05-31 6 3 2008-08-08 5 4 2013-01-27 5 5 2009-08-24 4 6 2012-06-24 4 7 2012-09-30 4 8 2012-10-27 4 9 2012-11-24 4 10 2013-02-28 4```

So we can see a couple of days with 6 posts, a couple with 5 posts, a few more with 4 posts and then presumably loads of days with 1 post.

I thought it’d be cool if we could blog a histogram which had on the x axis the number of posts and on the y axis how many days that number of posts occurred e.g. for an x value of 6 (posts) we’d have a y value of 2 (occurrences).

```> df %>% mutate(day = as.Date(date)) %>% count(day) %>% count(n) Error: cannot modify grouping variable```

Unfortunately that isn’t allowed. I tried ungrouping and then counting again:

``` df %>% mutate(day = as.Date(date)) %>% count(day) %>% ungroup() %>% count(n) Error: cannot modify grouping variable```

Still no luck. I did a bit of googlign around and came across a post which suggested using a combination of group_by + mutate or group_by + summarize.

I tried the mutate approach first:

```> df %>% mutate(day = as.Date(date)) %>% + group_by(day) %>% mutate(n = n()) %>% ungroup() %>% sample_n(5) title Source: local data frame [5 x 4]   title date day n 1 QCon London 2009: DDD & BDD - Dan North 2009-03-13 15:28:04 2009-03-13 2 2 Onboarding: Sketch the landscape 2013-02-15 07:36:06 2013-02-15 1 3 Ego Depletion 2013-06-04 23:16:29 2013-06-04 1 4 Clean Code: Book Review 2008-09-15 09:52:33 2008-09-15 1 5 Dreyfus Model: More thoughts 2009-08-10 10:36:51 2009-08-10 1```

That keeps around the ‘title’ which is a bit annoying. We can get rid of it using a distinct on ‘day’ if we want and if we also implement the second part of the function we end up with the following:

```> df %>% mutate(day = as.Date(date)) %>% group_by(day) %>% mutate(n = n()) %>% distinct(day) %>% ungroup() %>% group_by(n) %>% mutate(c = n()) %>% distinct(n)   Source: local data frame [6 x 5] Groups: n   title date day n c 1 Functional C#: Writing a 'partition' function 2010-02-01 23:34:02 2010-02-01 1 852 2 Willed vs Forced designs 2010-02-08 22:48:05 2010-02-08 2 235 3 TDD: Testing collections 2010-07-28 06:05:25 2010-07-28 3 41 4 Creating a Samba share between Ubuntu and Mac OS X 2012-06-24 00:40:35 2012-06-24 4 8 5 Gamification and Software: Some thoughts 2012-12-31 10:57:19 2012-12-31 6 2 6 Python/numpy: Selecting specific column in 2D array 2013-01-27 02:10:10 2013-01-27 5 2```

Annoyingly we’ve still got the ‘title’, ‘date’ and ‘day’ columns hanging around which we’d need to get rid of with a call to ‘select’. The code also feels quite icky, especially the use of distinct in a couple of places.

In fact we can simplify the code if we use summarize instead of mutate:

```> df %>% mutate(day = as.Date(date)) %>% group_by(day) %>% summarize(n = n()) %>% ungroup() %>% group_by(n) %>% summarize(c = n())     Source: local data frame [6 x 2]   n c 1 1 852 2 2 235 3 3 41 4 4 8 5 5 2 6 6 2```

And we’ve got also rid of the extra columns in the bargain which is great! And now we can plot our histogram:

```> library(ggplot2) > post_frequencies = df %>% mutate(day = as.Date(date)) %>% group_by(day) %>% summarize(n = n()) %>% ungroup() %>% group_by(n) %>% summarize(c = n()) > ggplot(aes(x = n, y = c), data = post_frequencies) + geom_bar(stat = "identity")```

In this case we don’t actually need to do the second grouping to create the bar chart since ggplot will do it for us if we feed it the following data:

```. ggplot(aes(x = n), data = df %>% mutate(day = as.Date(date)) %>% group_by(day) %>% summarize(n = n()) %>% ungroup()) + geom_bar(binwidth = 1) + scale_x_continuous(limits=c(1, 6))```

Still, it’s good to know how!

Written by Mark Needham

July 9th, 2015 at 5:55 am

Posted in R

Tagged with , ,

## R: ggplot geom_density – Error in exists(name, envir = env, mode = mode) : argument “env” is missing, with no default

Continuing on from yesterday’s blog post where I worked out how to clean up the Think Bayes Price is Right data set, the next task was to plot a distribution of the prices of show case items.

To recap, this is what the data frame we’re working with looks like:

```library(dplyr)   df2011 = read.csv("~/projects/rLearning/showcases.2011.csv", na.strings = c("", "NA")) df2011 = df2011 %>% na.omit()   > df2011 %>% head() X Sep..19 Sep..20 Sep..21 Sep..22 Sep..23 Sep..26 Sep..27 Sep..28 Sep..29 Sep..30 Oct..3 3 Showcase 1 50969 21901 32815 44432 24273 30554 20963 28941 25851 28800 37703 4 Showcase 2 45429 34061 53186 31428 22320 24337 41373 45437 41125 36319 38752 6 Bid 1 42000 14000 32000 27000 18750 27222 25000 35000 22500 21300 21567 7 Bid 2 34000 59900 45000 38000 23000 18525 32000 45000 32000 27500 23800 9 Difference 1 8969 7901 815 17432 5523 3332 -4037 -6059 3351 7500 16136 10 Difference 2 11429 -25839 8186 -6572 -680 5812 9373 437 9125 8819 14952 ...```

So our goal is to plot the density of the ‘Showcase 1’ items. Unfortunately those aren’t currently stored in a way that makes this easy for us. We need to flip the data frame so that we have a row for each date/price type/price:

```PriceType Date Price Showcase 1 Sep..19 50969 Showcase 2 Sep..19 21901 ... Showcase 1 Sep..20 45429 Showcase 2 Sep..20 34061```

The reshape library’s melt function is our friend here:

```library(reshape) meltedDf = melt(df2011, id=c("X"))   > meltedDf %>% sample_n(10) X variable value 643 Showcase 1 Feb..24 27883 224 Showcase 2 Nov..10 34089 1062 Difference 2 Jun..4 9962 770 Showcase 2 Mar..28 39620 150 Difference 2 Oct..24 9137 431 Difference 1 Jan..4 7516 345 Bid 1 Dec..12 21569 918 Difference 2 May.1 -2093 536 Showcase 2 Jan..31 30918 502 Bid 2 Jan..23 27000```

Now we need to plug this into ggplot. We’ll start by just plotting all the prices for showcase 1:

```> ggplot(aes(x = value), data = meltedDf %>% filter(X == "Showcase 1")) + geom_density()   Error in exists(name, envir = env, mode = mode) : argument "env" is missing, with no default```

This error usually means that you’ve passed an empty data set to ggplot which isn’t the case here, but if we extract the values column we can see the problem:

```> meltedDf\$value[1:10] [1] "50969" "45429" "42000" "34000" "8969" "11429" "21901" "34061" "14000" "59900"```

They are all strings! Making it very difficult to plot a density curve which relies on the data being continuous. Let’s fix that and try again:

```meltedDf\$value = as.numeric(meltedDf\$value)

ggplot(aes(x = value), data = meltedDf %>% filter(X == "Showcase 1")) +
geom_density()
```

If we want to show the curves for both showcases we can tweak our code slightly:

```ggplot(meltedDf %>% filter(grepl("Showcase", X)), aes(x = value, colour = X)) + geom_density() + theme(legend.position="top")```

Et voila!

Written by Mark Needham

June 3rd, 2015 at 5:52 am

Posted in R

Tagged with ,

## R: Think Bayes Euro Problem

I’ve got back to working my way through Think Bayes after a month’s break and started out with the one euro coin problem in Chapter 4:

A statistical statement appeared in “The Guardian” on Friday January 4, 2002:

When spun on edge 250 times, a Belgian one-euro coin came up heads 140 times and tails 110. ‘It looks very suspicious to me,’ said Barry Blight, a statistics lecturer at the London School of Economics. ‘If the coin were unbiased, the chance of getting a result as extreme as that would be less than 7%.’

But do these data give evidence that the coin is biased rather than fair?

We’re going to create a data frame with each row representing the probability that heads shows up that often. We need one row for each value between 0 (no heads) and 100 (all heads) and we’ll start with the assumption that each value can be chosen equally (a uniform prior):

```library(dplyr)   values = seq(0, 100) scores = rep(1.0 / length(values), length(values)) df = data.frame(score = scores, value = values)   > df %>% sample_n(10) score value 60 0.00990099 59 101 0.00990099 100 10 0.00990099 9 41 0.00990099 40 2 0.00990099 1 83 0.00990099 82 44 0.00990099 43 97 0.00990099 96 100 0.00990099 99 12 0.00990099 11```

Now we need to feed in our observations. We need to create a vector containing 140 heads and 110 tails. The ‘rep’ function comes in handy here:

```observations = c(rep("T", times = 110), rep("H", times = 140)) > observations [1] "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" [29] "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" [57] "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" [85] "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "H" "H" [113] "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" [141] "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" [169] "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" [197] "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" [225] "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H"```

Now we need to iterate over each of the observations and update our data frame appropriately.

```for(observation in observations) { if(observation == "H") { df = df %>% mutate(score = score * (value / 100.0)) } else { df = df %>% mutate(score = score * (1.0 - (value / 100.0))) } }   df = df %>% mutate(weighted = score / sum(score))```

Now that we’ve done that we can calculate the maximum likelihood, mean, median and credible interval. We’ll create a ‘percentile’ function to help us out:

```percentile = function(df, p) { df %>% filter(cumsum(weighted) > p) %>% head(1) %>% select(value) %>% as.numeric }```

And now let’s calculate the values:

```# Maximum likelihood > df %>% filter(weighted == max(weighted)) %>% select(value) %>% as.numeric [1] 56   # Mean > df %>% mutate(mean = value * weighted) %>% select(mean) %>% sum [1] 55.95238   # Median > percentile(df, 0.5) [1] 56   # Credible Interval percentage = 90 prob = (1 - percentage / 100.0) / 2   # lower > percentile(df, prob) [1] 51   # upper > percentile(df, 1 - prob) [1] 61```

This all wraps up nicely into a function:

```euro = function(values, priors, observations) { df = data.frame(score = priors, value = values)   for(observation in observations) { if(observation == "H") { df = df %>% mutate(score = score * (value / 100.0)) } else { df = df %>% mutate(score = score * (1.0 - (value / 100.0))) } }   return(df %>% mutate(weighted = score / sum(score))) }```

which we can call like so:

```values = seq(0,100) priors = rep(1.0 / length(values), length(values)) observations = c(rep("T", times = 110), rep("H", times = 140)) df = euro(values, priors, observations)```

The next part of the problem requires us to change the prior distribution to be more weighted to values close to 50%. We can tweak the parameters we pass into the function accordingly:

```values = seq(0,100) priors = sapply(values, function(x) ifelse(x < 50, x, 100 - x)) priors = priors / sum(priors) observations = c(rep("T", times = 110), rep("H", times = 140)) df = euro(values, priors, observations)```

In fact even with the adjusted priors we still end up with the same posterior distribution:

```> df %>% filter(weighted == max(weighted)) %>% select(value) %>% as.numeric [1] 56   > df %>% mutate(mean = value * weighted) %>% select(mean) %>% sum [1] 55.7435   > percentile(df, 0.5) [1] 56   > percentile(df, 0.05) [1] 51   > percentile(df, 0.95) [1] 61```

The book describes this phenemenom as follows:

This is an example of swamping the priors: with enough data, people who start with different priors will tend to converge on the same posterior.

Written by Mark Needham

May 31st, 2015 at 11:11 pm

Posted in R

Tagged with ,

## R: ggplot – Displaying multiple charts with a for loop

Continuing with my analysis of the Neo4j London user group I wanted to drill into some individual meetups and see the makeup of the people attending those meetups with respect to the cohort they belong to.

I started by writing a function which would take in an event ID and output a bar chart showing the number of people who attended that event from each cohort.

We can work out the cohort that a member belongs to by querying for the first event they attended.

Our query for the most recent Intro to graphs session looks like this:

```library(RNeo4j) graph = startGraph("http://127.0.0.1:7474/db/data/")   eventId = "220750415" query = "match (g:Group {name: 'Neo4j - London User Group'})-[:HOSTED_EVENT]-> (e {id: {id}})<-[:TO]-(rsvp {response: 'yes'})<-[:RSVPD]-(person) WITH rsvp, person MATCH (person)-[:RSVPD]->(otherRSVP)   WITH person, rsvp, otherRSVP ORDER BY person.id, otherRSVP.time   WITH person, rsvp, COLLECT(otherRSVP)[0] AS earliestRSVP return rsvp.time, earliestRSVP.time, person.id"   df = cypher(graph, query, id= eventId)   > df %>% sample_n(10) rsvp.time earliestRSVP.time person.id 18 1.430819e+12 1.392726e+12 130976662 95 1.430069e+12 1.430069e+12 10286388 79 1.429035e+12 1.429035e+12 38344282 64 1.428108e+12 1.412935e+12 153473172 73 1.429513e+12 1.398236e+12 143322942 19 1.430389e+12 1.430389e+12 129261842 37 1.429643e+12 1.327603e+12 9750821 49 1.429618e+12 1.429618e+12 184325696 69 1.430781e+12 1.404554e+12 67485912 1 1.430929e+12 1.430146e+12 185405773```

We’re not doing anything too clever here, just using a couple of WITH clauses to order RSVPs so we can get the earlier one for each person.

Once we’ve done that we’ll tidy up the data frame so that it contains columns containing the cohort in which the member attended their first event:

```timestampToDate <- function(x) as.POSIXct(x / 1000, origin="1970-01-01", tz = "GMT")   df\$time = timestampToDate(df\$rsvp.time) df\$date = format(as.Date(df\$time), "%Y-%m") df\$earliestTime = timestampToDate(df\$earliestRSVP.time) df\$earliestDate = format(as.Date(df\$earliestTime), "%Y-%m")   > df %>% sample_n(10) rsvp.time earliestRSVP.time person.id time date earliestTime earliestDate 47 1.430697e+12 1.430697e+12 186893861 2015-05-03 23:47:11 2015-05 2015-05-03 23:47:11 2015-05 44 1.430924e+12 1.430924e+12 186998186 2015-05-06 14:49:44 2015-05 2015-05-06 14:49:44 2015-05 85 1.429611e+12 1.422378e+12 53761842 2015-04-21 10:13:46 2015-04 2015-01-27 16:56:02 2015-01 14 1.430125e+12 1.412690e+12 7994846 2015-04-27 09:01:58 2015-04 2014-10-07 13:57:09 2014-10 29 1.430035e+12 1.430035e+12 37719672 2015-04-26 07:59:03 2015-04 2015-04-26 07:59:03 2015-04 12 1.430855e+12 1.430855e+12 186968869 2015-05-05 19:38:10 2015-05 2015-05-05 19:38:10 2015-05 41 1.428917e+12 1.422459e+12 133623562 2015-04-13 09:20:07 2015-04 2015-01-28 15:37:40 2015-01 87 1.430927e+12 1.430927e+12 185155627 2015-05-06 15:46:59 2015-05 2015-05-06 15:46:59 2015-05 62 1.430849e+12 1.430849e+12 186965212 2015-05-05 17:56:23 2015-05 2015-05-05 17:56:23 2015-05 8 1.430237e+12 1.425567e+12 184979500 2015-04-28 15:58:23 2015-04 2015-03-05 14:45:40 2015-03```

Now that we’ve got that we can group by the earliestDate cohort and then create a bar chart:

```byCohort = df %>% count(earliestDate)   ggplot(aes(x= earliestDate, y = n), data = byCohort) + geom_bar(stat="identity", fill = "dark blue") + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=1))```

This is good and gives us the insight that most of the members attending this version of intro to graphs just joined the group. The event was on 7th April and most people joined in March which makes sense.

Let’s see if that trend continues over the previous two years. To do this we need to create a for loop which goes over all the Intro to Graphs events and then outputs a chart for each one.

First I pulled out the code above into a function:

```plotEvent = function(eventId) { query = "match (g:Group {name: 'Neo4j - London User Group'})-[:HOSTED_EVENT]-> (e {id: {id}})<-[:TO]-(rsvp {response: 'yes'})<-[:RSVPD]-(person) WITH rsvp, person MATCH (person)-[:RSVPD]->(otherRSVP)   WITH person, rsvp, otherRSVP ORDER BY person.id, otherRSVP.time   WITH person, rsvp, COLLECT(otherRSVP)[0] AS earliestRSVP return rsvp.time, earliestRSVP.time, person.id"   df = cypher(graph, query, id= eventId) df\$time = timestampToDate(df\$rsvp.time) df\$date = format(as.Date(df\$time), "%Y-%m") df\$earliestTime = timestampToDate(df\$earliestRSVP.time) df\$earliestDate = format(as.Date(df\$earliestTime), "%Y-%m")   byCohort = df %>% count(earliestDate)   ggplot(aes(x= earliestDate, y = n), data = byCohort) + geom_bar(stat="identity", fill = "dark blue") + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=1)) }```

We’d call it like this for the Intro to graphs meetup:

`> plotEvent("220750415")`

Next I tweaked the code to look up all Into to graphs events and then loop through and output a chart for each event:

```events = cypher(graph, "match (e:Event {name: 'Intro to Graphs'}) RETURN e.id ORDER BY e.time")   for(event in events\$n) { plotEvent(as.character(event)) }```

Unfortunately that doesn’t print anything at all which we can fix by storing our plots in a list and then displaying it afterwards:

```library(gridExtra) p = list() for(i in 1:count(events)\$n) { event = events[i, 1] p[[i]] = plotEvent(as.character(event)) }   do.call(grid.arrange,p)```

This visualisation is probably better without any of the axis so let’s update the function to scrap those. We’ll also add the date of the event at the top of each chart which will require a slight tweak of the query:

```plotEvent = function(eventId) { query = "match (g:Group {name: 'Neo4j - London User Group'})-[:HOSTED_EVENT]-> (e {id: {id}})<-[:TO]-(rsvp {response: 'yes'})<-[:RSVPD]-(person) WITH e,rsvp, person MATCH (person)-[:RSVPD]->(otherRSVP)   WITH e,person, rsvp, otherRSVP ORDER BY person.id, otherRSVP.time   WITH e, person, rsvp, COLLECT(otherRSVP)[0] AS earliestRSVP return rsvp.time, earliestRSVP.time, person.id, e.time"   df = cypher(graph, query, id= eventId) df\$time = timestampToDate(df\$rsvp.time) df\$eventTime = timestampToDate(df\$e.time) df\$date = format(as.Date(df\$time), "%Y-%m") df\$earliestTime = timestampToDate(df\$earliestRSVP.time) df\$earliestDate = format(as.Date(df\$earliestTime), "%Y-%m")   byCohort = df %>% count(earliestDate)   ggplot(aes(x= earliestDate, y = n), data = byCohort) + geom_bar(stat="identity", fill = "dark blue") + theme(axis.ticks = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank()) + labs(title = df\$eventTime[1]) }```

I think this makes it a bit easier to read although I’ve made the mistake of not having all the charts representing the same scale – one to fix for next time.

We started doing the intro to graphs sessions less frequently towards the end of last year so my hypothesis was that we’d see a range of people from different cohorts RSVPing for them but that doesn’t seem to be the case. Instead it’s very dominated by people signing up close to the event.

Written by Mark Needham

May 14th, 2015 at 12:17 am

Posted in neo4j,R

Tagged with , ,

## R: Cohort heatmap of Neo4j London meetup

A few months ago I had a go at doing some cohort analysis of the Neo4j London meetup group which was an interesting experiment but unfortunately resulted in a chart that was completely illegible.

I wasn’t sure how to progress from there but a few days ago I came across the cohort heatmap which seemed like a better way of visualising things over time.

The underlying idea is still the same – we’ve comparing different cohorts of users against each other to see whether a change or intervention we did at a certain time had any impact.

However, the way we display the cohorts changes and I think for the better.

```df = read.csv("/tmp/df.csv") > df %>% sample_n(5) rsvp.time person.id time date 255 1.354277e+12 12228948 2012-11-30 12:05:08 2012-11 2475 1.407342e+12 19057581 2014-08-06 16:26:04 2014-08 3988 1.421769e+12 66122172 2015-01-20 15:58:02 2015-01 4411 1.419377e+12 165750262 2014-12-23 23:27:44 2014-12 1010 1.383057e+12 74602292 2013-10-29 14:24:32 2013-10```

And we need to transform this into a data frame which is grouped by cohort (members who attended their first meetup in a particular month). The following code gets us there:

```firstMeetup = df %>% group_by(person.id) %>% summarise(firstEvent = min(time), count = n()) %>% arrange(desc(count)) firstMeetup\$date = format(as.Date(firstMeetup\$firstEvent), "%Y-%m")   countsForCohort = function(df, firstMeetup, cohort) { members = (firstMeetup %>% filter(date == cohort))\$person.id   attendance = df %>% filter(person.id %in% members) %>% count(person.id, date) %>% ungroup() %>% count(date)   allCohorts = df %>% select(date) %>% unique cohortAttendance = merge(allCohorts, attendance, by = "date", all.x = TRUE) cohortAttendance[is.na(cohortAttendance) & cohortAttendance\$date > cohort] = 0 cohortAttendance %>% mutate(cohort = cohort, retention = n / length(members), members = n) }   cohorts = collect(df %>% select(date) %>% unique())[,1]   cohortAttendance = data.frame() for(cohort in cohorts) { cohortAttendance = rbind(cohortAttendance,countsForCohort(df, firstMeetup, cohort)) }   monthNumber = function(cohort, date) { cohortAsDate = as.yearmon(cohort) dateAsDate = as.yearmon(date)   if(cohortAsDate > dateAsDate) { "NA" } else { paste(round((dateAsDate - cohortAsDate) * 12), sep="") } }   cohortAttendanceWithMonthNumber = cohortAttendance %>% group_by(row_number()) %>% mutate(monthNumber = monthNumber(cohort, date)) %>% filter(monthNumber != "NA") %>% filter(!is.na(members)) %>% mutate(monthNumber = as.numeric(monthNumber)) %>% arrange(monthNumber)   > cohortAttendanceWithMonthNumber %>% head(10) Source: local data frame [10 x 7] Groups: row_number()   date n cohort retention members row_number() monthNumber 1 2011-06 4 2011-06 1.00 4 1 0 2 2011-07 1 2011-06 0.25 1 2 1 3 2011-08 1 2011-06 0.25 1 3 2 4 2011-09 2 2011-06 0.50 2 4 3 5 2011-10 1 2011-06 0.25 1 5 4 6 2011-11 1 2011-06 0.25 1 6 5 7 2012-01 1 2011-06 0.25 1 7 7 8 2012-04 2 2011-06 0.50 2 8 10 9 2012-05 1 2011-06 0.25 1 9 11 10 2012-06 1 2011-06 0.25 1 10 12```

Now let’s create our first heatmap.

```t <- max(cohortAttendanceWithMonthNumber\$members)   cols <- c("#e7f0fa", "#c9e2f6", "#95cbee", "#0099dc", "#4ab04a", "#ffd73e", "#eec73a", "#e29421", "#e29421", "#f05336", "#ce472e") ggplot(cohortAttendanceWithMonthNumber, aes(y=cohort, x=date, fill=members)) + theme_minimal() + geom_tile(colour="white", linewidth=2, width=.9, height=.9) + scale_fill_gradientn(colours=cols, limits=c(0, t), breaks=seq(0, t, by=t/4), labels=c("0", round(t/4*1, 1), round(t/4*2, 1), round(t/4*3, 1), round(t/4*4, 1)), guide=guide_colourbar(ticks=T, nbin=50, barheight=.5, label=T, barwidth=10)) + theme(legend.position='bottom', legend.direction="horizontal", plot.title = element_text(size=20, face="bold", vjust=2), axis.text.x=element_text(size=8, angle=90, hjust=.5, vjust=.5, face="plain")) + ggtitle("Cohort Activity Heatmap (number of members who attended event)")```

‘t’ is the maximum number of members within a cohort who attended a meetup in a given month. This makes it easy to see which cohorts started with the most members but makes it difficult to compare their retention over time.

We can fix that by showing the percentage of members in the cohort who attend each month rather than using absolute values. To do that we must first add an extra column containing the percentage values:

```cohortAttendanceWithMonthNumber\$retentionPercentage = ifelse(!is.na(cohortAttendanceWithMonthNumber\$retention), cohortAttendanceWithMonthNumber\$retention * 100, 0) t <- max(cohortAttendanceWithMonthNumber\$retentionPercentage)   cols <- c("#e7f0fa", "#c9e2f6", "#95cbee", "#0099dc", "#4ab04a", "#ffd73e", "#eec73a", "#e29421", "#e29421", "#f05336", "#ce472e") ggplot(cohortAttendanceWithMonthNumber, aes(y=cohort, x=date, fill=retentionPercentage)) + theme_minimal() + geom_tile(colour="white", linewidth=2, width=.9, height=.9) + scale_fill_gradientn(colours=cols, limits=c(0, t), breaks=seq(0, t, by=t/4), labels=c("0", round(t/4*1, 1), round(t/4*2, 1), round(t/4*3, 1), round(t/4*4, 1)), guide=guide_colourbar(ticks=T, nbin=50, barheight=.5, label=T, barwidth=10)) + theme(legend.position='bottom', legend.direction="horizontal", plot.title = element_text(size=20, face="bold", vjust=2), axis.text.x=element_text(size=8, angle=90, hjust=.5, vjust=.5, face="plain")) + ggtitle("Cohort Activity Heatmap (number of members who attended event)")```

This version allows us to compare cohorts against each other but now we don’t have the exact numbers which means earlier cohorts will look better since there are less people in them. We can get the best of both worlds by keeping this visualisation but showing the actual values inside each box:

```t <- max(cohortAttendanceWithMonthNumber\$retentionPercentage)   cols <- c("#e7f0fa", "#c9e2f6", "#95cbee", "#0099dc", "#4ab04a", "#ffd73e", "#eec73a", "#e29421", "#e29421", "#f05336", "#ce472e") ggplot(cohortAttendanceWithMonthNumber, aes(y=cohort, x=date, fill=retentionPercentage)) + theme_minimal() + geom_tile(colour="white", linewidth=2, width=.9, height=.9) + scale_fill_gradientn(colours=cols, limits=c(0, t), breaks=seq(0, t, by=t/4), labels=c("0", round(t/4*1, 1), round(t/4*2, 1), round(t/4*3, 1), round(t/4*4, 1)), guide=guide_colourbar(ticks=T, nbin=50, barheight=.5, label=T, barwidth=10)) + theme(legend.position='bottom', legend.direction="horizontal", plot.title = element_text(size=20, face="bold", vjust=2), axis.text.x=element_text(size=8, angle=90, hjust=.5, vjust=.5, face="plain")) + ggtitle("Cohort Activity Heatmap (number of members who attended event)") + geom_text(aes(label=members),size=3)```

What we can learn overall is that the majority of people seem to have a passing interest and then we have a smaller percentage who will continue to come to events.

It seems like we did a better job at retaining attendees in the middle of last year – one hypothesis is that the events we ran around then were more compelling but I need to do more analysis.

Next I’m going to drill further into some of the recent events and see what cohorts the attendees came from.

Written by Mark Needham

May 11th, 2015 at 11:16 pm

Posted in R

Tagged with ,

## R: Neo4j London meetup group – How many events do people come to?

with one comment

Earlier this week the number of members in the Neo4j London meetup group creeped over the 2,000 mark and I thought it’d be fun to re-explore the data that I previously imported into Neo4j.

How often do people come to meetups?

```library(RNeo4j) library(dplyr)   graph = startGraph("http://localhost:7474/db/data/")   query = "MATCH (g:Group {name: 'Neo4j - London User Group'})-[:HOSTED_EVENT]->(event)<-[:TO]-({response: 'yes'})<-[:RSVPD]-(profile)-[:HAS_MEMBERSHIP]->(membership)-[:OF_GROUP]->(g) WHERE (event.time + event.utc_offset) < timestamp() RETURN event.id, event.time + event.utc_offset AS eventTime, profile.id, membership.joined"   df = cypher(graph, query)   > df %>% head() event.id eventTime profile.id membership.joined 1 20616111 1.309372e+12 6436797 1.307285e+12 2 20616111 1.309372e+12 12964956 1.307275e+12 3 20616111 1.309372e+12 14533478 1.307290e+12 4 20616111 1.309372e+12 10793775 1.307705e+12 5 24528711 1.311793e+12 10793775 1.307705e+12 6 29953071 1.314815e+12 10595297 1.308154e+12```
```byEventsAttended = df %>% count(profile.id)   > byEventsAttended %>% sample_n(10) Source: local data frame [10 x 2]   profile.id n 1 128137932 2 2 126947632 1 3 98733862 2 4 20468901 11 5 48293132 5 6 144764532 1 7 95259802 1 8 14524524 3 9 80611852 2 10 134907492 2```

Now let’s visualise the number of people that have attended certain number of events:

```ggplot(aes(x = n), data = byEventsAttended) + geom_bar(binwidth = 1, fill = "Dark Blue") + scale_y_continuous(breaks = seq(0,750,by = 50))```

Most people come to one meetup and then there’s a long tail after that with fewer and fewer people coming to lots of meetups.

The chart has lots of blank space due to the sparseness of people on the right hand side. If we exclude any people who’ve attended more than 20 events we might get a more interesting visualisation:

```ggplot(aes(x = n), data = byEventsAttended %>% filter(n <= 20)) + geom_bar(binwidth = 1, fill = "Dark Blue") + scale_y_continuous(breaks = seq(0,750,by = 50))```

Nicole suggested a more interesting visualisation would be a box plot so I decided to try that next:

```ggplot(aes(x = "Attendees", y = n), data = byEventsAttended) + geom_boxplot(fill = "grey80", colour = "Dark Blue") + coord_flip()```

This visualisation really emphasises that the majority are between 1 and 3 and it’s much less obvious how many values there are at the higher end. A quick check of the data with the summary function reveals as much:

```> summary(byEventsAttended\$n) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 1.000 2.000 2.837 3.000 69.000```

Now to figure out how to move that box plot a bit to the right 🙂

Written by Mark Needham

May 9th, 2015 at 10:33 pm

Posted in R

Tagged with ,

## R: dplyr – Error in (list: invalid subscript type ‘double’

In my continued playing around with R I wanted to find the minimum value for a specified percentile given a data frame representing a cumulative distribution function (CDF).

e.g. imagine we have the following CDF represented in a data frame:

```library(dplyr) df = data.frame(score = c(5,7,8,10,12,20), percentile = c(0.05,0.1,0.15,0.20,0.25,0.5))```

and we want to find the minimum value for the 0.05 percentile. We can use the filter function to do so:

```> (df %>% filter(percentile > 0.05) %>% slice(1))\$score [1] 7```

Things become more tricky if we want to return multiple percentiles in one go.

My first thought was to create a data frame with one row for each target percentile and then pull in the appropriate row from our original data frame:

```targetPercentiles = c(0.05, 0.2) percentilesDf = data.frame(targetPercentile = targetPercentiles) > percentilesDf %>% group_by(targetPercentile) %>% mutate(x = (df %>% filter(percentile > targetPercentile) %>% slice(1))\$score)   Error in (list(score = c(5, 7, 8, 10, 12, 20), percentile = c(0.05, 0.1, : invalid subscript type 'double'```

Unfortunately this didn’t quite work as I expected – Antonios pointed out that this is probably because we’re mixing up two pipelines and dplyr can’t figure out what we want to do.

Instead he suggested the following variant which uses the do function

```df = data.frame(score = c(5,7,8,10,12,20), percentile = c(0.05,0.1,0.15,0.20,0.25,0.5)) targetPercentiles = c(0.05, 0.2)   > data.frame(targetPercentile = targetPercentiles) %>% group_by(targetPercentile) %>% do(df) %>% filter(percentile > targetPercentile) %>% slice(1) %>% select(targetPercentile, score) Source: local data frame [2 x 2] Groups: targetPercentile   targetPercentile score 1 0.05 7 2 0.20 12```

We can then wrap this up in a function:

```percentiles = function(df, targetPercentiles) { # make sure the percentiles are in order df = df %>% arrange(percentile)   data.frame(targetPercentile = targetPercentiles) %>% group_by(targetPercentile) %>% do(df) %>% filter(percentile > targetPercentile) %>% slice(1) %>% select(targetPercentile, score) }```

which we call like this:

```df = data.frame(score = c(5,7,8,10,12,20), percentile = c(0.05,0.1,0.15,0.20,0.25,0.5)) > percentiles(df, c(0.08, 0.10, 0.50, 0.80)) Source: local data frame [2 x 2] Groups: targetPercentile   targetPercentile score 1 0.08 7 2 0.10 8```

Note that we don’t actually get any rows back for 0.50 or 0.80 since we don’t have any entries greater than those percentiles. With a proper CDF we would though so the function does its job.

Written by Mark Needham

April 27th, 2015 at 10:34 pm

Posted in R

Tagged with ,