Mark Needham

Thoughts on Software Development

Archive for the ‘rstats’ tag

R: data.table – Finding the maximum row

with 2 comments

In my continued playing around with the R data.table package I wanted to find the maximum row based on one of the columns, grouped by another column, and then return back the whole row.

We’ll use the following data table to illustrate:

> blogDT = data.table(name = c("Property 1","Property 1","Property 1","Property 2","Property 2","Property 2"), 
                    price = c(10000, 12500, 18000, 245000, 512000, 1000000),
                    date = c("Day 1", "Day 7", "Day 10", "Day 3", "Day 5", "Day 12"))
 
> blogDT[, lag.price := c(NA, price[-.N]), by = name]
 
> blogDT[, diff := price - lag.price]
 
> blogDT
         name   price   date lag.price   diff
1: Property 1   10000  Day 1        NA     NA
2: Property 1   12500  Day 7     10000   2500
3: Property 1   18000 Day 10     12500   5500
4: Property 2  245000  Day 3        NA     NA
5: Property 2  512000  Day 5    245000 267000
6: Property 2 1000000 Day 12    512000 488000

I wanted to find the biggest difference in ‘price’ and ‘lag.price’ grouped by the ‘name’ column.

If we just want to get the max ‘diff’ grouped by ‘name’ it’s quite easy:

> blogDT[!is.na(diff), .(max = max(diff)), keyby = name]
         name    max
1: Property 1   5500
2: Property 2 488000

However now we’ve lost the information about which ‘date’ that was on and what the ‘price’ and ‘lag.price’ were which is a bit annoying.

If we only want to keep the rows which have the highest ‘diff’ grouped by ‘name’, one way to go about this is to add a ‘max.diff’ column to each row and then filter appropriately. e.g.

> maxDT = blogDT[!is.na(diff)][, max := max(diff), by = name]
 
> maxDT
         name   price   date lag.price   diff    max
1: Property 1   12500  Day 7     10000   2500   5500
2: Property 1   18000 Day 10     12500   5500   5500
3: Property 2  512000  Day 5    245000 267000 488000
4: Property 2 1000000 Day 12    512000 488000 488000
 
> maxDT[diff == max]
         name   price   date lag.price   diff    max
1: Property 1   18000 Day 10     12500   5500   5500
2: Property 2 1000000 Day 12    512000 488000 488000

I’ve only been playing with data.table for a few days so if there’s a better way do let me know in the comments.

Written by Mark Needham

October 2nd, 2015 at 6:42 pm

Posted in R

Tagged with , ,

R: data.table – Comparing adjacent rows

with 2 comments


As part of my exploration of the Land Registry price paid data set I wanted to compare the difference between consecutive sales of properties.

This means we need to group the sales by a property identifier and then get the previous sale price into a column on each row unless it’s the first sale in which case we’ll have ‘NA’. We can do this by creating a lag variable.

I’ll use a simpler data set which is very similar in structure to the Land Registry’s to demonstrate:

> blogDT = data.table(name = c("Property 1","Property 1","Property 1","Property 2","Property 2","Property 2"), 
                      price = c(10000, 12500, 18000, 245000, 512000, 1000000))
 
> blogDT
         name   price
1: Property 1   10000
2: Property 1   12500
3: Property 1   18000
4: Property 2  245000
5: Property 2  512000
6: Property 2 1000000

We want to group by the ‘name’ column and then have the price on row 1 show on row 2, the price on row 2 on row 3, the price on row 4 on row 5 and the price on row 5 on row 6. To do that we’ll introduce a ‘lag.price’ column:

> blogDT[, lag.price := c(NA, price[-.N]), by = name]
 
> blogDT
         name   price lag.price
1: Property 1   10000        NA
2: Property 1   12500     10000
3: Property 1   18000     12500
4: Property 2  245000        NA
5: Property 2  512000    245000
6: Property 2 1000000    512000

Next let’s calculate the difference between the two prices:

> blogDT[, diff := price - lag.price]
> blogDT
         name   price lag.price   diff
1: Property 1   10000        NA     NA
2: Property 1   12500     10000   2500
3: Property 1   18000     12500   5500
4: Property 2  245000        NA     NA
5: Property 2  512000    245000 267000
6: Property 2 1000000    512000 488000

Finally let’s order the data table by the biggest price gains:

> blogDT[order(-diff)]
         name   price lag.price   diff
1: Property 2 1000000    512000 488000
2: Property 2  512000    245000 267000
3: Property 1   18000     12500   5500
4: Property 1   12500     10000   2500
5: Property 1   10000        NA     NA
6: Property 2  245000        NA     NA

Written by Mark Needham

September 27th, 2015 at 10:02 pm

Posted in R

Tagged with ,

R: Date for given week/year

without comments

As I mentioned in my last couple of blog posts I’ve been looking at the data behind this blog and I wanted to plot a chart showing the number of posts per week since the blog started.

I started out with a data frame with posts and publication date:

> library(dplyr)
> df = read.csv("posts.csv")
> df$date = ymd_hms(df$date)
 
> df %>% sample_n(10)
                                                                                title                date
538                                    Nygard Big Data Model: The Investigation Stage 2012-10-10 00:00:36
341                                                            The read-only database 2011-08-29 23:32:26
1112                                  CSS in Internet Explorer - Some lessons learned 2008-10-31 15:24:51
143                                                       Coding: Mutating parameters 2010-08-26 07:47:23
433  Scala: Counting number of inversions (via merge sort) for an unsorted collection 2012-03-20 06:53:18
618                                    neo4j/cypher: SQL style GROUP BY functionality 2013-02-17 21:05:27
1111                                 Testing Hibernate mappings: Setting up test data 2008-10-30 13:24:14
462                                       neo4j: What question do you want to answer? 2012-05-05 13:20:41
1399                                       Book Club: Design Sense (Michael Feathers) 2009-09-29 14:42:29
494                                    Bash Shell: Reusing parts of previous commands 2012-07-05 23:42:35

The first step was to add a couple of columns representing the week and year for the publication date. The ‘lubridate’ library came in handy here:

byWeek = df %>% 
  mutate(year = year(date), week = week(date)) %>% 
  group_by(week, year) %>% summarise(n = n()) %>% 
  ungroup() %>% arrange(desc(n))
 
> byWeek
Source: local data frame [352 x 3]
 
   week year  n
1    33 2008 14
2    35 2008 11
3    53 2012 11
4     9 2013 10
5    12 2013  9
6    21 2009  9
7    22 2009  9
8    38 2013  9
9    40 2008  9
10   48 2012  9
..  ...  ... ..

The next step is to calculate the start date of each of those weeks so that we can plot the counts on a continuous date scale. I spent a while searching how to do this before realising that the ‘week’ function I used before can set the week for a given data as well. Let’s get to work:

calculate_start_of_week = function(week, year) {
  date <- ymd(paste(year, 1, 1, sep="-"))
  week(date) = week
  return(date)
}
 
> calculate_start_of_week(c(1,2,3), c(2015,2014,2013))
[1] "2015-01-01 UTC" "2014-01-08 UTC" "2013-01-15 UTC"

And now let’s transform our data frame and plot the counts:

ggplot(aes(x=start_of_week, y=n, group=1), 
       data = byWeek %>% mutate(start_of_week = calculate_start_of_week(week, year))) + 
  geom_line()

2015 07 10 22 43 54

It’s a bit erratic as you can see. Some of this can be explained by the fact that I do in fact post in an erratic way while some of it is explained by the fact that some weeks only have a few days if they start on the 29th onwards.

Written by Mark Needham

July 10th, 2015 at 10:01 pm

Posted in R

Tagged with ,

R: dplyr – Error: cannot modify grouping variable

without comments

I’ve been doing some exploration of the posts made on this blog and I thought I’d start with answering a simple question – on which dates did I write the most posts?

I started with a data frame containing each post and the date it was published:

> library(dplyr)
> df %>% sample_n(5)
                                                title                date
1148 Taiichi Ohno's Workplace Management: Book Review 2008-12-08 14:14:48
158     Rails: Faking a delete method with 'form_for' 2010-09-20 18:52:15
331           Retrospectives: The 4 L's Retrospective 2011-07-25 21:00:30
1035       msbuild - Use OutputPath instead of OutDir 2008-08-14 18:54:03
1181                The danger of commenting out code 2009-01-17 06:02:33

To find the most popular days for blog posts we can write the following aggregation function:

> df %>% mutate(day = as.Date(date)) %>% count(day) %>% arrange(desc(n))
 
Source: local data frame [1,140 x 2]
 
          day n
1  2012-12-31 6
2  2014-05-31 6
3  2008-08-08 5
4  2013-01-27 5
5  2009-08-24 4
6  2012-06-24 4
7  2012-09-30 4
8  2012-10-27 4
9  2012-11-24 4
10 2013-02-28 4

So we can see a couple of days with 6 posts, a couple with 5 posts, a few more with 4 posts and then presumably loads of days with 1 post.

I thought it’d be cool if we could blog a histogram which had on the x axis the number of posts and on the y axis how many days that number of posts occurred e.g. for an x value of 6 (posts) we’d have a y value of 2 (occurrences).

My initial attempt was this:

> df %>% mutate(day = as.Date(date)) %>% count(day) %>% count(n)
Error: cannot modify grouping variable

Unfortunately that isn’t allowed. I tried ungrouping and then counting again:

 df %>% mutate(day = as.Date(date)) %>% count(day) %>% ungroup() %>% count(n)
Error: cannot modify grouping variable

Still no luck. I did a bit of googlign around and came across a post which suggested using a combination of group_by + mutate or group_by + summarize.

I tried the mutate approach first:

> df %>% mutate(day = as.Date(date)) %>% 
+     group_by(day) %>% mutate(n = n()) %>% ungroup() %>% sample_n(5)
                                                        title                Source: local data frame [5 x 4]
 
                                    title                date        day n
1 QCon London 2009: DDD & BDD - Dan North 2009-03-13 15:28:04 2009-03-13 2
2        Onboarding: Sketch the landscape 2013-02-15 07:36:06 2013-02-15 1
3                           Ego Depletion 2013-06-04 23:16:29 2013-06-04 1
4                 Clean Code: Book Review 2008-09-15 09:52:33 2008-09-15 1
5            Dreyfus Model: More thoughts 2009-08-10 10:36:51 2009-08-10 1

That keeps around the ‘title’ which is a bit annoying. We can get rid of it using a distinct on ‘day’ if we want and if we also implement the second part of the function we end up with the following:

> df %>% mutate(day = as.Date(date)) %>% 
    group_by(day) %>% mutate(n = n()) %>% distinct(day) %>% ungroup() %>% 
    group_by(n) %>%
    mutate(c = n()) %>%
    distinct(n)  
 
Source: local data frame [6 x 5]
Groups: n
 
                                                title                date        day n   c
1       Functional C#: Writing a 'partition' function 2010-02-01 23:34:02 2010-02-01 1 852
2                            Willed vs Forced designs 2010-02-08 22:48:05 2010-02-08 2 235
3                            TDD: Testing collections 2010-07-28 06:05:25 2010-07-28 3  41
4  Creating a Samba share between Ubuntu and Mac OS X 2012-06-24 00:40:35 2012-06-24 4   8
5            Gamification and Software: Some thoughts 2012-12-31 10:57:19 2012-12-31 6   2
6 Python/numpy: Selecting specific column in 2D array 2013-01-27 02:10:10 2013-01-27 5   2

Annoyingly we’ve still got the ‘title’, ‘date’ and ‘day’ columns hanging around which we’d need to get rid of with a call to ‘select’. The code also feels quite icky, especially the use of distinct in a couple of places.

In fact we can simplify the code if we use summarize instead of mutate:

> df %>% mutate(day = as.Date(date)) %>% 
    group_by(day) %>% summarize(n = n()) %>% ungroup() %>% 
    group_by(n) %>% summarize(c = n())
 
 
Source: local data frame [6 x 2]
 
  n   c
1 1 852
2 2 235
3 3  41
4 4   8
5 5   2
6 6   2

And we’ve got also rid of the extra columns in the bargain which is great! And now we can plot our histogram:

> library(ggplot2)
> post_frequencies = df %>% mutate(day = as.Date(date)) %>% 
    group_by(day) %>% summarize(n = n()) %>% ungroup() %>% 
    group_by(n) %>% summarize(c = n())
> ggplot(aes(x = n, y = c), data = post_frequencies) + geom_bar(stat = "identity")

2015 07 09 06 44 47

In this case we don’t actually need to do the second grouping to create the bar chart since ggplot will do it for us if we feed it the following data:

. ggplot(aes(x = n), 
         data = df %>% mutate(day = as.Date(date)) %>% group_by(day) %>% summarize(n = n()) %>% ungroup()) +
    geom_bar(binwidth = 1) +
    scale_x_continuous(limits=c(1, 6))
2015 07 09 06 55 12

Still, it’s good to know how!

Written by Mark Needham

July 9th, 2015 at 5:55 am

Posted in R

Tagged with , ,

R: ggplot geom_density – Error in exists(name, envir = env, mode = mode) : argument “env” is missing, with no default

without comments

Continuing on from yesterday’s blog post where I worked out how to clean up the Think Bayes Price is Right data set, the next task was to plot a distribution of the prices of show case items.

To recap, this is what the data frame we’re working with looks like:

library(dplyr)
 
df2011 = read.csv("~/projects/rLearning/showcases.2011.csv", na.strings = c("", "NA"))
df2011 = df2011 %>% na.omit()
 
> df2011 %>% head()
              X Sep..19 Sep..20 Sep..21 Sep..22 Sep..23 Sep..26 Sep..27 Sep..28 Sep..29 Sep..30 Oct..3
3    Showcase 1   50969   21901   32815   44432   24273   30554   20963   28941   25851   28800  37703
4    Showcase 2   45429   34061   53186   31428   22320   24337   41373   45437   41125   36319  38752
6         Bid 1   42000   14000   32000   27000   18750   27222   25000   35000   22500   21300  21567
7         Bid 2   34000   59900   45000   38000   23000   18525   32000   45000   32000   27500  23800
9  Difference 1    8969    7901     815   17432    5523    3332   -4037   -6059    3351    7500  16136
10 Difference 2   11429  -25839    8186   -6572    -680    5812    9373     437    9125    8819  14952
...

So our goal is to plot the density of the ‘Showcase 1’ items. Unfortunately those aren’t currently stored in a way that makes this easy for us. We need to flip the data frame so that we have a row for each date/price type/price:

PriceType  Date     Price
Showcase 1 Sep..19  50969
Showcase 2 Sep..19  21901
...
Showcase 1 Sep..20  45429
Showcase 2 Sep..20  34061

The reshape library’s melt function is our friend here:

library(reshape)
meltedDf = melt(df2011, id=c("X"))
 
> meltedDf %>% sample_n(10)
                X variable value
643    Showcase 1  Feb..24 27883
224    Showcase 2  Nov..10 34089
1062 Difference 2   Jun..4  9962
770    Showcase 2  Mar..28 39620
150  Difference 2  Oct..24  9137
431  Difference 1   Jan..4  7516
345         Bid 1  Dec..12 21569
918  Difference 2    May.1 -2093
536    Showcase 2  Jan..31 30918
502         Bid 2  Jan..23 27000

Now we need to plug this into ggplot. We’ll start by just plotting all the prices for showcase 1:

> ggplot(aes(x = value), data = meltedDf %>% filter(X == "Showcase 1")) +
    geom_density()
 
Error in exists(name, envir = env, mode = mode) : 
  argument "env" is missing, with no default


This error usually means that you’ve passed an empty data set to ggplot which isn’t the case here, but if we extract the values column we can see the problem:

> meltedDf$value[1:10]
 [1] "50969" "45429" "42000" "34000" "8969"  "11429" "21901" "34061" "14000" "59900"

They are all strings! Making it very difficult to plot a density curve which relies on the data being continuous. Let’s fix that and try again:

meltedDf$value = as.numeric(meltedDf$value)

ggplot(aes(x = value), data = meltedDf %>% filter(X == "Showcase 1")) +
  geom_density()

2015 06 03 06 46 48

If we want to show the curves for both showcases we can tweak our code slightly:

ggplot(meltedDf %>% filter(grepl("Showcase", X)), aes(x = value, colour = X)) + 
  geom_density() + 
  theme(legend.position="top")
2015 06 03 06 50 35

Et voila!

Written by Mark Needham

June 3rd, 2015 at 5:52 am

Posted in R

Tagged with ,

R: Think Bayes Euro Problem

without comments

I’ve got back to working my way through Think Bayes after a month’s break and started out with the one euro coin problem in Chapter 4:

A statistical statement appeared in “The Guardian” on Friday January 4, 2002:

When spun on edge 250 times, a Belgian one-euro coin came up heads 140 times and tails 110. ‘It looks very suspicious to me,’ said Barry Blight, a statistics lecturer at the London School of Economics. ‘If the coin were unbiased, the chance of getting a result as extreme as that would be less than 7%.’

But do these data give evidence that the coin is biased rather than fair?

We’re going to create a data frame with each row representing the probability that heads shows up that often. We need one row for each value between 0 (no heads) and 100 (all heads) and we’ll start with the assumption that each value can be chosen equally (a uniform prior):

library(dplyr)
 
values = seq(0, 100)
scores = rep(1.0 / length(values), length(values))  
df = data.frame(score = scores, value = values)
 
> df %>% sample_n(10)
         score value
60  0.00990099    59
101 0.00990099   100
10  0.00990099     9
41  0.00990099    40
2   0.00990099     1
83  0.00990099    82
44  0.00990099    43
97  0.00990099    96
100 0.00990099    99
12  0.00990099    11

Now we need to feed in our observations. We need to create a vector containing 140 heads and 110 tails. The ‘rep’ function comes in handy here:

observations = c(rep("T", times = 110), rep("H", times = 140))
> observations
  [1] "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T"
 [29] "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T"
 [57] "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T"
 [85] "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "H" "H"
[113] "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H"
[141] "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H"
[169] "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H"
[197] "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H"
[225] "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H"

Now we need to iterate over each of the observations and update our data frame appropriately.

for(observation in observations) {
  if(observation == "H") {
    df = df %>% mutate(score = score * (value / 100.0))
  } else {
    df = df %>% mutate(score = score * (1.0 - (value / 100.0)))
  }    
}
 
df = df %>% mutate(weighted = score / sum(score))

Now that we’ve done that we can calculate the maximum likelihood, mean, median and credible interval. We’ll create a ‘percentile’ function to help us out:

percentile = function(df, p) {
    df %>% filter(cumsum(weighted) > p) %>% head(1) %>% select(value) %>% as.numeric
}

And now let’s calculate the values:

# Maximum likelihood
> df %>% filter(weighted == max(weighted)) %>% select(value) %>% as.numeric
[1] 56
 
# Mean
> df %>% mutate(mean = value * weighted) %>% select(mean) %>% sum
[1] 55.95238
 
# Median
> percentile(df, 0.5)
[1] 56
 
# Credible Interval
percentage = 90
prob = (1 - percentage / 100.0) / 2
 
# lower
> percentile(df, prob)
[1] 51
 
# upper
> percentile(df, 1 - prob)
[1] 61

This all wraps up nicely into a function:

euro = function(values, priors, observations) {
  df = data.frame(score = priors, value = values)
 
  for(observation in observations) {
    if(observation == "H") {
      df = df %>% mutate(score = score * (value / 100.0))
    } else {
      df = df %>% mutate(score = score * (1.0 - (value / 100.0)))
    }    
  }
 
  return(df %>% mutate(weighted = score / sum(score)))
}

which we can call like so:

values = seq(0,100)
priors = rep(1.0 / length(values), length(values))
observations = c(rep("T", times = 110), rep("H", times = 140))
df = euro(values, priors, observations)

The next part of the problem requires us to change the prior distribution to be more weighted to values close to 50%. We can tweak the parameters we pass into the function accordingly:

values = seq(0,100)
priors = sapply(values, function(x) ifelse(x < 50, x, 100 - x))
priors = priors / sum(priors)
observations = c(rep("T", times = 110), rep("H", times = 140))
df = euro(values, priors, observations)

In fact even with the adjusted priors we still end up with the same posterior distribution:

> df %>% filter(weighted == max(weighted)) %>% select(value) %>% as.numeric
[1] 56
 
> df %>% mutate(mean = value * weighted) %>% select(mean) %>% sum
[1] 55.7435
 
> percentile(df, 0.5)
[1] 56
 
> percentile(df, 0.05)
[1] 51
 
> percentile(df, 0.95)
[1] 61

The book describes this phenemenom as follows:

This is an example of swamping the priors: with enough data, people who start with different priors will tend to converge on the same posterior.

Written by Mark Needham

May 31st, 2015 at 11:11 pm

Posted in R

Tagged with ,

R: ggplot – Displaying multiple charts with a for loop

without comments

Continuing with my analysis of the Neo4j London user group I wanted to drill into some individual meetups and see the makeup of the people attending those meetups with respect to the cohort they belong to.

I started by writing a function which would take in an event ID and output a bar chart showing the number of people who attended that event from each cohort.

We can work out the cohort that a member belongs to by querying for the first event they attended.

Our query for the most recent Intro to graphs session looks like this:

library(RNeo4j)
graph = startGraph("http://127.0.0.1:7474/db/data/")
 
eventId = "220750415"
query =  "match (g:Group {name: 'Neo4j - London User Group'})-[:HOSTED_EVENT]->
                (e {id: {id}})<-[:TO]-(rsvp {response: 'yes'})<-[:RSVPD]-(person) 
          WITH rsvp, person
          MATCH (person)-[:RSVPD]->(otherRSVP)
 
          WITH person, rsvp, otherRSVP
          ORDER BY person.id, otherRSVP.time
 
          WITH person, rsvp, COLLECT(otherRSVP)[0] AS earliestRSVP
          return rsvp.time, earliestRSVP.time,  person.id"
 
df = cypher(graph, query, id= eventId)
 
> df %>% sample_n(10)
      rsvp.time earliestRSVP.time person.id
18 1.430819e+12      1.392726e+12 130976662
95 1.430069e+12      1.430069e+12  10286388
79 1.429035e+12      1.429035e+12  38344282
64 1.428108e+12      1.412935e+12 153473172
73 1.429513e+12      1.398236e+12 143322942
19 1.430389e+12      1.430389e+12 129261842
37 1.429643e+12      1.327603e+12   9750821
49 1.429618e+12      1.429618e+12 184325696
69 1.430781e+12      1.404554e+12  67485912
1  1.430929e+12      1.430146e+12 185405773

We’re not doing anything too clever here, just using a couple of WITH clauses to order RSVPs so we can get the earlier one for each person.

Once we’ve done that we’ll tidy up the data frame so that it contains columns containing the cohort in which the member attended their first event:

timestampToDate <- function(x) as.POSIXct(x / 1000, origin="1970-01-01", tz = "GMT")
 
df$time = timestampToDate(df$rsvp.time)
df$date = format(as.Date(df$time), "%Y-%m")
df$earliestTime = timestampToDate(df$earliestRSVP.time)
df$earliestDate = format(as.Date(df$earliestTime), "%Y-%m")
 
> df %>% sample_n(10)
      rsvp.time earliestRSVP.time person.id                time    date        earliestTime earliestDate
47 1.430697e+12      1.430697e+12 186893861 2015-05-03 23:47:11 2015-05 2015-05-03 23:47:11      2015-05
44 1.430924e+12      1.430924e+12 186998186 2015-05-06 14:49:44 2015-05 2015-05-06 14:49:44      2015-05
85 1.429611e+12      1.422378e+12  53761842 2015-04-21 10:13:46 2015-04 2015-01-27 16:56:02      2015-01
14 1.430125e+12      1.412690e+12   7994846 2015-04-27 09:01:58 2015-04 2014-10-07 13:57:09      2014-10
29 1.430035e+12      1.430035e+12  37719672 2015-04-26 07:59:03 2015-04 2015-04-26 07:59:03      2015-04
12 1.430855e+12      1.430855e+12 186968869 2015-05-05 19:38:10 2015-05 2015-05-05 19:38:10      2015-05
41 1.428917e+12      1.422459e+12 133623562 2015-04-13 09:20:07 2015-04 2015-01-28 15:37:40      2015-01
87 1.430927e+12      1.430927e+12 185155627 2015-05-06 15:46:59 2015-05 2015-05-06 15:46:59      2015-05
62 1.430849e+12      1.430849e+12 186965212 2015-05-05 17:56:23 2015-05 2015-05-05 17:56:23      2015-05
8  1.430237e+12      1.425567e+12 184979500 2015-04-28 15:58:23 2015-04 2015-03-05 14:45:40      2015-03

Now that we’ve got that we can group by the earliestDate cohort and then create a bar chart:

byCohort = df %>% count(earliestDate) 
 
ggplot(aes(x= earliestDate, y = n), data = byCohort) + 
    geom_bar(stat="identity", fill = "dark blue") +
    theme(axis.text.x=element_text(angle=90,hjust=1,vjust=1))

2015 05 13 00 30 59

This is good and gives us the insight that most of the members attending this version of intro to graphs just joined the group. The event was on 7th April and most people joined in March which makes sense.

Let’s see if that trend continues over the previous two years. To do this we need to create a for loop which goes over all the Intro to Graphs events and then outputs a chart for each one.

First I pulled out the code above into a function:

plotEvent = function(eventId) {
  query =  "match (g:Group {name: 'Neo4j - London User Group'})-[:HOSTED_EVENT]->
                (e {id: {id}})<-[:TO]-(rsvp {response: 'yes'})<-[:RSVPD]-(person) 
          WITH rsvp, person
          MATCH (person)-[:RSVPD]->(otherRSVP)
 
          WITH person, rsvp, otherRSVP
          ORDER BY person.id, otherRSVP.time
 
          WITH person, rsvp, COLLECT(otherRSVP)[0] AS earliestRSVP
          return rsvp.time, earliestRSVP.time,  person.id"
 
  df = cypher(graph, query, id= eventId)
  df$time = timestampToDate(df$rsvp.time)
  df$date = format(as.Date(df$time), "%Y-%m")
  df$earliestTime = timestampToDate(df$earliestRSVP.time)
  df$earliestDate = format(as.Date(df$earliestTime), "%Y-%m")
 
  byCohort = df %>% count(earliestDate) 
 
  ggplot(aes(x= earliestDate, y = n), data = byCohort) + 
    geom_bar(stat="identity", fill = "dark blue") +
    theme(axis.text.x=element_text(angle=90,hjust=1,vjust=1))
}

We’d call it like this for the Intro to graphs meetup:

> plotEvent("220750415")

Next I tweaked the code to look up all Into to graphs events and then loop through and output a chart for each event:

events = cypher(graph, "match (e:Event {name: 'Intro to Graphs'}) RETURN e.id ORDER BY e.time")
 
for(event in events$n) {
  plotEvent(as.character(event))
}

Unfortunately that doesn’t print anything at all which we can fix by storing our plots in a list and then displaying it afterwards:

library(gridExtra)
p = list()
for(i in 1:count(events)$n) {
  event = events[i, 1]
  p[[i]] = plotEvent(as.character(event))
}
 
do.call(grid.arrange,p)

2015 05 14 00 57 10

This visualisation is probably better without any of the axis so let’s update the function to scrap those. We’ll also add the date of the event at the top of each chart which will require a slight tweak of the query:

plotEvent = function(eventId) {
  query =  "match (g:Group {name: 'Neo4j - London User Group'})-[:HOSTED_EVENT]->
                (e {id: {id}})<-[:TO]-(rsvp {response: 'yes'})<-[:RSVPD]-(person) 
            WITH e,rsvp, person
            MATCH (person)-[:RSVPD]->(otherRSVP)
 
            WITH e,person, rsvp, otherRSVP
            ORDER BY person.id, otherRSVP.time
 
            WITH e, person, rsvp, COLLECT(otherRSVP)[0] AS earliestRSVP
            return rsvp.time, earliestRSVP.time,  person.id, e.time"
 
  df = cypher(graph, query, id= eventId)
  df$time = timestampToDate(df$rsvp.time)
  df$eventTime = timestampToDate(df$e.time)
  df$date = format(as.Date(df$time), "%Y-%m")
  df$earliestTime = timestampToDate(df$earliestRSVP.time)
  df$earliestDate = format(as.Date(df$earliestTime), "%Y-%m")
 
  byCohort = df %>% count(earliestDate) 
 
  ggplot(aes(x= earliestDate, y = n), data = byCohort) + 
    geom_bar(stat="identity", fill = "dark blue") +
    theme(axis.ticks = element_blank(), 
          axis.text.x = element_blank(), 
          axis.text.y = element_blank(),
          axis.title.x = element_blank(),
          axis.title.y = element_blank()) + 
    labs(title = df$eventTime[1])
}
2015 05 14 01 08 54

I think this makes it a bit easier to read although I’ve made the mistake of not having all the charts representing the same scale – one to fix for next time.

We started doing the intro to graphs sessions less frequently towards the end of last year so my hypothesis was that we’d see a range of people from different cohorts RSVPing for them but that doesn’t seem to be the case. Instead it’s very dominated by people signing up close to the event.

Written by Mark Needham

May 14th, 2015 at 12:17 am

Posted in neo4j,R

Tagged with , ,

R: Cohort heatmap of Neo4j London meetup

without comments

A few months ago I had a go at doing some cohort analysis of the Neo4j London meetup group which was an interesting experiment but unfortunately resulted in a chart that was completely illegible.

I wasn’t sure how to progress from there but a few days ago I came across the cohort heatmap which seemed like a better way of visualising things over time.

The underlying idea is still the same – we’ve comparing different cohorts of users against each other to see whether a change or intervention we did at a certain time had any impact.

However, the way we display the cohorts changes and I think for the better.

To recap, we start with the following data frame:

df = read.csv("/tmp/df.csv")
> df %>% sample_n(5)
        rsvp.time person.id                time    date
255  1.354277e+12  12228948 2012-11-30 12:05:08 2012-11
2475 1.407342e+12  19057581 2014-08-06 16:26:04 2014-08
3988 1.421769e+12  66122172 2015-01-20 15:58:02 2015-01
4411 1.419377e+12 165750262 2014-12-23 23:27:44 2014-12
1010 1.383057e+12  74602292 2013-10-29 14:24:32 2013-10

And we need to transform this into a data frame which is grouped by cohort (members who attended their first meetup in a particular month). The following code gets us there:

firstMeetup = df %>% 
  group_by(person.id) %>% 
  summarise(firstEvent = min(time), count = n()) %>% 
  arrange(desc(count))
firstMeetup$date = format(as.Date(firstMeetup$firstEvent), "%Y-%m")
 
countsForCohort = function(df, firstMeetup, cohort) {
  members = (firstMeetup %>% filter(date == cohort))$person.id
 
  attendance = df %>% 
    filter(person.id %in% members) %>% 
    count(person.id, date) %>% 
    ungroup() %>%
    count(date)
 
  allCohorts = df %>% select(date) %>% unique
  cohortAttendance = merge(allCohorts, attendance, by = "date", all.x = TRUE)  
  cohortAttendance[is.na(cohortAttendance) & cohortAttendance$date > cohort] = 0
  cohortAttendance %>% mutate(cohort = cohort, retention = n / length(members), members = n)  
}
 
cohorts = collect(df %>% select(date) %>% unique())[,1]
 
cohortAttendance = data.frame()
for(cohort in cohorts) {
  cohortAttendance = rbind(cohortAttendance,countsForCohort(df, firstMeetup, cohort))      
}
 
monthNumber = function(cohort, date) {
  cohortAsDate = as.yearmon(cohort)
  dateAsDate = as.yearmon(date)
 
  if(cohortAsDate > dateAsDate) {
    "NA"
  } else {
    paste(round((dateAsDate - cohortAsDate) * 12), sep="")
  }
}
 
cohortAttendanceWithMonthNumber = cohortAttendance %>% 
  group_by(row_number()) %>% 
  mutate(monthNumber = monthNumber(cohort, date)) %>%
  filter(monthNumber != "NA") %>%
  filter(!is.na(members)) %>%
  mutate(monthNumber = as.numeric(monthNumber)) %>% 
  arrange(monthNumber)
 
> cohortAttendanceWithMonthNumber %>% head(10)
Source: local data frame [10 x 7]
Groups: row_number()
 
      date n  cohort retention members row_number() monthNumber
1  2011-06 4 2011-06      1.00       4            1           0
2  2011-07 1 2011-06      0.25       1            2           1
3  2011-08 1 2011-06      0.25       1            3           2
4  2011-09 2 2011-06      0.50       2            4           3
5  2011-10 1 2011-06      0.25       1            5           4
6  2011-11 1 2011-06      0.25       1            6           5
7  2012-01 1 2011-06      0.25       1            7           7
8  2012-04 2 2011-06      0.50       2            8          10
9  2012-05 1 2011-06      0.25       1            9          11
10 2012-06 1 2011-06      0.25       1           10          12

Now let’s create our first heatmap.

t <- max(cohortAttendanceWithMonthNumber$members)
 
cols <- c("#e7f0fa", "#c9e2f6", "#95cbee", "#0099dc", "#4ab04a", "#ffd73e", "#eec73a", "#e29421", "#e29421", "#f05336", "#ce472e")
ggplot(cohortAttendanceWithMonthNumber, aes(y=cohort, x=date, fill=members)) +
  theme_minimal() +
  geom_tile(colour="white", linewidth=2, width=.9, height=.9) +
  scale_fill_gradientn(colours=cols, limits=c(0, t),
                       breaks=seq(0, t, by=t/4),
                       labels=c("0", round(t/4*1, 1), round(t/4*2, 1), round(t/4*3, 1), round(t/4*4, 1)),
                       guide=guide_colourbar(ticks=T, nbin=50, barheight=.5, label=T, barwidth=10)) +
  theme(legend.position='bottom', 
        legend.direction="horizontal",
        plot.title = element_text(size=20, face="bold", vjust=2),
        axis.text.x=element_text(size=8, angle=90, hjust=.5, vjust=.5, face="plain")) +
  ggtitle("Cohort Activity Heatmap (number of members who attended event)")

2015 05 11 23 55 56

‘t’ is the maximum number of members within a cohort who attended a meetup in a given month. This makes it easy to see which cohorts started with the most members but makes it difficult to compare their retention over time.

We can fix that by showing the percentage of members in the cohort who attend each month rather than using absolute values. To do that we must first add an extra column containing the percentage values:

cohortAttendanceWithMonthNumber$retentionPercentage = ifelse(!is.na(cohortAttendanceWithMonthNumber$retention),  cohortAttendanceWithMonthNumber$retention * 100, 0)
t <- max(cohortAttendanceWithMonthNumber$retentionPercentage)
 
cols <- c("#e7f0fa", "#c9e2f6", "#95cbee", "#0099dc", "#4ab04a", "#ffd73e", "#eec73a", "#e29421", "#e29421", "#f05336", "#ce472e")
ggplot(cohortAttendanceWithMonthNumber, aes(y=cohort, x=date, fill=retentionPercentage)) +
  theme_minimal() +
  geom_tile(colour="white", linewidth=2, width=.9, height=.9) +
  scale_fill_gradientn(colours=cols, limits=c(0, t),
                       breaks=seq(0, t, by=t/4),
                       labels=c("0", round(t/4*1, 1), round(t/4*2, 1), round(t/4*3, 1), round(t/4*4, 1)),
                       guide=guide_colourbar(ticks=T, nbin=50, barheight=.5, label=T, barwidth=10)) +
  theme(legend.position='bottom', 
        legend.direction="horizontal",
        plot.title = element_text(size=20, face="bold", vjust=2),
        axis.text.x=element_text(size=8, angle=90, hjust=.5, vjust=.5, face="plain")) +
  ggtitle("Cohort Activity Heatmap (number of members who attended event)")

2015 05 12 00 01 55

This version allows us to compare cohorts against each other but now we don’t have the exact numbers which means earlier cohorts will look better since there are less people in them. We can get the best of both worlds by keeping this visualisation but showing the actual values inside each box:

t <- max(cohortAttendanceWithMonthNumber$retentionPercentage)
 
cols <- c("#e7f0fa", "#c9e2f6", "#95cbee", "#0099dc", "#4ab04a", "#ffd73e", "#eec73a", "#e29421", "#e29421", "#f05336", "#ce472e")
ggplot(cohortAttendanceWithMonthNumber, aes(y=cohort, x=date, fill=retentionPercentage)) +
  theme_minimal() +
  geom_tile(colour="white", linewidth=2, width=.9, height=.9) +
  scale_fill_gradientn(colours=cols, limits=c(0, t),
                       breaks=seq(0, t, by=t/4),
                       labels=c("0", round(t/4*1, 1), round(t/4*2, 1), round(t/4*3, 1), round(t/4*4, 1)),
                       guide=guide_colourbar(ticks=T, nbin=50, barheight=.5, label=T, barwidth=10)) +
  theme(legend.position='bottom', 
        legend.direction="horizontal",
        plot.title = element_text(size=20, face="bold", vjust=2),
        axis.text.x=element_text(size=8, angle=90, hjust=.5, vjust=.5, face="plain")) +
  ggtitle("Cohort Activity Heatmap (number of members who attended event)") + 
  geom_text(aes(label=members),size=3)

2015 05 12 00 04 31

What we can learn overall is that the majority of people seem to have a passing interest and then we have a smaller percentage who will continue to come to events.

It seems like we did a better job at retaining attendees in the middle of last year – one hypothesis is that the events we ran around then were more compelling but I need to do more analysis.

Next I’m going to drill further into some of the recent events and see what cohorts the attendees came from.

Written by Mark Needham

May 11th, 2015 at 11:16 pm

Posted in R

Tagged with ,

R: Neo4j London meetup group – How many events do people come to?

with one comment

Earlier this week the number of members in the Neo4j London meetup group creeped over the 2,000 mark and I thought it’d be fun to re-explore the data that I previously imported into Neo4j.

How often do people come to meetups?

library(RNeo4j)
library(dplyr)
 
graph = startGraph("http://localhost:7474/db/data/")
 
query = "MATCH (g:Group {name: 'Neo4j - London User Group'})-[:HOSTED_EVENT]->(event)<-[:TO]-({response: 'yes'})<-[:RSVPD]-(profile)-[:HAS_MEMBERSHIP]->(membership)-[:OF_GROUP]->(g)
         WHERE (event.time + event.utc_offset) < timestamp()
         RETURN event.id, event.time + event.utc_offset AS eventTime, profile.id, membership.joined"
 
df = cypher(graph, query)
 
> df %>% head()
  event.id    eventTime profile.id membership.joined
1 20616111 1.309372e+12    6436797      1.307285e+12
2 20616111 1.309372e+12   12964956      1.307275e+12
3 20616111 1.309372e+12   14533478      1.307290e+12
4 20616111 1.309372e+12   10793775      1.307705e+12
5 24528711 1.311793e+12   10793775      1.307705e+12
6 29953071 1.314815e+12   10595297      1.308154e+12
byEventsAttended = df %>% count(profile.id)
 
> byEventsAttended %>% sample_n(10)
Source: local data frame [10 x 2]
 
   profile.id  n
1   128137932  2
2   126947632  1
3    98733862  2
4    20468901 11
5    48293132  5
6   144764532  1
7    95259802  1
8    14524524  3
9    80611852  2
10  134907492  2

Now let’s visualise the number of people that have attended certain number of events:

ggplot(aes(x = n), data = byEventsAttended) + 
  geom_bar(binwidth = 1, fill = "Dark Blue") +
  scale_y_continuous(breaks = seq(0,750,by = 50))

2015 05 09 01 15 02

Most people come to one meetup and then there’s a long tail after that with fewer and fewer people coming to lots of meetups.

The chart has lots of blank space due to the sparseness of people on the right hand side. If we exclude any people who’ve attended more than 20 events we might get a more interesting visualisation:

ggplot(aes(x = n), data = byEventsAttended %>% filter(n <= 20)) + 
  geom_bar(binwidth = 1, fill = "Dark Blue") +
  scale_y_continuous(breaks = seq(0,750,by = 50))
2015 05 09 01 15 36

Nicole suggested a more interesting visualisation would be a box plot so I decided to try that next:

ggplot(aes(x = "Attendees", y = n), data = byEventsAttended) +
  geom_boxplot(fill = "grey80", colour = "Dark Blue") +
  coord_flip()

2015 05 09 22 31 20

This visualisation really emphasises that the majority are between 1 and 3 and it’s much less obvious how many values there are at the higher end. A quick check of the data with the summary function reveals as much:

> summary(byEventsAttended$n)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.000   2.000   2.837   3.000  69.000


Now to figure out how to move that box plot a bit to the right 🙂

Written by Mark Needham

May 9th, 2015 at 10:33 pm

Posted in R

Tagged with ,

R: dplyr – Error in (list: invalid subscript type ‘double’

without comments

In my continued playing around with R I wanted to find the minimum value for a specified percentile given a data frame representing a cumulative distribution function (CDF).

e.g. imagine we have the following CDF represented in a data frame:

library(dplyr)
df = data.frame(score = c(5,7,8,10,12,20), percentile = c(0.05,0.1,0.15,0.20,0.25,0.5))

and we want to find the minimum value for the 0.05 percentile. We can use the filter function to do so:

> (df %>% filter(percentile > 0.05) %>% slice(1))$score
[1] 7

Things become more tricky if we want to return multiple percentiles in one go.

My first thought was to create a data frame with one row for each target percentile and then pull in the appropriate row from our original data frame:

targetPercentiles = c(0.05, 0.2)
percentilesDf = data.frame(targetPercentile = targetPercentiles)
> percentilesDf %>% 
    group_by(targetPercentile) %>%
    mutate(x = (df %>% filter(percentile > targetPercentile) %>% slice(1))$score)
 
Error in (list(score = c(5, 7, 8, 10, 12, 20), percentile = c(0.05, 0.1,  : 
  invalid subscript type 'double'

Unfortunately this didn’t quite work as I expected – Antonios pointed out that this is probably because we’re mixing up two pipelines and dplyr can’t figure out what we want to do.

Instead he suggested the following variant which uses the do function

df = data.frame(score = c(5,7,8,10,12,20), percentile = c(0.05,0.1,0.15,0.20,0.25,0.5))
targetPercentiles = c(0.05, 0.2)
 
> data.frame(targetPercentile = targetPercentiles) %>%
    group_by(targetPercentile) %>%
    do(df) %>% 
    filter(percentile > targetPercentile) %>% 
    slice(1) %>%
    select(targetPercentile, score)
Source: local data frame [2 x 2]
Groups: targetPercentile
 
  targetPercentile score
1             0.05     7
2             0.20    12

We can then wrap this up in a function:

percentiles = function(df, targetPercentiles) {
  # make sure the percentiles are in order
  df = df %>% arrange(percentile)
 
  data.frame(targetPercentile = targetPercentiles) %>%
    group_by(targetPercentile) %>%
    do(df) %>% 
    filter(percentile > targetPercentile) %>% 
    slice(1) %>%
    select(targetPercentile, score)
}

which we call like this:

df = data.frame(score = c(5,7,8,10,12,20), percentile = c(0.05,0.1,0.15,0.20,0.25,0.5))
> percentiles(df, c(0.08, 0.10, 0.50, 0.80))
Source: local data frame [2 x 2]
Groups: targetPercentile
 
  targetPercentile score
1             0.08     7
2             0.10     8

Note that we don’t actually get any rows back for 0.50 or 0.80 since we don’t have any entries greater than those percentiles. With a proper CDF we would though so the function does its job.

Written by Mark Needham

April 27th, 2015 at 10:34 pm

Posted in R

Tagged with ,