Mark Needham

Thoughts on Software Development

Archive for the ‘R’ Category

R: ggplot – Plotting multiple variables on a line chart

without comments

In my continued playing around with meetup data I wanted to plot the number of members who join the Neo4j group over time.

I started off with the variable ‘byWeek’ which shows how many members joined the group each week:

> head(byWeek)
Source: local data frame [6 x 2]
 
        week n
1 2011-06-02 8
2 2011-06-09 4
3 2011-06-30 2
4 2011-07-14 1
5 2011-07-21 1
6 2011-08-18 1

I wanted to plot the actual count alongside a rolling average for which I created the following data frame:

library(zoo)
joinsByWeek = data.frame(actual = byWeek$n, 
                         week = byWeek$week,
                         rolling = rollmean(byWeek$n, 4, fill = NA, align=c("right")))
> head(joinsByWeek, 10)
   actual       week rolling
1       8 2011-06-02      NA
2       4 2011-06-09      NA
3       2 2011-06-30      NA
4       1 2011-07-14    3.75
5       1 2011-07-21    2.00
6       1 2011-08-18    1.25
7       1 2011-10-13    1.00
8       2 2011-11-24    1.25
9       1 2012-01-05    1.25
10      3 2012-01-12    1.75

The next step was to work out how to plot both ‘rolling’ and ‘actual’ on the same line chart. The easiest way is to make two calls to ‘geom_line’, like so:

ggplot(joinsByWeek, aes(x = week)) + 
  geom_line(aes(y = rolling), colour="blue") + 
  geom_line(aes(y = actual), colour = "grey") + 
  ylab(label="Number of new members") + 
  xlab("Week")
2014 09 16 21 57 14

Alternatively we can make use of the ‘melt’ function from the reshape library…

library(reshape)
meltedJoinsByWeek = melt(joinsByWeek, id = 'week')
> head(meltedJoinsByWeek, 20)
   week variable value
1     1   actual     8
2     2   actual     4
3     3   actual     2
4     4   actual     1
5     5   actual     1
6     6   actual     1
7     7   actual     1
8     8   actual     2
9     9   actual     1
10   10   actual     3
11   11   actual     1
12   12   actual     2
13   13   actual     4
14   14   actual     2
15   15   actual     3
16   16   actual     5
17   17   actual     1
18   18   actual     2
19   19   actual     1
20   20   actual     2

…which then means we can plot the chart with a single call to geom_line:

ggplot(meltedJoinsByWeek, aes(x = week, y = value, colour = variable)) + 
  geom_line() + 
  ylab(label="Number of new members") + 
  xlab("Week Number") + 
  scale_colour_manual(values=c("grey", "blue"))

2014 09 16 22 17 40

Written by Mark Needham

September 16th, 2014 at 4:59 pm

Posted in R

Tagged with ,

R: ggplot – Plotting a single variable line chart (geom_line requires the following missing aesthetics: y)

without comments

I’ve been learning how to do moving averages in R and having done that calculation I wanted to plot these variables on a line chart using ggplot.

The vector of rolling averages looked like this:

> rollmean(byWeek$n, 4)
  [1]  3.75  2.00  1.25  1.00  1.25  1.25  1.75  1.75  1.75  2.50  2.25  2.75  3.50  2.75  2.75
 [16]  2.25  1.50  1.50  2.00  2.00  2.00  2.00  1.25  1.50  2.25  2.50  3.00  3.25  2.75  4.00
 [31]  4.25  5.25  7.50  6.50  5.75  5.00  3.50  4.00  5.75  6.25  6.25  6.00  5.25  6.25  7.25
 [46]  7.75  7.00  4.75  2.75  1.75  2.00  4.00  5.25  5.50 11.50 11.50 12.75 14.50 12.50 11.75
 [61] 11.00  9.25  5.25  4.50  3.25  4.75  7.50  8.50  9.25 10.50  9.75 15.25 16.00 15.25 15.00
 [76] 10.00  8.50  6.50  4.25  3.00  4.25  4.75  7.50 11.25 11.00 11.50 10.00  6.75 11.25 12.50
 [91] 12.00 11.50  6.50  8.75  8.50  8.25  9.50  8.50  8.75  9.50  8.00  4.25  4.50  7.50  9.00
[106] 12.00 19.00 19.00 22.25 23.50 22.25 21.75 19.50 20.75 22.75 22.75 24.25 28.00 23.00 26.00
[121] 24.25 21.50 26.00 24.00 28.25 25.50 24.25 31.50 31.50 35.75 35.75 29.00 28.50 27.25 25.50
[136] 27.50 26.00 23.75

I initially tried to plot a line chart like this:

library(ggplot2)
library(zoo)
rollingMean = rollmean(byWeek$n, 4)
qplot(rollingMean) + geom_line()

which resulted in this error:

stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Error: geom_line requires the following missing aesthetics: y

It turns out we need to provide an x and y value if we want to draw a line chart. In this case we’ll generate the ‘x’ value – we only care that the y values get plotted in order from left to right:

qplot(1:length(rollingMean), rollingMean, xlab ="Week Number") + geom_line()
2014 09 13 16 58 57

If we want to use the ‘ggplot’ function then we need to put everything into a data frame first and then plot it:

ggplot(data.frame(week = 1:length(rollingMean), rolling = rollingMean),
       aes(x = week, y = rolling)) +
  geom_line()

2014 09 13 17 11 13

Written by Mark Needham

September 13th, 2014 at 11:41 am

Posted in R

Tagged with

R: Calculating rolling or moving averages

with one comment

I’ve been playing around with some time series data in R and since there’s a bit of variation between consecutive points I wanted to smooth the data out by calculating the moving average.

I struggled to find an in built function to do this but came across Didier Ruedin’s blog post which described the following function to do the job:

mav <- function(x,n=5){filter(x,rep(1/n,n), sides=2)}

I tried plugging in some numbers to understand how it works:

> mav(c(4,5,4,6), 3)
Time Series:
Start = 1 
End = 4 
Frequency = 1 
[1]       NA 4.333333 5.000000       NA

Here I was trying to do a rolling average which took into account the last 3 numbers so I expected to get just two numbers back – 4.333333 and 5 – and if there were going to be NA values I thought they’d be at the beginning of the sequence.

In fact it turns out this is what the ‘sides’ parameter controls:

sides	
for convolution filters only. If sides = 1 the filter coefficients are for past values only; if sides = 2 they 
are centred around lag 0. In this case the length of the filter should be odd, but if it is even, more of the 
filter is forward in time than backward.

So in our ‘mav’ function the rolling average looks both sides of the current value rather than just at past values. We can tweak that to get the behaviour we want:

mav <- function(x,n=5){filter(x,rep(1/n,n), sides=1)}
> mav(c(4,5,4,6), 3)
Time Series:
Start = 1 
End = 4 
Frequency = 1 
[1]       NA       NA 4.333333 5.000000

The NA values are annoying for any plotting we want to do so let’s get rid of them:

> na.omit(mav(c(4,5,4,6), 3))
Time Series:
Start = 3 
End = 4 
Frequency = 1 
[1] 4.333333 5.000000

Having got to this point I noticed that Didier had referenced the zoo package in the comments and it has a built in function to take care of all this:

> library(zoo)
> rollmean(c(4,5,4,6), 3)
[1] 4.333333 5.000000

I also realised I can list all the functions in a package with the ‘ls’ function so I’ll be scanning zoo’s list of functions next time I need to do something time series related – there’ll probably already be a function for it!

> ls("package:zoo")
  [1] "as.Date"              "as.Date.numeric"      "as.Date.ts"          
  [4] "as.Date.yearmon"      "as.Date.yearqtr"      "as.yearmon"          
  [7] "as.yearmon.default"   "as.yearqtr"           "as.yearqtr.default"  
 [10] "as.zoo"               "as.zoo.default"       "as.zooreg"           
 [13] "as.zooreg.default"    "autoplot.zoo"         "cbind.zoo"           
 [16] "coredata"             "coredata.default"     "coredata<-"          
 [19] "facet_free"           "format.yearqtr"       "fortify.zoo"         
 [22] "frequency<-"          "ifelse.zoo"           "index"               
 [25] "index<-"              "index2char"           "is.regular"          
 [28] "is.zoo"               "make.par.list"        "MATCH"               
 [31] "MATCH.default"        "MATCH.times"          "median.zoo"          
 [34] "merge.zoo"            "na.aggregate"         "na.aggregate.default"
 [37] "na.approx"            "na.approx.default"    "na.fill"             
 [40] "na.fill.default"      "na.locf"              "na.locf.default"     
 [43] "na.spline"            "na.spline.default"    "na.StructTS"         
 [46] "na.trim"              "na.trim.default"      "na.trim.ts"          
 [49] "ORDER"                "ORDER.default"        "panel.lines.its"     
 [52] "panel.lines.tis"      "panel.lines.ts"       "panel.lines.zoo"     
 [55] "panel.plot.custom"    "panel.plot.default"   "panel.points.its"    
 [58] "panel.points.tis"     "panel.points.ts"      "panel.points.zoo"    
 [61] "panel.polygon.its"    "panel.polygon.tis"    "panel.polygon.ts"    
 [64] "panel.polygon.zoo"    "panel.rect.its"       "panel.rect.tis"      
 [67] "panel.rect.ts"        "panel.rect.zoo"       "panel.segments.its"  
 [70] "panel.segments.tis"   "panel.segments.ts"    "panel.segments.zoo"  
 [73] "panel.text.its"       "panel.text.tis"       "panel.text.ts"       
 [76] "panel.text.zoo"       "plot.zoo"             "quantile.zoo"        
 [79] "rbind.zoo"            "read.zoo"             "rev.zoo"             
 [82] "rollapply"            "rollapplyr"           "rollmax"             
 [85] "rollmax.default"      "rollmaxr"             "rollmean"            
 [88] "rollmean.default"     "rollmeanr"            "rollmedian"          
 [91] "rollmedian.default"   "rollmedianr"          "rollsum"             
 [94] "rollsum.default"      "rollsumr"             "scale_x_yearmon"     
 [97] "scale_x_yearqtr"      "scale_y_yearmon"      "scale_y_yearqtr"     
[100] "Sys.yearmon"          "Sys.yearqtr"          "time<-"              
[103] "write.zoo"            "xblocks"              "xblocks.default"     
[106] "xtfrm.zoo"            "yearmon"              "yearmon_trans"       
[109] "yearqtr"              "yearqtr_trans"        "zoo"                 
[112] "zooreg"

Written by Mark Needham

September 13th, 2014 at 8:15 am

Posted in R

Tagged with ,

R: ggplot – Cumulative frequency graphs

without comments

In my continued playing around with ggplot I wanted to create a chart showing the cumulative growth of the number of members of the Neo4j London meetup group.

My initial data frame looked like this:

> head(meetupMembers)
  joinTimestamp            joinDate  monthYear quarterYear       week dayMonthYear
1  1.376572e+12 2013-08-15 13:13:40 2013-08-01  2013-07-01 2013-08-15   2013-08-15
2  1.379491e+12 2013-09-18 07:55:11 2013-09-01  2013-07-01 2013-09-12   2013-09-18
3  1.349454e+12 2012-10-05 16:28:04 2012-10-01  2012-10-01 2012-10-04   2012-10-05
4  1.383127e+12 2013-10-30 09:59:03 2013-10-01  2013-10-01 2013-10-24   2013-10-30
5  1.372239e+12 2013-06-26 09:27:40 2013-06-01  2013-04-01 2013-06-20   2013-06-26
6  1.330295e+12 2012-02-26 22:27:00 2012-02-01  2012-01-01 2012-02-23   2012-02-26

The first step was to transform the data so that I had a data frame where a row represented a day where a member joined the group. There would then be a count of how many members joined on that date.

We can do this with dplyr like so:

library(dplyr)
> head(meetupMembers %.% group_by(dayMonthYear) %.% summarise(n = n()))
Source: local data frame [6 x 2]
 
  dayMonthYear n
1   2011-06-05 7
2   2011-06-07 1
3   2011-06-10 1
4   2011-06-12 1
5   2011-06-13 1
6   2011-06-15 1

To turn that into a chart we can plug it into ggplot and use the cumsum function to generate a line showing the cumulative total:

ggplot(data = meetupMembers %.% group_by(dayMonthYear) %.% summarise(n = n()), 
       aes(x = dayMonthYear, y = n)) + 
  ylab("Number of members") +
  xlab("Date") +
  geom_line(aes(y = cumsum(n)))
2014 08 31 22 58 42

Alternatively we could bring the call to cumsum forward and generate a data frame which has the cumulative total:

> head(meetupMembers %.% group_by(dayMonthYear) %.% summarise(n = n()) %.% mutate(n = cumsum(n)))
Source: local data frame [6 x 2]
 
  dayMonthYear  n
1   2011-06-05  7
2   2011-06-07  8
3   2011-06-10  9
4   2011-06-12 10
5   2011-06-13 11
6   2011-06-15 12

And if we plug that into ggplot we’ll get the same curve as before:

ggplot(data = meetupMembers %.% group_by(dayMonthYear) %.% summarise(n = n()) %.% mutate(n = cumsum(n)), 
       aes(x = dayMonthYear, y = n)) + 
  ylab("Number of members") +
  xlab("Date") +
  geom_line()

If we want the curve to be a bit smoother we can group it by quarter rather than by day:

> head(meetupMembers %.% group_by(quarterYear) %.% summarise(n = n()) %.% mutate(n = cumsum(n)))
Source: local data frame [6 x 2]
 
  quarterYear   n
1  2011-04-01  13
2  2011-07-01  18
3  2011-10-01  21
4  2012-01-01  43
5  2012-04-01  60
6  2012-07-01 122

Now let’s plug that into ggplot:

ggplot(data = meetupMembers %.% group_by(quarterYear) %.% summarise(n = n()) %.% mutate(n = cumsum(n)), 
       aes(x = quarterYear, y = n)) + 
    ylab("Number of members") +
    xlab("Date") +
    geom_line()
2014 08 31 23 08 24

Written by Mark Needham

August 31st, 2014 at 10:10 pm

Posted in R

Tagged with

R: dplyr – group_by dynamic or programmatic field / variable (Error: index out of bounds)

without comments

In my last blog post I showed how to group timestamp based data by week, month and quarter and by the end we had the following code samples using dplyr and zoo:

library(RNeo4j)
library(zoo)
 
timestampToDate <- function(x) as.POSIXct(x / 1000, origin="1970-01-01", tz = "GMT")
 
query = "MATCH (:Person)-[:HAS_MEETUP_PROFILE]->()-[:HAS_MEMBERSHIP]->(membership)-[:OF_GROUP]->(g:Group {name: \"Neo4j - London User Group\"})
         RETURN membership.joined AS joinTimestamp"
meetupMembers = cypher(graph, query)
 
meetupMembers$joinDate <- timestampToDate(meetupMembers$joinTimestamp)
meetupMembers$monthYear <- as.Date(as.yearmon(meetupMembers$joinDate))
meetupMembers$quarterYear <- as.Date(as.yearqtr(meetupMembers$joinDate))
 
meetupMembers %.% group_by(week) %.% summarise(n = n())
meetupMembers %.% group_by(monthYear) %.% summarise(n = n())
meetupMembers %.% group_by(quarterYear) %.% summarise(n = n())

As you can see there’s quite a bit of duplication going on – the only thing that changes in the last 3 lines is the name of the field that we want to group by.

I wanted to pull this code out into a function and my first attempt was this:

groupMembersBy = function(field) {
  meetupMembers %.% group_by(field) %.% summarise(n = n())
}

And now if we try to group by week:

> groupMembersBy("week")
 Show Traceback
 
 Rerun with Debug
 Error: index out of bounds

It turns out if we want to do this then we actually want the regroup function rather than group_by:

groupMembersBy = function(field) {
  meetupMembers %.% regroup(list(field)) %.% summarise(n = n())
}

And now if we group by week:

> head(groupMembersBy("week"), 20)
Source: local data frame [20 x 2]
 
         week n
1  2011-06-02 8
2  2011-06-09 4
3  2011-06-16 1
4  2011-06-30 2
5  2011-07-14 1
6  2011-07-21 1
7  2011-08-18 1
8  2011-10-13 1
9  2011-11-24 2
10 2012-01-05 1
11 2012-01-12 3
12 2012-02-09 1
13 2012-02-16 2
14 2012-02-23 4
15 2012-03-01 2
16 2012-03-08 3
17 2012-03-15 5
18 2012-03-29 1
19 2012-04-05 2
20 2012-04-19 1

Much better!

Written by Mark Needham

August 29th, 2014 at 9:13 am

Posted in R

Tagged with

R: Grouping by week, month, quarter

without comments

In my continued playing around with R and meetup data I wanted to have a look at when people joined the London Neo4j group based on week, month or quarter of the year to see when they were most likely to do so.

I started with the following query to get back the join timestamps:

library(RNeo4j)
query = "MATCH (:Person)-[:HAS_MEETUP_PROFILE]->()-[:HAS_MEMBERSHIP]->(membership)-[:OF_GROUP]->(g:Group {name: \"Neo4j - London User Group\"})
         RETURN membership.joined AS joinTimestamp"
meetupMembers = cypher(graph, query)
 
> head(meetupMembers)
      joinTimestamp
1 1.376572e+12
2 1.379491e+12
3 1.349454e+12
4 1.383127e+12
5 1.372239e+12
6 1.330295e+12

The first step was to get joinDate into a nicer format that we can use in R more easily:

timestampToDate <- function(x) as.POSIXct(x / 1000, origin="1970-01-01", tz = "GMT")
meetupMembers$joinDate <- timestampToDate(meetupMembers$joinTimestamp)
 
> head(meetupMembers)
  joinTimestamp            joinDate
1  1.376572e+12 2013-08-15 13:13:40
2  1.379491e+12 2013-09-18 07:55:11
3  1.349454e+12 2012-10-05 16:28:04
4  1.383127e+12 2013-10-30 09:59:03
5  1.372239e+12 2013-06-26 09:27:40
6  1.330295e+12 2012-02-26 22:27:00

Much better!

I started off with grouping by month and quarter and came across the excellent zoo library which makes it really easy to transform dates:

library(zoo)
meetupMembers$monthYear <- as.Date(as.yearmon(meetupMembers$joinDate))
meetupMembers$quarterYear <- as.Date(as.yearqtr(meetupMembers$joinDate))
 
> head(meetupMembers)
  joinTimestamp            joinDate  monthYear quarterYear
1  1.376572e+12 2013-08-15 13:13:40 2013-08-01  2013-07-01
2  1.379491e+12 2013-09-18 07:55:11 2013-09-01  2013-07-01
3  1.349454e+12 2012-10-05 16:28:04 2012-10-01  2012-10-01
4  1.383127e+12 2013-10-30 09:59:03 2013-10-01  2013-10-01
5  1.372239e+12 2013-06-26 09:27:40 2013-06-01  2013-04-01
6  1.330295e+12 2012-02-26 22:27:00 2012-02-01  2012-01-01

The next step was to create a new data frame which grouped the data by those fields. I’ve been learning dplyr as part of Udacity’s EDA course so I thought I’d try and use that:

> head(meetupMembers %.% group_by(monthYear) %.% summarise(n = n()), 20)
 
    monthYear  n
1  2011-06-01 13
2  2011-07-01  4
3  2011-08-01  1
4  2011-10-01  1
5  2011-11-01  2
6  2012-01-01  4
7  2012-02-01  7
8  2012-03-01 11
9  2012-04-01  3
10 2012-05-01  9
11 2012-06-01  5
12 2012-07-01 16
13 2012-08-01 32
14 2012-09-01 14
15 2012-10-01 28
16 2012-11-01 31
17 2012-12-01  7
18 2013-01-01 52
19 2013-02-01 49
20 2013-03-01 22
> head(meetupMembers %.% group_by(quarterYear) %.% summarise(n = n()), 20)
 
   quarterYear   n
1   2011-04-01  13
2   2011-07-01   5
3   2011-10-01   3
4   2012-01-01  22
5   2012-04-01  17
6   2012-07-01  62
7   2012-10-01  66
8   2013-01-01 123
9   2013-04-01 139
10  2013-07-01 117
11  2013-10-01  94
12  2014-01-01 266
13  2014-04-01 359
14  2014-07-01 216

Grouping by week number is a bit trickier but we can do it with a bit of transformation on our initial timestamp:

meetupMembers$week <- as.Date("1970-01-01")+7*trunc((meetupMembers$joinTimestamp / 1000)/(3600*24*7))
 
> head(meetupMembers %.% group_by(week) %.% summarise(n = n()), 20)
 
         week n
1  2011-06-02 8
2  2011-06-09 4
3  2011-06-16 1
4  2011-06-30 2
5  2011-07-14 1
6  2011-07-21 1
7  2011-08-18 1
8  2011-10-13 1
9  2011-11-24 2
10 2012-01-05 1
11 2012-01-12 3
12 2012-02-09 1
13 2012-02-16 2
14 2012-02-23 4
15 2012-03-01 2
16 2012-03-08 3
17 2012-03-15 5
18 2012-03-29 1
19 2012-04-05 2
20 2012-04-19 1

We can then plug that data frame into ggplot if we want to track membership sign up over time at different levels of granularity and create some bar charts of scatter plots depending on what we feel like!

Written by Mark Needham

August 29th, 2014 at 12:25 am

Posted in R

Tagged with

R: Rook – Hello world example – ‘Cannot find a suitable app in file’

without comments

I’ve been playing around with the Rook library and struggled a bit getting a basic Hello World application up and running so I thought I should document it for future me.

I wanted to spin up a web server using Rook and serve a page with the text ‘Hello World’. I started with the following code:

library(Rook)
s <- Rhttpd$new()
 
s$add(name='MyApp',app='helloworld.R')
s$start()
s$browse("MyApp")

where helloWorld.R contained the following code:

function(env){ 
  list(
    status=200,
    headers = list(
      'Content-Type' = 'text/html'
    ),
    body = paste('<h1>Hello World!</h1>')
  )
}

Unfortunately that failed on the ‘s$add’ line with the following error message:

> s$add(name='MyApp',app='helloworld.R')
Error in .Object$initialize(...) : 
  Cannot find a suitable app in file helloworld.R

I hadn’t realised that you actually need to assign that function to a variable ‘app’ in order for it to be picked up:

app <- function(env){ 
  list(
    status=200,
    headers = list(
      'Content-Type' = 'text/html'
    ),
    body = paste('<h1>Hello World!</h1>')
  )
}

Once I fixed that everything seemed to work as expected:s

> s
Server started on 127.0.0.1:27120
[1] MyApp http://127.0.0.1:27120/custom/MyApp
 
Call browse() with an index number or name to run an application.

Written by Mark Needham

August 22nd, 2014 at 11:05 am

Posted in R

Tagged with

Where does r studio install packages/libraries?

without comments

As a newbie to R I wanted to look at the source code of some of the libraries/packages that I’d installed via R Studio which I initially struggled to do as I wasn’t sure where the packages had been installed.

I eventually came across a StackOverflow post which described the .libPaths function which tells us where that is:

> .libPaths()
[1] "/Library/Frameworks/R.framework/Versions/3.1/Resources/library"

If we want to see which libraries are installed we can use the list.files function:

> list.files("/Library/Frameworks/R.framework/Versions/3.1/Resources/library")
 [1] "alr3"         "assertthat"   "base"         "bitops"       "boot"         "brew"        
 [7] "car"          "class"        "cluster"      "codetools"    "colorspace"   "compiler"    
[13] "data.table"   "datasets"     "devtools"     "dichromat"    "digest"       "dplyr"       
[19] "evaluate"     "foreign"      "formatR"      "Formula"      "gclus"        "ggplot2"     
[25] "graphics"     "grDevices"    "grid"         "gridExtra"    "gtable"       "hflights"    
[31] "highr"        "Hmisc"        "httr"         "KernSmooth"   "knitr"        "labeling"    
[37] "Lahman"       "lattice"      "latticeExtra" "magrittr"     "manipulate"   "markdown"    
[43] "MASS"         "Matrix"       "memoise"      "methods"      "mgcv"         "mime"        
[49] "munsell"      "nlme"         "nnet"         "openintro"    "parallel"     "plotrix"     
[55] "plyr"         "proto"        "RColorBrewer" "Rcpp"         "RCurl"        "reshape2"    
[61] "RJSONIO"      "RNeo4j"       "Rook"         "rpart"        "rstudio"      "scales"      
[67] "seriation"    "spatial"      "splines"      "stats"        "stats4"       "stringr"     
[73] "survival"     "swirl"        "tcltk"        "testthat"     "tools"        "translations"
[79] "TSP"          "utils"        "whisker"      "xts"          "yaml"         "zoo"

We can then drill into those directories to find the appropriate file – in this case I wanted to look at one of the Rook examples:

$ cat /Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rook/exampleApps/helloworld.R
app <- function(env){
    req <- Rook::Request$new(env)
    res <- Rook::Response$new()
    friend <- 'World'
    if (!is.null(req$GET()[['friend']]))
	friend <- req$GET()[['friend']]
    res$write(paste('<h1>Hello',friend,'</h1>\n'))
    res$write('What is your name?\n')
    res$write('<form method="GET">\n')
    res$write('<input type="text" name="friend">\n')
    res$write('<input type="submit" name="Submit">\n</form>\n<br>')
    res$finish()
}

Written by Mark Needham

August 14th, 2014 at 10:24 am

Posted in R

Tagged with

R: Grouping by two variables

without comments

In my continued playing around with R and meetup data I wanted to group a data table by two variables – day and event – so I could see the most popular day of the week for meetups and which events we’d held on those days.

I started off with the following data table:

> head(eventsOf2014, 20)
      eventTime                                              event.name rsvps            datetime       day monthYear
16 1.393351e+12                                         Intro to Graphs    38 2014-02-25 18:00:00   Tuesday   02-2014
17 1.403635e+12                                         Intro to Graphs    44 2014-06-24 18:30:00   Tuesday   06-2014
19 1.404844e+12                                         Intro to Graphs    38 2014-07-08 18:30:00   Tuesday   07-2014
28 1.398796e+12                                         Intro to Graphs    45 2014-04-29 18:30:00   Tuesday   04-2014
31 1.395772e+12                                         Intro to Graphs    56 2014-03-25 18:30:00   Tuesday   03-2014
41 1.406054e+12                                         Intro to Graphs    12 2014-07-22 18:30:00   Tuesday   07-2014
49 1.395167e+12                                         Intro to Graphs    45 2014-03-18 18:30:00   Tuesday   03-2014
50 1.401907e+12                                         Intro to Graphs    35 2014-06-04 18:30:00 Wednesday   06-2014
51 1.400006e+12                                         Intro to Graphs    31 2014-05-13 18:30:00   Tuesday   05-2014
54 1.392142e+12                                         Intro to Graphs    35 2014-02-11 18:00:00   Tuesday   02-2014
59 1.400611e+12                                         Intro to Graphs    53 2014-05-20 18:30:00   Tuesday   05-2014
61 1.390932e+12                                         Intro to Graphs    22 2014-01-28 18:00:00   Tuesday   01-2014
70 1.397587e+12                                         Intro to Graphs    47 2014-04-15 18:30:00   Tuesday   04-2014
7  1.402425e+12       Hands On Intro to Cypher - Neo4j's Query Language    38 2014-06-10 18:30:00   Tuesday   06-2014
25 1.397155e+12       Hands On Intro to Cypher - Neo4j's Query Language    28 2014-04-10 18:30:00  Thursday   04-2014
44 1.404326e+12       Hands On Intro to Cypher - Neo4j's Query Language    43 2014-07-02 18:30:00 Wednesday   07-2014
46 1.398364e+12       Hands On Intro to Cypher - Neo4j's Query Language    30 2014-04-24 18:30:00  Thursday   04-2014
65 1.400783e+12       Hands On Intro to Cypher - Neo4j's Query Language    26 2014-05-22 18:30:00  Thursday   05-2014
5  1.403203e+12 Hands on build your first Neo4j app for Java developers    34 2014-06-19 18:30:00  Thursday   06-2014
34 1.399574e+12 Hands on build your first Neo4j app for Java developers    28 2014-05-08 18:30:00  Thursday   05-2014

I was able to work out the average number of RSVPs per day with the following code using plyr:

> ddply(eventsOf2014, .(day=format(datetime, "%A")), summarise, 
        count=length(datetime),
        rsvps=sum(rsvps),
        rsvpsPerEvent = rsvps / count)
 
        day count rsvps rsvpsPerEvent
1  Thursday     5   146      29.20000
2   Tuesday    13   504      38.76923
3 Wednesday     2    78      39.00000

The next step was to show the names of events that happened on those days next to the row for that day. To do this we can make use of the paste function like so:

> ddply(eventsOf2014, .(day=format(datetime, "%A")), summarise, 
        events = paste(unique(event.name), collapse = ","),
        count=length(datetime),
        rsvps=sum(rsvps),
        rsvpsPerEvent = rsvps / count)
 
        day                                                                                                    events count rsvps rsvpsPerEvent
1  Thursday Hands On Intro to Cypher - Neo4j's Query Language,Hands on build your first Neo4j app for Java developers     5   146      29.20000
2   Tuesday                                         Intro to Graphs,Hands On Intro to Cypher - Neo4j's Query Language    13   504      38.76923
3 Wednesday                                         Intro to Graphs,Hands On Intro to Cypher - Neo4j's Query Language     2    78      39.00000

If we wanted to drill down further and see the number of RSVPs per day per event type then we could instead group by the day and event name:

> ddply(eventsOf2014, .(day=format(datetime, "%A"), event.name), summarise, 
        count=length(datetime),
        rsvps=sum(rsvps),
        rsvpsPerEvent = rsvps / count)
 
        day                                              event.name count rsvps rsvpsPerEvent
1  Thursday Hands on build your first Neo4j app for Java developers     2    62      31.00000
2  Thursday       Hands On Intro to Cypher - Neo4j's Query Language     3    84      28.00000
3   Tuesday       Hands On Intro to Cypher - Neo4j's Query Language     1    38      38.00000
4   Tuesday                                         Intro to Graphs    12   466      38.83333
5 Wednesday       Hands On Intro to Cypher - Neo4j's Query Language     1    43      43.00000
6 Wednesday                                         Intro to Graphs     1    35      35.00000

There are too few data points for some of those to make any decisions but as we gather more data hopefully we’ll see if there’s a trend for people to come to events on certain days or not.

Written by Mark Needham

August 11th, 2014 at 4:47 pm

Posted in R

Tagged with

R: ggplot – Plotting back to back charts using facet_wrap

without comments

Earlier in the week I showed a way to plot back to back charts using R’s ggplot library but looking back on the code it felt like it was a bit hacky to ‘glue’ two charts together using a grid.

I wanted to find a better way.

To recap, I came up with the following charts showing the RSVPs to Neo4j London meetup events using this code:

2014 07 20 17 42 40

The first thing we need to do to simplify chart generation is to return ‘yes’ and ‘no’ responses in the same cypher query, like so:

timestampToDate <- function(x) as.POSIXct(x / 1000, origin="1970-01-01", tz = "GMT")
 
query = "MATCH (e:Event)<-[:TO]-(response {response: 'yes'})
         WITH e, COLLECT(response) AS yeses
         MATCH (e)<-[:TO]-(response {response: 'no'})<-[:NEXT]-()
         WITH e, COLLECT(response) + yeses AS responses
         UNWIND responses AS response
         RETURN response.time AS time, e.time + e.utc_offset AS eventTime, response.response AS response"
allRSVPs = cypher(graph, query)
allRSVPs$time = timestampToDate(allRSVPs$time)
allRSVPs$eventTime = timestampToDate(allRSVPs$eventTime)
allRSVPs$difference = as.numeric(allRSVPs$eventTime - allRSVPs$time, units="days")

The query is a bit because we want to capture the ‘no’ responses when they initially said yes which is why we check for a ‘NEXT’ relationship when looking for the negative responses.

Let’s inspect allRSVPs:

> allRSVPs[1:10,]
                  time           eventTime response difference
1  2014-06-13 21:49:20 2014-07-22 18:30:00       no   38.86157
2  2014-07-02 22:24:06 2014-07-22 18:30:00      yes   19.83743
3  2014-05-23 23:46:02 2014-07-22 18:30:00      yes   59.78053
4  2014-06-23 21:07:11 2014-07-22 18:30:00      yes   28.89084
5  2014-06-06 15:09:29 2014-07-22 18:30:00      yes   46.13925
6  2014-05-31 13:03:09 2014-07-22 18:30:00      yes   52.22698
7  2014-05-23 23:46:02 2014-07-22 18:30:00      yes   59.78053
8  2014-07-02 12:28:22 2014-07-22 18:30:00      yes   20.25113
9  2014-06-30 23:44:39 2014-07-22 18:30:00      yes   21.78149
10 2014-06-06 15:35:53 2014-07-22 18:30:00      yes   46.12091

We’ve returned the actual response with each row so that we can distinguish between responses. It will also come in useful for pivoting our single chart later on.

The next step is to get ggplot to generate our side by side charts. I started off by plotting both types of response on the same chart:

ggplot(allRSVPs, aes(x = difference, fill=response)) + 
  geom_bar(binwidth=1)

2014 07 25 22 14 28

This one stacks the ‘yes’ and ‘no’ responses on top of each other which isn’t what we want as it’s difficult to compare the two.

What we need is the facet_wrap function which allows us to generate multiple charts grouped by key. We’ll group by ‘response’:

ggplot(allRSVPs, aes(x = difference, fill=response)) + 
  geom_bar(binwidth=1) + 
  facet_wrap(~ response, nrow=2, ncol=1)

2014 07 25 22 34 46

The only thing we’re missing now is the red and green colours which is where the scale_fill_manual function comes in handy:

ggplot(allRSVPs, aes(x = difference, fill=response)) + 
  scale_fill_manual(values=c("#FF0000", "#00FF00")) + 
  geom_bar(binwidth=1) +
  facet_wrap(~ response, nrow=2, ncol=1)

2014 07 25 22 39 56

If we want to show the ‘yes’ chart on top we can pass in an extra parameter to facet_wrap to change where it places the highest value:

ggplot(allRSVPs, aes(x = difference, fill=response)) + 
  scale_fill_manual(values=c("#FF0000", "#00FF00")) + 
  geom_bar(binwidth=1) +
  facet_wrap(~ response, nrow=2, ncol=1, as.table = FALSE)

2014 07 25 22 43 29

We could go one step further and group by response and day. First let’s add a ‘day’ column to our data frame:

allRSVPs$dayOfWeek = format(allRSVPs$eventTime, "%A")

And now let’s plot the charts using both columns:

ggplot(allRSVPs, aes(x = difference, fill=response)) + 
  scale_fill_manual(values=c("#FF0000", "#00FF00")) + 
  geom_bar(binwidth=1) +
  facet_wrap(~ response + dayOfWeek, as.table = FALSE)

2014 07 25 22 49 57

The distribution of dropouts looks fairly similar for all the days – Thursday is just at an order of magnitude below the other days because we haven’t run many events on Thursdays so far.

At a glance it doesn’t appear that so many people sign up for Thursday events on the day or one day before.

One potential hypothesis is that people have things planned for Thursday whereas they decide more last minute what to do on the other days.

We’ll have to run some more events on Thursdays to see whether that trend holds out.

The code is on github if you want to play with it

Written by Mark Needham

July 25th, 2014 at 9:57 pm

Posted in R

Tagged with