Mark Needham

Thoughts on Software Development

Archive for the ‘R’ Category

R: Linear models with the lm function, NA values and Collinearity

without comments

In my continued playing around with R I’ve sometimes noticed ‘NA’ values in the linear regression models I created but hadn’t really thought about what that meant.

On the advice of Peter Huber I recently started working my way through Coursera’s Regression Models which has a whole slide explaining its meaning:

2014 10 17 06 21 07

So in this case ‘z’ doesn’t help us in predicting Fertility since it doesn’t give us any more information that we can’t already get from ‘Agriculture’ and ‘Education’.

Although in this case we know why ‘z’ doesn’t have a coefficient sometimes it may not be clear which other variable the NA one is highly correlated with.

Multicollinearity (also collinearity) is a statistical phenomenon in which two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others with a non-trivial degree of accuracy.

In that situation we can make use of the alias function to explain the collinearity as suggested in this StackOverflow post:

library(datasets); data(swiss); require(stats); require(graphics)
z <- swiss$Agriculture + swiss$Education
fit = lm(Fertility ~ . + z, data = swiss)
> alias(fit)
Model :
Fertility ~ Agriculture + Examination + Education + Catholic + 
    Infant.Mortality + z
 
Complete :
  (Intercept) Agriculture Examination Education Catholic Infant.Mortality
z 0           1           0           1         0        0

In this case we can see that ‘z’ is highly correlated with both Agriculture and Education which makes sense given its the sum of those two variables.

When we notice that there’s an NA coefficient in our model we can choose to exclude that variable and the model will still have the same coefficients as before:

> require(dplyr)
> summary(lm(Fertility ~ . + z, data = swiss))$coefficients
                   Estimate  Std. Error   t value     Pr(>|t|)
(Intercept)      66.9151817 10.70603759  6.250229 1.906051e-07
Agriculture      -0.1721140  0.07030392 -2.448142 1.872715e-02
Examination      -0.2580082  0.25387820 -1.016268 3.154617e-01
Education        -0.8709401  0.18302860 -4.758492 2.430605e-05
Catholic          0.1041153  0.03525785  2.952969 5.190079e-03
Infant.Mortality  1.0770481  0.38171965  2.821568 7.335715e-03
> summary(lm(Fertility ~ ., data = swiss))$coefficients
                   Estimate  Std. Error   t value     Pr(>|t|)
(Intercept)      66.9151817 10.70603759  6.250229 1.906051e-07
Agriculture      -0.1721140  0.07030392 -2.448142 1.872715e-02
Examination      -0.2580082  0.25387820 -1.016268 3.154617e-01
Education        -0.8709401  0.18302860 -4.758492 2.430605e-05
Catholic          0.1041153  0.03525785  2.952969 5.190079e-03
Infant.Mortality  1.0770481  0.38171965  2.821568 7.335715e-03

If we call alias now we won’t see any output:

> alias(lm(Fertility ~ ., data = swiss))
Model :
Fertility ~ Agriculture + Examination + Education + Catholic + 
    Infant.Mortality

Written by Mark Needham

October 18th, 2014 at 6:35 am

Posted in R

Tagged with ,

R: A first attempt at linear regression

without comments

I’ve been working through the videos that accompany the Introduction to Statistical Learning with Applications in R book and thought it’d be interesting to try out the linear regression algorithm against my meetup data set.

I wanted to see how well a linear regression algorithm could predict how many people were likely to RSVP to a particular event. I started with the following code to build a data frame containing some potential predictors:

library(RNeo4j)
officeEventsQuery = "MATCH (g:Group {name: \"Neo4j - London User Group\"})-[:HOSTED_EVENT]->(event)<-[:TO]-({response: 'yes'})<-[:RSVPD]-(),
                           (event)-[:HELD_AT]->(venue)
                     WHERE (event.time + event.utc_offset) < timestamp() AND venue.name IN [\"Neo Technology\", \"OpenCredo\"]
                     RETURN event.time + event.utc_offset AS eventTime,event.announced_at AS announcedAt, event.name, COUNT(*) AS rsvps"
 
events = subset(cypher(graph, officeEventsQuery), !is.na(announcedAt))
events$eventTime <- timestampToDate(events$eventTime)
events$day <- format(events$eventTime, "%A")
events$monthYear <- format(events$eventTime, "%m-%Y")
events$month <- format(events$eventTime, "%m")
events$year <- format(events$eventTime, "%Y")
events$announcedAt<- timestampToDate(events$announcedAt)
events$timeDiff = as.numeric(events$eventTime - events$announcedAt, units = "days")

If we preview ‘events’ it contains the following columns:

> head(events)
            eventTime         announcedAt                                        event.name rsvps       day monthYear month year  timeDiff
1 2013-01-29 18:00:00 2012-11-30 11:30:57                                   Intro to Graphs    24   Tuesday   01-2013    01 2013 60.270174
2 2014-06-24 18:30:00 2014-06-18 19:11:19                                   Intro to Graphs    43   Tuesday   06-2014    06 2014  5.971308
3 2014-06-18 18:30:00 2014-06-08 07:03:13                         Neo4j World Cup Hackathon    24 Wednesday   06-2014    06 2014 10.476933
4 2014-05-20 18:30:00 2014-05-14 18:56:06                                   Intro to Graphs    53   Tuesday   05-2014    05 2014  5.981875
5 2014-02-11 18:00:00 2014-02-05 19:11:03                                   Intro to Graphs    35   Tuesday   02-2014    02 2014  5.950660
6 2014-09-04 18:30:00 2014-08-26 06:34:01 Hands On Intro to Cypher - Neo4j's Query Language    20  Thursday   09-2014    09 2014  9.497211

We want to predict ‘rsvps’ from the other columns so I started off by creating a linear model which took all the other columns into account:

> summary(lm(rsvps ~., data = events))
 
Call:
lm(formula = rsvps ~ ., data = events)
 
Residuals:
    Min      1Q  Median      3Q     Max 
-8.2582 -1.1538  0.0000  0.4158 10.5803 
 
Coefficients: (14 not defined because of singularities)
                                                                    Estimate Std. Error t value Pr(>|t|)   
(Intercept)                                                       -9.365e+03  3.009e+03  -3.113  0.00897 **
eventTime                                                          3.609e-06  2.951e-06   1.223  0.24479   
announcedAt                                                        3.278e-06  2.553e-06   1.284  0.22339   
event.nameGraph Modelling - Do's and Don'ts                        4.884e+01  1.140e+01   4.286  0.00106 **
event.nameHands on build your first Neo4j app for Java developers  3.735e+01  1.048e+01   3.562  0.00391 **
event.nameHands On Intro to Cypher - Neo4j's Query Language        2.560e+01  9.713e+00   2.635  0.02177 * 
event.nameIntro to Graphs                                          2.238e+01  8.726e+00   2.564  0.02480 * 
event.nameIntroduction to Graph Database Modeling                 -1.304e+02  4.835e+01  -2.696  0.01946 * 
event.nameLunch with Neo4j's CEO, Emil Eifrem                      3.920e+01  1.113e+01   3.523  0.00420 **
event.nameNeo4j Clojure Hackathon                                 -3.063e+00  1.195e+01  -0.256  0.80203   
event.nameNeo4j Python Hackathon with py2neo's Nigel Small         2.128e+01  1.070e+01   1.989  0.06998 . 
event.nameNeo4j World Cup Hackathon                                5.004e+00  9.622e+00   0.520  0.61251   
dayTuesday                                                         2.068e+01  5.626e+00   3.676  0.00317 **
dayWednesday                                                       2.300e+01  5.522e+00   4.165  0.00131 **
monthYear01-2014                                                  -2.350e+02  7.377e+01  -3.185  0.00784 **
monthYear02-2013                                                  -2.526e+01  1.376e+01  -1.836  0.09130 . 
monthYear02-2014                                                  -2.325e+02  7.763e+01  -2.995  0.01118 * 
monthYear03-2013                                                  -4.605e+01  1.683e+01  -2.736  0.01805 * 
monthYear03-2014                                                  -2.371e+02  8.324e+01  -2.848  0.01468 * 
monthYear04-2013                                                  -6.570e+01  2.309e+01  -2.845  0.01477 * 
monthYear04-2014                                                  -2.535e+02  8.746e+01  -2.899  0.01336 * 
monthYear05-2013                                                  -8.672e+01  2.845e+01  -3.049  0.01011 * 
monthYear05-2014                                                  -2.802e+02  9.420e+01  -2.975  0.01160 * 
monthYear06-2013                                                  -1.022e+02  3.283e+01  -3.113  0.00897 **
monthYear06-2014                                                  -2.996e+02  1.003e+02  -2.988  0.01132 * 
monthYear07-2014                                                  -3.123e+02  1.054e+02  -2.965  0.01182 * 
monthYear08-2013                                                  -1.326e+02  4.323e+01  -3.067  0.00976 **
monthYear08-2014                                                  -3.060e+02  1.107e+02  -2.763  0.01718 * 
monthYear09-2013                                                          NA         NA      NA       NA   
monthYear09-2014                                                  -3.465e+02  1.164e+02  -2.976  0.01158 * 
monthYear10-2012                                                   2.602e+01  1.959e+01   1.328  0.20886   
monthYear10-2013                                                  -1.728e+02  5.678e+01  -3.044  0.01020 * 
monthYear11-2012                                                   2.717e+01  1.509e+01   1.800  0.09704 . 
month02                                                                   NA         NA      NA       NA   
month03                                                                   NA         NA      NA       NA   
month04                                                                   NA         NA      NA       NA   
month05                                                                   NA         NA      NA       NA   
month06                                                                   NA         NA      NA       NA   
month07                                                                   NA         NA      NA       NA   
month08                                                                   NA         NA      NA       NA   
month09                                                                   NA         NA      NA       NA   
month10                                                                   NA         NA      NA       NA   
month11                                                                   NA         NA      NA       NA   
year2013                                                                  NA         NA      NA       NA   
year2014                                                                  NA         NA      NA       NA   
timeDiff                                                                  NA         NA      NA       NA   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 5.287 on 12 degrees of freedom
Multiple R-squared:  0.9585,	Adjusted R-squared:  0.8512 
F-statistic: 8.934 on 31 and 12 DF,  p-value: 0.0001399

As I understand it we can look at the R-squared value to understand how much of the variance in the data has been explained by the model – in this case it’s 85%.

A lot of the coefficients seem to be based around specific event names which seems a bit too specific to me so I wanted to see what would happen if I derived a feature which indicated whether a session was practical:

events$practical = grepl("Hackathon|Hands on|Hands On", events$event.name)

We can now run the model again with the new column having excluded ‘event.name’ field:

> summary(lm(rsvps ~., data = subset(events, select = -c(event.name))))
 
Call:
lm(formula = rsvps ~ ., data = subset(events, select = -c(event.name)))
 
Residuals:
    Min      1Q  Median      3Q     Max 
-18.647  -2.311   0.000   2.908  23.218 
 
Coefficients: (13 not defined because of singularities)
                   Estimate Std. Error t value Pr(>|t|)  
(Intercept)      -3.980e+03  4.752e+03  -0.838   0.4127  
eventTime         2.907e-06  3.873e-06   0.751   0.4621  
announcedAt       3.336e-08  3.559e-06   0.009   0.9926  
dayTuesday        7.547e+00  6.080e+00   1.241   0.2296  
dayWednesday      2.442e+00  7.046e+00   0.347   0.7327  
monthYear01-2014 -9.562e+01  1.187e+02  -0.806   0.4303  
monthYear02-2013 -4.230e+00  2.289e+01  -0.185   0.8553  
monthYear02-2014 -9.156e+01  1.254e+02  -0.730   0.4742  
monthYear03-2013 -1.633e+01  2.808e+01  -0.582   0.5676  
monthYear03-2014 -8.094e+01  1.329e+02  -0.609   0.5496  
monthYear04-2013 -2.249e+01  3.785e+01  -0.594   0.5595  
monthYear04-2014 -9.230e+01  1.401e+02  -0.659   0.5180  
monthYear05-2013 -3.237e+01  4.654e+01  -0.696   0.4952  
monthYear05-2014 -1.015e+02  1.509e+02  -0.673   0.5092  
monthYear06-2013 -3.947e+01  5.355e+01  -0.737   0.4701  
monthYear06-2014 -1.081e+02  1.604e+02  -0.674   0.5084  
monthYear07-2014 -1.110e+02  1.678e+02  -0.661   0.5163  
monthYear08-2013 -5.144e+01  6.988e+01  -0.736   0.4706  
monthYear08-2014 -1.023e+02  1.784e+02  -0.573   0.5731  
monthYear09-2013 -6.057e+01  7.893e+01  -0.767   0.4523  
monthYear09-2014 -1.260e+02  1.874e+02  -0.672   0.5094  
monthYear10-2012  9.557e+00  2.873e+01   0.333   0.7430  
monthYear10-2013 -6.450e+01  9.169e+01  -0.703   0.4903  
monthYear11-2012  1.689e+01  2.316e+01   0.729   0.4748  
month02                  NA         NA      NA       NA  
month03                  NA         NA      NA       NA  
month04                  NA         NA      NA       NA  
month05                  NA         NA      NA       NA  
month06                  NA         NA      NA       NA  
month07                  NA         NA      NA       NA  
month08                  NA         NA      NA       NA  
month09                  NA         NA      NA       NA  
month10                  NA         NA      NA       NA  
month11                  NA         NA      NA       NA  
year2013                 NA         NA      NA       NA  
year2014                 NA         NA      NA       NA  
timeDiff                 NA         NA      NA       NA  
practicalTRUE    -9.388e+00  5.289e+00  -1.775   0.0919 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 10.21 on 19 degrees of freedom
Multiple R-squared:  0.7546,	Adjusted R-squared:  0.4446 
F-statistic: 2.434 on 24 and 19 DF,  p-value: 0.02592

Now we’re only accounting for 44% of the variance and none of our coefficients are significant so this wasn’t such a good change.

I also noticed that we’ve got a bit of overlap in the date related features – we’ve got one column for monthYear and then separate ones for month and year. Let’s strip out the combined one:

> summary(lm(rsvps ~., data = subset(events, select = -c(event.name, monthYear))))
 
Call:
lm(formula = rsvps ~ ., data = subset(events, select = -c(event.name, 
    monthYear)))
 
Residuals:
     Min       1Q   Median       3Q      Max 
-16.5745  -4.0507  -0.1042   3.6586  24.4715 
 
Coefficients: (1 not defined because of singularities)
                Estimate Std. Error t value Pr(>|t|)  
(Intercept)   -1.573e+03  4.315e+03  -0.364   0.7185  
eventTime      3.320e-06  3.434e-06   0.967   0.3425  
announcedAt   -2.149e-06  2.201e-06  -0.976   0.3379  
dayTuesday     4.713e+00  5.871e+00   0.803   0.4294  
dayWednesday  -2.253e-01  6.685e+00  -0.034   0.9734  
month02        3.164e+00  1.285e+01   0.246   0.8075  
month03        1.127e+01  1.858e+01   0.607   0.5494  
month04        4.148e+00  2.581e+01   0.161   0.8736  
month05        1.979e+00  3.425e+01   0.058   0.9544  
month06       -1.220e-01  4.271e+01  -0.003   0.9977  
month07        1.671e+00  4.955e+01   0.034   0.9734  
month08        8.849e+00  5.940e+01   0.149   0.8827  
month09       -5.496e+00  6.782e+01  -0.081   0.9360  
month10       -5.066e+00  7.893e+01  -0.064   0.9493  
month11        4.255e+00  8.697e+01   0.049   0.9614  
year2013      -1.799e+01  1.032e+02  -0.174   0.8629  
year2014      -3.281e+01  2.045e+02  -0.160   0.8738  
timeDiff              NA         NA      NA       NA  
practicalTRUE -9.816e+00  5.084e+00  -1.931   0.0645 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 10.19 on 26 degrees of freedom
Multiple R-squared:  0.666,	Adjusted R-squared:  0.4476 
F-statistic: 3.049 on 17 and 26 DF,  p-value: 0.005187

Again none of the coefficients are statistically significant which is disappointing. I think the main problem may be that I have very few data points (only 42) making it difficult to come up with a general model.

I think my next step is to look for some other features that could impact the number of RSVPs e.g. other events on that day, the weather.

I’m a novice at this but trying to learn more so if you have any ideas of what I should do next please let me know.

Written by Mark Needham

September 30th, 2014 at 10:20 pm

Posted in R

Tagged with ,

R: Deriving a new data frame column based on containing string

without comments

I’ve been playing around with R data frames a bit more and one thing I wanted to do was derive a new column based on the text contained in the existing column.

I started with something like this:

> x = data.frame(name = c("Java Hackathon", "Intro to Graphs", "Hands on Cypher"))
> x
             name
1  Java Hackathon
2 Intro to Graphs
3 Hands on Cypher

And I wanted to derive a new column based on whether or not the session was a practical one. The grepl function seemed to be the best tool for the job:

> grepl("Hackathon|Hands on|Hands On", x$name)
[1]  TRUE FALSE  TRUE

We can then add a column to our data frame with that output:

x$practical = grepl("Hackathon|Hands on|Hands On", x$name)

And we end up with the following:

> x
             name practical
1  Java Hackathon      TRUE
2 Intro to Graphs     FALSE
3 Hands on Cypher      TRUE

Not too tricky but it took me a bit too long to figure it out so I thought I’d save future Mark some time!

Written by Mark Needham

September 29th, 2014 at 9:37 pm

Posted in R

Tagged with ,

R: Filtering data frames by column type (‘x’ must be numeric)

without comments

I’ve been working through the exercises from An Introduction to Statistical Learning and one of them required you to create a pair wise correlation matrix of variables in a data frame.

The exercise uses the ‘Carseats’ data set which can be imported like so:

> install.packages("ISLR")
> library(ISLR)
> head(Carseats)
  Sales CompPrice Income Advertising Population Price ShelveLoc Age Education Urban  US
1  9.50       138     73          11        276   120       Bad  42        17   Yes Yes
2 11.22       111     48          16        260    83      Good  65        10   Yes Yes
3 10.06       113     35          10        269    80    Medium  59        12   Yes Yes
4  7.40       117    100           4        466    97    Medium  55        14   Yes Yes
5  4.15       141     64           3        340   128       Bad  38        13   Yes  No
6 10.81       124    113          13        501    72       Bad  78        16    No Yes

filter the categorical variables from a data frame and

If we try to run the ‘cor‘ function on the data frame we’ll get the following error:

> cor(Carseats)
Error in cor(Carseats) : 'x' must be numeric

As the error message suggests, we can’t pass non numeric variables to this function so we need to remove the categorical variables from our data frame.

But first we need to work out which columns those are:

> sapply(Carseats, class)
      Sales   CompPrice      Income Advertising  Population       Price   ShelveLoc         Age   Education 
  "numeric"   "numeric"   "numeric"   "numeric"   "numeric"   "numeric"    "factor"   "numeric"   "numeric" 
      Urban          US 
   "factor"    "factor"

We can see a few columns of type ‘factor’ and luckily for us there’s a function which will help us identify those more easily:

> sapply(Carseats, is.factor)
      Sales   CompPrice      Income Advertising  Population       Price   ShelveLoc         Age   Education 
      FALSE       FALSE       FALSE       FALSE       FALSE       FALSE        TRUE       FALSE       FALSE 
      Urban          US 
       TRUE        TRUE

Now we can remove those columns from our data frame and create the correlation matrix:

> cor(Carseats[sapply(Carseats, function(x) !is.factor(x))])
                  Sales   CompPrice       Income  Advertising   Population       Price          Age    Education
Sales        1.00000000  0.06407873  0.151950979  0.269506781  0.050470984 -0.44495073 -0.231815440 -0.051955242
CompPrice    0.06407873  1.00000000 -0.080653423 -0.024198788 -0.094706516  0.58484777 -0.100238817  0.025197050
Income       0.15195098 -0.08065342  1.000000000  0.058994706 -0.007876994 -0.05669820 -0.004670094 -0.056855422
Advertising  0.26950678 -0.02419879  0.058994706  1.000000000  0.265652145  0.04453687 -0.004557497 -0.033594307
Population   0.05047098 -0.09470652 -0.007876994  0.265652145  1.000000000 -0.01214362 -0.042663355 -0.106378231
Price       -0.44495073  0.58484777 -0.056698202  0.044536874 -0.012143620  1.00000000 -0.102176839  0.011746599
Age         -0.23181544 -0.10023882 -0.004670094 -0.004557497 -0.042663355 -0.10217684  1.000000000  0.006488032
Education   -0.05195524  0.02519705 -0.056855422 -0.033594307 -0.106378231  0.01174660  0.006488032  1.000000000

Written by Mark Needham

September 29th, 2014 at 5:46 am

Posted in R

Tagged with ,

R: ggplot – Plotting multiple variables on a line chart

without comments

In my continued playing around with meetup data I wanted to plot the number of members who join the Neo4j group over time.

I started off with the variable ‘byWeek’ which shows how many members joined the group each week:

> head(byWeek)
Source: local data frame [6 x 2]
 
        week n
1 2011-06-02 8
2 2011-06-09 4
3 2011-06-30 2
4 2011-07-14 1
5 2011-07-21 1
6 2011-08-18 1

I wanted to plot the actual count alongside a rolling average for which I created the following data frame:

library(zoo)
joinsByWeek = data.frame(actual = byWeek$n, 
                         week = byWeek$week,
                         rolling = rollmean(byWeek$n, 4, fill = NA, align=c("right")))
> head(joinsByWeek, 10)
   actual       week rolling
1       8 2011-06-02      NA
2       4 2011-06-09      NA
3       2 2011-06-30      NA
4       1 2011-07-14    3.75
5       1 2011-07-21    2.00
6       1 2011-08-18    1.25
7       1 2011-10-13    1.00
8       2 2011-11-24    1.25
9       1 2012-01-05    1.25
10      3 2012-01-12    1.75

The next step was to work out how to plot both ‘rolling’ and ‘actual’ on the same line chart. The easiest way is to make two calls to ‘geom_line’, like so:

ggplot(joinsByWeek, aes(x = week)) + 
  geom_line(aes(y = rolling), colour="blue") + 
  geom_line(aes(y = actual), colour = "grey") + 
  ylab(label="Number of new members") + 
  xlab("Week")
2014 09 16 21 57 14

Alternatively we can make use of the ‘melt’ function from the reshape library…

library(reshape)
meltedJoinsByWeek = melt(joinsByWeek, id = 'week')
> head(meltedJoinsByWeek, 20)
   week variable value
1     1   actual     8
2     2   actual     4
3     3   actual     2
4     4   actual     1
5     5   actual     1
6     6   actual     1
7     7   actual     1
8     8   actual     2
9     9   actual     1
10   10   actual     3
11   11   actual     1
12   12   actual     2
13   13   actual     4
14   14   actual     2
15   15   actual     3
16   16   actual     5
17   17   actual     1
18   18   actual     2
19   19   actual     1
20   20   actual     2

…which then means we can plot the chart with a single call to geom_line:

ggplot(meltedJoinsByWeek, aes(x = week, y = value, colour = variable)) + 
  geom_line() + 
  ylab(label="Number of new members") + 
  xlab("Week Number") + 
  scale_colour_manual(values=c("grey", "blue"))

2014 09 16 22 17 40

Written by Mark Needham

September 16th, 2014 at 4:59 pm

Posted in R

Tagged with ,

R: ggplot – Plotting a single variable line chart (geom_line requires the following missing aesthetics: y)

without comments

I’ve been learning how to do moving averages in R and having done that calculation I wanted to plot these variables on a line chart using ggplot.

The vector of rolling averages looked like this:

> rollmean(byWeek$n, 4)
  [1]  3.75  2.00  1.25  1.00  1.25  1.25  1.75  1.75  1.75  2.50  2.25  2.75  3.50  2.75  2.75
 [16]  2.25  1.50  1.50  2.00  2.00  2.00  2.00  1.25  1.50  2.25  2.50  3.00  3.25  2.75  4.00
 [31]  4.25  5.25  7.50  6.50  5.75  5.00  3.50  4.00  5.75  6.25  6.25  6.00  5.25  6.25  7.25
 [46]  7.75  7.00  4.75  2.75  1.75  2.00  4.00  5.25  5.50 11.50 11.50 12.75 14.50 12.50 11.75
 [61] 11.00  9.25  5.25  4.50  3.25  4.75  7.50  8.50  9.25 10.50  9.75 15.25 16.00 15.25 15.00
 [76] 10.00  8.50  6.50  4.25  3.00  4.25  4.75  7.50 11.25 11.00 11.50 10.00  6.75 11.25 12.50
 [91] 12.00 11.50  6.50  8.75  8.50  8.25  9.50  8.50  8.75  9.50  8.00  4.25  4.50  7.50  9.00
[106] 12.00 19.00 19.00 22.25 23.50 22.25 21.75 19.50 20.75 22.75 22.75 24.25 28.00 23.00 26.00
[121] 24.25 21.50 26.00 24.00 28.25 25.50 24.25 31.50 31.50 35.75 35.75 29.00 28.50 27.25 25.50
[136] 27.50 26.00 23.75

I initially tried to plot a line chart like this:

library(ggplot2)
library(zoo)
rollingMean = rollmean(byWeek$n, 4)
qplot(rollingMean) + geom_line()

which resulted in this error:

stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Error: geom_line requires the following missing aesthetics: y

It turns out we need to provide an x and y value if we want to draw a line chart. In this case we’ll generate the ‘x’ value – we only care that the y values get plotted in order from left to right:

qplot(1:length(rollingMean), rollingMean, xlab ="Week Number") + geom_line()
2014 09 13 16 58 57

If we want to use the ‘ggplot’ function then we need to put everything into a data frame first and then plot it:

ggplot(data.frame(week = 1:length(rollingMean), rolling = rollingMean),
       aes(x = week, y = rolling)) +
  geom_line()

2014 09 13 17 11 13

Written by Mark Needham

September 13th, 2014 at 11:41 am

Posted in R

Tagged with

R: Calculating rolling or moving averages

with one comment

I’ve been playing around with some time series data in R and since there’s a bit of variation between consecutive points I wanted to smooth the data out by calculating the moving average.

I struggled to find an in built function to do this but came across Didier Ruedin’s blog post which described the following function to do the job:

mav <- function(x,n=5){filter(x,rep(1/n,n), sides=2)}

I tried plugging in some numbers to understand how it works:

> mav(c(4,5,4,6), 3)
Time Series:
Start = 1 
End = 4 
Frequency = 1 
[1]       NA 4.333333 5.000000       NA

Here I was trying to do a rolling average which took into account the last 3 numbers so I expected to get just two numbers back – 4.333333 and 5 – and if there were going to be NA values I thought they’d be at the beginning of the sequence.

In fact it turns out this is what the ‘sides’ parameter controls:

sides	
for convolution filters only. If sides = 1 the filter coefficients are for past values only; if sides = 2 they 
are centred around lag 0. In this case the length of the filter should be odd, but if it is even, more of the 
filter is forward in time than backward.

So in our ‘mav’ function the rolling average looks both sides of the current value rather than just at past values. We can tweak that to get the behaviour we want:

mav <- function(x,n=5){filter(x,rep(1/n,n), sides=1)}
> mav(c(4,5,4,6), 3)
Time Series:
Start = 1 
End = 4 
Frequency = 1 
[1]       NA       NA 4.333333 5.000000

The NA values are annoying for any plotting we want to do so let’s get rid of them:

> na.omit(mav(c(4,5,4,6), 3))
Time Series:
Start = 3 
End = 4 
Frequency = 1 
[1] 4.333333 5.000000

Having got to this point I noticed that Didier had referenced the zoo package in the comments and it has a built in function to take care of all this:

> library(zoo)
> rollmean(c(4,5,4,6), 3)
[1] 4.333333 5.000000

I also realised I can list all the functions in a package with the ‘ls’ function so I’ll be scanning zoo’s list of functions next time I need to do something time series related – there’ll probably already be a function for it!

> ls("package:zoo")
  [1] "as.Date"              "as.Date.numeric"      "as.Date.ts"          
  [4] "as.Date.yearmon"      "as.Date.yearqtr"      "as.yearmon"          
  [7] "as.yearmon.default"   "as.yearqtr"           "as.yearqtr.default"  
 [10] "as.zoo"               "as.zoo.default"       "as.zooreg"           
 [13] "as.zooreg.default"    "autoplot.zoo"         "cbind.zoo"           
 [16] "coredata"             "coredata.default"     "coredata<-"          
 [19] "facet_free"           "format.yearqtr"       "fortify.zoo"         
 [22] "frequency<-"          "ifelse.zoo"           "index"               
 [25] "index<-"              "index2char"           "is.regular"          
 [28] "is.zoo"               "make.par.list"        "MATCH"               
 [31] "MATCH.default"        "MATCH.times"          "median.zoo"          
 [34] "merge.zoo"            "na.aggregate"         "na.aggregate.default"
 [37] "na.approx"            "na.approx.default"    "na.fill"             
 [40] "na.fill.default"      "na.locf"              "na.locf.default"     
 [43] "na.spline"            "na.spline.default"    "na.StructTS"         
 [46] "na.trim"              "na.trim.default"      "na.trim.ts"          
 [49] "ORDER"                "ORDER.default"        "panel.lines.its"     
 [52] "panel.lines.tis"      "panel.lines.ts"       "panel.lines.zoo"     
 [55] "panel.plot.custom"    "panel.plot.default"   "panel.points.its"    
 [58] "panel.points.tis"     "panel.points.ts"      "panel.points.zoo"    
 [61] "panel.polygon.its"    "panel.polygon.tis"    "panel.polygon.ts"    
 [64] "panel.polygon.zoo"    "panel.rect.its"       "panel.rect.tis"      
 [67] "panel.rect.ts"        "panel.rect.zoo"       "panel.segments.its"  
 [70] "panel.segments.tis"   "panel.segments.ts"    "panel.segments.zoo"  
 [73] "panel.text.its"       "panel.text.tis"       "panel.text.ts"       
 [76] "panel.text.zoo"       "plot.zoo"             "quantile.zoo"        
 [79] "rbind.zoo"            "read.zoo"             "rev.zoo"             
 [82] "rollapply"            "rollapplyr"           "rollmax"             
 [85] "rollmax.default"      "rollmaxr"             "rollmean"            
 [88] "rollmean.default"     "rollmeanr"            "rollmedian"          
 [91] "rollmedian.default"   "rollmedianr"          "rollsum"             
 [94] "rollsum.default"      "rollsumr"             "scale_x_yearmon"     
 [97] "scale_x_yearqtr"      "scale_y_yearmon"      "scale_y_yearqtr"     
[100] "Sys.yearmon"          "Sys.yearqtr"          "time<-"              
[103] "write.zoo"            "xblocks"              "xblocks.default"     
[106] "xtfrm.zoo"            "yearmon"              "yearmon_trans"       
[109] "yearqtr"              "yearqtr_trans"        "zoo"                 
[112] "zooreg"

Written by Mark Needham

September 13th, 2014 at 8:15 am

Posted in R

Tagged with ,

R: ggplot – Cumulative frequency graphs

without comments

In my continued playing around with ggplot I wanted to create a chart showing the cumulative growth of the number of members of the Neo4j London meetup group.

My initial data frame looked like this:

> head(meetupMembers)
  joinTimestamp            joinDate  monthYear quarterYear       week dayMonthYear
1  1.376572e+12 2013-08-15 13:13:40 2013-08-01  2013-07-01 2013-08-15   2013-08-15
2  1.379491e+12 2013-09-18 07:55:11 2013-09-01  2013-07-01 2013-09-12   2013-09-18
3  1.349454e+12 2012-10-05 16:28:04 2012-10-01  2012-10-01 2012-10-04   2012-10-05
4  1.383127e+12 2013-10-30 09:59:03 2013-10-01  2013-10-01 2013-10-24   2013-10-30
5  1.372239e+12 2013-06-26 09:27:40 2013-06-01  2013-04-01 2013-06-20   2013-06-26
6  1.330295e+12 2012-02-26 22:27:00 2012-02-01  2012-01-01 2012-02-23   2012-02-26

The first step was to transform the data so that I had a data frame where a row represented a day where a member joined the group. There would then be a count of how many members joined on that date.

We can do this with dplyr like so:

library(dplyr)
> head(meetupMembers %.% group_by(dayMonthYear) %.% summarise(n = n()))
Source: local data frame [6 x 2]
 
  dayMonthYear n
1   2011-06-05 7
2   2011-06-07 1
3   2011-06-10 1
4   2011-06-12 1
5   2011-06-13 1
6   2011-06-15 1

To turn that into a chart we can plug it into ggplot and use the cumsum function to generate a line showing the cumulative total:

ggplot(data = meetupMembers %.% group_by(dayMonthYear) %.% summarise(n = n()), 
       aes(x = dayMonthYear, y = n)) + 
  ylab("Number of members") +
  xlab("Date") +
  geom_line(aes(y = cumsum(n)))
2014 08 31 22 58 42

Alternatively we could bring the call to cumsum forward and generate a data frame which has the cumulative total:

> head(meetupMembers %.% group_by(dayMonthYear) %.% summarise(n = n()) %.% mutate(n = cumsum(n)))
Source: local data frame [6 x 2]
 
  dayMonthYear  n
1   2011-06-05  7
2   2011-06-07  8
3   2011-06-10  9
4   2011-06-12 10
5   2011-06-13 11
6   2011-06-15 12

And if we plug that into ggplot we’ll get the same curve as before:

ggplot(data = meetupMembers %.% group_by(dayMonthYear) %.% summarise(n = n()) %.% mutate(n = cumsum(n)), 
       aes(x = dayMonthYear, y = n)) + 
  ylab("Number of members") +
  xlab("Date") +
  geom_line()

If we want the curve to be a bit smoother we can group it by quarter rather than by day:

> head(meetupMembers %.% group_by(quarterYear) %.% summarise(n = n()) %.% mutate(n = cumsum(n)))
Source: local data frame [6 x 2]
 
  quarterYear   n
1  2011-04-01  13
2  2011-07-01  18
3  2011-10-01  21
4  2012-01-01  43
5  2012-04-01  60
6  2012-07-01 122

Now let’s plug that into ggplot:

ggplot(data = meetupMembers %.% group_by(quarterYear) %.% summarise(n = n()) %.% mutate(n = cumsum(n)), 
       aes(x = quarterYear, y = n)) + 
    ylab("Number of members") +
    xlab("Date") +
    geom_line()
2014 08 31 23 08 24

Written by Mark Needham

August 31st, 2014 at 10:10 pm

Posted in R

Tagged with

R: dplyr – group_by dynamic or programmatic field / variable (Error: index out of bounds)

without comments

In my last blog post I showed how to group timestamp based data by week, month and quarter and by the end we had the following code samples using dplyr and zoo:

library(RNeo4j)
library(zoo)
 
timestampToDate <- function(x) as.POSIXct(x / 1000, origin="1970-01-01", tz = "GMT")
 
query = "MATCH (:Person)-[:HAS_MEETUP_PROFILE]->()-[:HAS_MEMBERSHIP]->(membership)-[:OF_GROUP]->(g:Group {name: \"Neo4j - London User Group\"})
         RETURN membership.joined AS joinTimestamp"
meetupMembers = cypher(graph, query)
 
meetupMembers$joinDate <- timestampToDate(meetupMembers$joinTimestamp)
meetupMembers$monthYear <- as.Date(as.yearmon(meetupMembers$joinDate))
meetupMembers$quarterYear <- as.Date(as.yearqtr(meetupMembers$joinDate))
 
meetupMembers %.% group_by(week) %.% summarise(n = n())
meetupMembers %.% group_by(monthYear) %.% summarise(n = n())
meetupMembers %.% group_by(quarterYear) %.% summarise(n = n())

As you can see there’s quite a bit of duplication going on – the only thing that changes in the last 3 lines is the name of the field that we want to group by.

I wanted to pull this code out into a function and my first attempt was this:

groupMembersBy = function(field) {
  meetupMembers %.% group_by(field) %.% summarise(n = n())
}

And now if we try to group by week:

> groupMembersBy("week")
 Show Traceback
 
 Rerun with Debug
 Error: index out of bounds

It turns out if we want to do this then we actually want the regroup function rather than group_by:

groupMembersBy = function(field) {
  meetupMembers %.% regroup(list(field)) %.% summarise(n = n())
}

And now if we group by week:

> head(groupMembersBy("week"), 20)
Source: local data frame [20 x 2]
 
         week n
1  2011-06-02 8
2  2011-06-09 4
3  2011-06-16 1
4  2011-06-30 2
5  2011-07-14 1
6  2011-07-21 1
7  2011-08-18 1
8  2011-10-13 1
9  2011-11-24 2
10 2012-01-05 1
11 2012-01-12 3
12 2012-02-09 1
13 2012-02-16 2
14 2012-02-23 4
15 2012-03-01 2
16 2012-03-08 3
17 2012-03-15 5
18 2012-03-29 1
19 2012-04-05 2
20 2012-04-19 1

Much better!

Written by Mark Needham

August 29th, 2014 at 9:13 am

Posted in R

Tagged with

R: Grouping by week, month, quarter

without comments

In my continued playing around with R and meetup data I wanted to have a look at when people joined the London Neo4j group based on week, month or quarter of the year to see when they were most likely to do so.

I started with the following query to get back the join timestamps:

library(RNeo4j)
query = "MATCH (:Person)-[:HAS_MEETUP_PROFILE]->()-[:HAS_MEMBERSHIP]->(membership)-[:OF_GROUP]->(g:Group {name: \"Neo4j - London User Group\"})
         RETURN membership.joined AS joinTimestamp"
meetupMembers = cypher(graph, query)
 
> head(meetupMembers)
      joinTimestamp
1 1.376572e+12
2 1.379491e+12
3 1.349454e+12
4 1.383127e+12
5 1.372239e+12
6 1.330295e+12

The first step was to get joinDate into a nicer format that we can use in R more easily:

timestampToDate <- function(x) as.POSIXct(x / 1000, origin="1970-01-01", tz = "GMT")
meetupMembers$joinDate <- timestampToDate(meetupMembers$joinTimestamp)
 
> head(meetupMembers)
  joinTimestamp            joinDate
1  1.376572e+12 2013-08-15 13:13:40
2  1.379491e+12 2013-09-18 07:55:11
3  1.349454e+12 2012-10-05 16:28:04
4  1.383127e+12 2013-10-30 09:59:03
5  1.372239e+12 2013-06-26 09:27:40
6  1.330295e+12 2012-02-26 22:27:00

Much better!

I started off with grouping by month and quarter and came across the excellent zoo library which makes it really easy to transform dates:

library(zoo)
meetupMembers$monthYear <- as.Date(as.yearmon(meetupMembers$joinDate))
meetupMembers$quarterYear <- as.Date(as.yearqtr(meetupMembers$joinDate))
 
> head(meetupMembers)
  joinTimestamp            joinDate  monthYear quarterYear
1  1.376572e+12 2013-08-15 13:13:40 2013-08-01  2013-07-01
2  1.379491e+12 2013-09-18 07:55:11 2013-09-01  2013-07-01
3  1.349454e+12 2012-10-05 16:28:04 2012-10-01  2012-10-01
4  1.383127e+12 2013-10-30 09:59:03 2013-10-01  2013-10-01
5  1.372239e+12 2013-06-26 09:27:40 2013-06-01  2013-04-01
6  1.330295e+12 2012-02-26 22:27:00 2012-02-01  2012-01-01

The next step was to create a new data frame which grouped the data by those fields. I’ve been learning dplyr as part of Udacity’s EDA course so I thought I’d try and use that:

> head(meetupMembers %.% group_by(monthYear) %.% summarise(n = n()), 20)
 
    monthYear  n
1  2011-06-01 13
2  2011-07-01  4
3  2011-08-01  1
4  2011-10-01  1
5  2011-11-01  2
6  2012-01-01  4
7  2012-02-01  7
8  2012-03-01 11
9  2012-04-01  3
10 2012-05-01  9
11 2012-06-01  5
12 2012-07-01 16
13 2012-08-01 32
14 2012-09-01 14
15 2012-10-01 28
16 2012-11-01 31
17 2012-12-01  7
18 2013-01-01 52
19 2013-02-01 49
20 2013-03-01 22
> head(meetupMembers %.% group_by(quarterYear) %.% summarise(n = n()), 20)
 
   quarterYear   n
1   2011-04-01  13
2   2011-07-01   5
3   2011-10-01   3
4   2012-01-01  22
5   2012-04-01  17
6   2012-07-01  62
7   2012-10-01  66
8   2013-01-01 123
9   2013-04-01 139
10  2013-07-01 117
11  2013-10-01  94
12  2014-01-01 266
13  2014-04-01 359
14  2014-07-01 216

Grouping by week number is a bit trickier but we can do it with a bit of transformation on our initial timestamp:

meetupMembers$week <- as.Date("1970-01-01")+7*trunc((meetupMembers$joinTimestamp / 1000)/(3600*24*7))
 
> head(meetupMembers %.% group_by(week) %.% summarise(n = n()), 20)
 
         week n
1  2011-06-02 8
2  2011-06-09 4
3  2011-06-16 1
4  2011-06-30 2
5  2011-07-14 1
6  2011-07-21 1
7  2011-08-18 1
8  2011-10-13 1
9  2011-11-24 2
10 2012-01-05 1
11 2012-01-12 3
12 2012-02-09 1
13 2012-02-16 2
14 2012-02-23 4
15 2012-03-01 2
16 2012-03-08 3
17 2012-03-15 5
18 2012-03-29 1
19 2012-04-05 2
20 2012-04-19 1

We can then plug that data frame into ggplot if we want to track membership sign up over time at different levels of granularity and create some bar charts of scatter plots depending on what we feel like!

Written by Mark Needham

August 29th, 2014 at 12:25 am

Posted in R

Tagged with