## Archive for the ‘R’ Category

## R: Linear models with the lm function, NA values and Collinearity

In my continued playing around with R I’ve sometimes noticed ‘NA’ values in the linear regression models I created but hadn’t really thought about what that meant.

On the advice of Peter Huber I recently started working my way through Coursera’s Regression Models which has a whole slide explaining its meaning:

So in this case ‘z’ doesn’t help us in predicting Fertility since it doesn’t give us any more information that we can’t already get from ‘Agriculture’ and ‘Education’.

Although in this case we know why ‘z’ doesn’t have a coefficient sometimes it may not be clear which other variable the NA one is highly correlated with.

Multicollinearity(alsocollinearity) is a statistical phenomenon in which two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others with a non-trivial degree of accuracy.

In that situation we can make use of the alias function to explain the collinearity as suggested in this StackOverflow post:

library(datasets); data(swiss); require(stats); require(graphics) z <- swiss$Agriculture + swiss$Education fit = lm(Fertility ~ . + z, data = swiss) |

> alias(fit) Model : Fertility ~ Agriculture + Examination + Education + Catholic + Infant.Mortality + z Complete : (Intercept) Agriculture Examination Education Catholic Infant.Mortality z 0 1 0 1 0 0 |

In this case we can see that ‘z’ is highly correlated with both Agriculture and Education which makes sense given its the sum of those two variables.

When we notice that there’s an NA coefficient in our model we can choose to exclude that variable and the model will still have the same coefficients as before:

> require(dplyr) > summary(lm(Fertility ~ . + z, data = swiss))$coefficients Estimate Std. Error t value Pr(>|t|) (Intercept) 66.9151817 10.70603759 6.250229 1.906051e-07 Agriculture -0.1721140 0.07030392 -2.448142 1.872715e-02 Examination -0.2580082 0.25387820 -1.016268 3.154617e-01 Education -0.8709401 0.18302860 -4.758492 2.430605e-05 Catholic 0.1041153 0.03525785 2.952969 5.190079e-03 Infant.Mortality 1.0770481 0.38171965 2.821568 7.335715e-03 > summary(lm(Fertility ~ ., data = swiss))$coefficients Estimate Std. Error t value Pr(>|t|) (Intercept) 66.9151817 10.70603759 6.250229 1.906051e-07 Agriculture -0.1721140 0.07030392 -2.448142 1.872715e-02 Examination -0.2580082 0.25387820 -1.016268 3.154617e-01 Education -0.8709401 0.18302860 -4.758492 2.430605e-05 Catholic 0.1041153 0.03525785 2.952969 5.190079e-03 Infant.Mortality 1.0770481 0.38171965 2.821568 7.335715e-03 |

If we call alias now we won’t see any output:

> alias(lm(Fertility ~ ., data = swiss)) Model : Fertility ~ Agriculture + Examination + Education + Catholic + Infant.Mortality |

## R: A first attempt at linear regression

I’ve been working through the videos that accompany the Introduction to Statistical Learning with Applications in R book and thought it’d be interesting to try out the linear regression algorithm against my meetup data set.

I wanted to see how well a linear regression algorithm could predict how many people were likely to RSVP to a particular event. I started with the following code to build a data frame containing some potential predictors:

library(RNeo4j) officeEventsQuery = "MATCH (g:Group {name: \"Neo4j - London User Group\"})-[:HOSTED_EVENT]->(event)<-[:TO]-({response: 'yes'})<-[:RSVPD]-(), (event)-[:HELD_AT]->(venue) WHERE (event.time + event.utc_offset) < timestamp() AND venue.name IN [\"Neo Technology\", \"OpenCredo\"] RETURN event.time + event.utc_offset AS eventTime,event.announced_at AS announcedAt, event.name, COUNT(*) AS rsvps" events = subset(cypher(graph, officeEventsQuery), !is.na(announcedAt)) events$eventTime <- timestampToDate(events$eventTime) events$day <- format(events$eventTime, "%A") events$monthYear <- format(events$eventTime, "%m-%Y") events$month <- format(events$eventTime, "%m") events$year <- format(events$eventTime, "%Y") events$announcedAt<- timestampToDate(events$announcedAt) events$timeDiff = as.numeric(events$eventTime - events$announcedAt, units = "days") |

If we preview ‘events’ it contains the following columns:

> head(events) eventTime announcedAt event.name rsvps day monthYear month year timeDiff 1 2013-01-29 18:00:00 2012-11-30 11:30:57 Intro to Graphs 24 Tuesday 01-2013 01 2013 60.270174 2 2014-06-24 18:30:00 2014-06-18 19:11:19 Intro to Graphs 43 Tuesday 06-2014 06 2014 5.971308 3 2014-06-18 18:30:00 2014-06-08 07:03:13 Neo4j World Cup Hackathon 24 Wednesday 06-2014 06 2014 10.476933 4 2014-05-20 18:30:00 2014-05-14 18:56:06 Intro to Graphs 53 Tuesday 05-2014 05 2014 5.981875 5 2014-02-11 18:00:00 2014-02-05 19:11:03 Intro to Graphs 35 Tuesday 02-2014 02 2014 5.950660 6 2014-09-04 18:30:00 2014-08-26 06:34:01 Hands On Intro to Cypher - Neo4j's Query Language 20 Thursday 09-2014 09 2014 9.497211 |

We want to predict ‘rsvps’ from the other columns so I started off by creating a linear model which took all the other columns into account:

> summary(lm(rsvps ~., data = events)) Call: lm(formula = rsvps ~ ., data = events) Residuals: Min 1Q Median 3Q Max -8.2582 -1.1538 0.0000 0.4158 10.5803 Coefficients: (14 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) -9.365e+03 3.009e+03 -3.113 0.00897 ** eventTime 3.609e-06 2.951e-06 1.223 0.24479 announcedAt 3.278e-06 2.553e-06 1.284 0.22339 event.nameGraph Modelling - Do's and Don'ts 4.884e+01 1.140e+01 4.286 0.00106 ** event.nameHands on build your first Neo4j app for Java developers 3.735e+01 1.048e+01 3.562 0.00391 ** event.nameHands On Intro to Cypher - Neo4j's Query Language 2.560e+01 9.713e+00 2.635 0.02177 * event.nameIntro to Graphs 2.238e+01 8.726e+00 2.564 0.02480 * event.nameIntroduction to Graph Database Modeling -1.304e+02 4.835e+01 -2.696 0.01946 * event.nameLunch with Neo4j's CEO, Emil Eifrem 3.920e+01 1.113e+01 3.523 0.00420 ** event.nameNeo4j Clojure Hackathon -3.063e+00 1.195e+01 -0.256 0.80203 event.nameNeo4j Python Hackathon with py2neo's Nigel Small 2.128e+01 1.070e+01 1.989 0.06998 . event.nameNeo4j World Cup Hackathon 5.004e+00 9.622e+00 0.520 0.61251 dayTuesday 2.068e+01 5.626e+00 3.676 0.00317 ** dayWednesday 2.300e+01 5.522e+00 4.165 0.00131 ** monthYear01-2014 -2.350e+02 7.377e+01 -3.185 0.00784 ** monthYear02-2013 -2.526e+01 1.376e+01 -1.836 0.09130 . monthYear02-2014 -2.325e+02 7.763e+01 -2.995 0.01118 * monthYear03-2013 -4.605e+01 1.683e+01 -2.736 0.01805 * monthYear03-2014 -2.371e+02 8.324e+01 -2.848 0.01468 * monthYear04-2013 -6.570e+01 2.309e+01 -2.845 0.01477 * monthYear04-2014 -2.535e+02 8.746e+01 -2.899 0.01336 * monthYear05-2013 -8.672e+01 2.845e+01 -3.049 0.01011 * monthYear05-2014 -2.802e+02 9.420e+01 -2.975 0.01160 * monthYear06-2013 -1.022e+02 3.283e+01 -3.113 0.00897 ** monthYear06-2014 -2.996e+02 1.003e+02 -2.988 0.01132 * monthYear07-2014 -3.123e+02 1.054e+02 -2.965 0.01182 * monthYear08-2013 -1.326e+02 4.323e+01 -3.067 0.00976 ** monthYear08-2014 -3.060e+02 1.107e+02 -2.763 0.01718 * monthYear09-2013 NA NA NA NA monthYear09-2014 -3.465e+02 1.164e+02 -2.976 0.01158 * monthYear10-2012 2.602e+01 1.959e+01 1.328 0.20886 monthYear10-2013 -1.728e+02 5.678e+01 -3.044 0.01020 * monthYear11-2012 2.717e+01 1.509e+01 1.800 0.09704 . month02 NA NA NA NA month03 NA NA NA NA month04 NA NA NA NA month05 NA NA NA NA month06 NA NA NA NA month07 NA NA NA NA month08 NA NA NA NA month09 NA NA NA NA month10 NA NA NA NA month11 NA NA NA NA year2013 NA NA NA NA year2014 NA NA NA NA timeDiff NA NA NA NA --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 5.287 on 12 degrees of freedom Multiple R-squared: 0.9585, Adjusted R-squared: 0.8512 F-statistic: 8.934 on 31 and 12 DF, p-value: 0.0001399 |

As I understand it we can look at the R-squared value to understand how much of the variance in the data has been explained by the model – in this case it’s 85%.

A lot of the coefficients seem to be based around specific event names which seems a bit too specific to me so I wanted to see what would happen if I derived a feature which indicated whether a session was practical:

events$practical = grepl("Hackathon|Hands on|Hands On", events$event.name) |

We can now run the model again with the new column having excluded ‘event.name’ field:

> summary(lm(rsvps ~., data = subset(events, select = -c(event.name)))) Call: lm(formula = rsvps ~ ., data = subset(events, select = -c(event.name))) Residuals: Min 1Q Median 3Q Max -18.647 -2.311 0.000 2.908 23.218 Coefficients: (13 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) -3.980e+03 4.752e+03 -0.838 0.4127 eventTime 2.907e-06 3.873e-06 0.751 0.4621 announcedAt 3.336e-08 3.559e-06 0.009 0.9926 dayTuesday 7.547e+00 6.080e+00 1.241 0.2296 dayWednesday 2.442e+00 7.046e+00 0.347 0.7327 monthYear01-2014 -9.562e+01 1.187e+02 -0.806 0.4303 monthYear02-2013 -4.230e+00 2.289e+01 -0.185 0.8553 monthYear02-2014 -9.156e+01 1.254e+02 -0.730 0.4742 monthYear03-2013 -1.633e+01 2.808e+01 -0.582 0.5676 monthYear03-2014 -8.094e+01 1.329e+02 -0.609 0.5496 monthYear04-2013 -2.249e+01 3.785e+01 -0.594 0.5595 monthYear04-2014 -9.230e+01 1.401e+02 -0.659 0.5180 monthYear05-2013 -3.237e+01 4.654e+01 -0.696 0.4952 monthYear05-2014 -1.015e+02 1.509e+02 -0.673 0.5092 monthYear06-2013 -3.947e+01 5.355e+01 -0.737 0.4701 monthYear06-2014 -1.081e+02 1.604e+02 -0.674 0.5084 monthYear07-2014 -1.110e+02 1.678e+02 -0.661 0.5163 monthYear08-2013 -5.144e+01 6.988e+01 -0.736 0.4706 monthYear08-2014 -1.023e+02 1.784e+02 -0.573 0.5731 monthYear09-2013 -6.057e+01 7.893e+01 -0.767 0.4523 monthYear09-2014 -1.260e+02 1.874e+02 -0.672 0.5094 monthYear10-2012 9.557e+00 2.873e+01 0.333 0.7430 monthYear10-2013 -6.450e+01 9.169e+01 -0.703 0.4903 monthYear11-2012 1.689e+01 2.316e+01 0.729 0.4748 month02 NA NA NA NA month03 NA NA NA NA month04 NA NA NA NA month05 NA NA NA NA month06 NA NA NA NA month07 NA NA NA NA month08 NA NA NA NA month09 NA NA NA NA month10 NA NA NA NA month11 NA NA NA NA year2013 NA NA NA NA year2014 NA NA NA NA timeDiff NA NA NA NA practicalTRUE -9.388e+00 5.289e+00 -1.775 0.0919 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 10.21 on 19 degrees of freedom Multiple R-squared: 0.7546, Adjusted R-squared: 0.4446 F-statistic: 2.434 on 24 and 19 DF, p-value: 0.02592 |

Now we’re only accounting for 44% of the variance and none of our coefficients are significant so this wasn’t such a good change.

I also noticed that we’ve got a bit of overlap in the date related features – we’ve got one column for monthYear and then separate ones for month and year. Let’s strip out the combined one:

> summary(lm(rsvps ~., data = subset(events, select = -c(event.name, monthYear)))) Call: lm(formula = rsvps ~ ., data = subset(events, select = -c(event.name, monthYear))) Residuals: Min 1Q Median 3Q Max -16.5745 -4.0507 -0.1042 3.6586 24.4715 Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) -1.573e+03 4.315e+03 -0.364 0.7185 eventTime 3.320e-06 3.434e-06 0.967 0.3425 announcedAt -2.149e-06 2.201e-06 -0.976 0.3379 dayTuesday 4.713e+00 5.871e+00 0.803 0.4294 dayWednesday -2.253e-01 6.685e+00 -0.034 0.9734 month02 3.164e+00 1.285e+01 0.246 0.8075 month03 1.127e+01 1.858e+01 0.607 0.5494 month04 4.148e+00 2.581e+01 0.161 0.8736 month05 1.979e+00 3.425e+01 0.058 0.9544 month06 -1.220e-01 4.271e+01 -0.003 0.9977 month07 1.671e+00 4.955e+01 0.034 0.9734 month08 8.849e+00 5.940e+01 0.149 0.8827 month09 -5.496e+00 6.782e+01 -0.081 0.9360 month10 -5.066e+00 7.893e+01 -0.064 0.9493 month11 4.255e+00 8.697e+01 0.049 0.9614 year2013 -1.799e+01 1.032e+02 -0.174 0.8629 year2014 -3.281e+01 2.045e+02 -0.160 0.8738 timeDiff NA NA NA NA practicalTRUE -9.816e+00 5.084e+00 -1.931 0.0645 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 10.19 on 26 degrees of freedom Multiple R-squared: 0.666, Adjusted R-squared: 0.4476 F-statistic: 3.049 on 17 and 26 DF, p-value: 0.005187 |

Again none of the coefficients are statistically significant which is disappointing. I think the main problem may be that I have very few data points (only 42) making it difficult to come up with a general model.

I think my next step is to look for some other features that could impact the number of RSVPs e.g. other events on that day, the weather.

I’m a novice at this but trying to learn more so if you have any ideas of what I should do next please let me know.

## R: Deriving a new data frame column based on containing string

I’ve been playing around with R data frames a bit more and one thing I wanted to do was derive a new column based on the text contained in the existing column.

I started with something like this:

> x = data.frame(name = c("Java Hackathon", "Intro to Graphs", "Hands on Cypher")) > x name 1 Java Hackathon 2 Intro to Graphs 3 Hands on Cypher |

And I wanted to derive a new column based on whether or not the session was a practical one. The grepl function seemed to be the best tool for the job:

> grepl("Hackathon|Hands on|Hands On", x$name) [1] TRUE FALSE TRUE |

We can then add a column to our data frame with that output:

x$practical = grepl("Hackathon|Hands on|Hands On", x$name) |

And we end up with the following:

> x name practical 1 Java Hackathon TRUE 2 Intro to Graphs FALSE 3 Hands on Cypher TRUE |

Not too tricky but it took me a bit too long to figure it out so I thought I’d save future Mark some time!

## R: Filtering data frames by column type (‘x’ must be numeric)

I’ve been working through the exercises from An Introduction to Statistical Learning and one of them required you to create a pair wise correlation matrix of variables in a data frame.

The exercise uses the ‘Carseats’ data set which can be imported like so:

> install.packages("ISLR") > library(ISLR) > head(Carseats) Sales CompPrice Income Advertising Population Price ShelveLoc Age Education Urban US 1 9.50 138 73 11 276 120 Bad 42 17 Yes Yes 2 11.22 111 48 16 260 83 Good 65 10 Yes Yes 3 10.06 113 35 10 269 80 Medium 59 12 Yes Yes 4 7.40 117 100 4 466 97 Medium 55 14 Yes Yes 5 4.15 141 64 3 340 128 Bad 38 13 Yes No 6 10.81 124 113 13 501 72 Bad 78 16 No Yes |

filter the categorical variables from a data frame and

If we try to run the ‘cor‘ function on the data frame we’ll get the following error:

> cor(Carseats) Error in cor(Carseats) : 'x' must be numeric |

As the error message suggests, we can’t pass non numeric variables to this function so we need to remove the categorical variables from our data frame.

But first we need to work out which columns those are:

> sapply(Carseats, class) Sales CompPrice Income Advertising Population Price ShelveLoc Age Education "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "factor" "numeric" "numeric" Urban US "factor" "factor" |

We can see a few columns of type ‘factor’ and luckily for us there’s a function which will help us identify those more easily:

> sapply(Carseats, is.factor) Sales CompPrice Income Advertising Population Price ShelveLoc Age Education FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE Urban US TRUE TRUE |

Now we can remove those columns from our data frame and create the correlation matrix:

> cor(Carseats[sapply(Carseats, function(x) !is.factor(x))]) Sales CompPrice Income Advertising Population Price Age Education Sales 1.00000000 0.06407873 0.151950979 0.269506781 0.050470984 -0.44495073 -0.231815440 -0.051955242 CompPrice 0.06407873 1.00000000 -0.080653423 -0.024198788 -0.094706516 0.58484777 -0.100238817 0.025197050 Income 0.15195098 -0.08065342 1.000000000 0.058994706 -0.007876994 -0.05669820 -0.004670094 -0.056855422 Advertising 0.26950678 -0.02419879 0.058994706 1.000000000 0.265652145 0.04453687 -0.004557497 -0.033594307 Population 0.05047098 -0.09470652 -0.007876994 0.265652145 1.000000000 -0.01214362 -0.042663355 -0.106378231 Price -0.44495073 0.58484777 -0.056698202 0.044536874 -0.012143620 1.00000000 -0.102176839 0.011746599 Age -0.23181544 -0.10023882 -0.004670094 -0.004557497 -0.042663355 -0.10217684 1.000000000 0.006488032 Education -0.05195524 0.02519705 -0.056855422 -0.033594307 -0.106378231 0.01174660 0.006488032 1.000000000 |

## R: ggplot – Plotting multiple variables on a line chart

In my continued playing around with meetup data I wanted to plot the number of members who join the Neo4j group over time.

I started off with the variable ‘byWeek’ which shows how many members joined the group each week:

> head(byWeek) Source: local data frame [6 x 2] week n 1 2011-06-02 8 2 2011-06-09 4 3 2011-06-30 2 4 2011-07-14 1 5 2011-07-21 1 6 2011-08-18 1 |

I wanted to plot the actual count alongside a rolling average for which I created the following data frame:

library(zoo) joinsByWeek = data.frame(actual = byWeek$n, week = byWeek$week, rolling = rollmean(byWeek$n, 4, fill = NA, align=c("right"))) |

> head(joinsByWeek, 10) actual week rolling 1 8 2011-06-02 NA 2 4 2011-06-09 NA 3 2 2011-06-30 NA 4 1 2011-07-14 3.75 5 1 2011-07-21 2.00 6 1 2011-08-18 1.25 7 1 2011-10-13 1.00 8 2 2011-11-24 1.25 9 1 2012-01-05 1.25 10 3 2012-01-12 1.75 |

The next step was to work out how to plot both ‘rolling’ and ‘actual’ on the same line chart. The easiest way is to make two calls to ‘geom_line’, like so:

ggplot(joinsByWeek, aes(x = week)) + geom_line(aes(y = rolling), colour="blue") + geom_line(aes(y = actual), colour = "grey") + ylab(label="Number of new members") + xlab("Week") |

Alternatively we can make use of the ‘melt’ function from the reshape library…

library(reshape) meltedJoinsByWeek = melt(joinsByWeek, id = 'week') |

> head(meltedJoinsByWeek, 20) week variable value 1 1 actual 8 2 2 actual 4 3 3 actual 2 4 4 actual 1 5 5 actual 1 6 6 actual 1 7 7 actual 1 8 8 actual 2 9 9 actual 1 10 10 actual 3 11 11 actual 1 12 12 actual 2 13 13 actual 4 14 14 actual 2 15 15 actual 3 16 16 actual 5 17 17 actual 1 18 18 actual 2 19 19 actual 1 20 20 actual 2 |

…which then means we can plot the chart with a single call to geom_line:

ggplot(meltedJoinsByWeek, aes(x = week, y = value, colour = variable)) + geom_line() + ylab(label="Number of new members") + xlab("Week Number") + scale_colour_manual(values=c("grey", "blue")) |

## R: ggplot – Plotting a single variable line chart (geom_line requires the following missing aesthetics: y)

I’ve been learning how to do moving averages in R and having done that calculation I wanted to plot these variables on a line chart using ggplot.

The vector of rolling averages looked like this:

> rollmean(byWeek$n, 4) [1] 3.75 2.00 1.25 1.00 1.25 1.25 1.75 1.75 1.75 2.50 2.25 2.75 3.50 2.75 2.75 [16] 2.25 1.50 1.50 2.00 2.00 2.00 2.00 1.25 1.50 2.25 2.50 3.00 3.25 2.75 4.00 [31] 4.25 5.25 7.50 6.50 5.75 5.00 3.50 4.00 5.75 6.25 6.25 6.00 5.25 6.25 7.25 [46] 7.75 7.00 4.75 2.75 1.75 2.00 4.00 5.25 5.50 11.50 11.50 12.75 14.50 12.50 11.75 [61] 11.00 9.25 5.25 4.50 3.25 4.75 7.50 8.50 9.25 10.50 9.75 15.25 16.00 15.25 15.00 [76] 10.00 8.50 6.50 4.25 3.00 4.25 4.75 7.50 11.25 11.00 11.50 10.00 6.75 11.25 12.50 [91] 12.00 11.50 6.50 8.75 8.50 8.25 9.50 8.50 8.75 9.50 8.00 4.25 4.50 7.50 9.00 [106] 12.00 19.00 19.00 22.25 23.50 22.25 21.75 19.50 20.75 22.75 22.75 24.25 28.00 23.00 26.00 [121] 24.25 21.50 26.00 24.00 28.25 25.50 24.25 31.50 31.50 35.75 35.75 29.00 28.50 27.25 25.50 [136] 27.50 26.00 23.75 |

I initially tried to plot a line chart like this:

library(ggplot2) library(zoo) rollingMean = rollmean(byWeek$n, 4) qplot(rollingMean) + geom_line() |

which resulted in this error:

stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this. Error: geom_line requires the following missing aesthetics: y |

It turns out we need to provide an x and y value if we want to draw a line chart. In this case we’ll generate the ‘x’ value – we only care that the y values get plotted in order from left to right:

qplot(1:length(rollingMean), rollingMean, xlab ="Week Number") + geom_line() |

If we want to use the ‘ggplot’ function then we need to put everything into a data frame first and then plot it:

ggplot(data.frame(week = 1:length(rollingMean), rolling = rollingMean), aes(x = week, y = rolling)) + geom_line() |

## R: Calculating rolling or moving averages

I’ve been playing around with some time series data in R and since there’s a bit of variation between consecutive points I wanted to smooth the data out by calculating the moving average.

I struggled to find an in built function to do this but came across Didier Ruedin’s blog post which described the following function to do the job:

mav <- function(x,n=5){filter(x,rep(1/n,n), sides=2)} |

I tried plugging in some numbers to understand how it works:

> mav(c(4,5,4,6), 3) Time Series: Start = 1 End = 4 Frequency = 1 [1] NA 4.333333 5.000000 NA |

Here I was trying to do a rolling average which took into account the last 3 numbers so I expected to get just two numbers back – 4.333333 and 5 – and if there were going to be NA values I thought they’d be at the beginning of the sequence.

In fact it turns out this is what the ‘sides’ parameter controls:

sides for convolution filters only. If sides = 1 the filter coefficients are for past values only; if sides = 2 they are centred around lag 0. In this case the length of the filter should be odd, but if it is even, more of the filter is forward in time than backward. |

So in our ‘mav’ function the rolling average looks both sides of the current value rather than just at past values. We can tweak that to get the behaviour we want:

mav <- function(x,n=5){filter(x,rep(1/n,n), sides=1)} |

> mav(c(4,5,4,6), 3) Time Series: Start = 1 End = 4 Frequency = 1 [1] NA NA 4.333333 5.000000 |

The NA values are annoying for any plotting we want to do so let’s get rid of them:

> na.omit(mav(c(4,5,4,6), 3)) Time Series: Start = 3 End = 4 Frequency = 1 [1] 4.333333 5.000000 |

Having got to this point I noticed that Didier had referenced the zoo package in the comments and it has a built in function to take care of all this:

> library(zoo) > rollmean(c(4,5,4,6), 3) [1] 4.333333 5.000000 |

I also realised I can list all the functions in a package with the ‘ls’ function so I’ll be scanning zoo’s list of functions next time I need to do something time series related – there’ll probably already be a function for it!

> ls("package:zoo") [1] "as.Date" "as.Date.numeric" "as.Date.ts" [4] "as.Date.yearmon" "as.Date.yearqtr" "as.yearmon" [7] "as.yearmon.default" "as.yearqtr" "as.yearqtr.default" [10] "as.zoo" "as.zoo.default" "as.zooreg" [13] "as.zooreg.default" "autoplot.zoo" "cbind.zoo" [16] "coredata" "coredata.default" "coredata<-" [19] "facet_free" "format.yearqtr" "fortify.zoo" [22] "frequency<-" "ifelse.zoo" "index" [25] "index<-" "index2char" "is.regular" [28] "is.zoo" "make.par.list" "MATCH" [31] "MATCH.default" "MATCH.times" "median.zoo" [34] "merge.zoo" "na.aggregate" "na.aggregate.default" [37] "na.approx" "na.approx.default" "na.fill" [40] "na.fill.default" "na.locf" "na.locf.default" [43] "na.spline" "na.spline.default" "na.StructTS" [46] "na.trim" "na.trim.default" "na.trim.ts" [49] "ORDER" "ORDER.default" "panel.lines.its" [52] "panel.lines.tis" "panel.lines.ts" "panel.lines.zoo" [55] "panel.plot.custom" "panel.plot.default" "panel.points.its" [58] "panel.points.tis" "panel.points.ts" "panel.points.zoo" [61] "panel.polygon.its" "panel.polygon.tis" "panel.polygon.ts" [64] "panel.polygon.zoo" "panel.rect.its" "panel.rect.tis" [67] "panel.rect.ts" "panel.rect.zoo" "panel.segments.its" [70] "panel.segments.tis" "panel.segments.ts" "panel.segments.zoo" [73] "panel.text.its" "panel.text.tis" "panel.text.ts" [76] "panel.text.zoo" "plot.zoo" "quantile.zoo" [79] "rbind.zoo" "read.zoo" "rev.zoo" [82] "rollapply" "rollapplyr" "rollmax" [85] "rollmax.default" "rollmaxr" "rollmean" [88] "rollmean.default" "rollmeanr" "rollmedian" [91] "rollmedian.default" "rollmedianr" "rollsum" [94] "rollsum.default" "rollsumr" "scale_x_yearmon" [97] "scale_x_yearqtr" "scale_y_yearmon" "scale_y_yearqtr" [100] "Sys.yearmon" "Sys.yearqtr" "time<-" [103] "write.zoo" "xblocks" "xblocks.default" [106] "xtfrm.zoo" "yearmon" "yearmon_trans" [109] "yearqtr" "yearqtr_trans" "zoo" [112] "zooreg" |

## R: ggplot – Cumulative frequency graphs

In my continued playing around with ggplot I wanted to create a chart showing the cumulative growth of the number of members of the Neo4j London meetup group.

My initial data frame looked like this:

> head(meetupMembers) joinTimestamp joinDate monthYear quarterYear week dayMonthYear 1 1.376572e+12 2013-08-15 13:13:40 2013-08-01 2013-07-01 2013-08-15 2013-08-15 2 1.379491e+12 2013-09-18 07:55:11 2013-09-01 2013-07-01 2013-09-12 2013-09-18 3 1.349454e+12 2012-10-05 16:28:04 2012-10-01 2012-10-01 2012-10-04 2012-10-05 4 1.383127e+12 2013-10-30 09:59:03 2013-10-01 2013-10-01 2013-10-24 2013-10-30 5 1.372239e+12 2013-06-26 09:27:40 2013-06-01 2013-04-01 2013-06-20 2013-06-26 6 1.330295e+12 2012-02-26 22:27:00 2012-02-01 2012-01-01 2012-02-23 2012-02-26 |

The first step was to transform the data so that I had a data frame where a row represented a day where a member joined the group. There would then be a count of how many members joined on that date.

We can do this with dplyr like so:

library(dplyr) > head(meetupMembers %.% group_by(dayMonthYear) %.% summarise(n = n())) Source: local data frame [6 x 2] dayMonthYear n 1 2011-06-05 7 2 2011-06-07 1 3 2011-06-10 1 4 2011-06-12 1 5 2011-06-13 1 6 2011-06-15 1 |

To turn that into a chart we can plug it into ggplot and use the cumsum function to generate a line showing the cumulative total:

ggplot(data = meetupMembers %.% group_by(dayMonthYear) %.% summarise(n = n()), aes(x = dayMonthYear, y = n)) + ylab("Number of members") + xlab("Date") + geom_line(aes(y = cumsum(n))) |

Alternatively we could bring the call to cumsum forward and generate a data frame which has the cumulative total:

> head(meetupMembers %.% group_by(dayMonthYear) %.% summarise(n = n()) %.% mutate(n = cumsum(n))) Source: local data frame [6 x 2] dayMonthYear n 1 2011-06-05 7 2 2011-06-07 8 3 2011-06-10 9 4 2011-06-12 10 5 2011-06-13 11 6 2011-06-15 12 |

And if we plug that into ggplot we’ll get the same curve as before:

ggplot(data = meetupMembers %.% group_by(dayMonthYear) %.% summarise(n = n()) %.% mutate(n = cumsum(n)), aes(x = dayMonthYear, y = n)) + ylab("Number of members") + xlab("Date") + geom_line() |

If we want the curve to be a bit smoother we can group it by quarter rather than by day:

> head(meetupMembers %.% group_by(quarterYear) %.% summarise(n = n()) %.% mutate(n = cumsum(n))) Source: local data frame [6 x 2] quarterYear n 1 2011-04-01 13 2 2011-07-01 18 3 2011-10-01 21 4 2012-01-01 43 5 2012-04-01 60 6 2012-07-01 122 |

Now let’s plug that into ggplot:

ggplot(data = meetupMembers %.% group_by(quarterYear) %.% summarise(n = n()) %.% mutate(n = cumsum(n)), aes(x = quarterYear, y = n)) + ylab("Number of members") + xlab("Date") + geom_line() |

## R: dplyr – group_by dynamic or programmatic field / variable (Error: index out of bounds)

In my last blog post I showed how to group timestamp based data by week, month and quarter and by the end we had the following code samples using dplyr and zoo:

library(RNeo4j) library(zoo) timestampToDate <- function(x) as.POSIXct(x / 1000, origin="1970-01-01", tz = "GMT") query = "MATCH (:Person)-[:HAS_MEETUP_PROFILE]->()-[:HAS_MEMBERSHIP]->(membership)-[:OF_GROUP]->(g:Group {name: \"Neo4j - London User Group\"}) RETURN membership.joined AS joinTimestamp" meetupMembers = cypher(graph, query) meetupMembers$joinDate <- timestampToDate(meetupMembers$joinTimestamp) meetupMembers$monthYear <- as.Date(as.yearmon(meetupMembers$joinDate)) meetupMembers$quarterYear <- as.Date(as.yearqtr(meetupMembers$joinDate)) meetupMembers %.% group_by(week) %.% summarise(n = n()) meetupMembers %.% group_by(monthYear) %.% summarise(n = n()) meetupMembers %.% group_by(quarterYear) %.% summarise(n = n()) |

As you can see there’s quite a bit of duplication going on – the only thing that changes in the last 3 lines is the name of the field that we want to group by.

I wanted to pull this code out into a function and my first attempt was this:

groupMembersBy = function(field) { meetupMembers %.% group_by(field) %.% summarise(n = n()) } |

And now if we try to group by week:

> groupMembersBy("week") Show Traceback Rerun with Debug Error: index out of bounds |

It turns out if we want to do this then we actually want the regroup function rather than group_by:

groupMembersBy = function(field) { meetupMembers %.% regroup(list(field)) %.% summarise(n = n()) } |

And now if we group by week:

> head(groupMembersBy("week"), 20) Source: local data frame [20 x 2] week n 1 2011-06-02 8 2 2011-06-09 4 3 2011-06-16 1 4 2011-06-30 2 5 2011-07-14 1 6 2011-07-21 1 7 2011-08-18 1 8 2011-10-13 1 9 2011-11-24 2 10 2012-01-05 1 11 2012-01-12 3 12 2012-02-09 1 13 2012-02-16 2 14 2012-02-23 4 15 2012-03-01 2 16 2012-03-08 3 17 2012-03-15 5 18 2012-03-29 1 19 2012-04-05 2 20 2012-04-19 1 |

Much better!

## R: Grouping by week, month, quarter

In my continued playing around with R and meetup data I wanted to have a look at when people joined the London Neo4j group based on week, month or quarter of the year to see when they were most likely to do so.

I started with the following query to get back the join timestamps:

library(RNeo4j) query = "MATCH (:Person)-[:HAS_MEETUP_PROFILE]->()-[:HAS_MEMBERSHIP]->(membership)-[:OF_GROUP]->(g:Group {name: \"Neo4j - London User Group\"}) RETURN membership.joined AS joinTimestamp" meetupMembers = cypher(graph, query) > head(meetupMembers) joinTimestamp 1 1.376572e+12 2 1.379491e+12 3 1.349454e+12 4 1.383127e+12 5 1.372239e+12 6 1.330295e+12 |

The first step was to get joinDate into a nicer format that we can use in R more easily:

timestampToDate <- function(x) as.POSIXct(x / 1000, origin="1970-01-01", tz = "GMT") meetupMembers$joinDate <- timestampToDate(meetupMembers$joinTimestamp) > head(meetupMembers) joinTimestamp joinDate 1 1.376572e+12 2013-08-15 13:13:40 2 1.379491e+12 2013-09-18 07:55:11 3 1.349454e+12 2012-10-05 16:28:04 4 1.383127e+12 2013-10-30 09:59:03 5 1.372239e+12 2013-06-26 09:27:40 6 1.330295e+12 2012-02-26 22:27:00 |

Much better!

I started off with grouping by month and quarter and came across the excellent zoo library which makes it really easy to transform dates:

library(zoo) meetupMembers$monthYear <- as.Date(as.yearmon(meetupMembers$joinDate)) meetupMembers$quarterYear <- as.Date(as.yearqtr(meetupMembers$joinDate)) > head(meetupMembers) joinTimestamp joinDate monthYear quarterYear 1 1.376572e+12 2013-08-15 13:13:40 2013-08-01 2013-07-01 2 1.379491e+12 2013-09-18 07:55:11 2013-09-01 2013-07-01 3 1.349454e+12 2012-10-05 16:28:04 2012-10-01 2012-10-01 4 1.383127e+12 2013-10-30 09:59:03 2013-10-01 2013-10-01 5 1.372239e+12 2013-06-26 09:27:40 2013-06-01 2013-04-01 6 1.330295e+12 2012-02-26 22:27:00 2012-02-01 2012-01-01 |

The next step was to create a new data frame which grouped the data by those fields. I’ve been learning dplyr as part of Udacity’s EDA course so I thought I’d try and use that:

> head(meetupMembers %.% group_by(monthYear) %.% summarise(n = n()), 20) monthYear n 1 2011-06-01 13 2 2011-07-01 4 3 2011-08-01 1 4 2011-10-01 1 5 2011-11-01 2 6 2012-01-01 4 7 2012-02-01 7 8 2012-03-01 11 9 2012-04-01 3 10 2012-05-01 9 11 2012-06-01 5 12 2012-07-01 16 13 2012-08-01 32 14 2012-09-01 14 15 2012-10-01 28 16 2012-11-01 31 17 2012-12-01 7 18 2013-01-01 52 19 2013-02-01 49 20 2013-03-01 22 |

> head(meetupMembers %.% group_by(quarterYear) %.% summarise(n = n()), 20) quarterYear n 1 2011-04-01 13 2 2011-07-01 5 3 2011-10-01 3 4 2012-01-01 22 5 2012-04-01 17 6 2012-07-01 62 7 2012-10-01 66 8 2013-01-01 123 9 2013-04-01 139 10 2013-07-01 117 11 2013-10-01 94 12 2014-01-01 266 13 2014-04-01 359 14 2014-07-01 216 |

Grouping by week number is a bit trickier but we can do it with a bit of transformation on our initial timestamp:

meetupMembers$week <- as.Date("1970-01-01")+7*trunc((meetupMembers$joinTimestamp / 1000)/(3600*24*7)) > head(meetupMembers %.% group_by(week) %.% summarise(n = n()), 20) week n 1 2011-06-02 8 2 2011-06-09 4 3 2011-06-16 1 4 2011-06-30 2 5 2011-07-14 1 6 2011-07-21 1 7 2011-08-18 1 8 2011-10-13 1 9 2011-11-24 2 10 2012-01-05 1 11 2012-01-12 3 12 2012-02-09 1 13 2012-02-16 2 14 2012-02-23 4 15 2012-03-01 2 16 2012-03-08 3 17 2012-03-15 5 18 2012-03-29 1 19 2012-04-05 2 20 2012-04-19 1 |

We can then plug that data frame into ggplot if we want to track membership sign up over time at different levels of granularity and create some bar charts of scatter plots depending on what we feel like!