R: ggplot2 – Each group consist of only one observation. Do you need to adjust the group aesthetic?

I’ve been playing around with some weather data over the last couple of days which I aggregated down to the average temperature per month over the last 4 years and stored in a CSV file.

This is what the file looks like:

$ cat /tmp/averageTemperatureByMonth.csv
"month","aveTemperature"
"January",6.02684563758389
"February",5.89380530973451
"March",7.54838709677419
"April",10.875
"May",13.3064516129032
"June",15.9666666666667
"July",18.8387096774194
"August",18.3709677419355
"September",16.2583333333333
"October",13.4596774193548
"November",9.19166666666667
"December",7.01612903225806

I wanted to create a simple line chart which would show the months of the year in ascending order with the appropriate temperature.

My first attempt was the following:

df = read.csv("/tmp/averageTemperatureByMonth.csv")
df$month = factor(df$month, month.name)
 
ggplot(aes(x = month, y = aveTemperature), data = df) + 
  geom_line( ) + 
  ggtitle("Temperature by month")

which resulted in the following error:

geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?

My understanding is that the points don’t get joined up by default because the variable on the x axis is not a continuous one but rather a factor variable.

One way to work around this problem is to make it numeric, like so:

ggplot(aes(x = as.numeric(month), y = aveTemperature), data = df) + 
  geom_line( ) + 
  ggtitle("Temperature by month")

which results in the following chart:

2015 01 30 00 25 18

This isn’t bad but it’d be much nicer if we could have the month names along the bottom instead.

It turns out we can but we need to specify a group that each point belongs to. ggplot will then connects points which belong to the same group.

In this case we don’t really have one so we’ll define a dummy one instead:

ggplot(aes(x = month, y = aveTemperature, group=1), data = df) + 
  geom_line( ) + 
  ggtitle("Temperature by month")

And now we get the visualisation we want:

2015 01 29 23 28 23

R: Featuring engineering for a linear model

I previously wrote about a linear model I created to predict how many people would RSVP ‘yes’ to a meetup event and having not found much correlation between any of my independent variables and RSVPs was a bit stuck.

As luck would have it I bumped into Antonios at a meetup a month ago and he offered to take a look at what I’d tried so far and give me some tips on how to progress.

The first thing he pointed out is that that all my features were related to date/time and that I should try and generate some other features. He suggested I start with the following:

  • info about organisers (quantify popularity of organisers, how many people work for them)
  • info about the venue (how many people fit there, how far it is from the centre of the city)
  • number of tweets for the event, during X days before the event

I’d read a lot on Kaggle forums about how feature engineering was the most important part of building statistical models but it didn’t click what that meant until Antonios pointed it out.

The first thing I decided to do was bring in the data for all London’s NoSQL meetups rather than just the Neo4j one to give myself a bit more data to work with.

Group Membership

Having done that, it seemed from visual inspection that the meetup groups with the most members (i.e. Data Science London, Big Data London) seemed to get the biggest turnouts.

I thought it’d be interesting to see what the correlation was between group membership and RSVPs so this was the first new feature I added.

I generated this feature by a combination of a Neo4j query and R code which resulted in this data frame as CSV file.

We can quickly preview it to see some of the events and the group membership at that time:

> df = read.csv("/tmp/membersWithGroupCounts.csv")
> df$eventTime = as.POSIXct(df$eventTime)
> df %>% sample_n(10) %>% select(event.name, g.name, eventTime, groupMembers, rsvps)
 
                                                                  event.name                                   g.name           eventTime groupMembers rsvps
23  Scoring Models, Apache Drill for querying structured & unstructured data                      Data Science London 2014-09-18 18:30:00         3466   159
421                                                      London Office Hours                London MongoDB User Group 2012-08-22 17:00:00          468     6
304                            MongoDB University Study Group London Meet up                London MongoDB User Group 2014-07-16 17:00:00         1256    23
43                                                           December Meetup          London ElasticSearch User Group 2014-12-10 18:30:00          721   126
222                                                          Intro to Graphs                Neo4j - London User Group 2014-09-03 18:30:00         1453    39
207                              Intro to Machine Learning with Scikit-Learn                            Women in Data 2014-11-11 18:15:00          574    41
168                                        NoSQL panel and LevelDB + Node.js                             London NoSQL 2014-04-15 18:30:00          183    51
443                                                      London Office Hours                London MongoDB User Group 2012-11-29 17:00:00          590     3
79                                  Apache Cassandra 1.2 with Jonathan Ellis                         Cassandra London 2013-03-06 19:00:00          399    95
362                                                          Span conference Span: scalable and distributed computing 2014-10-28 09:00:00           67    13

One thing I found difficult was finding features specific to an event – I’m not sure how much that matters. I generated features for the venue or group much more easily.

First let’s see if there’s actually any correlation between these two variables by plotting them:

ggplot(aes(x = groupMembers, y = rsvps), data = df) + 
  geom_point()
2014 12 28 21 02 21

It looks like there’s a positive correlation between these two variables but let’s create a single variable linear model to see how much of the variation is explained:

> fit = lm(rsvps ~ groupMembers, data = df)
> fit$coef
 (Intercept) groupMembers 
 20.03579637   0.05382738

Our linear model equation is therefore:

rsvps = 20.03579637 + 0.05382738(groupMembers)

Let’s see how well correlated our predicted RSVPs and actual RSVPs are:

> df$predictedRSVPs = predict(fit, df)
> with(df, cor(rsvps, predictedRSVPs))
[1] 0.6263096

Not too bad! There is quite a strong correlation between these variables although it’s not perfect.

Hours into the day

In my first model I’d treated time as a categorical variable but Antonios pointed out that it’s often easier to understand the relationship between variables if they’re both continuos so I transformed the event time like so:

df$hoursIntoDay = as.numeric(df$eventTime - trunc(df$eventTime, "day"), units="hours")

Let’s see how that plots against the RSVP count:

ggplot(aes(x = hoursIntoDay, y = rsvps), data = df) + 
  geom_point()
2014 12 28 21 27 48

It’s a bit more difficult to see a trend here as there are quite discrete times at which events happen and the majority start at 6.30 or 7.00. Nevertheless let’s build a linear model with just this variable:

> fit = lm(rsvps ~ hoursIntoDay, data = df)
> fit$coef
 (Intercept) hoursIntoDay 
   -18.79895      4.12984 
> 
> df$predictedRSVPs = predict(fit, df)
> with(df, cor(rsvps, predictedRSVPs))
[1] 0.181472

Distance from the centre of London

Next up I tried a feature based on the location of the venue that the events were held at. The hypothesis was that if a venue was closer to the centre of London then people would be more likely to attend.

To calculate this distance I used the distHaversine function from the geosphere package as shown in a previous blog post.

Let’s have a look at the graph for that variable:

ggplot(aes(x = distanceFromCentre, y = rsvps), data = df) + 
  geom_point()
2014 12 28 21 37 41

It’s hard to tell much from this plot, mainly because a majority of the points are clustered around the 2,500 metre mark which represents Shoreditch venues. Let’s plug it into a linear model and see what we come up with:

> fit = lm(rsvps ~ distanceFromCentre, data = df)
> fit$coef
       (Intercept) distanceFromCentre 
      57.243646619       -0.001310492 
> 
> df$predictedRSVPs = predict(fit, df)
> with(df, cor(rsvps, predictedRSVPs))
[1] 0.02999708

Interestingly there’s barely any correlation here which was surprising to me. I tried combining this variable in a multiple variable model with the others but it still didn’t have much impact so I think we’ll park this one for now.

This is as much as I’ve done at the moment and despite spending quite a bit of time on it I still haven’t really explained very much of the variation in RSVP rates!

I have managed to identify some ways that I was able to come up with new features to try out though:

  • Read what other people are doing e.g. I have some ideas for lag variables (e.g. how many people went to your previous meetup) having read about this baseball linear model
  • Talk to other people about your model – they often have ideas you wouldn’t think of being too deep into the problem.
  • Look at what data you already have and try and incorporate that and see where it leads

The next avenue I started exploring is topic modelling as I have a hypothesis that people RSVP for events based on the content of talks but I’m not sure of the best way to go about that.

My current thinking is to pull out some topics/terms by following the example from Chapter 6 of Machine Learning for Hackers.

R: Vectorising all the things

After my last post about finding the distance a date/time is from the weekend Hadley Wickham suggested I could improve the function by vectorising it…

…so I thought I’d try and vectorise some of the other functions I’ve written recently and show the two versions.

I found the following articles useful for explaining vectorisation and why you might want to do it:

Let’s get started.

Distance from the weekend

We want to find out how many hours away from the weekend i.e. nearest Saturday/Sunday a particular date/time is. We’ll be using the following libraries and set of date/times:

library(dplyr)
library(lubridate)
library(geosphere)
options("scipen"=100, "digits"=4)
 
times = ymd_hms("2002-01-01 17:00:00") + c(0:99) * hours(1)
data = data.frame(time = times)
> data %>% head()
                 time
1 2002-01-01 17:00:00
2 2002-01-01 18:00:00
3 2002-01-01 19:00:00
4 2002-01-01 20:00:00
5 2002-01-01 21:00:00
6 2002-01-01 22:00:00

Let’s have a look at the non vectorised version first:

distanceFromWeekend = function(dateToLookup) {
  before = floor_date(dateToLookup, "week") + hours(23) + minutes(59) + seconds(59)
  after  = ceiling_date(dateToLookup, "week") - days(1)
  timeToBefore = dateToLookup - before
  timeToAfter = after - dateToLookup
 
  if(timeToBefore < 0 || timeToAfter < 0) {
    0  
  } else {
    if(timeToBefore < timeToAfter) {
      timeToBefore / dhours(1)
    } else {
      timeToAfter / dhours(1)
    }
  }
}

Now let’s run it against our data frame:

> system.time(
    data %>% mutate(ind = row_number()) %>% group_by(ind) %>% mutate(dist = distanceFromWeekend(time))  
    )
   user  system elapsed 
  1.837   0.020   1.884

And now for Hadley’s vectorised version:

distanceFromWeekendVectorised = function(dateToLookup) {
  before = floor_date(dateToLookup, "week") + hours(23) + minutes(59) + seconds(59)
  after  = ceiling_date(dateToLookup, "week") - days(1)
  pmin(pmax(dateToLookup - before, 0), pmax(after - dateToLookup, 0)) / dhours(1)
}
 
> system.time(data %>% mutate(dist = distanceFromWeekendVectorised(time)))
   user  system elapsed 
  0.020   0.001   0.023

Extracting start date

My next example was from cleaning up Google Trends data and extracting the start date from a cell inside a CSV file.

We’ll use this data frame:

googleTrends = read.csv("/Users/markneedham/Downloads/report.csv", row.names=NULL)
names(googleTrends) = c("week", "score")
> googleTrends %>% head(10)
                        week score
1  Worldwide; 2004 - present      
2         Interest over time      
3                       Week neo4j
4    2004-01-04 - 2004-01-10     0
5    2004-01-11 - 2004-01-17     0
6    2004-01-18 - 2004-01-24     0
7    2004-01-25 - 2004-01-31     0
8    2004-02-01 - 2004-02-07     0
9    2004-02-08 - 2004-02-14     0
10   2004-02-15 - 2004-02-21     0

The non vectorised version looked like this:

> system.time(
    googleTrends %>% 
      mutate(ind = row_number()) %>% 
      group_by(ind) %>%
      mutate(dates = strsplit(week, " - "),
             start = dates[[1]][1] %>% strptime("%Y-%m-%d") %>% as.character())
    )
   user  system elapsed 
  0.215   0.000   0.214

In this case it’s actually not possible to vectorise the code using the strsplit so we need to use something else. Antonios showed me how to do so using substr:

> system.time(googleTrends %>% mutate(start = substr(week, 1, 10) %>% ymd()))
   user  system elapsed 
  0.018   0.000   0.017

Calculating haversine distance

I wanted to work out the great circular distance from a collection of venues to a centre point in London. I started out with this data frame:

centre = c(-0.129581, 51.516578)
venues = read.csv("/tmp/venues.csv")
 
> venues %>% head()
                       venue   lat      lon
1              Skills Matter 51.52 -0.09911
2                   Skinkers 51.50 -0.08387
3          Theodore Bullfrog 51.51 -0.12375
4 The Skills Matter eXchange 51.52 -0.09923
5               The Guardian 51.53 -0.12234
6            White Bear Yard 51.52 -0.10980

My non vectorised version looked like this:

> system.time(venues %>% 
    mutate(distanceFromCentre = by(venues, 1:nrow(venues), function(row) { distHaversine(c(row$lon, row$lat), centre)  }))
    )
   user  system elapsed 
  0.034   0.000   0.033

It’s pretty quick but we can do better – the distHaversine function allows us to calculate multiple distances if the first argument ot it is a matrix of lon/lat values rather than a vector:

> system.time(
    venues %>% mutate(distanceFromCentre = distHaversine(cbind(venues$lon, venues$lat), centre))
    )
   user  system elapsed 
  0.001   0.000   0.001

One I can’t figure out…

And finally I have a function which I can’t figure out how to vectorise but maybe someone with more R skillz than me can?

I have a data frame containing the cumulative member counts of various NoSQL London groups:

cumulativeMeetupMembers = read.csv("/tmp/cumulativeMeetupMembers.csv")
> cumulativeMeetupMembers %>% sample_n(10)
                               g.name dayMonthYear    n
4734            Hadoop Users Group UK   2013-10-26 1144
4668            Hadoop Users Group UK   2013-08-03  979
4936            Hadoop Users Group UK   2014-07-31 1644
5150                      Hive London   2012-10-15  109
8020        Neo4j - London User Group   2014-03-15  826
7666        Neo4j - London User Group   2012-08-06   78
1030                  Big Data London   2013-03-01 1416
6500        London MongoDB User Group   2013-09-21  952
8290 Oracle Big Data 4 the Enterprise   2012-06-04   61
2584              Data Science London   2012-03-20  285

And I want to find out the number of members for a group on a specific date. e.g. given the following data…

> cumulativeMeetupMembers %>% head(10)
                                          g.name dayMonthYear  n
1  Big Data / Data Science / Data Analytics Jobs   2013-01-29  1
2  Big Data / Data Science / Data Analytics Jobs   2013-02-06 15
3  Big Data / Data Science / Data Analytics Jobs   2013-02-07 28
4  Big Data / Data Science / Data Analytics Jobs   2013-02-10 31
5  Big Data / Data Science / Data Analytics Jobs   2013-02-18 33
6  Big Data / Data Science / Data Analytics Jobs   2013-03-27 38
7  Big Data / Data Science / Data Analytics Jobs   2013-04-16 41
8  Big Data / Data Science / Data Analytics Jobs   2013-07-17 53
9  Big Data / Data Science / Data Analytics Jobs   2013-08-28 58
10 Big Data / Data Science / Data Analytics Jobs   2013-11-11 63

…the number of members for the ‘Big Data / Data Science / Data Analytics Jobs’ group on the 10th November 2013 should be 58.

I created this data frame of groups and random dates:

dates = ymd("2014-09-01") + c(0:9) * weeks(1)
groups = cumulativeMeetupMembers %>% distinct(g.name) %>% select(g.name)
 
groupsOnDate = merge(dates, groups)
names(groupsOnDate) = c('date', 'name')
 
> groupsOnDate %>% sample_n(10)
          date                                            name
156 2014-10-06                                 GridGain London
153 2014-09-15                                 GridGain London
70  2014-11-03                                Couchbase London
185 2014-09-29                           Hadoop Users Group UK
105 2014-09-29                             Data Science London
137 2014-10-13            Equal Experts Technical Meetup Group
360 2014-11-03                        Scale Warriors of London
82  2014-09-08 Data Science & Business Analytics London Meetup
233 2014-09-15                 London ElasticSearch User Group
84  2014-09-22 Data Science & Business Analytics London Meetup

The non vectorised version looks like this:

memberCount = function(meetupMembers) {
  function(groupName, date) {
    (meetupMembers %>% 
       filter(g.name == groupName & dayMonthYear < date) %>% do(tail(., 1)))$n    
  }  
} 
 
findMemberCount = memberCount(cumulativeMeetupMembers)
 
> system.time(groupsOnDate %>% mutate(groupMembers = by(groupsOnDate, 1:nrow(groupsOnDate), function(row) { 
          findMemberCount(row$name, as.character(row$date))
        }) %>% 
        cbind() %>% 
        as.vector() ))
   user  system elapsed 
  2.259   0.005   2.269

The output looks like this:

          date                                     name groupMembers
116 2014-10-06                      DeNormalised London          157
322 2014-09-08                 OpenCredo Tech Workshops            7
71  2014-09-01                  Data Enthusiasts London             
233 2014-09-15          London ElasticSearch User Group          614
171 2014-09-01 HPC & GPU Supercomputing Group of London           80
109 2014-10-27                      Data Science London         3632
20  2014-11-03            Big Data Developers in London          708
42  2014-09-08              Big Data Week London Meetup           96
127 2014-10-13          Enterprise Search London Meetup          575
409 2014-10-27                            Women in Data          548

I’ve tried many different approaches but haven’t been able to come up with a version that lets me pass in all the rows to memberCount and calculate the count for each row in one go.

Any ideas/advice/hints welcome!

R: Time to/from the weekend

In my last post I showed some examples using R’s lubridate package and another problem it made really easy to solve was working out how close a particular date time was to the weekend.

I wanted to write a function which would return the previous Sunday or upcoming Saturday depending on which was closer.

lubridate’s floor_date and ceiling_date functions make this quite simple.

e.g. if we want to round the 18th December down to the beginning of the week and up to the beginning of the next week we could do the following:

> library(lubridate)
> floor_date(ymd("2014-12-18"), "week")
[1] "2014-12-14 UTC"
 
> ceiling_date(ymd("2014-12-18"), "week")
[1] "2014-12-21 UTC"

For the date in the future we actually want to grab the Saturday rather than the Sunday so we’ll subtract one day from that:

> ceiling_date(ymd("2014-12-18"), "week") - days(1)
[1] "2014-12-20 UTC"

Now let’s put that together into a function which finds the closest weekend for a given date:

findClosestWeekendDay = function(dateToLookup) {
  before = floor_date(dateToLookup, "week") + hours(23) + minutes(59) + seconds(59)
  after  = ceiling_date(dateToLookup, "week") - days(1)
  if((dateToLookup - before) < (after - dateToLookup)) {
    before  
  } else {
    after  
  }
}
 
> findClosestWeekendDay(ymd_hms("2014-12-13 13:33:29"))
[1] "2014-12-13 UTC"
 
> findClosestWeekendDay(ymd_hms("2014-12-14 18:33:29"))
[1] "2014-12-14 23:59:59 UTC"
 
> findClosestWeekendDay(ymd_hms("2014-12-15 18:33:29"))
[1] "2014-12-14 23:59:59 UTC"
 
> findClosestWeekendDay(ymd_hms("2014-12-17 11:33:29"))
[1] "2014-12-14 23:59:59 UTC"
 
> findClosestWeekendDay(ymd_hms("2014-12-17 13:33:29"))
[1] "2014-12-20 UTC"
 
> findClosestWeekendDay(ymd_hms("2014-12-19 13:33:29"))
[1] "2014-12-20 UTC"

I’ve set the Sunday date at 23:59:59 so that I can use this date in the next step where we want to calculate how how many hours it is from the current date to the nearest weekend.

I ended up with this function:

distanceFromWeekend = function(dateToLookup) {
  before = floor_date(dateToLookup, "week") + hours(23) + minutes(59) + seconds(59)
  after  = ceiling_date(dateToLookup, "week") - days(1)
  timeToBefore = dateToLookup - before
  timeToAfter = after - dateToLookup
 
  if(timeToBefore < 0 || timeToAfter < 0) {
    0  
  } else {
    if(timeToBefore < timeToAfter) {
      timeToBefore / dhours(1)
    } else {
      timeToAfter / dhours(1)
    }
  }
}
 
> distanceFromWeekend(ymd_hms("2014-12-13 13:33:29"))
[1] 0
 
> distanceFromWeekend(ymd_hms("2014-12-14 18:33:29"))
[1] 0
 
> distanceFromWeekend(ymd_hms("2014-12-15 18:33:29"))
[1] 18.55833
 
> distanceFromWeekend(ymd_hms("2014-12-17 11:33:29"))
[1] 59.55833
 
> distanceFromWeekend(ymd_hms("2014-12-17 13:33:29"))
[1] 58.44194
 
> distanceFromWeekend(ymd_hms("2014-12-19 13:33:29"))
[1] 10.44194

While this works it’s quite slow when you run it over a data frame which contains a lot of rows.

There must be a clever R way of doing the same thing (perhaps using matrices) which I haven’t figured out yet so if you know how to speed it up do let me know.

R: Numeric representation of date time

I’ve been playing around with date times in R recently and I wanted to derive a numeric representation for a given value to make it easier to see the correlation between time and another variable.

e.g. December 13th 2014 17:30 should return 17.5 since it’s 17.5 hours since midnight.

Using the standard R libraries we would write the following code:

> december13 = as.POSIXlt("2014-12-13 17:30:00")
> as.numeric(december13 - trunc(december13, "day"), units="hours")
[1] 17.5

That works pretty well but Antonios recently introduced me to the lubridate so I thought I’d give that a try as well.

The first nice thing about lubridate is that we can use the date we created earlier and call the floor_date function rather than truncate:

> (december13 - floor_date(december13, "day"))
Time difference of 17.5 hours

That gives us back a difftime

> class((december13 - floor_date(december13, "day")))
[1] "difftime"

…which we can divide by different units to get the granularity we want:

> diff = (december13 - floor_date(december13, "day"))
> diff / dhours(1)
[1] 17.5
 
> diff / ddays(1)
[1] 0.7291667
 
> diff / dminutes(1)
[1] 1050

Pretty neat!

lubridate also has some nice functions for creating dates/date times. e.g.

> ymd_hms("2014-12-13 17:00:00")
[1] "2014-12-13 17:00:00 UTC"
 
> ymd_hm("2014-12-13 17:00")
[1] "2014-12-13 17:00:00 UTC"
 
> ymd_h("2014-12-13 17")
[1] "2014-12-13 17:00:00 UTC"
 
> ymd("2014-12-13")
[1] "2014-12-13 UTC"

And if you want a different time zone that’s pretty easy too:

> with_tz(ymd("2014-12-13"), "GMT")
[1] "2014-12-13 GMT"

R: data.table/dplyr/lubridate – Error in wday(date, label = TRUE, abbr = FALSE) : unused arguments (label = TRUE, abbr = FALSE)

I spent a couple of hours playing around with data.table this evening and tried changing some code written using a data frame to use a data table instead.

I started off by building a data frame which contains all the weekends between 2010 and 2015…

> library(lubridate)
 
> library(dplyr)
 
> dates = data.frame(date = seq( dmy("01-01-2010"), to=dmy("01-01-2015"), by="day" ))
 
> dates = dates %>% filter(wday(date, label = TRUE, abbr = FALSE) %in% c("Saturday", "Sunday"))

…which works fine:

> dates %>% head()
         date
1: 2010-01-02
2: 2010-01-03
3: 2010-01-09
4: 2010-01-10
5: 2010-01-16
6: 2010-01-17

I then tried to change the code to use a data table instead which led to the following error:

> library(data.table)
 
> dates = data.table(date = seq( dmy("01-01-2010"), to=dmy("01-01-2015"), by="day" ))
 
> dates = dates %>% filter(wday(date, label = TRUE, abbr = FALSE) %in% c("Saturday", "Sunday"))
Error in wday(date, label = TRUE, abbr = FALSE) : 
  unused arguments (label = TRUE, abbr = FALSE)

I wasn’t sure what was going on so I went back to the data frame version to check if that still worked…

> dates = data.frame(date = seq( dmy("01-01-2010"), to=dmy("01-01-2015"), by="day" ))
 
> dates = dates %>% filter(wday(date, label = TRUE, abbr = FALSE) %in% c("Saturday", "Sunday"))
Error in wday(c(1262304000, 1262390400, 1262476800, 1262563200, 1262649600,  : 
  unused arguments (label = TRUE, abbr = FALSE)

…except it now didn’t work either! I decided to check what wday was referring to…

Help on topic ‘wday’ was found in the following packages:

Integer based date class
(in package data.table in library /Library/Frameworks/R.framework/Versions/3.1/Resources/library)
Get/set days component of a date-time.
(in package lubridate in library /Library/Frameworks/R.framework/Versions/3.1/Resources/library)

…and realised that data.table has its own wday function – I’d been caught out by R’s global scoping of all the things!

We can probably work around that by the order in which we require the various libraries but for now I’m just prefixing the call to wday and all is well:

dates = dates %>% filter(lubridate::wday(date, label = TRUE, abbr = FALSE) %in% c("Saturday", "Sunday"))

R: Cleaning up and plotting Google Trends data

I recently came across an excellent article written by Stian Haklev in which he describes things he wishes he’d been told before starting out with R, one being to do all data clean up in code which I thought I’d give a try.

My goal is to leave the raw data completely unchanged, and do all the transformation in code, which can be rerun at any time.

While I’m writing the scripts, I’m often jumping around, selectively executing individual lines or code blocks, running commands to inspect the data in the REPL (read-evaluate-print-loop, where each command is executed as soon as you type enter, in the picture above it’s the pane to the right), etc.

But I try to make sure that when I finish up, the script is runnable by itself.

I thought the Google Trends data set would be an interesting one to play around with as it gives you a CSV containing several different bits of data of which I’m only interested in ‘interest over time’.

It’s not very easy to automate the download of the CSV file so I did that bit manually and automated everything from there onwards.

The first step was to read the CSV file and explore some of the rows to see what it contained:

> library(dplyr)
 
> googleTrends = read.csv("/Users/markneedham/Downloads/report.csv", row.names=NULL)
 
> googleTrends %>% head()
##                   row.names Web.Search.interest..neo4j
## 1 Worldwide; 2004 - present                           
## 2        Interest over time                           
## 3                      Week                      neo4j
## 4   2004-01-04 - 2004-01-10                          0
## 5   2004-01-11 - 2004-01-17                          0
## 6   2004-01-18 - 2004-01-24                          0
 
> googleTrends %>% sample_n(10)
##                   row.names Web.Search.interest..neo4j
## 109 2006-01-08 - 2006-01-14                          0
## 113 2006-02-05 - 2006-02-11                          0
## 267 2009-01-18 - 2009-01-24                          0
## 199 2007-09-30 - 2007-10-06                          0
## 522 2013-12-08 - 2013-12-14                         88
## 265 2009-01-04 - 2009-01-10                          0
## 285 2009-05-24 - 2009-05-30                          0
## 318 2010-01-10 - 2010-01-16                          0
## 495 2013-06-02 - 2013-06-08                         79
## 28  2004-06-20 - 2004-06-26                          0
 
> googleTrends %>% tail()
##                row.names Web.Search.interest..neo4j
## 658        neo4j example                   Breakout
## 659 neo4j graph database                   Breakout
## 660           neo4j java                   Breakout
## 661           neo4j node                   Breakout
## 662           neo4j rest                   Breakout
## 663       neo4j tutorial                   Breakout

We only want to keep the rows which contain (week, interest) pairs so the first thing we’ll do is rename the columns:

names(googleTrends) = c("week", "score")

Now we want to strip out the rows which don’t contain (week, interest) pairs. The easiest way to do this is to look for rows which don’t contain date values in the ‘week’ column.

First we need to split the start and end dates in that column by using the strsplit function.

I found it much easier to apply the function to each row individually rather than passing in a list of values so I created a dummy column with a row number in to allow me to do that (a trick Antonios showed me):

> googleTrends %>% 
    mutate(ind = row_number()) %>% 
    group_by(ind) %>%
    mutate(dates = strsplit(week, " - "),
           start = dates[[1]][1] %>% strptime("%Y-%m-%d") %>% as.character(),
           end =   dates[[1]][2] %>% strptime("%Y-%m-%d") %>% as.character()) %>%
    head()
## Source: local data frame [6 x 6]
## Groups: ind
## 
##                        week score ind    dates      start        end
## 1 Worldwide; 2004 - present     1   1 <chr[2]>         NA         NA
## 2        Interest over time     1   2 <chr[1]>         NA         NA
## 3                      Week    90   3 <chr[1]>         NA         NA
## 4   2004-01-04 - 2004-01-10     3   4 <chr[2]> 2004-01-04 2004-01-10
## 5   2004-01-11 - 2004-01-17     3   5 <chr[2]> 2004-01-11 2004-01-17
## 6   2004-01-18 - 2004-01-24     3   6 <chr[2]> 2004-01-18 2004-01-24

Now we need to get rid of the rows which have an NA value for ‘start’ or ‘end':

> googleTrends %>% 
    mutate(ind = row_number()) %>% 
    group_by(ind) %>%
    mutate(dates = strsplit(week, " - "),
           start = dates[[1]][1] %>% strptime("%Y-%m-%d") %>% as.character(),
           end =   dates[[1]][2] %>% strptime("%Y-%m-%d") %>% as.character()) %>%
    filter(!is.na(start) | !is.na(end)) %>% 
    head()
## Source: local data frame [6 x 6]
## Groups: ind
## 
##                      week score ind    dates      start        end
## 1 2004-01-04 - 2004-01-10     3   4 <chr[2]> 2004-01-04 2004-01-10
## 2 2004-01-11 - 2004-01-17     3   5 <chr[2]> 2004-01-11 2004-01-17
## 3 2004-01-18 - 2004-01-24     3   6 <chr[2]> 2004-01-18 2004-01-24
## 4 2004-01-25 - 2004-01-31     3   7 <chr[2]> 2004-01-25 2004-01-31
## 5 2004-02-01 - 2004-02-07     3   8 <chr[2]> 2004-02-01 2004-02-07
## 6 2004-02-08 - 2004-02-14     3   9 <chr[2]> 2004-02-08 2004-02-14

Next we’ll get rid of ‘week’, ‘ind’ and ‘dates’ as we aren’t going to need those anymore:

> cleanGoogleTrends = googleTrends %>% 
    mutate(ind = row_number()) %>% 
    group_by(ind) %>%
    mutate(dates = strsplit(week, " - "),
           start = dates[[1]][1] %>% strptime("%Y-%m-%d") %>% as.character(),
           end =   dates[[1]][2] %>% strptime("%Y-%m-%d") %>% as.character()) %>%
    filter(!is.na(start) | !is.na(end)) %>%
    ungroup() %>%
    select(-c(ind, dates, week))
 
> cleanGoogleTrends %>% head()
## Source: local data frame [6 x 3]
## 
##   score      start        end
## 1     3 2004-01-04 2004-01-10
## 2     3 2004-01-11 2004-01-17
## 3     3 2004-01-18 2004-01-24
## 4     3 2004-01-25 2004-01-31
## 5     3 2004-02-01 2004-02-07
## 6     3 2004-02-08 2004-02-14
 
> cleanGoogleTrends %>% sample_n(10)
## Source: local data frame [10 x 3]
## 
##    score      start        end
## 1      8 2010-09-26 2010-10-02
## 2     73 2013-11-17 2013-11-23
## 3     52 2012-07-01 2012-07-07
## 4      3 2005-06-19 2005-06-25
## 5      3 2004-12-12 2004-12-18
## 6      3 2009-09-06 2009-09-12
## 7     71 2014-09-14 2014-09-20
## 8      3 2004-12-26 2005-01-01
## 9     62 2013-03-03 2013-03-09
## 10     3 2006-03-19 2006-03-25
 
> cleanGoogleTrends %>% tail()
## Source: local data frame [6 x 3]
## 
##   score      start        end
## 1    80 2014-10-19 2014-10-25
## 2    80 2014-10-26 2014-11-01
## 3    84 2014-11-02 2014-11-08
## 4    81 2014-11-09 2014-11-15
## 5    83 2014-11-16 2014-11-22
## 6     2 2014-11-23 2014-11-29

Ok now we’re ready to plot. This was my first attempt:

> library(ggplot2)
> ggplot(aes(x = start, y = score), data = cleanGoogleTrends) + 
    geom_line(size = 0.5)
## geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?
2014 12 09 17 57 49

As you can see, not too successful! The first mistake I’ve made is not telling ggplot that the ‘start’ column is a date and so it can use that ordering when plotting:

> cleanGoogleTrends = cleanGoogleTrends %>% mutate(start =  as.Date(start))
> ggplot(aes(x = start, y = score), data = cleanGoogleTrends) + 
    geom_line(size = 0.5)

2014 12 09 18 00 03

My next mistake is that ‘score’ is not being treated as a continuous variable and so we’re ending up with this very strange looking chart. We can see that if we call the class function:

> class(cleanGoogleTrends$score)
## [1] "factor"

Let’s fix that and plot again:

> cleanGoogleTrends = cleanGoogleTrends %>% mutate(score = as.numeric(score))
> ggplot(aes(x = start, y = score), data = cleanGoogleTrends) + 
    geom_line(size = 0.5)

2014 12 09 18 02 39

That’s much better but there is quite a bit of noise in the week to week scores which we can flatten a bit by plotting a rolling mean of the last 4 weeks instead:

> library(zoo)
> cleanGoogleTrends = cleanGoogleTrends %>% 
    mutate(rolling = rollmean(score, 4, fill = NA, align=c("right")),
           start =  as.Date(start))
 
> ggplot(aes(x = start, y = rolling), data = cleanGoogleTrends) + 
    geom_line(size = 0.5)

2014 12 09 18 05 26

Here’s the full code if you want to reproduce:

library(dplyr)
library(zoo)
library(ggplot2)
 
googleTrends = read.csv("/Users/markneedham/Downloads/report.csv", row.names=NULL)
names(googleTrends) = c("week", "score")
 
cleanGoogleTrends = googleTrends %>% 
  mutate(ind = row_number()) %>% 
  group_by(ind) %>%
  mutate(dates = strsplit(week, " - "),
         start = dates[[1]][1] %>% strptime("%Y-%m-%d") %>% as.character(),
         end =   dates[[1]][2] %>% strptime("%Y-%m-%d") %>% as.character()) %>%
  filter(!is.na(start) | !is.na(end)) %>%
  ungroup() %>%
  select(-c(ind, dates, week)) %>%
  mutate(start =  as.Date(start),
         score = as.numeric(score),
         rolling = rollmean(score, 4, fill = NA, align=c("right")))
 
ggplot(aes(x = start, y = rolling), data = cleanGoogleTrends) + 
  geom_line(size = 0.5)

My next step is to plot the Google Trends scores against my meetup data set to see if there’s any interesting correlations going on.

As an aside I made use of knitr while putting together this post – it works really well for checking that you’ve included all the steps and that it actually works!

R: dplyr – mutate with strptime (incompatible size/wrong result size)

Having worked out how to translate a string into a date or NA if it wasn’t the appropriate format the next thing I wanted to do was store the result of the transformation in my data frame.

I started off with this:

data = data.frame(x = c("2014-01-01", "2014-02-01", "foo"))
> data
           x
1 2014-01-01
2 2014-02-01
3        foo

And when I tried to do the date translation ran into the following error:

> data %>% mutate(y = strptime(x, "%Y-%m-%d"))
Error: wrong result size (11), expected 3 or 1

As I understand it this error is telling us that we are trying to put a value into the data frame which represents 11 rows rather than 3 rows or 1 row.

It turns out that storing POSIXlts in a data frame isn’t such a good idea! In this case we can use the as.character function to create a character vector which can be stored in the data frame:

> data %>% mutate(y = strptime(x, "%Y-%m-%d") %>% as.character())
           x          y
1 2014-01-01 2014-01-01
2 2014-02-01 2014-02-01
3        foo       <NA>

We can then get rid of the NA row by using the is.na function:

> data %>% mutate(y = strptime(x, "%Y-%m-%d") %>% as.character()) %>% filter(!is.na(y))
           x          y
1 2014-01-01 2014-01-01
2 2014-02-01 2014-02-01

And a final tweak so that we have 100% pipelining goodness:

> data %>% 
    mutate(y = x %>% strptime("%Y-%m-%d") %>% as.character()) %>%
    filter(!is.na(y))
           x          y
1 2014-01-01 2014-01-01
2 2014-02-01 2014-02-01

R: String to Date or NA

I’ve been trying to clean up a CSV file which contains some rows with dates and some not – I only want to keep the cells which do have dates so I’ve been trying to work out how to do that.

My first thought was that I’d try and find a function which would convert the contents of the cell into a date if it was in date format and NA if not. I could then filter out the NA values using the is.na function.

I started out with the as.Date function…

> as.Date("2014-01-01")
[1] "2014-01-01"
 
> as.Date("foo")
Error in charToDate(x) : 
  character string is not in a standard unambiguous format

…but that throws an error if we have a non date value so it’s not so useful in this case.

Instead we can make use of the strptime function which does exactly what we want:

> strptime("2014-01-01", "%Y-%m-%d")
[1] "2014-01-01 GMT"
 
> strptime("foo", "%Y-%m-%d")
[1] NA

We can then feed those values into is.na..

> strptime("2014-01-01", "%Y-%m-%d") %>% is.na()
[1] FALSE
 
> strptime("foo", "%Y-%m-%d") %>% is.na()
[1] TRUE

…and we have exactly the behaviour we were looking for.

R: Applying a function to every row of a data frame

In my continued exploration of London’s meetups I wanted to calculate the distance from meetup venues to a centre point in London.

I’ve created a gist containing the coordinates of some of the venues that host NoSQL meetups in London town if you want to follow along:

library(dplyr)
 
# https://gist.github.com/mneedham/7e926a213bf76febf5ed
venues = read.csv("/tmp/venues.csv")
 
venues %>% head()
##                        venue      lat       lon
## 1              Skills Matter 51.52482 -0.099109
## 2                   Skinkers 51.50492 -0.083870
## 3          Theodore Bullfrog 51.50878 -0.123749
## 4 The Skills Matter eXchange 51.52452 -0.099231
## 5               The Guardian 51.53373 -0.122340
## 6            White Bear Yard 51.52227 -0.109804

Now to do the calculation. I’ve chosen the Centre Point building in Tottenham Court Road as our centre point. We can use the distHaversine function in the geosphere library allows us to do the calculation:

options("scipen"=100, "digits"=4)
library(geosphere)
 
centre = c(-0.129581, 51.516578)
aVenue = venues %>% slice(1)
aVenue
##           venue   lat      lon
## 1 Skills Matter 51.52 -0.09911

Now we can calculate the distance from Skillsmatter to our centre point:

distHaversine(c(aVenue$lon, aVenue$lat), centre)
## [1] 2302

That works pretty well so now we want to apply it to every row in the venues data frame and add an extra column containing that value.

This was my first attempt…

venues %>% mutate(distHaversine(c(lon,lat),centre))
## Error in .pointsToMatrix(p1): Wrong length for a vector, should be 2

…which didn’t work quite as I’d imagined!

I eventually found my way to the by function which allows you to ‘apply a function to a data frame split by factors’. In this case I wouldn’t be grouping rows by a factor – I’d apply the function to each row separately.

I wired everything up like so:

distanceFromCentre = by(venues, 1:nrow(venues), function(row) { distHaversine(c(row$lon, row$lat), centre)  })
 
distanceFromCentre %>% head()
## 1:nrow(venues)
##      1      2      3      4      5      6 
## 2301.6 3422.6  957.5 2280.6 1974.1 1509.5

We can now add the distances to our venues data frame:

venuesWithCentre = venues %>% 
  mutate(distanceFromCentre = by(venues, 1:nrow(venues), function(row) { distHaversine(c(row$lon, row$lat), centre)  }))
 
venuesWithCentre %>% head()
##                        venue   lat      lon distanceFromCentre
## 1              Skills Matter 51.52 -0.09911             2301.6
## 2                   Skinkers 51.50 -0.08387             3422.6
## 3          Theodore Bullfrog 51.51 -0.12375              957.5
## 4 The Skills Matter eXchange 51.52 -0.09923             2280.6
## 5               The Guardian 51.53 -0.12234             1974.1
## 6            White Bear Yard 51.52 -0.10980             1509.5

Et voila!