Mark Needham

Thoughts on Software Development

Archive for the ‘R’ Category

R/ggplot: Controlling X axis order

without comments

As part of a talk I gave at the Neo4j London meetup earlier this week I wanted to show how you could build a simple chart showing the number of friends that different actors had using the ggplot library.

I started out with the following code:

df = read.csv("/tmp/friends.csv")
top = df %>% head(20)
 
ggplot(aes(x = p.name, y = colleagues), data = top) + 
  geom_bar(fill = "dark blue", stat = "identity")

The friends CSV file is available as a gist if you want to reproduce the chart. This is how the chart renders:

2015 02 27 00 41 08

It’s in a fairly arbitrary order when it would be quite cool if we could get the most popular people to show from left to right.

I had the people’s names in the correct order in the data frame but annoyingly it was then sorting them into alphabetical order. Luckily I came across the by using the scale_x_discrete function which does exactly what I needed.

If we pass in the list of names to that function we get the chart we desire:

ggplot(aes(x = p.name, y = colleagues), data = top) + 
  geom_bar(fill = "dark blue", stat = "identity") + 
  scale_x_discrete(limits= top$p.name)

2015 02 27 00 45 01

Written by Mark Needham

February 27th, 2015 at 12:49 am

Posted in R

Tagged with ,

R: Conditionally updating rows of a data frame

without comments

In a blog post I wrote a couple of days ago about cohort analysis I had to assign a monthNumber to each row in a data frame and started out with the following code:

library(zoo)
library(dplyr)
 
monthNumber = function(cohort, date) {
  cohortAsDate = as.yearmon(cohort)
  dateAsDate = as.yearmon(date)
 
  if(cohortAsDate > dateAsDate) {
    "NA"
  } else {
    paste(round((dateAsDate - cohortAsDate) * 12), sep="")
  }
}
 
cohortAttendance %>% 
  group_by(row_number()) %>% 
  mutate(monthNumber = monthNumber(cohort, date)) %>%
  filter(monthNumber != "NA") %>%
  filter(monthNumber != "0") %>% 
  mutate(monthNumber = as.numeric(monthNumber)) %>% 
  arrange(monthNumber)

If we time this function using system.time we’ll see that it’s not very snappy:

system.time(cohortAttendance %>% 
  group_by(row_number()) %>% 
  mutate(monthNumber = monthNumber(cohort, date)) %>%
  filter(monthNumber != "NA") %>%
  filter(monthNumber != "0") %>% 
  mutate(monthNumber = as.numeric(monthNumber)) %>% 
  arrange(monthNumber))
 
   user  system elapsed 
  1.968   0.019   2.016

The reason for the poor performance is that we process each row of the data table individually due to the call to group_by on the second line. One way we can refactor the code is to use the ifelse which can process multiple rows at a time:

system.time(
cohortAttendance %>% 
  mutate(monthNumber = ifelse(as.yearmon(cohort) > as.yearmon(date), 
                              paste((round(as.yearmon(date) - as.yearmon(cohort))*12), sep=""), 
                              NA)))
   user  system elapsed 
  0.026   0.000   0.026

Antonios suggested another approach which involves first setting every row to ‘NA’ and then selectively updating the appropriate rows. I ended up with the following code:

cohortAttendance$monthNumber = NA
 
cohortAttendance$monthNumber[as.yearmon(cohortAttendance$cohort) > as.yearmon(cohortAttendance$date)] = paste((round(as.yearmon(cohortAttendance$date) - as.yearmon(cohortAttendance$cohort))*12), sep="")

Let’s measure that:

system.time(paste((round(as.yearmon(cohortAttendance$date) - as.yearmon(cohortAttendance$cohort))*12), sep=""))
   user  system elapsed 
  0.013   0.000   0.013

Both approaches are much quicker than my original version although this one seems to be marginally quicker than the ifelse approach.

Note to future Mark: try to avoid grouping by row number – there’s usually a better and faster solution!

Written by Mark Needham

February 26th, 2015 at 12:45 am

Posted in R

Tagged with

R: Cohort analysis of Neo4j meetup members

without comments

A few weeks ago I came across a blog post explaining how to apply cohort analysis to customer retention using R and I thought it’d be a fun exercise to calculate something similar for meetup attendees.

In the customer retention example we track customer purchases on a month by month basis and each customer is put into a cohort or bucket based on the first month they made a purchase in.

We then calculate how many of them made purchases in subsequent months and compare that with the behaviour of people in other cohorts.

In our case we aren’t selling anything so our equivalent will be a person attending a meetup. We’ll put people into cohorts based on the month of the first meetup they attended.

This can act as a proxy for when people become interested in a technology and could perhaps allow us to see how the behaviour of innovators, early adopters and the early majority differs, if at all.

The first thing we need to do is get the data showing the events that people RSVP’d ‘yes’ to. I’ve already got the data in Neo4j so we’ll write a query to extract it as a data frame:

library(RNeo4j)
graph = startGraph("http://127.0.0.1:7474/db/data/")
 
query = "MATCH (g:Group {name: \"Neo4j - London User Group\"})-[:HOSTED_EVENT]->(e),
               (e)<-[:TO]-(rsvp {response: \"yes\"})<-[:RSVPD]-(person)
         RETURN rsvp.time, person.id"
 
timestampToDate <- function(x) as.POSIXct(x / 1000, origin="1970-01-01", tz = "GMT")
 
df = cypher(graph, query)
df$time = timestampToDate(df$rsvp.time)
df$date = format(as.Date(df$time), "%Y-%m")
> df %>% head()
##         rsvp.time person.id                time    date
## 612  1.404857e+12  23362191 2014-07-08 22:00:29 2014-07
## 1765 1.380049e+12 112623332 2013-09-24 18:58:00 2013-09
## 1248 1.390563e+12   9746061 2014-01-24 11:24:35 2014-01
## 1541 1.390920e+12   7881615 2014-01-28 14:40:35 2014-01
## 3056 1.420670e+12  12810159 2015-01-07 22:31:04 2015-01
## 607  1.406025e+12  14329387 2014-07-22 10:34:51 2014-07
## 1634 1.391445e+12  91330472 2014-02-03 16:33:58 2014-02
## 2137 1.371453e+12  68874702 2013-06-17 07:17:10 2013-06
## 430  1.407835e+12 150265192 2014-08-12 09:15:31 2014-08
## 2957 1.417190e+12 182752269 2014-11-28 15:45:18 2014-11

Next we need to find the first meetup that a person attended – this will determine the cohort that the person is assigned to:

firstMeetup = df %>% 
  group_by(person.id) %>% 
  summarise(firstEvent = min(time), count = n()) %>% 
  arrange(desc(count))
 
> firstMeetup
## Source: local data frame [10 x 3]
## 
##    person.id          firstEvent count
## 1   13526622 2013-01-24 20:25:19     2
## 2  119400912 2014-10-03 13:09:09     2
## 3  122524352 2014-08-14 14:09:44     1
## 4   37201052 2012-05-21 10:26:24     3
## 5  137112712 2014-07-31 09:32:12     1
## 6  152448642 2014-06-20 08:32:50    17
## 7   98563682 2014-11-05 17:27:57     1
## 8  146976492 2014-05-17 00:04:42     4
## 9   12318409 2014-11-03 05:25:26     2
## 10  41280492 2014-10-16 19:02:03     5

Let’s assign each person to a cohort (month/year) and see how many people belong to each one:

firstMeetup$date = format(as.Date(firstMeetup$firstEvent), "%Y-%m")
byMonthYear = firstMeetup %>% count(date) %>% arrange(date)
 
ggplot(aes(x=date, y = n), data = byMonthYear) + 
  geom_bar(stat="identity", fill = "dark blue") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Unnamed chunk 4 1

Next we need to track a cohort over time to see whether people keep coming to events. I wrote the following function to work it out:

countsForCohort = function(df, firstMeetup, cohort) {
  members = (firstMeetup %>% filter(date == cohort))$person.id
 
  attendance = df %>% 
    filter(person.id %in% members) %>% 
    count(person.id, date) %>% 
    ungroup() %>%
    count(date)
 
  allCohorts = df %>% select(date) %>% unique
  cohortAttendance = merge(allCohorts, attendance, by = "date", all = TRUE)
 
  cohortAttendance[is.na(cohortAttendance) & cohortAttendance$date > cohort] = 0
  cohortAttendance %>% mutate(cohort = cohort, retention = n / length(members))  
}

On the first line we get the ids of all the people in the cohort so that we can filter the data frame to only include RSVPs by these people. The first call to ‘count’ makes sure that we only have one entry per person per month and the second call gives us a count of how many people attended an event in a given month.

Next we do the equivalent of a left join using the merge function to ensure we have a row representing each month even if noone from the cohort attended. This will lead to NA entries if there’s no matching row in the ‘attendance’ data frame – we’ll replace those with a 0 if the cohort is in the future. If not we’ll leave it as it is.

Finally we calculate the retention rate for each month for that cohort. e.g. these are some of the rows for the ‘2011-06′ cohort:

> countsForCohort(df, firstMeetup, "2011-06") %>% sample_n(10)
      date n  cohort retention
16 2013-01 1 2011-06      0.25
5  2011-10 1 2011-06      0.25
30 2014-03 0 2011-06      0.00
29 2014-02 0 2011-06      0.00
40 2015-01 0 2011-06      0.00
31 2014-04 0 2011-06      0.00
8  2012-04 2 2011-06      0.50
39 2014-12 0 2011-06      0.00
2  2011-07 1 2011-06      0.25
19 2013-04 1 2011-06      0.25

We could then choose to plot that cohort:

ggplot(aes(x=date, y = retention, colour = cohort), data = countsForCohort(df, firstMeetup, "2011-06")) + 
  geom_line(aes(group = cohort)) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Unnamed chunk 5 1

From this chart we can see that none of the people who first attended a Neo4j meetup in June 2011 have attended any events for the last two years.

Next we want to be able to plot multiple cohorts on the same chart which we can easily do by constructing one big data frame and passing it to ggplot:

cohorts = collect(df %>% select(date) %>% unique())[,1]
 
cohortAttendance = data.frame()
for(cohort in cohorts) {
  cohortAttendance = rbind(cohortAttendance,countsForCohort(df, firstMeetup, cohort))      
}
 
ggplot(aes(x=date, y = retention, colour = cohort), data = cohortAttendance) + 
  geom_line(aes(group = cohort)) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Unnamed chunk 5 2

This all looks a bit of a mess and at the moment we can’t easily compare cohorts as they start at different places on the x axis. We can fix that by adding a ‘monthNumber’ column to the data frame which we calculate with the following function:

monthNumber = function(cohort, date) {
  cohortAsDate = as.yearmon(cohort)
  dateAsDate = as.yearmon(date)
 
  if(cohortAsDate > dateAsDate) {
    "NA"
  } else {
    paste(round((dateAsDate - cohortAsDate) * 12), sep="")
  }
}

Now let’s create a new data frame with the month field added:

cohortAttendanceWithMonthNumber = cohortAttendance %>% 
  group_by(row_number()) %>% 
  mutate(monthNumber = monthNumber(cohort, date)) %>%
  filter(monthNumber != "NA") %>%
  filter(monthNumber != "0") %>% 
  mutate(monthNumber = as.numeric(monthNumber)) %>% 
  arrange(monthNumber)

We’re also filtering out any ‘NA’ columns which would represent row entries for months from before the cohort started. We don’t want to plot those.

finally let’s plot a chart containing all cohorts normalised by month number:

ggplot(aes(x=monthNumber, y = retention, colour = cohort), data = cohortAttendanceWithMonthNumber) + 
  geom_line(aes(group = cohort)) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1), panel.background = element_blank())
Unnamed chunk 5 3

It’s still a bit of a mess but what stands out is that when the number of people in a cohort is small the fluctuation in the retention value can be quite pronounced.

The next step is to make the cohorts a bit more coarse grained to see if it reveals some insights. I think I’ll start out with a cohort covering a 3 month period and see how that works out.

Written by Mark Needham

February 24th, 2015 at 1:19 am

Posted in R

Tagged with ,

R/dplyr: Extracting data frame column value for filtering with %in%

with one comment

I’ve been playing around with dplyr over the weekend and wanted to extract the values from a data frame column to use in a later filtering step.

I had a data frame:

library(dplyr)
df = data.frame(userId = c(1,2,3,4,5), score = c(2,3,4,5,5))

And wanted to extract the userIds of those people who have a score greater than 3. I started with:

highScoringPeople = df %>% filter(score > 3) %>% select(userId)
> highScoringPeople
  userId
1      3
2      4
3      5

And then filtered the data frame expecting to get back those 3 people:

> df %>% filter(userId %in% highScoringPeople)
[1] userId score 
<0 rows> (or 0-length row.names)

No rows! I created vector with the numbers 3-5 to make sure that worked:

> df %>% filter(userId %in% c(3,4,5))
  userId score
1      3     4
2      4     5
3      5     5

That works as expected so highScoringPeople obviously isn’t in the right format to facilitate an ‘in lookup’. Let’s explore:

> str(c(3,4,5))
 num [1:3] 3 4 5
 
> str(highScoringPeople)
'data.frame':	3 obs. of  1 variable:
 $ userId: num  3 4 5

Now it’s even more obvious why it doesn’t work – highScoringPeople is still a data frame when we need it to be a vector/list.

One way to fix this is to extract the userIds using the $ syntax instead of the select function:

highScoringPeople = (df %>% filter(score > 3))$userId
 
> str(highScoringPeople)
 num [1:3] 3 4 5
 
> df %>% filter(userId %in% highScoringPeople)
  userId score
1      3     4
2      4     5
3      5     5

Or if we want to do the column selection using dplyr we can extract the values for the column like this:

highScoringPeople = (df %>% filter(score > 3) %>% select(userId))[[1]]
 
> str(highScoringPeople)
 num [1:3] 3 4 5

Not so difficult after all.

Written by Mark Needham

February 22nd, 2015 at 8:58 am

Posted in R

Tagged with ,

R: Weather vs attendance at NoSQL meetups

without comments

A few weeks ago I came across a tweet by Sean Taylor asking for a weather data set with a few years worth of recording and I was surprised to learn that R already has such a thing – the weatherData package.

weatherData provides a thin veneer around the wunderground API and was exactly what I’d been looking for to compare meetup at London’s NoSQL against weather conditions that day.

The first step was to download the appropriate weather recordings and save them to a CSV file so I wouldn’t have to keep calling the API.

I thought I may as well download all the recordings available to me and wrote the following code to make that happen:

library(weatherData)
 
# London City Airport
getDetailedWeatherForYear = function(year) {
  getWeatherForDate("LCY", 
                    start_date= paste(sep="", year, "-01-01"),
                    end_date = paste(sep="", year, "-12-31"),
                    opt_detailed = FALSE,
                    opt_all_columns = TRUE)
}
 
df = rbind(getDetailedWeatherForYear(2011), 
      getDetailedWeatherForYear(2012),
      getDetailedWeatherForYear(2013),
      getDetailedWeatherForYear(2014),
      getWeatherForDate("LCY", start_date="2015-01-01",
                        end_date = "2015-01-25",
                        opt_detailed = FALSE,
                        opt_all_columns = TRUE))

I then saved that to a CSV file:

write.csv(df, 'weather/temp_data.csv', row.names = FALSE)
"Date","GMT","Max_TemperatureC","Mean_TemperatureC","Min_TemperatureC","Dew_PointC","MeanDew_PointC","Min_DewpointC","Max_Humidity","Mean_Humidity","Min_Humidity","Max_Sea_Level_PressurehPa","Mean_Sea_Level_PressurehPa","Min_Sea_Level_PressurehPa","Max_VisibilityKm","Mean_VisibilityKm","Min_VisibilitykM","Max_Wind_SpeedKm_h","Mean_Wind_SpeedKm_h","Max_Gust_SpeedKm_h","Precipitationmm","CloudCover","Events","WindDirDegrees"
2011-01-01,"2011-1-1",7,6,4,5,3,1,93,85,76,1027,1025,1023,10,9,3,14,10,NA,0,7,"Rain",312
2011-01-02,"2011-1-2",4,3,2,1,0,-1,87,81,75,1029,1028,1027,10,10,10,11,8,NA,0,7,"",321
2011-01-03,"2011-1-3",4,2,1,0,-2,-5,87,74,56,1028,1024,1019,10,10,10,8,5,NA,0,6,"Rain-Snow",249
2011-01-04,"2011-1-4",6,3,1,3,1,-1,93,83,65,1019,1013,1008,10,10,10,21,6,NA,0,5,"Rain",224
2011-01-05,"2011-1-5",8,7,5,6,3,0,93,80,61,1008,1000,994,10,9,4,26,16,45,0,4,"Rain",200
2011-01-06,"2011-1-6",7,4,3,6,3,1,93,90,87,1002,996,993,10,9,5,13,6,NA,0,5,"Rain",281
2011-01-07,"2011-1-7",11,6,2,9,5,2,100,91,82,1003,999,996,10,7,2,24,11,NA,0,5,"Rain-Snow",124
2011-01-08,"2011-1-8",11,7,4,8,4,-1,87,77,65,1004,997,987,10,10,5,39,23,50,0,5,"Rain",230
2011-01-09,"2011-1-9",7,4,3,1,0,-1,87,74,57,1018,1012,1004,10,10,10,24,16,NA,0,NA,"",242

If we want to read that back in future we can do so with the following code:

weather = read.csv("weather/temp_data.csv")
weather$Date = as.POSIXct(weather$Date)
 
> weather %>% sample_n(10) %>% select(Date, Min_TemperatureC, Mean_TemperatureC, Max_TemperatureC)
           Date Min_TemperatureC Mean_TemperatureC Max_TemperatureC
1471 2015-01-10                5                 9               14
802  2013-03-12               -2                 1                4
1274 2014-06-27               14                18               22
848  2013-04-27                5                 8               10
832  2013-04-11                6                 8               10
717  2012-12-17                6                 7                9
1463 2015-01-02                6                 9               13
1090 2013-12-25                4                 6                7
560  2012-07-13               15                18               20
1230 2014-05-14                9                14               19

The next step was to bring the weather data together with the meetup attendance data that I already had.

For simplicity’s sake I’ve got those saved in a CSV file as we can just read those in as well:

timestampToDate <- function(x) as.POSIXct(x / 1000, origin="1970-01-01", tz = "GMT")
 
events = read.csv("events.csv")
events$eventTime = timestampToDate(events$eventTime)
 
> events %>% sample_n(10) %>% select(event.name, rsvps, eventTime)
                                                           event.name rsvps           eventTime
36                                   London Office Hours - Old Street    10 2012-01-18 17:00:00
137                                          Enterprise Search London    34 2011-05-23 18:15:00
256                           MarkLogic User Group London: Jim Fuller    40 2014-04-29 18:30:00
117                                  Neural Networks and Data Science   171 2013-03-28 18:30:00
210                                  London Office Hours - Old Street     3 2011-09-15 17:00:00
443                                                      July social!    12 2014-07-14 19:00:00
322                                                   Intro to Graphs    39 2014-09-03 18:30:00
203                                  Vendor focus: Amazon CloudSearch    24 2013-05-16 17:30:00
17  Neo4J Tales from the Trenches: A Recommendation Engine Case Study    12 2012-04-25 18:30:00
55                                                London Office Hours    10 2013-09-18 17:00:00

Now that we’ve got our two datasets ready we can plot a simple chart of the average attendance and temperature grouped by month:

byMonth = events %>% 
  mutate(month = factor(format(eventTime, "%B"), levels=month.name)) %>%
  group_by(month) %>%
  summarise(events = n(), 
            count = sum(rsvps)) %>%
  mutate(ave = count / events) %>%
  arrange(desc(ave))
 
averageTemperatureByMonth = weather %>% 
  mutate(month = factor(format(Date, "%B"), levels=month.name)) %>%
  group_by(month) %>% 
  summarise(aveTemperature = mean(Mean_TemperatureC))
 
g1 = ggplot(aes(x = month, y = aveTemperature, group=1), data = averageTemperatureByMonth) + 
  geom_line( ) + 
  ggtitle("Temperature by month")
 
g2 = ggplot(aes(x = month, y = count, group=1), data = byMonth) + 
  geom_bar(stat="identity", fill="dark blue") +
  ggtitle("Attendance by month")
 
library(gridExtra)
grid.arrange(g1,g2, ncol = 1)

2015 02 09 20 32 50

We can see a rough inverse correlation between the temperature and attendance, particularly between April and August – as the temperature increases, total attendance decreases.

But what about if we compare at a finer level of granularity such as a specific date? We can do that by adding a ‘day’ column to our events data frame and merging it with the weather one:

byDay = events %>% 
  mutate(day = as.Date(as.POSIXct(eventTime))) %>%
  group_by(day) %>%
  summarise(events = n(), 
            count = sum(rsvps)) %>%
  mutate(ave = count / events) %>%
  arrange(desc(ave))
weather = weather %>% mutate(day = Date)
merged = merge(weather, byDay, by = "day")

Now we can plot the attendance vs the mean temperature for individual days:

ggplot(aes(x =count, y = Mean_TemperatureC,group = day), data = merged) + 
  geom_point()
2015 02 10 07 21 24

Interestingly there now doesn’t seem to be any correlation between the temperature and attendance. We can confirm our suspicions by running a correlation:

> cor(merged$count, merged$Mean_TemperatureC)
[1] 0.008516294

Not even 1% correlation between the values! One way we could confirm that non correlation is to plot the average temperature against the average attendance rather than total attendance:

g1 = ggplot(aes(x = month, y = aveTemperature, group=1), data = averageTemperatureByMonth) + 
  geom_line( ) + 
  ggtitle("Temperature by month")
 
g2 = ggplot(aes(x = month, y = ave, group=1), data = byMonth) + 
  geom_bar(stat="identity", fill="dark blue") +
  ggtitle("Attendance by month")
 
grid.arrange(g1,g2, ncol = 1)

2015 02 11 06 48 05

Now we can see there’s not really that much of a correlation between temperature and month – in fact 9 of the months have a very similar average attendance. It’s only July, December and especially August where there’s a noticeable dip.

This could suggest there’s another variable other than temperature which is influencing attendance in these months. My hypothesis is that we’d see lower attendance in the weeks of school holidays – the main ones happen in July/August, December and March/April (which interestingly don’t show the dip!)

Another interesting thing to look into is whether the reason for the dip in attendance isn’t through lack of will from attendees but rather because there aren’t actually any events to go to. Let’s plot the number of events being hosted each month against the temperature:

g1 = ggplot(aes(x = month, y = aveTemperature, group=1), data = averageTemperatureByMonth) + 
  geom_line( ) + 
  ggtitle("Temperature by month")
 
g2 = ggplot(aes(x = month, y = events, group=1), data = byMonth) + 
  geom_bar(stat="identity", fill="dark blue") +
  ggtitle("Events by month")
 
grid.arrange(g1,g2, ncol = 1)

2015 02 11 06 57 16

Here we notice there’s a big dip in events in December – organisers are hosting less events and we know from our earlier plot that on average less people are attending those events. Lots of events are hosted in the Autumn, slightly fewer in the Spring and fewer in January, March and August in particular.

Again there’s no particular correlation between temperature and the number of events being hosted on a particular day:

ggplot(aes(x = events, y = Mean_TemperatureC,group = day), data = merged) + 
  geom_point()

2015 02 11 07 05 48

There’s not any obvious correlation from looking at this plot although I find it difficult to interpret plots where we have the values all grouped around very few points (often factor variables) on one axis and spread out (continuous variable) on the other. Let’s confirm our suspicion by calculating the correlation between these two variables:

> cor(merged$events, merged$Mean_TemperatureC)
[1] 0.0251698

Back to the drawing board for my attendance prediction model then!

If you have any suggestions for doing this analysis more effectively or I’ve made any mistakes please let me know in the comments, I’m still learning how to investigate what data is actually telling us.

Written by Mark Needham

February 11th, 2015 at 7:09 am

Posted in R

Tagged with ,

R: ggplot2 – Each group consist of only one observation. Do you need to adjust the group aesthetic?

with one comment

I’ve been playing around with some weather data over the last couple of days which I aggregated down to the average temperature per month over the last 4 years and stored in a CSV file.

This is what the file looks like:

$ cat /tmp/averageTemperatureByMonth.csv
"month","aveTemperature"
"January",6.02684563758389
"February",5.89380530973451
"March",7.54838709677419
"April",10.875
"May",13.3064516129032
"June",15.9666666666667
"July",18.8387096774194
"August",18.3709677419355
"September",16.2583333333333
"October",13.4596774193548
"November",9.19166666666667
"December",7.01612903225806

I wanted to create a simple line chart which would show the months of the year in ascending order with the appropriate temperature.

My first attempt was the following:

df = read.csv("/tmp/averageTemperatureByMonth.csv")
df$month = factor(df$month, month.name)
 
ggplot(aes(x = month, y = aveTemperature), data = df) + 
  geom_line( ) + 
  ggtitle("Temperature by month")

which resulted in the following error:

geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?

My understanding is that the points don’t get joined up by default because the variable on the x axis is not a continuous one but rather a factor variable.

One way to work around this problem is to make it numeric, like so:

ggplot(aes(x = as.numeric(month), y = aveTemperature), data = df) + 
  geom_line( ) + 
  ggtitle("Temperature by month")

which results in the following chart:

2015 01 30 00 25 18

This isn’t bad but it’d be much nicer if we could have the month names along the bottom instead.

It turns out we can but we need to specify a group that each point belongs to. ggplot will then connects points which belong to the same group.

In this case we don’t really have one so we’ll define a dummy one instead:

ggplot(aes(x = month, y = aveTemperature, group=1), data = df) + 
  geom_line( ) + 
  ggtitle("Temperature by month")

And now we get the visualisation we want:

2015 01 29 23 28 23

Written by Mark Needham

January 30th, 2015 at 12:27 am

Posted in R

Tagged with ,

R: Featuring engineering for a linear model

without comments

I previously wrote about a linear model I created to predict how many people would RSVP ‘yes’ to a meetup event and having not found much correlation between any of my independent variables and RSVPs was a bit stuck.

As luck would have it I bumped into Antonios at a meetup a month ago and he offered to take a look at what I’d tried so far and give me some tips on how to progress.

The first thing he pointed out is that that all my features were related to date/time and that I should try and generate some other features. He suggested I start with the following:

  • info about organisers (quantify popularity of organisers, how many people work for them)
  • info about the venue (how many people fit there, how far it is from the centre of the city)
  • number of tweets for the event, during X days before the event

I’d read a lot on Kaggle forums about how feature engineering was the most important part of building statistical models but it didn’t click what that meant until Antonios pointed it out.

The first thing I decided to do was bring in the data for all London’s NoSQL meetups rather than just the Neo4j one to give myself a bit more data to work with.

Group Membership

Having done that, it seemed from visual inspection that the meetup groups with the most members (i.e. Data Science London, Big Data London) seemed to get the biggest turnouts.

I thought it’d be interesting to see what the correlation was between group membership and RSVPs so this was the first new feature I added.

I generated this feature by a combination of a Neo4j query and R code which resulted in this data frame as CSV file.

We can quickly preview it to see some of the events and the group membership at that time:

> df = read.csv("/tmp/membersWithGroupCounts.csv")
> df$eventTime = as.POSIXct(df$eventTime)
> df %>% sample_n(10) %>% select(event.name, g.name, eventTime, groupMembers, rsvps)
 
                                                                  event.name                                   g.name           eventTime groupMembers rsvps
23  Scoring Models, Apache Drill for querying structured & unstructured data                      Data Science London 2014-09-18 18:30:00         3466   159
421                                                      London Office Hours                London MongoDB User Group 2012-08-22 17:00:00          468     6
304                            MongoDB University Study Group London Meet up                London MongoDB User Group 2014-07-16 17:00:00         1256    23
43                                                           December Meetup          London ElasticSearch User Group 2014-12-10 18:30:00          721   126
222                                                          Intro to Graphs                Neo4j - London User Group 2014-09-03 18:30:00         1453    39
207                              Intro to Machine Learning with Scikit-Learn                            Women in Data 2014-11-11 18:15:00          574    41
168                                        NoSQL panel and LevelDB + Node.js                             London NoSQL 2014-04-15 18:30:00          183    51
443                                                      London Office Hours                London MongoDB User Group 2012-11-29 17:00:00          590     3
79                                  Apache Cassandra 1.2 with Jonathan Ellis                         Cassandra London 2013-03-06 19:00:00          399    95
362                                                          Span conference Span: scalable and distributed computing 2014-10-28 09:00:00           67    13

One thing I found difficult was finding features specific to an event – I’m not sure how much that matters. I generated features for the venue or group much more easily.

First let’s see if there’s actually any correlation between these two variables by plotting them:

ggplot(aes(x = groupMembers, y = rsvps), data = df) + 
  geom_point()
2014 12 28 21 02 21

It looks like there’s a positive correlation between these two variables but let’s create a single variable linear model to see how much of the variation is explained:

> fit = lm(rsvps ~ groupMembers, data = df)
> fit$coef
 (Intercept) groupMembers 
 20.03579637   0.05382738

Our linear model equation is therefore:

rsvps = 20.03579637 + 0.05382738(groupMembers)

Let’s see how well correlated our predicted RSVPs and actual RSVPs are:

> df$predictedRSVPs = predict(fit, df)
> with(df, cor(rsvps, predictedRSVPs))
[1] 0.6263096

Not too bad! There is quite a strong correlation between these variables although it’s not perfect.

Hours into the day

In my first model I’d treated time as a categorical variable but Antonios pointed out that it’s often easier to understand the relationship between variables if they’re both continuos so I transformed the event time like so:

df$hoursIntoDay = as.numeric(df$eventTime - trunc(df$eventTime, "day"), units="hours")

Let’s see how that plots against the RSVP count:

ggplot(aes(x = hoursIntoDay, y = rsvps), data = df) + 
  geom_point()
2014 12 28 21 27 48

It’s a bit more difficult to see a trend here as there are quite discrete times at which events happen and the majority start at 6.30 or 7.00. Nevertheless let’s build a linear model with just this variable:

> fit = lm(rsvps ~ hoursIntoDay, data = df)
> fit$coef
 (Intercept) hoursIntoDay 
   -18.79895      4.12984 
> 
> df$predictedRSVPs = predict(fit, df)
> with(df, cor(rsvps, predictedRSVPs))
[1] 0.181472

Distance from the centre of London

Next up I tried a feature based on the location of the venue that the events were held at. The hypothesis was that if a venue was closer to the centre of London then people would be more likely to attend.

To calculate this distance I used the distHaversine function from the geosphere package as shown in a previous blog post.

Let’s have a look at the graph for that variable:

ggplot(aes(x = distanceFromCentre, y = rsvps), data = df) + 
  geom_point()
2014 12 28 21 37 41

It’s hard to tell much from this plot, mainly because a majority of the points are clustered around the 2,500 metre mark which represents Shoreditch venues. Let’s plug it into a linear model and see what we come up with:

> fit = lm(rsvps ~ distanceFromCentre, data = df)
> fit$coef
       (Intercept) distanceFromCentre 
      57.243646619       -0.001310492 
> 
> df$predictedRSVPs = predict(fit, df)
> with(df, cor(rsvps, predictedRSVPs))
[1] 0.02999708

Interestingly there’s barely any correlation here which was surprising to me. I tried combining this variable in a multiple variable model with the others but it still didn’t have much impact so I think we’ll park this one for now.

This is as much as I’ve done at the moment and despite spending quite a bit of time on it I still haven’t really explained very much of the variation in RSVP rates!

I have managed to identify some ways that I was able to come up with new features to try out though:

  • Read what other people are doing e.g. I have some ideas for lag variables (e.g. how many people went to your previous meetup) having read about this baseball linear model
  • Talk to other people about your model – they often have ideas you wouldn’t think of being too deep into the problem.
  • Look at what data you already have and try and incorporate that and see where it leads

The next avenue I started exploring is topic modelling as I have a hypothesis that people RSVP for events based on the content of talks but I’m not sure of the best way to go about that.

My current thinking is to pull out some topics/terms by following the example from Chapter 6 of Machine Learning for Hackers.

Written by Mark Needham

December 28th, 2014 at 9:55 pm

Posted in R

Tagged with

R: Vectorising all the things

without comments

After my last post about finding the distance a date/time is from the weekend Hadley Wickham suggested I could improve the function by vectorising it…

…so I thought I’d try and vectorise some of the other functions I’ve written recently and show the two versions.

I found the following articles useful for explaining vectorisation and why you might want to do it:

Let’s get started.

Distance from the weekend

We want to find out how many hours away from the weekend i.e. nearest Saturday/Sunday a particular date/time is. We’ll be using the following libraries and set of date/times:

library(dplyr)
library(lubridate)
library(geosphere)
options("scipen"=100, "digits"=4)
 
times = ymd_hms("2002-01-01 17:00:00") + c(0:99) * hours(1)
data = data.frame(time = times)
> data %>% head()
                 time
1 2002-01-01 17:00:00
2 2002-01-01 18:00:00
3 2002-01-01 19:00:00
4 2002-01-01 20:00:00
5 2002-01-01 21:00:00
6 2002-01-01 22:00:00

Let’s have a look at the non vectorised version first:

distanceFromWeekend = function(dateToLookup) {
  before = floor_date(dateToLookup, "week") + hours(23) + minutes(59) + seconds(59)
  after  = ceiling_date(dateToLookup, "week") - days(1)
  timeToBefore = dateToLookup - before
  timeToAfter = after - dateToLookup
 
  if(timeToBefore < 0 || timeToAfter < 0) {
    0  
  } else {
    if(timeToBefore < timeToAfter) {
      timeToBefore / dhours(1)
    } else {
      timeToAfter / dhours(1)
    }
  }
}

Now let’s run it against our data frame:

> system.time(
    data %>% mutate(ind = row_number()) %>% group_by(ind) %>% mutate(dist = distanceFromWeekend(time))  
    )
   user  system elapsed 
  1.837   0.020   1.884

And now for Hadley’s vectorised version:

distanceFromWeekendVectorised = function(dateToLookup) {
  before = floor_date(dateToLookup, "week") + hours(23) + minutes(59) + seconds(59)
  after  = ceiling_date(dateToLookup, "week") - days(1)
  pmin(pmax(dateToLookup - before, 0), pmax(after - dateToLookup, 0)) / dhours(1)
}
 
> system.time(data %>% mutate(dist = distanceFromWeekendVectorised(time)))
   user  system elapsed 
  0.020   0.001   0.023

Extracting start date

My next example was from cleaning up Google Trends data and extracting the start date from a cell inside a CSV file.

We’ll use this data frame:

googleTrends = read.csv("/Users/markneedham/Downloads/report.csv", row.names=NULL)
names(googleTrends) = c("week", "score")
> googleTrends %>% head(10)
                        week score
1  Worldwide; 2004 - present      
2         Interest over time      
3                       Week neo4j
4    2004-01-04 - 2004-01-10     0
5    2004-01-11 - 2004-01-17     0
6    2004-01-18 - 2004-01-24     0
7    2004-01-25 - 2004-01-31     0
8    2004-02-01 - 2004-02-07     0
9    2004-02-08 - 2004-02-14     0
10   2004-02-15 - 2004-02-21     0

The non vectorised version looked like this:

> system.time(
    googleTrends %>% 
      mutate(ind = row_number()) %>% 
      group_by(ind) %>%
      mutate(dates = strsplit(week, " - "),
             start = dates[[1]][1] %>% strptime("%Y-%m-%d") %>% as.character())
    )
   user  system elapsed 
  0.215   0.000   0.214

In this case it’s actually not possible to vectorise the code using the strsplit so we need to use something else. Antonios showed me how to do so using substr:

> system.time(googleTrends %>% mutate(start = substr(week, 1, 10) %>% ymd()))
   user  system elapsed 
  0.018   0.000   0.017

Calculating haversine distance

I wanted to work out the great circular distance from a collection of venues to a centre point in London. I started out with this data frame:

centre = c(-0.129581, 51.516578)
venues = read.csv("/tmp/venues.csv")
 
> venues %>% head()
                       venue   lat      lon
1              Skills Matter 51.52 -0.09911
2                   Skinkers 51.50 -0.08387
3          Theodore Bullfrog 51.51 -0.12375
4 The Skills Matter eXchange 51.52 -0.09923
5               The Guardian 51.53 -0.12234
6            White Bear Yard 51.52 -0.10980

My non vectorised version looked like this:

> system.time(venues %>% 
    mutate(distanceFromCentre = by(venues, 1:nrow(venues), function(row) { distHaversine(c(row$lon, row$lat), centre)  }))
    )
   user  system elapsed 
  0.034   0.000   0.033

It’s pretty quick but we can do better – the distHaversine function allows us to calculate multiple distances if the first argument ot it is a matrix of lon/lat values rather than a vector:

> system.time(
    venues %>% mutate(distanceFromCentre = distHaversine(cbind(venues$lon, venues$lat), centre))
    )
   user  system elapsed 
  0.001   0.000   0.001

One I can’t figure out…

And finally I have a function which I can’t figure out how to vectorise but maybe someone with more R skillz than me can?

I have a data frame containing the cumulative member counts of various NoSQL London groups:

cumulativeMeetupMembers = read.csv("/tmp/cumulativeMeetupMembers.csv")
> cumulativeMeetupMembers %>% sample_n(10)
                               g.name dayMonthYear    n
4734            Hadoop Users Group UK   2013-10-26 1144
4668            Hadoop Users Group UK   2013-08-03  979
4936            Hadoop Users Group UK   2014-07-31 1644
5150                      Hive London   2012-10-15  109
8020        Neo4j - London User Group   2014-03-15  826
7666        Neo4j - London User Group   2012-08-06   78
1030                  Big Data London   2013-03-01 1416
6500        London MongoDB User Group   2013-09-21  952
8290 Oracle Big Data 4 the Enterprise   2012-06-04   61
2584              Data Science London   2012-03-20  285

And I want to find out the number of members for a group on a specific date. e.g. given the following data…

> cumulativeMeetupMembers %>% head(10)
                                          g.name dayMonthYear  n
1  Big Data / Data Science / Data Analytics Jobs   2013-01-29  1
2  Big Data / Data Science / Data Analytics Jobs   2013-02-06 15
3  Big Data / Data Science / Data Analytics Jobs   2013-02-07 28
4  Big Data / Data Science / Data Analytics Jobs   2013-02-10 31
5  Big Data / Data Science / Data Analytics Jobs   2013-02-18 33
6  Big Data / Data Science / Data Analytics Jobs   2013-03-27 38
7  Big Data / Data Science / Data Analytics Jobs   2013-04-16 41
8  Big Data / Data Science / Data Analytics Jobs   2013-07-17 53
9  Big Data / Data Science / Data Analytics Jobs   2013-08-28 58
10 Big Data / Data Science / Data Analytics Jobs   2013-11-11 63

…the number of members for the ‘Big Data / Data Science / Data Analytics Jobs’ group on the 10th November 2013 should be 58.

I created this data frame of groups and random dates:

dates = ymd("2014-09-01") + c(0:9) * weeks(1)
groups = cumulativeMeetupMembers %>% distinct(g.name) %>% select(g.name)
 
groupsOnDate = merge(dates, groups)
names(groupsOnDate) = c('date', 'name')
 
> groupsOnDate %>% sample_n(10)
          date                                            name
156 2014-10-06                                 GridGain London
153 2014-09-15                                 GridGain London
70  2014-11-03                                Couchbase London
185 2014-09-29                           Hadoop Users Group UK
105 2014-09-29                             Data Science London
137 2014-10-13            Equal Experts Technical Meetup Group
360 2014-11-03                        Scale Warriors of London
82  2014-09-08 Data Science & Business Analytics London Meetup
233 2014-09-15                 London ElasticSearch User Group
84  2014-09-22 Data Science & Business Analytics London Meetup

The non vectorised version looks like this:

memberCount = function(meetupMembers) {
  function(groupName, date) {
    (meetupMembers %>% 
       filter(g.name == groupName & dayMonthYear < date) %>% do(tail(., 1)))$n    
  }  
} 
 
findMemberCount = memberCount(cumulativeMeetupMembers)
 
> system.time(groupsOnDate %>% mutate(groupMembers = by(groupsOnDate, 1:nrow(groupsOnDate), function(row) { 
          findMemberCount(row$name, as.character(row$date))
        }) %>% 
        cbind() %>% 
        as.vector() ))
   user  system elapsed 
  2.259   0.005   2.269

The output looks like this:

          date                                     name groupMembers
116 2014-10-06                      DeNormalised London          157
322 2014-09-08                 OpenCredo Tech Workshops            7
71  2014-09-01                  Data Enthusiasts London             
233 2014-09-15          London ElasticSearch User Group          614
171 2014-09-01 HPC & GPU Supercomputing Group of London           80
109 2014-10-27                      Data Science London         3632
20  2014-11-03            Big Data Developers in London          708
42  2014-09-08              Big Data Week London Meetup           96
127 2014-10-13          Enterprise Search London Meetup          575
409 2014-10-27                            Women in Data          548

I’ve tried many different approaches but haven’t been able to come up with a version that lets me pass in all the rows to memberCount and calculate the count for each row in one go.

Any ideas/advice/hints welcome!

Written by Mark Needham

December 22nd, 2014 at 11:46 am

Posted in R

Tagged with ,

R: Time to/from the weekend

without comments

In my last post I showed some examples using R’s lubridate package and another problem it made really easy to solve was working out how close a particular date time was to the weekend.

I wanted to write a function which would return the previous Sunday or upcoming Saturday depending on which was closer.

lubridate’s floor_date and ceiling_date functions make this quite simple.

e.g. if we want to round the 18th December down to the beginning of the week and up to the beginning of the next week we could do the following:

> library(lubridate)
> floor_date(ymd("2014-12-18"), "week")
[1] "2014-12-14 UTC"
 
> ceiling_date(ymd("2014-12-18"), "week")
[1] "2014-12-21 UTC"

For the date in the future we actually want to grab the Saturday rather than the Sunday so we’ll subtract one day from that:

> ceiling_date(ymd("2014-12-18"), "week") - days(1)
[1] "2014-12-20 UTC"

Now let’s put that together into a function which finds the closest weekend for a given date:

findClosestWeekendDay = function(dateToLookup) {
  before = floor_date(dateToLookup, "week") + hours(23) + minutes(59) + seconds(59)
  after  = ceiling_date(dateToLookup, "week") - days(1)
  if((dateToLookup - before) < (after - dateToLookup)) {
    before  
  } else {
    after  
  }
}
 
> findClosestWeekendDay(ymd_hms("2014-12-13 13:33:29"))
[1] "2014-12-13 UTC"
 
> findClosestWeekendDay(ymd_hms("2014-12-14 18:33:29"))
[1] "2014-12-14 23:59:59 UTC"
 
> findClosestWeekendDay(ymd_hms("2014-12-15 18:33:29"))
[1] "2014-12-14 23:59:59 UTC"
 
> findClosestWeekendDay(ymd_hms("2014-12-17 11:33:29"))
[1] "2014-12-14 23:59:59 UTC"
 
> findClosestWeekendDay(ymd_hms("2014-12-17 13:33:29"))
[1] "2014-12-20 UTC"
 
> findClosestWeekendDay(ymd_hms("2014-12-19 13:33:29"))
[1] "2014-12-20 UTC"

I’ve set the Sunday date at 23:59:59 so that I can use this date in the next step where we want to calculate how how many hours it is from the current date to the nearest weekend.

I ended up with this function:

distanceFromWeekend = function(dateToLookup) {
  before = floor_date(dateToLookup, "week") + hours(23) + minutes(59) + seconds(59)
  after  = ceiling_date(dateToLookup, "week") - days(1)
  timeToBefore = dateToLookup - before
  timeToAfter = after - dateToLookup
 
  if(timeToBefore < 0 || timeToAfter < 0) {
    0  
  } else {
    if(timeToBefore < timeToAfter) {
      timeToBefore / dhours(1)
    } else {
      timeToAfter / dhours(1)
    }
  }
}
 
> distanceFromWeekend(ymd_hms("2014-12-13 13:33:29"))
[1] 0
 
> distanceFromWeekend(ymd_hms("2014-12-14 18:33:29"))
[1] 0
 
> distanceFromWeekend(ymd_hms("2014-12-15 18:33:29"))
[1] 18.55833
 
> distanceFromWeekend(ymd_hms("2014-12-17 11:33:29"))
[1] 59.55833
 
> distanceFromWeekend(ymd_hms("2014-12-17 13:33:29"))
[1] 58.44194
 
> distanceFromWeekend(ymd_hms("2014-12-19 13:33:29"))
[1] 10.44194

While this works it’s quite slow when you run it over a data frame which contains a lot of rows.

There must be a clever R way of doing the same thing (perhaps using matrices) which I haven’t figured out yet so if you know how to speed it up do let me know.

Written by Mark Needham

December 13th, 2014 at 8:38 pm

Posted in R

Tagged with ,

R: Numeric representation of date time

without comments

I’ve been playing around with date times in R recently and I wanted to derive a numeric representation for a given value to make it easier to see the correlation between time and another variable.

e.g. December 13th 2014 17:30 should return 17.5 since it’s 17.5 hours since midnight.

Using the standard R libraries we would write the following code:

> december13 = as.POSIXlt("2014-12-13 17:30:00")
> as.numeric(december13 - trunc(december13, "day"), units="hours")
[1] 17.5

That works pretty well but Antonios recently introduced me to the lubridate so I thought I’d give that a try as well.

The first nice thing about lubridate is that we can use the date we created earlier and call the floor_date function rather than truncate:

> (december13 - floor_date(december13, "day"))
Time difference of 17.5 hours

That gives us back a difftime

> class((december13 - floor_date(december13, "day")))
[1] "difftime"

…which we can divide by different units to get the granularity we want:

> diff = (december13 - floor_date(december13, "day"))
> diff / dhours(1)
[1] 17.5
 
> diff / ddays(1)
[1] 0.7291667
 
> diff / dminutes(1)
[1] 1050

Pretty neat!

lubridate also has some nice functions for creating dates/date times. e.g.

> ymd_hms("2014-12-13 17:00:00")
[1] "2014-12-13 17:00:00 UTC"
 
> ymd_hm("2014-12-13 17:00")
[1] "2014-12-13 17:00:00 UTC"
 
> ymd_h("2014-12-13 17")
[1] "2014-12-13 17:00:00 UTC"
 
> ymd("2014-12-13")
[1] "2014-12-13 UTC"

And if you want a different time zone that’s pretty easy too:

> with_tz(ymd("2014-12-13"), "GMT")
[1] "2014-12-13 GMT"

Written by Mark Needham

December 13th, 2014 at 7:58 pm

Posted in R

Tagged with ,