Mark Needham

Thoughts on Software Development

Archive for the ‘R’ Category

R: dplyr – “Variables not shown”

without comments

I recently ran into a problem where the result of applying some operations to a data frame wasn’t being output the way I wanted.

I started with this data frame:

words = function(numberOfWords, lengthOfWord) {
  w = c(1:numberOfWords)  
  for(i in 1:numberOfWords) {
    w[i] = paste(sample(letters, lengthOfWord, replace=TRUE), collapse = "")
  }
  w
}
 
numberOfRows = 100
df = data.frame(a = sample (1:numberOfRows, 10, replace = TRUE),
                b = sample (1:numberOfRows, 10, replace = TRUE),
                name = words(numberOfRows, 10))

I wanted to group the data frame by a and b and output a comma separated list of the associated names. I started with this:

> df %>% 
    group_by(a,b) %>%
    summarise(n = n(), words = paste(name, collapse = ",")) %>%
    arrange(desc(n)) %>%
    head(5)
 
Source: local data frame [5 x 4]
Groups: a
 
   a  b  n
1 19 90 10
2 24 36 10
3 29 20 10
4 29 80 10
5 62 54 10
Variables not shown: words (chr)

Unfortunately the words column has been excluded and I came across this Stack Overflow post which suggested that the print.tbl_df function was the one responsible for filtering columns.

Browsing the docs I found a couple of ways to overwrite this behaviour:

> df %>% 
    group_by(a,b) %>%
    summarise(n = n(), words = paste(name, collapse = ",")) %>%
    arrange(desc(n)) %>%
    head(5) %>%
    print(width = Inf)

or

> options(dplyr.width = Inf)
> df %>% 
    group_by(a,b) %>%
    summarise(n = n(), words = paste(name, collapse = ",")) %>%
    arrange(desc(n)) %>%
    head(5)

And now we see this output instead:

Source: local data frame [5 x 4]
Groups: a
 
   a  b  n                                                                                                         words
1 19 90 10 dfhtcgymxt,zpemxbpnri,rfmkksuavp,jxaarxzdzd,peydpxjizc,trdzchaxiy,arthnxbaeg,kjbpdvvghm,kpvsddlsua,xmysfcynxw
2 24 36 10 wtokzdfecx,eprsvpsdcp,kzgxtwnqli,jbyuicevrn,klriuenjzu,qzgtmkljoy,bonbhmqfaz,uauoybprrl,rzummfbkbx,icyeorwzxl
3 29 20 10 ebubytlosp,vtligdgvqw,ejlqonhuit,jwidjvtark,kmdzcalblg,qzrlewxcsr,eckfgjnkys,vfdaeqbfqi,rumblliqmn,fvezcdfiaz
4 29 80 10 wputpwgayx,lpawiyhzuh,ufykwguynu,nyqnwjallh,abaxicpixl,uirudflazn,wyynsikwcl,usescualww,bkvsowfaab,gfhyifzepx
5 62 54 10 beuegfzssp,gfmegjtrys,wkubhvnkkk,rkhgprxttb,cwsrzulnpo,hzkvjbiywc,gbmiupnlbw,gffovxwtok,uxadfrjvdn,aojjfhxygs

Much better!

Written by Mark Needham

November 23rd, 2014 at 1:02 am

Posted in R

Tagged with ,

R: ggmap – Overlay shapefile with filled polygon of regions

with one comment

I’ve been playing around with plotting maps in R over the last week and got to the point where I wanted to have a google map in the background with a filled polygon on a shapefile in the foreground.

The first bit is reasonably simple – we can just import the ggmap library and make a call to get_map:

> library(ggmap)
> sfMap = map = get_map(location = 'San Francisco', zoom = 12)
2014 11 17 00 27 11

Next I wanted to show the outlines of the different San Francisco zip codes and came across a blog post by Paul Bidanset on Baltimore neighbourhoods which I was able to adapt.

I downloaded a shapefile of San Francisco’s zip codes from the DataSF website and then loaded it into R using the readOGR and spTransform functions from the rgdal package:

> library(rgdal)
> library(ggplot2)
> sfn = readOGR(".","sfzipcodes") %>% spTransform(CRS("+proj=longlat +datum=WGS84"))
> ggplot(data = sfn, aes(x = long, y = lat, group = group)) + geom_path()
2014 11 17 00 38 32

sfn is a spatial type of data frame…

> class(sfn)
[1] "SpatialPolygonsDataFrame"
attr(,"package")
[1] "sp"

…but we need a normal data frame to be able to easily merge other data onto the map and then plot it. We can use ggplot2’s fortify command to do this:

> names(sfn)
[1] "OBJECTID" "ZIP_CODE" "ID"   
 
> sfn.f = sfn %>% fortify(region = 'ZIP_CODE')
 
SFNeighbourhoods  = merge(sfn.f, sfn@data, by.x = 'id', by.y = 'ZIP_CODE')

I then made up some fake values for each zip code so that we could have different colour shadings for each zip code on the visualisation:

> library(dplyr) 
 
> postcodes = SFNeighbourhoods %>% select(id) %>% distinct()
 
> values = data.frame(id = c(postcodes),
                      value = c(runif(postcodes %>% count() %>% unlist(),5.0, 25.0)))

I then merged those values onto SFNeighbourhoods:

> sf = merge(SFNeighbourhoods, values, by.x='id')
 
> sf %>% group_by(id) %>% do(head(., 1)) %>% head(10)
Source: local data frame [10 x 10]
Groups: id
 
      id      long      lat order  hole piece   group OBJECTID    ID     value
1  94102 -122.4193 37.77515     1 FALSE     1 94102.1       14 94102  6.184814
2  94103 -122.4039 37.77006   106 FALSE     1 94103.1       12 94103 21.659752
3  94104 -122.4001 37.79030   255 FALSE     1 94104.1       10 94104  5.173199
4  94105 -122.3925 37.79377   293 FALSE     1 94105.1        2 94105 15.723456
5  94107 -122.4012 37.78202   504 FALSE     1 94107.1        1 94107  8.402726
6  94108 -122.4042 37.79169  2232 FALSE     1 94108.1       11 94108  8.632652
7  94109 -122.4139 37.79046  2304 FALSE     1 94109.1        8 94109 20.129402
8  94110 -122.4217 37.73181  2794 FALSE     1 94110.1       16 94110 12.410610
9  94111 -122.4001 37.79369  3067 FALSE     1 94111.1        9 94111 10.185054
10 94112 -122.4278 37.73469  3334 FALSE     1 94112.1       18 94112 24.297588

Now we can easily plot those colours onto our shapefile by calling geom_polgon instead of geom_path:

> ggplot(sf, aes(long, lat, group = group)) + 
    geom_polygon(aes(fill = value))

2014 11 17 00 49 11

And finally let’s wire it up to our google map:

> ggmap(sfMap) + 
    geom_polygon(aes(fill = value, x = long, y = lat, group = group), 
                 data = sf,
                 alpha = 0.8, 
                 color = "black",
                 size = 0.2)
2014 11 17 00 50 13

I spent way too long with the alpha value set to ‘0’ on this last plot wondering why I wasn’t seeing any shading so don’t make that mistake!

Written by Mark Needham

November 17th, 2014 at 12:53 am

Posted in R

Tagged with , ,

R: dplyr – Sum for group_by multiple columns

with 2 comments

Over the weekend I was playing around with dplyr and had the following data frame grouped by both columns:

> library(dplyr)
 
> data = data.frame(
    letter = sample(LETTERS, 50000, replace = TRUE),
    number = sample (1:10, 50000, replace = TRUE)
    )
 
> data %>% count(letter, number) %>% head(20)
Source: local data frame [20 x 3]
Groups: letter
 
   letter number   n
1       A      1 184
2       A      2 192
3       A      3 183
4       A      4 193
5       A      5 214
6       A      6 172
7       A      7 196
8       A      8 198
9       A      9 174
10      A     10 196
11      B      1 212
12      B      2 198
13      B      3 194
14      B      4 181
15      B      5 203
16      B      6 234
17      B      7 221
18      B      8 179
19      B      9 182
20      B     10 170

I wanted to add an extra column which would show what percentage of the values for that letter each number had.

If we wrote that code standalone we’d have the following:

> data %>% count(letter)
Source: local data frame [26 x 2]
 
   letter    n
1       A 1902
2       B 1974
3       C 1911
4       D 1948
5       E 1888
6       F 1975
7       G 1914
8       H 1886
9       I 1910
10      J 1924
11      K 1974
12      L 1891
13      M 1894
14      N 1946
15      O 1956
16      P 1934
17      Q 1865
18      R 1992
19      S 1946
20      T 1854
21      U 1919
22      V 1913
23      W 1928
24      X 1934
25      Y 1908
26      Z 1914

We can also get that value by summing up the individual (letter, number) pairs from the previous data frame. The ungroup function allows us to do so:

> data %>% count(letter, number) %>% ungroup %>% group_by(letter) %>% mutate(sum.n = sum(n)) %>% head(20)
Source: local data frame [20 x 4]
Groups: letter
 
   letter number   n sum.n
1       A      1 184  1902
2       A      2 192  1902
3       A      3 183  1902
4       A      4 193  1902
5       A      5 214  1902
6       A      6 172  1902
7       A      7 196  1902
8       A      8 198  1902
9       A      9 174  1902
10      A     10 196  1902
11      B      1 212  1974
12      B      2 198  1974
13      B      3 194  1974
14      B      4 181  1974
15      B      5 203  1974
16      B      6 234  1974
17      B      7 221  1974
18      B      8 179  1974
19      B      9 182  1974
20      B     10 170  1974

Pretty neat!

Written by Mark Needham

November 11th, 2014 at 12:17 am

Posted in R

Tagged with

R: dplyr – Maximum value row in each group

without comments

In my continued work with R’s dplyr I wanted to be able to group a data frame by some columns and then find the maximum value for each group.

We’ll continue with my favourite dummy data set:

> library(dplyr)
 
> data = data.frame(
    letter = sample(LETTERS, 50000, replace = TRUE),
    number = sample (1:10, 50000, replace = TRUE)
    )

I started with the following code to count how many occurrences of each (letter, number) pair there were:

> data %>% count(letter, number)
Source: local data frame [260 x 3]
Groups: letter
 
   letter number   n
1       A      1 184
2       A      2 192
3       A      3 183
4       A      4 193
5       A      5 214
6       A      6 172
7       A      7 196
8       A      8 198
9       A      9 174
10      A     10 196
..    ...    ... ...

I wanted to find the top number for each letter and with a bit of help from Stack Overflow I ended up with the following:

> data %>% count(letter, number) %>% filter(n == max(n))
Source: local data frame [26 x 3]
Groups: letter
 
   letter number   n
1       A      5 214
2       B      6 234
3       C      8 213
4       D      6 211
5       E      9 208
6       F      2 219
7       G      1 213
8       H      2 208
9       I      9 220
10      J      7 213
11      K      3 228
12      L      2 206
13      M      3 212
14      N      4 222
15      O      1 211
16      P      7 211
17      Q      9 210
18      R      5 224
19      S      2 211
20      T      9 204
21      U      8 217
22      V      6 220
23      W      2 213
24      X      2 214
25      Y      3 211
26      Z      3 206

Here we’re filtering over a collection of rows grouped by letter and only selecting the row which has the max value. We can see more clearly what’s happening if we filter the data frame to only contain one letter:

> letterA = data %>% filter(letter == "A") %>% count(letter, number)
> letterA
Source: local data frame [10 x 3]
Groups: letter
 
   letter number   n
1       A      1 184
2       A      2 192
3       A      3 183
4       A      4 193
5       A      5 214
6       A      6 172
7       A      7 196
8       A      8 198
9       A      9 174
10      A     10 196

And now let’s apply the filter command while also printing out information about each row:

> letterA %>% filter(print(n), print(n == max(n)), n == max(n))
 [1] 184 192 183 193 214 172 196 198 174 196
 [1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
Source: local data frame [1 x 3]
Groups: letter
 
  letter number   n
1      A      5 214

If instead of getting the top number by letter we wanted to get the top letter by number we just need to reorder the field names in the count method:

> data %>% count(number, letter) %>% filter(n == max(n))
Source: local data frame [10 x 3]
Groups: number
 
   number letter   n
1       1      G 213
2       2      F 219
3       3      K 228
4       4      N 222
5       5      R 224
6       6      B 234
7       7      B 221
8       8      U 217
9       9      I 220
10     10      O 209

And if we want to see the letter that shows up least frequently we can change the call to ‘max’ to an equivalent one to ‘min':

> data %>% count(number, letter) %>% filter(n == min(n))
Source: local data frame [11 x 3]
Groups: number
 
   number letter   n
1       1      H 166
2       2      O 160
3       3      E 156
4       4      R 169
5       5      L 169
6       6      I 164
7       7      H 170
8       7      I 170
9       8      Q 166
10      9      W 162
11     10      J 168

Written by Mark Needham

November 10th, 2014 at 10:06 pm

Posted in R

Tagged with

R: dplyr – Ordering by count after multiple column group_by

without comments

I was recently trying to group a data frame by two columns and then sort by the count using dplyr but it wasn’t sorting in the way I expecting which was initially very confusing.

I started with this data frame:

library(dplyr)
 
data = data.frame(
  letter = sample(LETTERS, 50000, replace = TRUE),
  number = sample (1:10, 50000, replace = TRUE)
  )

And I wanted to find out how many occurrences of each (letter, number) pair exist in the data set. I started with the following code:

> data %>% count(letter, number, sort = TRUE)
Source: local data frame [260 x 3]
Groups: letter
 
   letter number   n
1       A      4 205
2       A      9 201
3       A      3 197
4       A      1 195
5       A     10 191
6       A      2 189
7       A      8 184
8       A      7 183
9       A      5 181
10      A      6 173
..    ...    ... ...

As you can see it’s only showing A’s which is interesting as I wouldn’t expect there to be a bias towards that letter. Let’s filter out the A’s:

> data %>% filter(letter != "A") %>% count(letter, number, sort = TRUE)
Source: local data frame [250 x 3]
Groups: letter
 
   letter number   n
1       B      8 222
2       B      9 212
3       B      5 207
4       B      6 201
5       B     10 200
6       B      7 192
7       B      2 189
8       B      3 189
9       B      1 187
10      B      4 181
..    ...    ... ...

Now all we see are B’s and we can see that both (B,8) and (B,9) have a higher ‘n’ value than any of the A’s.

I put the code back into the more verbose form to see if it was the count function that behaved unexpectedly:

> data %>% group_by(letter, number) %>% summarise(n = n()) %>% arrange(desc(n))
Source: local data frame [260 x 3]
Groups: letter
 
   letter number   n
1       A      4 205
2       A      9 201
3       A      3 197
4       A      1 195
5       A     10 191
6       A      2 189
7       A      8 184
8       A      7 183
9       A      5 181
10      A      6 173
..    ...    ... ...

Nope, still the same behaviour.

At this point I vaguely remembered there being a function called ungroup which I hadn’t used and wondered if now was the time.

> data %>% group_by(letter, number) %>% summarise(n = n()) %>% ungroup() %>% arrange(desc(n))
Source: local data frame [260 x 3]
 
   letter number   n
1       L      2 236
2       V      1 231
3       Y      8 226
4       J      4 225
5       J     10 223
6       Q      7 223
7       B      8 222
8       O      9 222
9       Q     10 221
10      Z      9 221
..    ...    ... ...

Indeed it was and now we can go back to our original version of the code using count and handle the sorting afterwards:

> data %>% count(letter, number) %>% ungroup() %>% arrange(desc(n))
Source: local data frame [260 x 3]
 
   letter number   n
1       L      2 236
2       V      1 231
3       Y      8 226
4       J      4 225
5       J     10 223
6       Q      7 223
7       B      8 222
8       O      9 222
9       Q     10 221
10      Z      9 221
..    ...    ... ...

Written by Mark Needham

November 9th, 2014 at 9:30 am

Posted in R

Tagged with

R: Refactoring to dplyr

without comments

I’ve been looking back over some of the early code I wrote using R before I knew about the dplyr library and thought it’d be an interesting exercise to refactor some of the snippets.

We’ll use the following data frame for each of the examples:

library(dplyr)
 
data = data.frame(
  letter = sample(LETTERS, 50000, replace = TRUE),
  number = sample (1:10, 50000, replace = TRUE)
  )

Take {n} rows

> data[1:5,]
  letter number
1      R      7
2      Q      3
3      B      8
4      R      3
5      U      2

becomes:

> data %>% head(5)
  letter number
1      R      7
2      Q      3
3      B      8
4      R      3
5      U      2

Order by numeric value descending

> data[order(-(data$number)),][1:5,]
   letter number
14      H     10
17      G     10
63      L     10
66      W     10
73      R     10

becomes:

> data %>% arrange(desc(number)) %>% head(5)
  letter number
1      H     10
2      G     10
3      L     10
4      W     10
5      R     10

Count number of items

> length(data[,1])
[1] 50000

becomes:

> data %>% count()
Source: local data frame [1 x 1]
 
      n
1 50000

Filter by column value

> length(subset(data, number == 1)[, 1])
[1] 4928

becomes:

> data %>% filter(number == 1) %>% count()
Source: local data frame [1 x 1]
 
     n
1 4928

Group by variable and count

> aggregate(data, by= list(data$number), function(x) length(x))
   Group.1 letter number
1        1   4928   4928
2        2   5045   5045
3        3   5064   5064
4        4   4823   4823
5        5   5032   5032
6        6   5163   5163
7        7   4945   4945
8        8   5077   5077
9        9   5025   5025
10      10   4898   4898

becomes:

> data %>% count(number)
Source: local data frame [10 x 2]
 
   number    n
1       1 4928
2       2 5045
3       3 5064
4       4 4823
5       5 5032
6       6 5163
7       7 4945
8       8 5077
9       9 5025
10     10 4898

Select a range of rows

> data[4:5,]
  letter number
4      R      3
5      U      2

becomes:

> data %>% slice(4:5)
  letter number
1      R      3
2      U      2

There’s certainly more code in some of the dplyr examples but I find it easier to remember how the dplyr code works when I come back to it and hence tend to favour that approach.

Written by Mark Needham

November 9th, 2014 at 12:11 am

Posted in R

Tagged with

R: dplyr – Group by field dynamically (‘regroup’ is deprecated / no applicable method for ‘as.lazy’ applied to an object of class “list” )

without comments

A few months ago I wrote a blog explaining how to dynamically/programatically group a data frame by a field using dplyr but that approach has been deprecated in the latest version.

To recap, the original function looked like this:

library(dplyr)
 
groupBy = function(df, field) {
  df %.% regroup(list(field)) %.% summarise(n = n())
}

And if we execute that with a sample data frame we’ll see the following:

> data = data.frame(
      letter = sample(LETTERS, 50000, replace = TRUE),
      number = sample (1:10, 50000, replace = TRUE)
  )
 
> groupBy(data, 'letter') %>% head(5)
Source: local data frame [5 x 2]
 
  letter    n
1      A 1951
2      B 1903
3      C 1954
4      D 1923
5      E 1886
Warning messages:
1: %.% is deprecated. Please use %>% 
2: %.% is deprecated. Please use %>% 
3: 'regroup' is deprecated.
Use 'group_by_' instead.
See help("Deprecated")

I replaced each of the deprecated operators and ended up with this function:

groupBy = function(df, field) {
  df %>% group_by_(list(field)) %>% summarise(n = n())
}

Now if we run that:

> groupBy(data, 'letter') %>% head(5)
Error in UseMethod("as.lazy") : 
  no applicable method for 'as.lazy' applied to an object of class "list"

It turns out the ‘group_by_’ function doesn’t want to receive a list of fields so let’s remove the call to list:

groupBy = function(df, field) {
  df %>% group_by_(field) %>% summarise(n = n())
}

And now if we run that:

> groupBy(data, 'letter') %>% head(5)
Source: local data frame [5 x 2]
 
  letter    n
1      A 1951
2      B 1903
3      C 1954
4      D 1923
5      E 1886

Good times! We get the correct result and no deprecation messages.

If we want to group by multiple fields we can just pass in the field names like so:

groupBy = function(df, field1, field2) {
  df %>% group_by_(field1, field2) %>% summarise(n = n())
}
> groupBy(data, 'letter', 'number') %>% head(5)
Source: local data frame [5 x 3]
Groups: letter
 
  letter number   n
1      A      1 200
2      A      2 218
3      A      3 205
4      A      4 176
5      A      5 203

Or with this simpler version:

groupBy = function(df, ...) {
  df %>% group_by_(...) %>% summarise(n = n())
}
> groupBy(data, 'letter', 'number') %>% head(5)
Source: local data frame [5 x 3]
Groups: letter
 
  letter number   n
1      A      1 200
2      A      2 218
3      A      3 205
4      A      4 176
5      A      5 203

I realised that we can actually just use the group_by itself and pass in the field names without quotes, something I couldn’t get to work in earlier versions:

groupBy = function(df, ...) {
  df %>% group_by(...) %>% summarise(n = n())
}
> groupBy(data, letter, number) %>% head(5)
Source: local data frame [5 x 3]
Groups: letter
 
  letter number   n
1      A      1 200
2      A      2 218
3      A      3 205
4      A      4 176
5      A      5 203

We could even get a bit of pipelining going on if we fancied it:

> data %>% groupBy(letter, number) %>% head(5)
Source: local data frame [5 x 3]
Groups: letter
 
  letter number   n
1      A      1 200
2      A      2 218
3      A      3 205
4      A      4 176
5      A      5 203

And as of dplyr 0.3 we can simplify our groupBy function to make use of the new count function which combines group_by and summarise:

groupBy = function(df, ...) {
  df %>% count(...)
}
> data %>% groupBy(letter, number) %>% head(5)
Source: local data frame [5 x 3]
Groups: letter
 
  letter number   n
1      A      1 200
2      A      2 218
3      A      3 205
4      A      4 176
5      A      5 203

Written by Mark Needham

November 8th, 2014 at 10:29 pm

Posted in R

Tagged with

R: Joining multiple data frames

without comments

I’ve been looking through the code from Martin Eastwood’s excellent talk ‘Predicting Football Using R‘ and was intrigued by the code which reshaped the data into that expected by glm.

The original looks like this:

df <- read.csv('http://www.football-data.co.uk/mmz4281/1314/E0.csv')
 
# munge data into format compatible with glm function
df <- apply(df, 1, function(row){
  data.frame(team=c(row['HomeTeam'], row['AwayTeam']),
             opponent=c(row['AwayTeam'], row['HomeTeam']),
             goals=c(row['FTHG'], row['FTAG']),
             home=c(1, 0))
})
df <- do.call(rbind, df)

The initial data frame looks like this:

> library(dplyr)
> df %>% select(HomeTeam, AwayTeam, FTHG, FTAG) %>% head(1)
  HomeTeam    AwayTeam FTHG FTAG
1  Arsenal Aston Villa    1    3

And we want to get it to look like this:

> head(df, 2)
                team    opponent goals home
HomeTeam     Arsenal Aston Villa     1    1
AwayTeam Aston Villa     Arsenal     3    0

So for each row in the initial data frame we want to have two rows: one representing each team, how many goals they scored in the match and whether they were playing at home or away.

I really like dplyr’s pipelining function so I thought I’d try and translate Martin’s code to use that and other dplyr functions.

I ended up with the following two sets of function calls:

df %>% select(team = HomeTeam, opponent = AwayTeam, goals = FTHG) %>% mutate(home = 1)
df %>% select(team = AwayTeam, opponent = HomeTeam, goals = FTAG) %>% mutate(home = 0)

I’m doing pretty much the same thing as Martin except I’ve used dplyr’s select and mutate functions to transform the data frame.

The next step was to join those two data frames together and with Nicole’s help I realised that there are many ways we can do this.

The functions that will do the job are:

We decided to benchmark them to see which was able to transform the data frame the fastest:

# load data into data.frame
dfOrig <- read.csv('http://www.football-data.co.uk/mmz4281/1314/E0.csv')
 
original = function(df) {
  df <- apply(df, 1, function(row){
    data.frame(team=c(row['HomeTeam'], row['AwayTeam']),
               opponent=c(row['AwayTeam'], row['HomeTeam']),
               goals=c(row['FTHG'], row['FTAG']),
               home=c(1, 0))
  })
  do.call(rbind, df)
}
 
newRBind = function(df) {
  rbind(df %>% select(team = HomeTeam, opponent = AwayTeam, goals = FTHG) %>% mutate(home = 1),
        df %>% select(team = AwayTeam, opponent = HomeTeam, goals = FTAG) %>% mutate(home = 0))  
}
 
newUnion = function(df) {
  union(df %>% select(team = HomeTeam, opponent = AwayTeam, goals = FTHG) %>% mutate(home = 1),
        df %>% select(team = AwayTeam, opponent = HomeTeam, goals = FTAG) %>% mutate(home = 0))  
}
 
newJoin = function(df) {
  join(df %>% select(team = HomeTeam, opponent = AwayTeam, goals = FTHG) %>% mutate(home = 1),
       df %>% select(team = AwayTeam, opponent = HomeTeam, goals = FTAG) %>% mutate(home = 0),
      type = "full")  
}
 
newMerge = function(df) {
  merge(df %>% select(team = HomeTeam, opponent = AwayTeam, goals = FTHG) %>% mutate(home = 1),
       df %>% select(team = AwayTeam, opponent = HomeTeam, goals = FTAG) %>% mutate(home = 0),
       all = TRUE)  
}
> library(microbenchmark)
 
> microbenchmark(original(dfOrig))
Unit: milliseconds
             expr   min    lq  mean median    uq max neval
 original(dfOrig) 189.4 196.8 202.5    201 205.5 284   100
 
> microbenchmark(newRBind(dfOrig))
Unit: milliseconds
             expr   min    lq  mean median    uq   max neval
 newRBind(dfOrig) 2.197 2.274 2.396  2.309 2.377 4.526   100
 
> microbenchmark(newUnion(dfOrig))
Unit: milliseconds
             expr   min    lq  mean median   uq   max neval
 newUnion(dfOrig) 2.156 2.223 2.377  2.264 2.34 4.597   100
 
> microbenchmark(newJoin(dfOrig))
 
Unit: milliseconds
            expr   min    lq  mean median   uq   max neval
 newJoin(dfOrig) 5.961 6.132 6.817  6.253 6.65 11.95   100
 
> microbenchmark(newMerge(dfOrig))
Unit: milliseconds
             expr   min    lq  mean median    uq   max neval
 newMerge(dfOrig) 7.121 7.413 8.037  7.541 7.934 13.32   100

We actually get a 100 time speed up over the original function if we use rbind or union whereas with merge or join it’s around 35 times quicker.

In this case using merge or join is a bit misleading because we’re not actually connecting the data frames together based on any particular field – we are just appending one to the other.

The code’s available as a gist if you want to have a play.

Written by Mark Needham

November 7th, 2014 at 1:29 am

Posted in R

Tagged with

R: Converting a named vector to a data frame

without comments

I’ve been playing around with igraph’s page rank function to see who the most central nodes in the London NoSQL scene are and I wanted to put the result in a data frame to make the data easier to work with.

I started off with a data frame containing pairs of people and the number of events that they’d both RSVP’d ‘yes’ to:

> library(dplyr)
> data %>% arrange(desc(times)) %>% head(10)
       p.name     other.name times
1  Amit Nandi Anish Mohammed    51
2  Amit Nandi Enzo Martoglio    49
3       louis          zheng    46
4       louis     Raja Kolli    45
5  Raja Kolli Enzo Martoglio    43
6  Amit Nandi     Raja Kolli    42
7       zheng Anish Mohammed    42
8  Raja Kolli          Rohit    41
9  Amit Nandi          zheng    40
10      louis          Rohit    40

I actually had ~ 900,000 such rows in the data frame:

> length(data[,1])
[1] 985664

I ran page rank over the data set like so:

g = graph.data.frame(data, directed = F)
pr = page.rank(g)$vector

If we evaluate pr we can see the person’s name and their page rank:

> head(pr)
Ioanna Eirini          Mjay       Baktash      madhuban    Karl Prior   Keith Bolam 
    0.0002190     0.0001206     0.0001524     0.0008819     0.0001240     0.0005702

I initially tried to convert this to a data frame with the following code…

> head(data.frame(pr))
                     pr
Ioanna Eirini 0.0002190
Mjay          0.0001206
Baktash       0.0001524
madhuban      0.0008819
Karl Prior    0.0001240
Keith Bolam   0.0005702

…which unfortunately didn’t create a column for the person’s name.

> colnames(data.frame(pr))
[1] "pr"

Nicole pointed out that I actually had a named vector and would need to explicitly extract the names from that vector into the data frame. I ended up with this:

> prDf = data.frame(name = names(pr), rank = pr)
> head(prDf)
                       name      rank
Ioanna Eirini Ioanna Eirini 0.0002190
Mjay                   Mjay 0.0001206
Baktash             Baktash 0.0001524
madhuban           madhuban 0.0008819
Karl Prior       Karl Prior 0.0001240
Keith Bolam     Keith Bolam 0.0005702

We can now sort the data frame to find the most central people on the NoSQL London scene based on meetup attendance:

> data.frame(prDf) %>%
+   arrange(desc(pr)) %>%
+   head(10)
             name     rank
1           louis 0.001708
2       Kannappan 0.001657
3           zheng 0.001514
4    Peter Morgan 0.001492
5      Ricki Long 0.001437
6      Raja Kolli 0.001416
7      Amit Nandi 0.001411
8  Enzo Martoglio 0.001396
9           Chris 0.001327
10          Rohit 0.001305

Written by Mark Needham

October 31st, 2014 at 11:47 pm

Posted in R

Tagged with

R: Linear models with the lm function, NA values and Collinearity

without comments

In my continued playing around with R I’ve sometimes noticed ‘NA’ values in the linear regression models I created but hadn’t really thought about what that meant.

On the advice of Peter Huber I recently started working my way through Coursera’s Regression Models which has a whole slide explaining its meaning:

2014 10 17 06 21 07

So in this case ‘z’ doesn’t help us in predicting Fertility since it doesn’t give us any more information that we can’t already get from ‘Agriculture’ and ‘Education’.

Although in this case we know why ‘z’ doesn’t have a coefficient sometimes it may not be clear which other variable the NA one is highly correlated with.

Multicollinearity (also collinearity) is a statistical phenomenon in which two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others with a non-trivial degree of accuracy.

In that situation we can make use of the alias function to explain the collinearity as suggested in this StackOverflow post:

library(datasets); data(swiss); require(stats); require(graphics)
z <- swiss$Agriculture + swiss$Education
fit = lm(Fertility ~ . + z, data = swiss)
> alias(fit)
Model :
Fertility ~ Agriculture + Examination + Education + Catholic + 
    Infant.Mortality + z
 
Complete :
  (Intercept) Agriculture Examination Education Catholic Infant.Mortality
z 0           1           0           1         0        0

In this case we can see that ‘z’ is highly correlated with both Agriculture and Education which makes sense given its the sum of those two variables.

When we notice that there’s an NA coefficient in our model we can choose to exclude that variable and the model will still have the same coefficients as before:

> require(dplyr)
> summary(lm(Fertility ~ . + z, data = swiss))$coefficients
                   Estimate  Std. Error   t value     Pr(>|t|)
(Intercept)      66.9151817 10.70603759  6.250229 1.906051e-07
Agriculture      -0.1721140  0.07030392 -2.448142 1.872715e-02
Examination      -0.2580082  0.25387820 -1.016268 3.154617e-01
Education        -0.8709401  0.18302860 -4.758492 2.430605e-05
Catholic          0.1041153  0.03525785  2.952969 5.190079e-03
Infant.Mortality  1.0770481  0.38171965  2.821568 7.335715e-03
> summary(lm(Fertility ~ ., data = swiss))$coefficients
                   Estimate  Std. Error   t value     Pr(>|t|)
(Intercept)      66.9151817 10.70603759  6.250229 1.906051e-07
Agriculture      -0.1721140  0.07030392 -2.448142 1.872715e-02
Examination      -0.2580082  0.25387820 -1.016268 3.154617e-01
Education        -0.8709401  0.18302860 -4.758492 2.430605e-05
Catholic          0.1041153  0.03525785  2.952969 5.190079e-03
Infant.Mortality  1.0770481  0.38171965  2.821568 7.335715e-03

If we call alias now we won’t see any output:

> alias(lm(Fertility ~ ., data = swiss))
Model :
Fertility ~ Agriculture + Examination + Education + Catholic + 
    Infant.Mortality

Written by Mark Needham

October 18th, 2014 at 6:35 am

Posted in R

Tagged with ,