Mark Needham

Thoughts on Software Development

Archive for the ‘R’ Category

R: Filling in missing dates with 0s

without comments

I wanted to plot a chart showing the number of blog posts published by month and started with the following code which makes use of zoo’s ‘as.yearmon’ function to add the appropriate column and grouping:

> library(zoo)
> library(dplyr)
> df %>% sample_n(5)
                                                  title                date
888        R: Converting a named vector to a data frame 2014-10-31 23:47:26
144  Rails: Populating a dropdown list using 'form_for' 2010-08-31 01:22:14
615                    Onboarding: Sketch the landscape 2013-02-15 07:36:06
28                        Javascript: The 'new' keyword 2010-03-06 15:16:02
1290                Coding Dojo #16: Reading SUnit code 2009-05-28 23:23:19
 
> posts_by_date  = df %>% mutate(year_mon = as.Date(as.yearmon(date))) %>% count(year_mon)
> posts_by_date %>% head(5)
 
    year_mon  n
1 2006-08-01  1
2 2006-09-01  4
3 2008-02-01  4
4 2008-07-01  2
5 2008-08-01 38

I then plugged the new data frame into ggplot to get the chart:

> ggplot(aes(x = year_mon, y = n), data = posts_by_date) + geom_line()

2015 07 12 09 07 47

The problem with this chart is that it’s showing there being 4 posts per month for all the dates between September 2006 and February 2008 even though I didn’t write anything! It’s doing the same thing between February 2008 and July 2008 too.

We can fix that by filling in the gaps with 0s.

First we’ll create a vector containing every month in the data range contained by our data frame:

> all_dates = seq(as.Date(as.yearmon(min(df$date))), as.Date(as.yearmon(max(df$date))), by="month")
 
> all_dates
  [1] "2006-08-01" "2006-09-01" "2006-10-01" "2006-11-01" "2006-12-01" "2007-01-01" "2007-02-01" "2007-03-01"
  [9] "2007-04-01" "2007-05-01" "2007-06-01" "2007-07-01" "2007-08-01" "2007-09-01" "2007-10-01" "2007-11-01"
 [17] "2007-12-01" "2008-01-01" "2008-02-01" "2008-03-01" "2008-04-01" "2008-05-01" "2008-06-01" "2008-07-01"
 [25] "2008-08-01" "2008-09-01" "2008-10-01" "2008-11-01" "2008-12-01" "2009-01-01" "2009-02-01" "2009-03-01"
 [33] "2009-04-01" "2009-05-01" "2009-06-01" "2009-07-01" "2009-08-01" "2009-09-01" "2009-10-01" "2009-11-01"
 [41] "2009-12-01" "2010-01-01" "2010-02-01" "2010-03-01" "2010-04-01" "2010-05-01" "2010-06-01" "2010-07-01"
 [49] "2010-08-01" "2010-09-01" "2010-10-01" "2010-11-01" "2010-12-01" "2011-01-01" "2011-02-01" "2011-03-01"
 [57] "2011-04-01" "2011-05-01" "2011-06-01" "2011-07-01" "2011-08-01" "2011-09-01" "2011-10-01" "2011-11-01"
 [65] "2011-12-01" "2012-01-01" "2012-02-01" "2012-03-01" "2012-04-01" "2012-05-01" "2012-06-01" "2012-07-01"
 [73] "2012-08-01" "2012-09-01" "2012-10-01" "2012-11-01" "2012-12-01" "2013-01-01" "2013-02-01" "2013-03-01"
 [81] "2013-04-01" "2013-05-01" "2013-06-01" "2013-07-01" "2013-08-01" "2013-09-01" "2013-10-01" "2013-11-01"
 [89] "2013-12-01" "2014-01-01" "2014-02-01" "2014-03-01" "2014-04-01" "2014-05-01" "2014-06-01" "2014-07-01"
 [97] "2014-08-01" "2014-09-01" "2014-10-01" "2014-11-01" "2014-12-01" "2015-01-01" "2015-02-01" "2015-03-01"
[105] "2015-04-01" "2015-05-01" "2015-06-01" "2015-07-01"

Now we need to create a data frame containing those dates and merge it with the original:

posts_by_date_clean = merge(data.frame(date = all_dates),
                            posts_by_date,
                            by.x='date',
                            by.y='year_mon',
                            all.x=T,
                            all.y=T)
 
> posts_by_date_clean %>% head()
        date  n
1 2006-08-01  1
2 2006-09-01  4
3 2006-10-01 NA
4 2006-11-01 NA
5 2006-12-01 NA
6 2007-01-01 NA

We’ve still got some ‘NA’ values in there which won’t plot so well. Let’s set those to 0 and then try and plot our chart again:

> posts_by_date_clean$n[is.na(posts_by_date_clean$n)] = 0
> ggplot(aes(x = date, y = n), data = posts_by_date_clean) + geom_line()
2015 07 12 09 17 10

Much better!

Written by Mark Needham

July 12th, 2015 at 8:30 am

Posted in R

Tagged with

R: Date for given week/year

without comments

As I mentioned in my last couple of blog posts I’ve been looking at the data behind this blog and I wanted to plot a chart showing the number of posts per week since the blog started.

I started out with a data frame with posts and publication date:

> library(dplyr)
> df = read.csv("posts.csv")
> df$date = ymd_hms(df$date)
 
> df %>% sample_n(10)
                                                                                title                date
538                                    Nygard Big Data Model: The Investigation Stage 2012-10-10 00:00:36
341                                                            The read-only database 2011-08-29 23:32:26
1112                                  CSS in Internet Explorer - Some lessons learned 2008-10-31 15:24:51
143                                                       Coding: Mutating parameters 2010-08-26 07:47:23
433  Scala: Counting number of inversions (via merge sort) for an unsorted collection 2012-03-20 06:53:18
618                                    neo4j/cypher: SQL style GROUP BY functionality 2013-02-17 21:05:27
1111                                 Testing Hibernate mappings: Setting up test data 2008-10-30 13:24:14
462                                       neo4j: What question do you want to answer? 2012-05-05 13:20:41
1399                                       Book Club: Design Sense (Michael Feathers) 2009-09-29 14:42:29
494                                    Bash Shell: Reusing parts of previous commands 2012-07-05 23:42:35

The first step was to add a couple of columns representing the week and year for the publication date. The ‘lubridate’ library came in handy here:

byWeek = df %>% 
  mutate(year = year(date), week = week(date)) %>% 
  group_by(week, year) %>% summarise(n = n()) %>% 
  ungroup() %>% arrange(desc(n))
 
> byWeek
Source: local data frame [352 x 3]
 
   week year  n
1    33 2008 14
2    35 2008 11
3    53 2012 11
4     9 2013 10
5    12 2013  9
6    21 2009  9
7    22 2009  9
8    38 2013  9
9    40 2008  9
10   48 2012  9
..  ...  ... ..

The next step is to calculate the start date of each of those weeks so that we can plot the counts on a continuous date scale. I spent a while searching how to do this before realising that the ‘week’ function I used before can set the week for a given data as well. Let’s get to work:

calculate_start_of_week = function(week, year) {
  date <- ymd(paste(year, 1, 1, sep="-"))
  week(date) = week
  return(date)
}
 
> calculate_start_of_week(c(1,2,3), c(2015,2014,2013))
[1] "2015-01-01 UTC" "2014-01-08 UTC" "2013-01-15 UTC"

And now let’s transform our data frame and plot the counts:

ggplot(aes(x=start_of_week, y=n, group=1), 
       data = byWeek %>% mutate(start_of_week = calculate_start_of_week(week, year))) + 
  geom_line()

2015 07 10 22 43 54

It’s a bit erratic as you can see. Some of this can be explained by the fact that I do in fact post in an erratic way while some of it is explained by the fact that some weeks only have a few days if they start on the 29th onwards.

Written by Mark Needham

July 10th, 2015 at 10:01 pm

Posted in R

Tagged with ,

R: dplyr – Error: cannot modify grouping variable

without comments

I’ve been doing some exploration of the posts made on this blog and I thought I’d start with answering a simple question – on which dates did I write the most posts?

I started with a data frame containing each post and the date it was published:

> library(dplyr)
> df %>% sample_n(5)
                                                title                date
1148 Taiichi Ohno's Workplace Management: Book Review 2008-12-08 14:14:48
158     Rails: Faking a delete method with 'form_for' 2010-09-20 18:52:15
331           Retrospectives: The 4 L's Retrospective 2011-07-25 21:00:30
1035       msbuild - Use OutputPath instead of OutDir 2008-08-14 18:54:03
1181                The danger of commenting out code 2009-01-17 06:02:33

To find the most popular days for blog posts we can write the following aggregation function:

> df %>% mutate(day = as.Date(date)) %>% count(day) %>% arrange(desc(n))
 
Source: local data frame [1,140 x 2]
 
          day n
1  2012-12-31 6
2  2014-05-31 6
3  2008-08-08 5
4  2013-01-27 5
5  2009-08-24 4
6  2012-06-24 4
7  2012-09-30 4
8  2012-10-27 4
9  2012-11-24 4
10 2013-02-28 4

So we can see a couple of days with 6 posts, a couple with 5 posts, a few more with 4 posts and then presumably loads of days with 1 post.

I thought it’d be cool if we could blog a histogram which had on the x axis the number of posts and on the y axis how many days that number of posts occurred e.g. for an x value of 6 (posts) we’d have a y value of 2 (occurrences).

My initial attempt was this:

> df %>% mutate(day = as.Date(date)) %>% count(day) %>% count(n)
Error: cannot modify grouping variable

Unfortunately that isn’t allowed. I tried ungrouping and then counting again:

 df %>% mutate(day = as.Date(date)) %>% count(day) %>% ungroup() %>% count(n)
Error: cannot modify grouping variable

Still no luck. I did a bit of googlign around and came across a post which suggested using a combination of group_by + mutate or group_by + summarize.

I tried the mutate approach first:

> df %>% mutate(day = as.Date(date)) %>% 
+     group_by(day) %>% mutate(n = n()) %>% ungroup() %>% sample_n(5)
                                                        title                Source: local data frame [5 x 4]
 
                                    title                date        day n
1 QCon London 2009: DDD & BDD - Dan North 2009-03-13 15:28:04 2009-03-13 2
2        Onboarding: Sketch the landscape 2013-02-15 07:36:06 2013-02-15 1
3                           Ego Depletion 2013-06-04 23:16:29 2013-06-04 1
4                 Clean Code: Book Review 2008-09-15 09:52:33 2008-09-15 1
5            Dreyfus Model: More thoughts 2009-08-10 10:36:51 2009-08-10 1

That keeps around the ‘title’ which is a bit annoying. We can get rid of it using a distinct on ‘day’ if we want and if we also implement the second part of the function we end up with the following:

> df %>% mutate(day = as.Date(date)) %>% 
    group_by(day) %>% mutate(n = n()) %>% distinct(day) %>% ungroup() %>% 
    group_by(n) %>%
    mutate(c = n()) %>%
    distinct(n)  
 
Source: local data frame [6 x 5]
Groups: n
 
                                                title                date        day n   c
1       Functional C#: Writing a 'partition' function 2010-02-01 23:34:02 2010-02-01 1 852
2                            Willed vs Forced designs 2010-02-08 22:48:05 2010-02-08 2 235
3                            TDD: Testing collections 2010-07-28 06:05:25 2010-07-28 3  41
4  Creating a Samba share between Ubuntu and Mac OS X 2012-06-24 00:40:35 2012-06-24 4   8
5            Gamification and Software: Some thoughts 2012-12-31 10:57:19 2012-12-31 6   2
6 Python/numpy: Selecting specific column in 2D array 2013-01-27 02:10:10 2013-01-27 5   2

Annoyingly we’ve still got the ‘title’, ‘date’ and ‘day’ columns hanging around which we’d need to get rid of with a call to ‘select’. The code also feels quite icky, especially the use of distinct in a couple of places.

In fact we can simplify the code if we use summarize instead of mutate:

> df %>% mutate(day = as.Date(date)) %>% 
    group_by(day) %>% summarize(n = n()) %>% ungroup() %>% 
    group_by(n) %>% summarize(c = n())
 
 
Source: local data frame [6 x 2]
 
  n   c
1 1 852
2 2 235
3 3  41
4 4   8
5 5   2
6 6   2

And we’ve got also rid of the extra columns in the bargain which is great! And now we can plot our histogram:

> library(ggplot2)
> post_frequencies = df %>% mutate(day = as.Date(date)) %>% 
    group_by(day) %>% summarize(n = n()) %>% ungroup() %>% 
    group_by(n) %>% summarize(c = n())
> ggplot(aes(x = n, y = c), data = post_frequencies) + geom_bar(stat = "identity")

2015 07 09 06 44 47

In this case we don’t actually need to do the second grouping to create the bar chart since ggplot will do it for us if we feed it the following data:

. ggplot(aes(x = n), 
         data = df %>% mutate(day = as.Date(date)) %>% group_by(day) %>% summarize(n = n()) %>% ungroup()) +
    geom_bar(binwidth = 1) +
    scale_x_continuous(limits=c(1, 6))
2015 07 09 06 55 12

Still, it’s good to know how!

Written by Mark Needham

July 9th, 2015 at 5:55 am

Posted in R

Tagged with , ,

R: Wimbledon – How do the seeds get on?

without comments

Continuing on with the Wimbledon data set I’ve been playing with I wanted to do some exploration on how the seeded players have fared over the years.

Taking the last 10 years worth of data there have always had 32 seeds and with the following function we can feed in a seeding and get back the round they would be expected to reach:

expected_round = function(seeding) {  
  if(seeding == 1) {
    return("Winner")
  } else if(seeding == 2) {
    return("Finals") 
  } else if(seeding <= 4) {
    return("Semi-Finals")
  } else if(seeding <= 8) {
    return("Quarter-Finals")
  } else if(seeding <= 16) {
    return("Round of 16")
  } else {
    return("Round of 32")
  }
}
 
> expected_round(1)
[1] "Winner"
 
> expected_round(4)
[1] "Semi-Finals"

We can then have a look at each of the Wimbledon tournaments and work out how far they actually got.

round_reached = function(player, main_matches) {
  furthest_match = main_matches %>% 
    filter(winner == player | loser == player) %>% 
    arrange(desc(round)) %>% 
    head(1)  
 
    return(ifelse(furthest_match$winner == player, "Winner", as.character(furthest_match$round)))
}
 
seeds = function(matches_to_consider) {
  winners =  matches_to_consider %>% filter(!is.na(winner_seeding)) %>% 
    select(name = winner, seeding =  winner_seeding) %>% distinct()
  losers = matches_to_consider %>% filter( !is.na(loser_seeding)) %>% 
    select(name = loser, seeding =  loser_seeding) %>% distinct()
 
  return(rbind(winners, losers) %>% distinct() %>% mutate(name = as.character(name)))
}

Let’s have a look how the seeds got on last year:

matches_to_consider = main_matches %>% filter(year == 2014)
 
result = seeds(matches_to_consider) %>% group_by(name) %>% 
    mutate(expected = expected_round(seeding), round = round_reached(name, matches_to_consider)) %>% 
    ungroup() %>% arrange(seeding)
 
rounds = c("Did not enter", "Round of 128", "Round of 64", "Round of 32", "Round of 16", "Quarter-Finals", "Semi-Finals", "Finals", "Winner")
result$round = factor(result$round, levels = rounds, ordered = TRUE)
result$expected = factor(result$expected, levels = rounds, ordered = TRUE) 
 
> result %>% head(10)
Source: local data frame [10 x 4]
 
             name seeding       expected          round
1  Novak Djokovic       1         Winner         Winner
2    Rafael Nadal       2         Finals    Round of 16
3     Andy Murray       3    Semi-Finals Quarter-Finals
4   Roger Federer       4    Semi-Finals         Finals
5   Stan Wawrinka       5 Quarter-Finals Quarter-Finals
6   Tomas Berdych       6 Quarter-Finals    Round of 32
7    David Ferrer       7 Quarter-Finals    Round of 64
8    Milos Raonic       8 Quarter-Finals    Semi-Finals
9      John Isner       9    Round of 16    Round of 32
10  Kei Nishikori      10    Round of 16    Round of 16

We’ll wrap all of that code into the following function:

expectations = function(y, matches) {
  matches_to_consider = matches %>% filter(year == y)  
 
  result = seeds(matches_to_consider) %>% group_by(name) %>% 
    mutate(expected = expected_round(seeding), round = round_reached(name, matches_to_consider)) %>% 
    ungroup() %>% arrange(seeding)
 
  result$round = factor(result$round, levels = rounds, ordered = TRUE)
  result$expected = factor(result$expected, levels = rounds, ordered = TRUE)  
 
  return(result)
}

Next, instead of showing the round names it’d be cool to come up with numerical value indicating how well the player did:

  • -1 would mean they lost in the round before their seeding suggested e.g. seed 2 loses in Semi Final
  • 2 would mean they got 2 rounds further than they should have e.g. Seed 7 reaches the Final

The unclass function comes to our rescue here:

# expectations plot
years = 2005:2014
exp = data.frame()
for(y in years) {
  differences = (expectations(y, main_matches)  %>% 
                   mutate(expected_n = unclass(expected), 
                          round_n = unclass(round), 
                          difference = round_n - expected_n))$difference %>% as.numeric()    
  exp = rbind(exp, data.frame(year = rep(y, length(differences)), difference = differences)) 
}
 
> exp %>% sample_n(10)
Source: local data frame [10 x 6]
 
              name seeding expected_n round_n difference year
1    Tomas Berdych       6          6       5         -1 2011
2    Tomas Berdych       7          6       6          0 2013
3     Rafael Nadal       2          8       5         -3 2014
4    Fabio Fognini      16          5       4         -1 2014
5  Robin Soderling      13          5       5          0 2009
6    Jurgen Melzer      16          5       5          0 2010
7  Nicolas Almagro      19          4       2         -2 2010
8    Stan Wawrinka      14          5       3         -2 2011
9     David Ferrer       7          6       5         -1 2011
10 Mikhail Youzhny      14          5       5          0 2007

We can then group by the ‘difference’ column to see how seeds are getting on as a whole:

> exp %>% count(difference)
Source: local data frame [9 x 2]
 
  difference  n
1         -5  2
2         -4  7
3         -3 24
4         -2 70
5         -1 66
6          0 85
7          1 43
8          2 17
9          3  4
 
library(ggplot2)
ggplot(aes(x = difference, y = n), data = exp %>% count(difference)) +
  geom_bar(stat = "identity") +
  scale_x_continuous(limits=c(min(potential), max(potential) + 1))
2015 07 04 00 45 02

So from this visualisation we can see that the most common outcome for a seed is that they reach the round they were expected to reach. There are still a decent number of seeds who do 1 or 2 rounds worse than expected as well though.

Antonios suggested doing some analysis of how the seeds fared on a year by year basis – we’ll start by looking at what % of them exactly achieved their seeding:

exp$correct_pred = 0
exp$correct_pred[dt$difference==0] = 1
 
exp %>% group_by(year) %>% 
  summarise(MeanDiff = mean(difference),
            PrcCorrect = mean(correct_pred),
            N=n())
 
Source: local data frame [10 x 4]
 
   year   MeanDiff PrcCorrect  N
1  2005 -0.6562500  0.2187500 32
2  2006 -0.8125000  0.2812500 32
3  2007 -0.4838710  0.4193548 31
4  2008 -0.9677419  0.2580645 31
5  2009 -0.3750000  0.2500000 32
6  2010 -0.7187500  0.4375000 32
7  2011 -0.7187500  0.0937500 32
8  2012 -0.7500000  0.2812500 32
9  2013 -0.9375000  0.2500000 32
10 2014 -0.7187500  0.1875000 32

Some years are better than others – we can use a chisq test to see whether there are any significant differences between the years:

tbl = table(exp$year, exp$correct_pred)
tbl
 
> chisq.test(tbl)
 
	Pearson's Chi-squared test
 
data:  tbl
X-squared = 14.9146, df = 9, p-value = 0.09331

This looks for at least one statistically significant different between the years, although it doesn’t look like there are any. We can also try doing a comparison of each year against all the others:

> pairwise.prop.test(tbl)
 
	Pairwise comparisons using Pairwise comparison of proportions 
 
data:  tbl 
 
     2005 2006 2007 2008 2009 2010 2011 2012 2013
2006 1.00 -    -    -    -    -    -    -    -   
2007 1.00 1.00 -    -    -    -    -    -    -   
2008 1.00 1.00 1.00 -    -    -    -    -    -   
2009 1.00 1.00 1.00 1.00 -    -    -    -    -   
2010 1.00 1.00 1.00 1.00 1.00 -    -    -    -   
2011 1.00 1.00 0.33 1.00 1.00 0.21 -    -    -   
2012 1.00 1.00 1.00 1.00 1.00 1.00 1.00 -    -   
2013 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 -   
2014 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
 
P value adjustment method: holm


2007/2011 and 2010/2011 show the biggest differences but they’re still not significant. Since we have so few data items in each bucket there has to be a really massive difference for it to be significant.

The data I used in this post is available on this gist if you want to look into it and come up with your own analysis.

Written by Mark Needham

July 5th, 2015 at 8:38 am

Posted in R

Tagged with

R: Calculating the difference between ordered factor variables

without comments

In my continued exploration of Wimbledon data I wanted to work out whether a player had done as well as their seeding suggested they should.

I therefore wanted to work out the difference between the round they reached and the round they were expected to reach. A ’round’ in the dataset is an ordered factor variable.

These are all the possible values:

rounds = c("Did not enter", "Round of 128", "Round of 64", "Round of 32", "Round of 16", "Quarter-Finals", "Semi-Finals", "Finals", "Winner")

And if we want to factorise a couple of strings into this factor we would do it like this:

round = factor("Finals", levels = rounds, ordered = TRUE)
expected = factor("Winner", levels = rounds, ordered = TRUE)  
 
> round
[1] Finals
9 Levels: Did not enter < Round of 128 < Round of 64 < Round of 32 < Round of 16 < Quarter-Finals < ... < Winner
 
> expected
[1] Winner
9 Levels: Did not enter < Round of 128 < Round of 64 < Round of 32 < Round of 16 < Quarter-Finals < ... < Winner

In this case the difference between the actual round and expected round should be -1 – the player was expected to win the tournament but lost in the final. We can calculate that differnce by calling the unclass function on each variable:

 
> unclass(round) - unclass(expected)
[1] -1
attr(,"levels")
[1] "Did not enter"  "Round of 128"   "Round of 64"    "Round of 32"    "Round of 16"    "Quarter-Finals"
[7] "Semi-Finals"    "Finals"         "Winner"

That still seems to have some remnants of the factor variable so to get rid of that we can cast it to a numeric value:

> as.numeric(unclass(round) - unclass(expected))
[1] -1

And that’s it! We can now go and apply this calculation to all seeds to see how they got on.

Written by Mark Needham

July 2nd, 2015 at 10:55 pm

Posted in R

Tagged with

R: write.csv – unimplemented type ‘list’ in ‘EncodeElement’

without comments

Everyone now and then I want to serialise an R data frame to a CSV file so I can easily load it up again if my R environment crashes without having to recalculate everything but recently ran into the following error:

> write.csv(foo, "/tmp/foo.csv", row.names = FALSE)
Error in .External2(C_writetable, x, file, nrow(x), p, rnames, sep, eol,  : 
  unimplemented type 'list' in 'EncodeElement'

If we take a closer look at the data frame in question it looks ok:

> foo
  col1 col2
1    1    a
2    2    b
3    3    c

However, one of the columns contains a list in each cell and we need to find out which one it is. I’ve found the quickest way is to run the typeof function over each column:

> typeof(foo$col1)
[1] "double"
 
> typeof(foo$col2)
[1] "list"

So ‘col2’ is the problem one which isn’t surprising if you consider the way I created ‘foo’:

library(dplyr)
foo = data.frame(col1 = c(1,2,3)) %>% mutate(col2 = list("a", "b", "c"))

If we do have a list that we want to add to the data frame we need to convert it to a vector first so we don’t run into this type of problem:

foo = data.frame(col1 = c(1,2,3)) %>% mutate(col2 = list("a", "b", "c") %>% unlist())

And now we can write to the CSV file:

write.csv(foo, "/tmp/foo.csv", row.names = FALSE)
$ cat /tmp/foo.csv
"col1","col2"
1,"a"
2,"b"
3,"c"

And that’s it!

Written by Mark Needham

June 30th, 2015 at 10:26 pm

Posted in R

Tagged with

R: Speeding up the Wimbledon scraping job

without comments

Over the past few days I’ve written a few blog posts about a Wimbledon data set I’ve been building and after running the scripts a few times I noticed that it was taking much longer to run that I expected.

To recap, I started out with the following function which takes in a URI and returns a data frame containing a row for each match:

library(rvest)
library(dplyr)
 
scrape_matches1 = function(uri) {
  matches = data.frame()
 
  s = html(uri)
  rows = s %>% html_nodes("div#scoresResultsContent tr")
  i = 0
  for(row in rows) {  
    players = row %>% html_nodes("td.day-table-name a")
    seedings = row %>% html_nodes("td.day-table-seed")
    score = row %>% html_node("td.day-table-score a")
    flags = row %>% html_nodes("td.day-table-flag img")
 
    if(!is.null(score)) {
      player1 = players[1] %>% html_text() %>% str_trim()
      seeding1 = ifelse(!is.na(seedings[1]), seedings[1] %>% html_node("span") %>% html_text() %>% str_trim(), NA)
      flag1 = flags[1] %>% html_attr("alt")
 
      player2 = players[2] %>% html_text() %>% str_trim()
      seeding2 = ifelse(!is.na(seedings[2]), seedings[2] %>% html_node("span") %>% html_text() %>% str_trim(), NA)
      flag2 = flags[2] %>% html_attr("alt")
 
      matches = rbind(data.frame(winner = player1, 
                                 winner_seeding = seeding1, 
                                 winner_flag = flag1,
                                 loser = player2, 
                                 loser_seeding = seeding2,
                                 loser_flag = flag2,
                                 score = score %>% html_text() %>% str_trim(),
                                 round = round), matches)      
    } else {
      round = row %>% html_node("th") %>% html_text()
    }
  } 
  return(matches)
}

Let’s run it to get an idea of the data that it returns:

matches1 = scrape_matches1("http://www.atpworldtour.com/en/scores/archive/wimbledon/540/2014/results")
 
> matches1 %>% filter(round %in% c("Finals", "Semi-Finals", "Quarter-Finals"))
           winner winner_seeding winner_flag           loser loser_seeding loser_flag            score          round
1    Milos Raonic            (8)         CAN    Nick Kyrgios          (WC)        AUS    674 62 64 764 Quarter-Finals
2   Roger Federer            (4)         SUI   Stan Wawrinka           (5)        SUI     36 765 64 64 Quarter-Finals
3 Grigor Dimitrov           (11)         BUL     Andy Murray           (3)        GBR        61 764 62 Quarter-Finals
4  Novak Djokovic            (1)         SRB     Marin Cilic          (26)        CRO  61 36 674 62 62 Quarter-Finals
5   Roger Federer            (4)         SUI    Milos Raonic           (8)        CAN         64 64 64    Semi-Finals
6  Novak Djokovic            (1)         SRB Grigor Dimitrov          (11)        BUL    64 36 762 767    Semi-Finals
7  Novak Djokovic            (1)         SRB   Roger Federer           (4)        SUI 677 64 764 57 64         Finals

As I mentioned, it’s quite slow but I thought I’d wrap it in system.time so I could see exactly how long it was taking:

> system.time(scrape_matches1("http://www.atpworldtour.com/en/scores/archive/wimbledon/540/2014/results"))
   user  system elapsed 
 25.570   0.111  31.416

About 30 seconds! The first thing I tried was downloading the file separately and running the function against the local file:

> system.time(scrape_matches1("data/raw/2014.html"))
   user  system elapsed 
 25.662   0.123  25.863

Hmmm, that’s only saved us 5 seconds so the bottleneck must be somewhere else. Still there’s no point making a HTTP request every time we run the script so we’ll stick with the local file version.

While browsing rvest’s vignette I noticed a function called html_table which I was curious about. I decided to try and replace some of my code with a call to that:

matches2= html("data/raw/2014.html") %>% 
  html_node("div#scoresResultsContent table.day-table") %>% html_table(header = FALSE) %>% 
  mutate(X1 = ifelse(X1 == "", NA, X1)) %>%
  mutate(round = ifelse(grepl("\\([0-9]\\)|\\(", X1), NA, X1)) %>% 
  mutate(round = na.locf(round)) %>%
  filter(!is.na(X8)) %>%
  select(winner = X3, winner_seeding = X1, loser = X7, loser_seeding = X5, score = X8, round)
 
> matches2 %>% filter(round %in% c("Finals", "Semi-Finals", "Quarter-Finals"))
           winner winner_seeding           loser loser_seeding            score          round
1  Novak Djokovic            (1)   Roger Federer           (4) 677 64 764 57 64         Finals
2  Novak Djokovic            (1) Grigor Dimitrov          (11)    64 36 762 767    Semi-Finals
3   Roger Federer            (4)    Milos Raonic           (8)         64 64 64    Semi-Finals
4  Novak Djokovic            (1)     Marin Cilic          (26)  61 36 674 62 62 Quarter-Finals
5 Grigor Dimitrov           (11)     Andy Murray           (3)        61 764 62 Quarter-Finals
6   Roger Federer            (4)   Stan Wawrinka           (5)     36 765 64 64 Quarter-Finals
7    Milos Raonic            (8)    Nick Kyrgios          (WC)    674 62 64 764 Quarter-Finals

I had to do some slightly clever stuff to get the ’round’ column into shape using zoo’s na.locf function which I wrote about previously.

Unfortunately I couldn’t work out how to extract the flag with this version – that value is hidden in the ‘alt’ tag of an img and presumably html_table is just grabbing the text value of each cell. This version is much quicker though!

system.time(html("data/raw/2014.html") %>% 
  html_node("div#scoresResultsContent table.day-table") %>% html_table(header = FALSE) %>% 
  mutate(X1 = ifelse(X1 == "", NA, X1)) %>%
  mutate(round = ifelse(grepl("\\([0-9]\\)|\\(", X1), NA, X1)) %>% 
  mutate(round = na.locf(round)) %>%
  filter(!is.na(X8)) %>%
  select(winner = X3, winner_seeding = X1, loser = X7, loser_seeding = X5, score = X8, round))
 
   user  system elapsed 
  0.545   0.002   0.548

What I realised from writing this version is that I need to match all the columns with one call to html_nodes rather than getting the row and then each column in a loop.

I rewrote the function to do that:

scrape_matches3 = function(uri) {
  s = html(uri)
 
  players  = s %>% html_nodes("div#scoresResultsContent tr td.day-table-name a")
  seedings = s %>% html_nodes("div#scoresResultsContent tr td.day-table-seed")
  scores   = s %>% html_nodes("div#scoresResultsContent tr td.day-table-score a")
  flags    = s %>% html_nodes("div#scoresResultsContent tr td.day-table-flag img") %>% html_attr("alt") %>% str_trim()
 
  matches3 = data.frame(
    winner         = sapply(seq(1,length(players),2),  function(idx) players[[idx]] %>% html_text()),
    winner_seeding = sapply(seq(1,length(seedings),2), function(idx) seedings[[idx]] %>% html_text() %>% str_trim()),
    winner_flag    = sapply(seq(1,length(flags),2),    function(idx) flags[[idx]]),  
    loser          = sapply(seq(2,length(players),2),  function(idx) players[[idx]] %>% html_text()),
    loser_seeding  = sapply(seq(2,length(seedings),2), function(idx) seedings[[idx]] %>% html_text() %>% str_trim()),
    loser_flag     = sapply(seq(2,length(flags),2),    function(idx) flags[[idx]]),
    score          = sapply(scores,                    function(score) score %>% html_text() %>% str_trim())
  )
  return(matches3)
}

Let’s run and time that to check we’re getting back the right results in a timely manner:

> matches3 %>% sample_n(10)
                   winner winner_seeding winner_flag               loser loser_seeding loser_flag         score
70           David Ferrer            (7)         ESP Pablo Carreno Busta                      ESP  60 673 61 61
128        Alex Kuznetsov           (26)         USA         Tim Smyczek           (3)        USA   46 63 63 63
220   Rogerio Dutra Silva                        BRA   Kristijan Mesaros                      CRO         62 63
83         Kevin Anderson           (20)         RSA        Aljaz Bedene          (LL)        GBR      63 75 62
73          Kei Nishikori           (10)         JPN   Kenny De Schepper                      FRA     64 765 75
56  Roberto Bautista Agut           (27)         ESP         Jan Hernych           (Q)        CZE   75 46 62 62
138            Ante Pavic                        CRO        Marc Gicquel          (29)        FRA  46 63 765 64
174             Tim Puetz                        GER     Ruben Bemelmans                      BEL         64 62
103        Lleyton Hewitt                        AUS   Michal Przysiezny                      POL 62 6714 61 64
35          Roger Federer            (4)         SUI       Gilles Muller           (Q)        LUX      63 75 63
 
> system.time(scrape_matches3("data/raw/2014.html"))
   user  system elapsed 
  0.815   0.006   0.827

It’s still quick – a bit slower than html_table but we can deal with that. As you can see, I also had to add some logic to separate the values for the winners and losers – the players, seeds, flags come back as as one big list. The odd rows represent the winner; the even rows the loser.

Annoyingly we’ve now lost the ’round’ column because that appears as a table heading so we can’t extract it the same way. I ended up cheating a bit to get it to work by working out how many matches each round should contain and generated a vector with that number of entries:

raw_rounds = s %>% html_nodes("th") %>% html_text()
 
> raw_rounds
 [1] "Finals"               "Semi-Finals"          "Quarter-Finals"       "Round of 16"          "Round of 32"         
 [6] "Round of 64"          "Round of 128"         "3rd Round Qualifying" "2nd Round Qualifying" "1st Round Qualifying"
 
rounds = c( sapply(0:6, function(idx) rep(raw_rounds[[idx + 1]], 2 ** idx)) %>% unlist(),
            sapply(7:9, function(idx) rep(raw_rounds[[idx + 1]], 2 ** (idx - 3))) %>% unlist())
 
> rounds[1:10]
 [1] "Finals"         "Semi-Finals"    "Semi-Finals"    "Quarter-Finals" "Quarter-Finals" "Quarter-Finals" "Quarter-Finals"
 [8] "Round of 16"    "Round of 16"    "Round of 16"

Let’s put that code into the function and see if we end up with the same resulting data frame:

scrape_matches4 = function(uri) {
  s = html(uri)
 
  players  = s %>% html_nodes("div#scoresResultsContent tr td.day-table-name a")
  seedings = s %>% html_nodes("div#scoresResultsContent tr td.day-table-seed")
  scores   = s %>% html_nodes("div#scoresResultsContent tr td.day-table-score a")
  flags    = s %>% html_nodes("div#scoresResultsContent tr td.day-table-flag img") %>% html_attr("alt") %>% str_trim()
 
  raw_rounds = s %>% html_nodes("th") %>% html_text()
  rounds = c( sapply(0:6, function(idx) rep(raw_rounds[[idx + 1]], 2 ** idx)) %>% unlist(),
              sapply(7:9, function(idx) rep(raw_rounds[[idx + 1]], 2 ** (idx - 3))) %>% unlist())
 
  matches4 = data.frame(
    winner         = sapply(seq(1,length(players),2),  function(idx) players[[idx]] %>% html_text()),
    winner_seeding = sapply(seq(1,length(seedings),2), function(idx) seedings[[idx]] %>% html_text() %>% str_trim()),
    winner_flag    = sapply(seq(1,length(flags),2),    function(idx) flags[[idx]]),  
    loser          = sapply(seq(2,length(players),2),  function(idx) players[[idx]] %>% html_text()),
    loser_seeding  = sapply(seq(2,length(seedings),2), function(idx) seedings[[idx]] %>% html_text() %>% str_trim()),
    loser_flag     = sapply(seq(2,length(flags),2),    function(idx) flags[[idx]]),
    score          = sapply(scores,                    function(score) score %>% html_text() %>% str_trim()),
    round          = rounds
  )
  return(matches4)
}
 
matches4 = scrape_matches4("data/raw/2014.html")
 
> matches4 %>% filter(round %in% c("Finals", "Semi-Finals", "Quarter-Finals"))
           winner winner_seeding winner_flag           loser loser_seeding loser_flag            score          round
1  Novak Djokovic            (1)         SRB   Roger Federer           (4)        SUI 677 64 764 57 64         Finals
2  Novak Djokovic            (1)         SRB Grigor Dimitrov          (11)        BUL    64 36 762 767    Semi-Finals
3   Roger Federer            (4)         SUI    Milos Raonic           (8)        CAN         64 64 64    Semi-Finals
4  Novak Djokovic            (1)         SRB     Marin Cilic          (26)        CRO  61 36 674 62 62 Quarter-Finals
5 Grigor Dimitrov           (11)         BUL     Andy Murray           (3)        GBR        61 764 62 Quarter-Finals
6   Roger Federer            (4)         SUI   Stan Wawrinka           (5)        SUI     36 765 64 64 Quarter-Finals
7    Milos Raonic            (8)         CAN    Nick Kyrgios          (WC)        AUS    674 62 64 764 Quarter-Finals

We shouldn’t have added much to the time but let’s check:

> system.time(scrape_matches4("data/raw/2014.html"))
   user  system elapsed 
  0.816   0.004   0.824

Sweet. We’ve saved ourselves 29 seconds per page as long as the number of rounds stayed constant over the years. For the 10 years that I’ve looked at it has but I expect if you go back further the draw sizes will have been different and our script would break.

For now though this will do!

Written by Mark Needham

June 29th, 2015 at 5:36 am

Posted in R

Tagged with

R: dplyr – Update rows with earlier/previous rows values

without comments

Recently I had a data frame which contained a column which had mostly empty values:

> data.frame(col1 = c(1,2,3,4,5), col2  = c("a", NA, NA , "b", NA))
  col1 col2
1    1    a
2    2 <NA>
3    3 <NA>
4    4    b
5    5 <NA>

I wanted to fill in the NA values with the last non NA value from that column. So I want the data frame to look like this:

1    1    a
2    2    a
3    3    a
4    4    b
5    5    b

I spent ages searching around before I came across the na.locf function in the zoo library which does the job:

library(zoo)
library(dplyr)
 
> data.frame(col1 = c(1,2,3,4,5), col2  = c("a", NA, NA , "b", NA)) %>% 
    do(na.locf(.))
  col1 col2
1    1    a
2    2    a
3    3    a
4    4    b
5    5    b

This will fill in the missing values for every column, so if we had a third column with missing values it would populate those too:

> data.frame(col1 = c(1,2,3,4,5), col2  = c("a", NA, NA , "b", NA), col3 = c("A", NA, "B", NA, NA)) %>% 
    do(na.locf(.))
 
  col1 col2 col3
1    1    a    A
2    2    a    A
3    3    a    B
4    4    b    B
5    5    b    B

If we only want to populate ‘col2’ and leave ‘col3’ as it is we can apply the function specifically to that column:

> data.frame(col1 = c(1,2,3,4,5), col2  = c("a", NA, NA , "b", NA), col3 = c("A", NA, "B", NA, NA)) %>% 
    mutate(col2 = na.locf(col2))
  col1 col2 col3
1    1    a    A
2    2    a <NA>
3    3    a    B
4    4    b <NA>
5    5    b <NA>

It’s quite a neat function and certainly comes in helpful when cleaning up data sets which don’t tend to be as uniform as you’d hope!

Written by Mark Needham

June 28th, 2015 at 10:30 pm

Posted in R

Tagged with

R: Command line – Error in GenericTranslator$new : could not find function “loadMethod”

without comments

I’ve been reading Text Processing with Ruby over the last week or so and one of the ideas the author describes is setting up your scripts so you can run them directly from the command line.

I wanted to do this with my Wimbledon R script and wrote the following script which uses the ‘Rscript’ executable so that R doesn’t launch in interactive mode:

wimbledon

#!/usr/bin/env Rscript
 
library(rvest)
library(dplyr)
library(stringr)
library(readr)
 
# stuff

Then I tried to run it:

$ time ./wimbledon
 
...
 
Error in GenericTranslator$new : could not find function "loadMethod"
Calls: write.csv ... html_extract_n -> <Anonymous> -> Map -> mapply -> <Anonymous> -> $
Execution halted
 
real	0m1.431s
user	0m1.127s
sys	0m0.078s

As the error suggests, the script fails when trying to write to a CSV file – it looks like Rscript doesn’t load in something from the core library that we need. It turns out adding the following line to our script is all we need:

library(methods)

So we end up with this:

#!/usr/bin/env Rscript
 
library(methods)
library(rvest)
library(dplyr)
library(stringr)
library(readr)

And when we run that all is well!

Written by Mark Needham

June 27th, 2015 at 10:47 pm

Posted in R

Tagged with

R: dplyr – squashing multiple rows per group into one

without comments

I spent a bit of the day working on my Wimbledon data set and the next thing I explored is all the people that have beaten Andy Murray in the tournament.

The following dplyr query gives us the names of those people and the year the match took place:

library(dplyr)
 
> main_matches %>% filter(loser == "Andy Murray") %>% select(winner, year)
 
            winner year
1  Grigor Dimitrov 2014
2    Roger Federer 2012
3     Rafael Nadal 2011
4     Rafael Nadal 2010
5     Andy Roddick 2009
6     Rafael Nadal 2008
7 Marcos Baghdatis 2006
8 David Nalbandian 2005

As you can see, Rafael Nadal shows up multiple times. I wanted to get one row per player and list all the years in a single column.

This was my initial attempt:

> main_matches %>% filter(loser == "Andy Murray") %>% 
     group_by(winner) %>% summarise(years = paste(year))
Source: local data frame [6 x 2]
 
            winner years
1     Andy Roddick  2009
2 David Nalbandian  2005
3  Grigor Dimitrov  2014
4 Marcos Baghdatis  2006
5     Rafael Nadal  2011
6    Roger Federer  2012

Unfortunately it just gives you the last matching row per group which isn’t quite what we want.. I realised my mistake while trying to pass a vector into paste and noticing that a vector came back when I’d expected a string:

> paste(c(2008,2009,2010))
[1] "2008" "2009" "2010"

The missing argument was ‘collapse’ – something I’d come across when using plyr last year:

> paste(c(2008,2009,2010), collapse=", ")
[1] "2008, 2009, 2010"

Now, if we apply that to our original function:

> main_matches %>% filter(loser == "Andy Murray") %>% 
     group_by(winner) %>% summarise(years = paste(year, collapse=", "))
Source: local data frame [6 x 2]
 
            winner            years
1     Andy Roddick             2009
2 David Nalbandian             2005
3  Grigor Dimitrov             2014
4 Marcos Baghdatis             2006
5     Rafael Nadal 2011, 2010, 2008
6    Roger Federer             2012

That’s exactly what we want. Let’s tidy that up a bit:

> main_matches %>% filter(loser == "Andy Murray") %>% 
     group_by(winner) %>% arrange(year) %>%
     summarise(years  = paste(year, collapse =","), times = length(year))  %>%
     arrange(desc(times), years)
Source: local data frame [6 x 3]
 
            winner          years times
1     Rafael Nadal 2008,2010,2011     3
2 David Nalbandian           2005     1
3 Marcos Baghdatis           2006     1
4     Andy Roddick           2009     1
5    Roger Federer           2012     1
6  Grigor Dimitrov           2014     1

Written by Mark Needham

June 27th, 2015 at 10:36 pm

Posted in R

Tagged with