Mark Needham

Thoughts on Software Development

Archive for the ‘R’ Category

R: Bootstrap confidence intervals

without comments

I recently came across an interesting post on Julia Evans’ blog showing how to generate a bigger set of data points by sampling the small set of data points that we actually have using bootstrapping. Julia’s examples are all in Python so I thought it’d be a fun exercise to translate them into R.

We’re doing the bootstrapping to simulate the number of no-shows for a flight so we can work out how many seats we can overbook the plane by.

We start out with a small sample of no-shows and work off the assumption that it’s ok to kick someone off a flight 5% of the time. Let’s work out how many people that’d be for our initial sample:

> data = c(0, 1, 3, 2, 8, 2, 3, 4)
> quantile(data, 0.05)
  5% 
0.35

0.35 people! That’s not a particularly useful result so we’re going to resample the initial data set 10,000 times, taking the 5%ile each time and see if we come up with something better:

We’re going to use the sample function with replacement to generate our resamples:

> sample(data, replace = TRUE)
[1] 0 3 2 8 8 0 8 0
> sample(data, replace = TRUE)
[1] 2 2 4 3 4 4 2 2

Now let’s write a function to do that multiple times:

library(ggplot)
 
bootstrap_5th_percentile = function(data, n_bootstraps) {
  return(sapply(1:n_bootstraps, 
                function(iteration) quantile(sample(data, replace = TRUE), 0.05)))
}
 
values = bootstrap_5th_percentile(data, 10000)
 
ggplot(aes(x = value), data = data.frame(value = values)) + geom_histogram(binwidth=0.25)

2015 07 19 18 05 48

So this visualisation is telling us that we can oversell by 0-2 people but we don’t know an exact number.

Let’s try the same exercise but with a bigger initial data set of 1,000 values rather than just 8. First we’ll generate a distribution (with a mean of 5 and standard deviation of 2) and visualise it:

library(dplyr)
 
df = data.frame(value = rnorm(1000,5, 2))
df = df %>% filter(value >= 0) %>% mutate(value = as.integer(round(value)))
ggplot(aes(x = value), data = df) + geom_histogram(binwidth=1)

2015 07 19 18 09 15

Our distribution seems to have a lot more values around 4 & 5 whereas the Python version has a flatter distribution – I’m not sure why that is so if you have any ideas let me know. In any case, let’s check the 5%ile for this data set:

> quantile(df$value, 0.05)
5% 
 2

Cool! Now at least we have an integer value rather than the 0.35 we got earlier. Finally let’s do some bootstrapping over our new distribution and see what 5%ile we come up with:

resampled = bootstrap_5th_percentile(df$value, 10000)
byValue = data.frame(value = resampled) %>% count(value)
 
> byValue
Source: local data frame [3 x 2]
 
  value    n
1   1.0    3
2   1.7    2
3   2.0 9995
 
ggplot(aes(x = value, y = n), data = byValue) + geom_bar(stat = "identity")

2015 07 19 18 23 29

‘2’ is by far the most popular 5%ile here although it seems weighted more towards that value than with Julia’s Python version, which I imagine is because we seem to have sampled from a slightly different distribution.

Written by Mark Needham

July 19th, 2015 at 7:44 pm

Posted in R

Tagged with

R: Blog post frequency anomaly detection

without comments

I came across Twitter’s anomaly detection library last year but haven’t yet had a reason to take it for a test run so having got my blog post frequency data into shape I thought it’d be fun to run it through the algorithm.

I wanted to see if it would detect any periods of time when the number of posts differed significantly – I don’t really have an action I’m going to take based on the results, it’s curiosity more than anything else!

First we need to get the library installed. It’s not on CRAN so we need to use devtools to install it from the github repository:

install.packages("devtools")
devtools::install_github("twitter/AnomalyDetection")
library(AnomalyDetection)

The expected data format is two columns – one containing a time stamp and the other a count. e.g. using the ‘raw_data’ data frame that is in scope when you add the library:

> library(dplyr)
> raw_data %>% head()
            timestamp   count
1 1980-09-25 14:01:00 182.478
2 1980-09-25 14:02:00 176.231
3 1980-09-25 14:03:00 183.917
4 1980-09-25 14:04:00 177.798
5 1980-09-25 14:05:00 165.469
6 1980-09-25 14:06:00 181.878

In our case the timestamps will be the start date of a week and the count the number of posts in that week. But first let’s get some practice calling the anomaly function using the canned data:

res = AnomalyDetectionTs(raw_data, max_anoms=0.02, direction='both', plot=TRUE)
res$plot

2015 07 18 00 09 22

From this visualisation we learn that we should expect both high and low outliers to be identified. Let’s give it a try with the blog post publication data.

We need to get the data into shape so we’ll start by getting a count of the number of blog posts by (week, year) pair:

> df %>% sample_n(5)
                                                           title                date
1425                            Coding: Copy/Paste then refactor 2009-10-31 07:54:31
783  Neo4j 2.0.0-M06 -> 2.0.0-RC1: Working with path expressions 2013-11-23 10:30:41
960                                        R: Removing for loops 2015-04-18 23:53:20
966   R: dplyr - Error in (list: invalid subscript type 'double' 2015-04-27 22:34:43
343                     Parsing XML from the unix terminal/shell 2011-09-03 23:42:11
 
> byWeek = df %>% 
    mutate(year = year(date), week = week(date)) %>% 
    group_by(week, year) %>% summarise(n = n()) %>% 
    ungroup() %>% arrange(desc(n))
 
> byWeek %>% sample_n(5)
Source: local data frame [5 x 3]
 
  week year n
1   44 2009 6
2   37 2011 4
3   39 2012 3
4    7 2013 4
5    6 2010 6

Great. The next step is to translate this data frame into one containing a date representing the start of that week and the number of posts:

> data = byWeek %>% 
    mutate(start_of_week = calculate_start_of_week(week, year)) %>%
    filter(start_of_week > ymd("2008-07-01")) %>%
    select(start_of_week, n)
 
> data %>% sample_n(5)
Source: local data frame [5 x 2]
 
  start_of_week n
1    2010-09-10 4
2    2013-04-09 4
3    2010-04-30 6
4    2012-03-11 3
5    2014-12-03 3

We’re now ready to plug it into the anomaly detection function:

res = AnomalyDetectionTs(data, 
                         max_anoms=0.02, 
                         direction='both', 
                         plot=TRUE)
res$plot

2015 07 18 00 24 20

Interestingly I don’t seem to have any low end anomalies – there were a couple of really high frequency weeks when I first started writing and I think one of the other weeks contains a New Year’s Eve when I was particularly bored!

If we group by month instead only the very first month stands out as an outlier:

data = byMonth %>% 
  mutate(start_of_month = ymd(paste(year, month, 1, sep="-"))) %>%
  filter(start_of_month > ymd("2008-07-01")) %>%
  select(start_of_month, n)
res = AnomalyDetectionTs(data, 
                         max_anoms=0.02, 
                         direction='both',       
                         #longterm = TRUE,
                         plot=TRUE)
res$plot

2015 07 18 00 34 02

I’m not sure what else to do as far as anomaly detection goes but if you have any ideas please let me know!

Written by Mark Needham

July 17th, 2015 at 11:34 pm

Posted in R

Tagged with

R: I write more in the last week of the month, or do I?

without comments

I’ve been writing on this blog for almost 7 years and have always believed that I write more frequently towards the end of a month. Now that I’ve got all the data I thought it’d be interesting to test that belief.

I started with a data frame containing each post and its publication date and added an extra column which works out how many weeks from the end of the month that post was written:

> df %>% sample_n(5)
                                                               title                date
946  Python: Equivalent to flatMap for flattening an array of arrays 2015-03-23 00:45:00
175                                         Ruby: Hash default value 2010-10-16 14:02:37
375               Java/Scala: Runtime.exec hanging/in 'pipe_w' state 2011-11-20 20:20:08
1319                            Coding Dojo #18: Groovy Bowling Game 2009-06-26 08:15:23
381                   Continuous Delivery: Removing manual scenarios 2011-12-05 23:13:34
 
calculate_start_of_week = function(week, year) {
  date <- ymd(paste(year, 1, 1, sep="-"))
  week(date) = week
  return(date)
}
 
tidy_df  = df %>% 
  mutate(year = year(date), 
         week = week(date),
         week_in_month = ceiling(day(date) / 7),
         max_week = max(week_in_month), 
         weeks_from_end = max_week - week_in_month,
         start_of_week = calculate_start_of_week(week, year))
 
> tidy_df %>% select(date, weeks_from_end, start_of_week) %>% sample_n(5)
 
                    date weeks_from_end start_of_week
1023 2008-08-08 21:16:02              3    2008-08-05
800  2014-01-31 06:51:06              0    2014-01-29
859  2014-08-14 10:24:52              3    2014-08-13
107  2010-07-10 22:49:52              3    2010-07-09
386  2011-12-20 23:57:51              2    2011-12-17

Next I want to get a count of how many posts were published in a given week. The following code does that transformation for us:

weeks_from_end_counts =  tidy_df %>%
  group_by(start_of_week, weeks_from_end) %>%
  summarise(count = n())
 
> weeks_from_end_counts
Source: local data frame [540 x 4]
Groups: start_of_week, weeks_from_end
 
   start_of_week weeks_from_end year count
1     2006-08-27              0 2006     1
2     2006-08-27              4 2006     3
3     2006-09-03              4 2006     1
4     2008-02-05              3 2008     2
5     2008-02-12              3 2008     2
6     2008-07-15              2 2008     1
7     2008-07-22              1 2008     1
8     2008-08-05              3 2008     8
9     2008-08-12              2 2008     5
10    2008-08-12              3 2008     9
..           ...            ...  ...   ...

We group by both ‘start_of_week’ and ‘weeks_from_end’ because we could have posts published in the same week but different month and we want to capture that difference. Now we can run a correlation on the data frame to see if there’s any relationship between ‘count’ and ‘weeks_from_end':

> cor(weeks_from_end_counts %>% ungroup() %>% select(weeks_from_end, count))
               weeks_from_end       count
weeks_from_end     1.00000000 -0.08253569
count             -0.08253569  1.00000000

This suggests there’s a slight negative correlation between the two variables i.e. ‘count’ decreases as ‘weeks_from_end’ increases. Let’s plug the data frame into a linear model to see how good ‘weeks_from_end’ is as a predictor of ‘count':

> fit = lm(count ~ weeks_from_end, weeks_from_end_counts)
 
> summary(fit)
 
Call:
lm(formula = count ~ weeks_from_end, data = weeks_from_end_counts)
 
Residuals:
    Min      1Q  Median      3Q     Max 
-2.0000 -1.5758 -0.5758  1.1060  8.0000 
 
Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     3.00000    0.13764  21.795   <2e-16 ***
weeks_from_end -0.10605    0.05521  -1.921   0.0553 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 1.698 on 538 degrees of freedom
Multiple R-squared:  0.006812,	Adjusted R-squared:  0.004966 
F-statistic:  3.69 on 1 and 538 DF,  p-value: 0.05527

We see a similar result here. The effect of ‘weeks_from_end’ is worth 0.1 posts per week with a p value of 0.0553 so it’s on the border line of being significant.

We also have a very low ‘R squared’ value which suggests the ‘weeks_from_end’ isn’t explaining much of the variation in the data which makes sense given that we didn’t see much of a correlation.

If we charged on and wanted to predict the number of posts likely to be published in a given week we could run the predict function like this:

> predict(fit, data.frame(weeks_from_end=c(1,2,3,4,5)))
       1        2        3        4        5 
2.893952 2.787905 2.681859 2.575812 2.469766

Obviously it’s a bit flawed since we could plug in any numeric value we want, even ones that don’t make any sense, and it’d still come back with a prediction:

> predict(fit, data.frame(weeks_from_end=c(30 ,-10)))
        1         2 
-0.181394  4.060462

I think we’d probably protect against that with a function wrapping our call to predict that doesn’t allow ‘weeks_from_end’ to be greater than 5 or less than 0.

So far it looks like my belief is incorrect! I’m a bit dubious about my calculation of ‘weeks_from_end’ though – it’s not completely capturing what I want since in some months the last week only contains a couple of days.

Next I’m going to explore whether it makes any difference if I calculate that value by counting the number of days back from the last day of the month rather than using week number.

Written by Mark Needham

July 12th, 2015 at 9:53 am

Posted in R

Tagged with

R: Filling in missing dates with 0s

without comments

I wanted to plot a chart showing the number of blog posts published by month and started with the following code which makes use of zoo’s ‘as.yearmon’ function to add the appropriate column and grouping:

> library(zoo)
> library(dplyr)
> df %>% sample_n(5)
                                                  title                date
888        R: Converting a named vector to a data frame 2014-10-31 23:47:26
144  Rails: Populating a dropdown list using 'form_for' 2010-08-31 01:22:14
615                    Onboarding: Sketch the landscape 2013-02-15 07:36:06
28                        Javascript: The 'new' keyword 2010-03-06 15:16:02
1290                Coding Dojo #16: Reading SUnit code 2009-05-28 23:23:19
 
> posts_by_date  = df %>% mutate(year_mon = as.Date(as.yearmon(date))) %>% count(year_mon)
> posts_by_date %>% head(5)
 
    year_mon  n
1 2006-08-01  1
2 2006-09-01  4
3 2008-02-01  4
4 2008-07-01  2
5 2008-08-01 38

I then plugged the new data frame into ggplot to get the chart:

> ggplot(aes(x = year_mon, y = n), data = posts_by_date) + geom_line()

2015 07 12 09 07 47

The problem with this chart is that it’s showing there being 4 posts per month for all the dates between September 2006 and February 2008 even though I didn’t write anything! It’s doing the same thing between February 2008 and July 2008 too.

We can fix that by filling in the gaps with 0s.

First we’ll create a vector containing every month in the data range contained by our data frame:

> all_dates = seq(as.Date(as.yearmon(min(df$date))), as.Date(as.yearmon(max(df$date))), by="month")
 
> all_dates
  [1] "2006-08-01" "2006-09-01" "2006-10-01" "2006-11-01" "2006-12-01" "2007-01-01" "2007-02-01" "2007-03-01"
  [9] "2007-04-01" "2007-05-01" "2007-06-01" "2007-07-01" "2007-08-01" "2007-09-01" "2007-10-01" "2007-11-01"
 [17] "2007-12-01" "2008-01-01" "2008-02-01" "2008-03-01" "2008-04-01" "2008-05-01" "2008-06-01" "2008-07-01"
 [25] "2008-08-01" "2008-09-01" "2008-10-01" "2008-11-01" "2008-12-01" "2009-01-01" "2009-02-01" "2009-03-01"
 [33] "2009-04-01" "2009-05-01" "2009-06-01" "2009-07-01" "2009-08-01" "2009-09-01" "2009-10-01" "2009-11-01"
 [41] "2009-12-01" "2010-01-01" "2010-02-01" "2010-03-01" "2010-04-01" "2010-05-01" "2010-06-01" "2010-07-01"
 [49] "2010-08-01" "2010-09-01" "2010-10-01" "2010-11-01" "2010-12-01" "2011-01-01" "2011-02-01" "2011-03-01"
 [57] "2011-04-01" "2011-05-01" "2011-06-01" "2011-07-01" "2011-08-01" "2011-09-01" "2011-10-01" "2011-11-01"
 [65] "2011-12-01" "2012-01-01" "2012-02-01" "2012-03-01" "2012-04-01" "2012-05-01" "2012-06-01" "2012-07-01"
 [73] "2012-08-01" "2012-09-01" "2012-10-01" "2012-11-01" "2012-12-01" "2013-01-01" "2013-02-01" "2013-03-01"
 [81] "2013-04-01" "2013-05-01" "2013-06-01" "2013-07-01" "2013-08-01" "2013-09-01" "2013-10-01" "2013-11-01"
 [89] "2013-12-01" "2014-01-01" "2014-02-01" "2014-03-01" "2014-04-01" "2014-05-01" "2014-06-01" "2014-07-01"
 [97] "2014-08-01" "2014-09-01" "2014-10-01" "2014-11-01" "2014-12-01" "2015-01-01" "2015-02-01" "2015-03-01"
[105] "2015-04-01" "2015-05-01" "2015-06-01" "2015-07-01"

Now we need to create a data frame containing those dates and merge it with the original:

posts_by_date_clean = merge(data.frame(date = all_dates),
                            posts_by_date,
                            by.x='date',
                            by.y='year_mon',
                            all.x=T,
                            all.y=T)
 
> posts_by_date_clean %>% head()
        date  n
1 2006-08-01  1
2 2006-09-01  4
3 2006-10-01 NA
4 2006-11-01 NA
5 2006-12-01 NA
6 2007-01-01 NA

We’ve still got some ‘NA’ values in there which won’t plot so well. Let’s set those to 0 and then try and plot our chart again:

> posts_by_date_clean$n[is.na(posts_by_date_clean$n)] = 0
> ggplot(aes(x = date, y = n), data = posts_by_date_clean) + geom_line()
2015 07 12 09 17 10

Much better!

Written by Mark Needham

July 12th, 2015 at 8:30 am

Posted in R

Tagged with

R: Date for given week/year

without comments

As I mentioned in my last couple of blog posts I’ve been looking at the data behind this blog and I wanted to plot a chart showing the number of posts per week since the blog started.

I started out with a data frame with posts and publication date:

> library(dplyr)
> df = read.csv("posts.csv")
> df$date = ymd_hms(df$date)
 
> df %>% sample_n(10)
                                                                                title                date
538                                    Nygard Big Data Model: The Investigation Stage 2012-10-10 00:00:36
341                                                            The read-only database 2011-08-29 23:32:26
1112                                  CSS in Internet Explorer - Some lessons learned 2008-10-31 15:24:51
143                                                       Coding: Mutating parameters 2010-08-26 07:47:23
433  Scala: Counting number of inversions (via merge sort) for an unsorted collection 2012-03-20 06:53:18
618                                    neo4j/cypher: SQL style GROUP BY functionality 2013-02-17 21:05:27
1111                                 Testing Hibernate mappings: Setting up test data 2008-10-30 13:24:14
462                                       neo4j: What question do you want to answer? 2012-05-05 13:20:41
1399                                       Book Club: Design Sense (Michael Feathers) 2009-09-29 14:42:29
494                                    Bash Shell: Reusing parts of previous commands 2012-07-05 23:42:35

The first step was to add a couple of columns representing the week and year for the publication date. The ‘lubridate’ library came in handy here:

byWeek = df %>% 
  mutate(year = year(date), week = week(date)) %>% 
  group_by(week, year) %>% summarise(n = n()) %>% 
  ungroup() %>% arrange(desc(n))
 
> byWeek
Source: local data frame [352 x 3]
 
   week year  n
1    33 2008 14
2    35 2008 11
3    53 2012 11
4     9 2013 10
5    12 2013  9
6    21 2009  9
7    22 2009  9
8    38 2013  9
9    40 2008  9
10   48 2012  9
..  ...  ... ..

The next step is to calculate the start date of each of those weeks so that we can plot the counts on a continuous date scale. I spent a while searching how to do this before realising that the ‘week’ function I used before can set the week for a given data as well. Let’s get to work:

calculate_start_of_week = function(week, year) {
  date <- ymd(paste(year, 1, 1, sep="-"))
  week(date) = week
  return(date)
}
 
> calculate_start_of_week(c(1,2,3), c(2015,2014,2013))
[1] "2015-01-01 UTC" "2014-01-08 UTC" "2013-01-15 UTC"

And now let’s transform our data frame and plot the counts:

ggplot(aes(x=start_of_week, y=n, group=1), 
       data = byWeek %>% mutate(start_of_week = calculate_start_of_week(week, year))) + 
  geom_line()

2015 07 10 22 43 54

It’s a bit erratic as you can see. Some of this can be explained by the fact that I do in fact post in an erratic way while some of it is explained by the fact that some weeks only have a few days if they start on the 29th onwards.

Written by Mark Needham

July 10th, 2015 at 10:01 pm

Posted in R

Tagged with ,

R: dplyr – Error: cannot modify grouping variable

without comments

I’ve been doing some exploration of the posts made on this blog and I thought I’d start with answering a simple question – on which dates did I write the most posts?

I started with a data frame containing each post and the date it was published:

> library(dplyr)
> df %>% sample_n(5)
                                                title                date
1148 Taiichi Ohno's Workplace Management: Book Review 2008-12-08 14:14:48
158     Rails: Faking a delete method with 'form_for' 2010-09-20 18:52:15
331           Retrospectives: The 4 L's Retrospective 2011-07-25 21:00:30
1035       msbuild - Use OutputPath instead of OutDir 2008-08-14 18:54:03
1181                The danger of commenting out code 2009-01-17 06:02:33

To find the most popular days for blog posts we can write the following aggregation function:

> df %>% mutate(day = as.Date(date)) %>% count(day) %>% arrange(desc(n))
 
Source: local data frame [1,140 x 2]
 
          day n
1  2012-12-31 6
2  2014-05-31 6
3  2008-08-08 5
4  2013-01-27 5
5  2009-08-24 4
6  2012-06-24 4
7  2012-09-30 4
8  2012-10-27 4
9  2012-11-24 4
10 2013-02-28 4

So we can see a couple of days with 6 posts, a couple with 5 posts, a few more with 4 posts and then presumably loads of days with 1 post.

I thought it’d be cool if we could blog a histogram which had on the x axis the number of posts and on the y axis how many days that number of posts occurred e.g. for an x value of 6 (posts) we’d have a y value of 2 (occurrences).

My initial attempt was this:

> df %>% mutate(day = as.Date(date)) %>% count(day) %>% count(n)
Error: cannot modify grouping variable

Unfortunately that isn’t allowed. I tried ungrouping and then counting again:

 df %>% mutate(day = as.Date(date)) %>% count(day) %>% ungroup() %>% count(n)
Error: cannot modify grouping variable

Still no luck. I did a bit of googlign around and came across a post which suggested using a combination of group_by + mutate or group_by + summarize.

I tried the mutate approach first:

> df %>% mutate(day = as.Date(date)) %>% 
+     group_by(day) %>% mutate(n = n()) %>% ungroup() %>% sample_n(5)
                                                        title                Source: local data frame [5 x 4]
 
                                    title                date        day n
1 QCon London 2009: DDD & BDD - Dan North 2009-03-13 15:28:04 2009-03-13 2
2        Onboarding: Sketch the landscape 2013-02-15 07:36:06 2013-02-15 1
3                           Ego Depletion 2013-06-04 23:16:29 2013-06-04 1
4                 Clean Code: Book Review 2008-09-15 09:52:33 2008-09-15 1
5            Dreyfus Model: More thoughts 2009-08-10 10:36:51 2009-08-10 1

That keeps around the ‘title’ which is a bit annoying. We can get rid of it using a distinct on ‘day’ if we want and if we also implement the second part of the function we end up with the following:

> df %>% mutate(day = as.Date(date)) %>% 
    group_by(day) %>% mutate(n = n()) %>% distinct(day) %>% ungroup() %>% 
    group_by(n) %>%
    mutate(c = n()) %>%
    distinct(n)  
 
Source: local data frame [6 x 5]
Groups: n
 
                                                title                date        day n   c
1       Functional C#: Writing a 'partition' function 2010-02-01 23:34:02 2010-02-01 1 852
2                            Willed vs Forced designs 2010-02-08 22:48:05 2010-02-08 2 235
3                            TDD: Testing collections 2010-07-28 06:05:25 2010-07-28 3  41
4  Creating a Samba share between Ubuntu and Mac OS X 2012-06-24 00:40:35 2012-06-24 4   8
5            Gamification and Software: Some thoughts 2012-12-31 10:57:19 2012-12-31 6   2
6 Python/numpy: Selecting specific column in 2D array 2013-01-27 02:10:10 2013-01-27 5   2

Annoyingly we’ve still got the ‘title’, ‘date’ and ‘day’ columns hanging around which we’d need to get rid of with a call to ‘select’. The code also feels quite icky, especially the use of distinct in a couple of places.

In fact we can simplify the code if we use summarize instead of mutate:

> df %>% mutate(day = as.Date(date)) %>% 
    group_by(day) %>% summarize(n = n()) %>% ungroup() %>% 
    group_by(n) %>% summarize(c = n())
 
 
Source: local data frame [6 x 2]
 
  n   c
1 1 852
2 2 235
3 3  41
4 4   8
5 5   2
6 6   2

And we’ve got also rid of the extra columns in the bargain which is great! And now we can plot our histogram:

> library(ggplot2)
> post_frequencies = df %>% mutate(day = as.Date(date)) %>% 
    group_by(day) %>% summarize(n = n()) %>% ungroup() %>% 
    group_by(n) %>% summarize(c = n())
> ggplot(aes(x = n, y = c), data = post_frequencies) + geom_bar(stat = "identity")

2015 07 09 06 44 47

In this case we don’t actually need to do the second grouping to create the bar chart since ggplot will do it for us if we feed it the following data:

. ggplot(aes(x = n), 
         data = df %>% mutate(day = as.Date(date)) %>% group_by(day) %>% summarize(n = n()) %>% ungroup()) +
    geom_bar(binwidth = 1) +
    scale_x_continuous(limits=c(1, 6))
2015 07 09 06 55 12

Still, it’s good to know how!

Written by Mark Needham

July 9th, 2015 at 5:55 am

Posted in R

Tagged with , ,

R: Wimbledon – How do the seeds get on?

without comments

Continuing on with the Wimbledon data set I’ve been playing with I wanted to do some exploration on how the seeded players have fared over the years.

Taking the last 10 years worth of data there have always had 32 seeds and with the following function we can feed in a seeding and get back the round they would be expected to reach:

expected_round = function(seeding) {  
  if(seeding == 1) {
    return("Winner")
  } else if(seeding == 2) {
    return("Finals") 
  } else if(seeding <= 4) {
    return("Semi-Finals")
  } else if(seeding <= 8) {
    return("Quarter-Finals")
  } else if(seeding <= 16) {
    return("Round of 16")
  } else {
    return("Round of 32")
  }
}
 
> expected_round(1)
[1] "Winner"
 
> expected_round(4)
[1] "Semi-Finals"

We can then have a look at each of the Wimbledon tournaments and work out how far they actually got.

round_reached = function(player, main_matches) {
  furthest_match = main_matches %>% 
    filter(winner == player | loser == player) %>% 
    arrange(desc(round)) %>% 
    head(1)  
 
    return(ifelse(furthest_match$winner == player, "Winner", as.character(furthest_match$round)))
}
 
seeds = function(matches_to_consider) {
  winners =  matches_to_consider %>% filter(!is.na(winner_seeding)) %>% 
    select(name = winner, seeding =  winner_seeding) %>% distinct()
  losers = matches_to_consider %>% filter( !is.na(loser_seeding)) %>% 
    select(name = loser, seeding =  loser_seeding) %>% distinct()
 
  return(rbind(winners, losers) %>% distinct() %>% mutate(name = as.character(name)))
}

Let’s have a look how the seeds got on last year:

matches_to_consider = main_matches %>% filter(year == 2014)
 
result = seeds(matches_to_consider) %>% group_by(name) %>% 
    mutate(expected = expected_round(seeding), round = round_reached(name, matches_to_consider)) %>% 
    ungroup() %>% arrange(seeding)
 
rounds = c("Did not enter", "Round of 128", "Round of 64", "Round of 32", "Round of 16", "Quarter-Finals", "Semi-Finals", "Finals", "Winner")
result$round = factor(result$round, levels = rounds, ordered = TRUE)
result$expected = factor(result$expected, levels = rounds, ordered = TRUE) 
 
> result %>% head(10)
Source: local data frame [10 x 4]
 
             name seeding       expected          round
1  Novak Djokovic       1         Winner         Winner
2    Rafael Nadal       2         Finals    Round of 16
3     Andy Murray       3    Semi-Finals Quarter-Finals
4   Roger Federer       4    Semi-Finals         Finals
5   Stan Wawrinka       5 Quarter-Finals Quarter-Finals
6   Tomas Berdych       6 Quarter-Finals    Round of 32
7    David Ferrer       7 Quarter-Finals    Round of 64
8    Milos Raonic       8 Quarter-Finals    Semi-Finals
9      John Isner       9    Round of 16    Round of 32
10  Kei Nishikori      10    Round of 16    Round of 16

We’ll wrap all of that code into the following function:

expectations = function(y, matches) {
  matches_to_consider = matches %>% filter(year == y)  
 
  result = seeds(matches_to_consider) %>% group_by(name) %>% 
    mutate(expected = expected_round(seeding), round = round_reached(name, matches_to_consider)) %>% 
    ungroup() %>% arrange(seeding)
 
  result$round = factor(result$round, levels = rounds, ordered = TRUE)
  result$expected = factor(result$expected, levels = rounds, ordered = TRUE)  
 
  return(result)
}

Next, instead of showing the round names it’d be cool to come up with numerical value indicating how well the player did:

  • -1 would mean they lost in the round before their seeding suggested e.g. seed 2 loses in Semi Final
  • 2 would mean they got 2 rounds further than they should have e.g. Seed 7 reaches the Final

The unclass function comes to our rescue here:

# expectations plot
years = 2005:2014
exp = data.frame()
for(y in years) {
  differences = (expectations(y, main_matches)  %>% 
                   mutate(expected_n = unclass(expected), 
                          round_n = unclass(round), 
                          difference = round_n - expected_n))$difference %>% as.numeric()    
  exp = rbind(exp, data.frame(year = rep(y, length(differences)), difference = differences)) 
}
 
> exp %>% sample_n(10)
Source: local data frame [10 x 6]
 
              name seeding expected_n round_n difference year
1    Tomas Berdych       6          6       5         -1 2011
2    Tomas Berdych       7          6       6          0 2013
3     Rafael Nadal       2          8       5         -3 2014
4    Fabio Fognini      16          5       4         -1 2014
5  Robin Soderling      13          5       5          0 2009
6    Jurgen Melzer      16          5       5          0 2010
7  Nicolas Almagro      19          4       2         -2 2010
8    Stan Wawrinka      14          5       3         -2 2011
9     David Ferrer       7          6       5         -1 2011
10 Mikhail Youzhny      14          5       5          0 2007

We can then group by the ‘difference’ column to see how seeds are getting on as a whole:

> exp %>% count(difference)
Source: local data frame [9 x 2]
 
  difference  n
1         -5  2
2         -4  7
3         -3 24
4         -2 70
5         -1 66
6          0 85
7          1 43
8          2 17
9          3  4
 
library(ggplot2)
ggplot(aes(x = difference, y = n), data = exp %>% count(difference)) +
  geom_bar(stat = "identity") +
  scale_x_continuous(limits=c(min(potential), max(potential) + 1))
2015 07 04 00 45 02

So from this visualisation we can see that the most common outcome for a seed is that they reach the round they were expected to reach. There are still a decent number of seeds who do 1 or 2 rounds worse than expected as well though.

Antonios suggested doing some analysis of how the seeds fared on a year by year basis – we’ll start by looking at what % of them exactly achieved their seeding:

exp$correct_pred = 0
exp$correct_pred[dt$difference==0] = 1
 
exp %>% group_by(year) %>% 
  summarise(MeanDiff = mean(difference),
            PrcCorrect = mean(correct_pred),
            N=n())
 
Source: local data frame [10 x 4]
 
   year   MeanDiff PrcCorrect  N
1  2005 -0.6562500  0.2187500 32
2  2006 -0.8125000  0.2812500 32
3  2007 -0.4838710  0.4193548 31
4  2008 -0.9677419  0.2580645 31
5  2009 -0.3750000  0.2500000 32
6  2010 -0.7187500  0.4375000 32
7  2011 -0.7187500  0.0937500 32
8  2012 -0.7500000  0.2812500 32
9  2013 -0.9375000  0.2500000 32
10 2014 -0.7187500  0.1875000 32

Some years are better than others – we can use a chisq test to see whether there are any significant differences between the years:

tbl = table(exp$year, exp$correct_pred)
tbl
 
> chisq.test(tbl)
 
	Pearson's Chi-squared test
 
data:  tbl
X-squared = 14.9146, df = 9, p-value = 0.09331

This looks for at least one statistically significant different between the years, although it doesn’t look like there are any. We can also try doing a comparison of each year against all the others:

> pairwise.prop.test(tbl)
 
	Pairwise comparisons using Pairwise comparison of proportions 
 
data:  tbl 
 
     2005 2006 2007 2008 2009 2010 2011 2012 2013
2006 1.00 -    -    -    -    -    -    -    -   
2007 1.00 1.00 -    -    -    -    -    -    -   
2008 1.00 1.00 1.00 -    -    -    -    -    -   
2009 1.00 1.00 1.00 1.00 -    -    -    -    -   
2010 1.00 1.00 1.00 1.00 1.00 -    -    -    -   
2011 1.00 1.00 0.33 1.00 1.00 0.21 -    -    -   
2012 1.00 1.00 1.00 1.00 1.00 1.00 1.00 -    -   
2013 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 -   
2014 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
 
P value adjustment method: holm


2007/2011 and 2010/2011 show the biggest differences but they’re still not significant. Since we have so few data items in each bucket there has to be a really massive difference for it to be significant.

The data I used in this post is available on this gist if you want to look into it and come up with your own analysis.

Written by Mark Needham

July 5th, 2015 at 8:38 am

Posted in R

Tagged with

R: Calculating the difference between ordered factor variables

without comments

In my continued exploration of Wimbledon data I wanted to work out whether a player had done as well as their seeding suggested they should.

I therefore wanted to work out the difference between the round they reached and the round they were expected to reach. A ’round’ in the dataset is an ordered factor variable.

These are all the possible values:

rounds = c("Did not enter", "Round of 128", "Round of 64", "Round of 32", "Round of 16", "Quarter-Finals", "Semi-Finals", "Finals", "Winner")

And if we want to factorise a couple of strings into this factor we would do it like this:

round = factor("Finals", levels = rounds, ordered = TRUE)
expected = factor("Winner", levels = rounds, ordered = TRUE)  
 
> round
[1] Finals
9 Levels: Did not enter < Round of 128 < Round of 64 < Round of 32 < Round of 16 < Quarter-Finals < ... < Winner
 
> expected
[1] Winner
9 Levels: Did not enter < Round of 128 < Round of 64 < Round of 32 < Round of 16 < Quarter-Finals < ... < Winner

In this case the difference between the actual round and expected round should be -1 – the player was expected to win the tournament but lost in the final. We can calculate that differnce by calling the unclass function on each variable:

 
> unclass(round) - unclass(expected)
[1] -1
attr(,"levels")
[1] "Did not enter"  "Round of 128"   "Round of 64"    "Round of 32"    "Round of 16"    "Quarter-Finals"
[7] "Semi-Finals"    "Finals"         "Winner"

That still seems to have some remnants of the factor variable so to get rid of that we can cast it to a numeric value:

> as.numeric(unclass(round) - unclass(expected))
[1] -1

And that’s it! We can now go and apply this calculation to all seeds to see how they got on.

Written by Mark Needham

July 2nd, 2015 at 10:55 pm

Posted in R

Tagged with

R: write.csv – unimplemented type ‘list’ in ‘EncodeElement’

without comments

Everyone now and then I want to serialise an R data frame to a CSV file so I can easily load it up again if my R environment crashes without having to recalculate everything but recently ran into the following error:

> write.csv(foo, "/tmp/foo.csv", row.names = FALSE)
Error in .External2(C_writetable, x, file, nrow(x), p, rnames, sep, eol,  : 
  unimplemented type 'list' in 'EncodeElement'

If we take a closer look at the data frame in question it looks ok:

> foo
  col1 col2
1    1    a
2    2    b
3    3    c

However, one of the columns contains a list in each cell and we need to find out which one it is. I’ve found the quickest way is to run the typeof function over each column:

> typeof(foo$col1)
[1] "double"
 
> typeof(foo$col2)
[1] "list"

So ‘col2′ is the problem one which isn’t surprising if you consider the way I created ‘foo':

library(dplyr)
foo = data.frame(col1 = c(1,2,3)) %>% mutate(col2 = list("a", "b", "c"))

If we do have a list that we want to add to the data frame we need to convert it to a vector first so we don’t run into this type of problem:

foo = data.frame(col1 = c(1,2,3)) %>% mutate(col2 = list("a", "b", "c") %>% unlist())

And now we can write to the CSV file:

write.csv(foo, "/tmp/foo.csv", row.names = FALSE)
$ cat /tmp/foo.csv
"col1","col2"
1,"a"
2,"b"
3,"c"

And that’s it!

Written by Mark Needham

June 30th, 2015 at 10:26 pm

Posted in R

Tagged with

R: Speeding up the Wimbledon scraping job

without comments

Over the past few days I’ve written a few blog posts about a Wimbledon data set I’ve been building and after running the scripts a few times I noticed that it was taking much longer to run that I expected.

To recap, I started out with the following function which takes in a URI and returns a data frame containing a row for each match:

library(rvest)
library(dplyr)
 
scrape_matches1 = function(uri) {
  matches = data.frame()
 
  s = html(uri)
  rows = s %>% html_nodes("div#scoresResultsContent tr")
  i = 0
  for(row in rows) {  
    players = row %>% html_nodes("td.day-table-name a")
    seedings = row %>% html_nodes("td.day-table-seed")
    score = row %>% html_node("td.day-table-score a")
    flags = row %>% html_nodes("td.day-table-flag img")
 
    if(!is.null(score)) {
      player1 = players[1] %>% html_text() %>% str_trim()
      seeding1 = ifelse(!is.na(seedings[1]), seedings[1] %>% html_node("span") %>% html_text() %>% str_trim(), NA)
      flag1 = flags[1] %>% html_attr("alt")
 
      player2 = players[2] %>% html_text() %>% str_trim()
      seeding2 = ifelse(!is.na(seedings[2]), seedings[2] %>% html_node("span") %>% html_text() %>% str_trim(), NA)
      flag2 = flags[2] %>% html_attr("alt")
 
      matches = rbind(data.frame(winner = player1, 
                                 winner_seeding = seeding1, 
                                 winner_flag = flag1,
                                 loser = player2, 
                                 loser_seeding = seeding2,
                                 loser_flag = flag2,
                                 score = score %>% html_text() %>% str_trim(),
                                 round = round), matches)      
    } else {
      round = row %>% html_node("th") %>% html_text()
    }
  } 
  return(matches)
}

Let’s run it to get an idea of the data that it returns:

matches1 = scrape_matches1("http://www.atpworldtour.com/en/scores/archive/wimbledon/540/2014/results")
 
> matches1 %>% filter(round %in% c("Finals", "Semi-Finals", "Quarter-Finals"))
           winner winner_seeding winner_flag           loser loser_seeding loser_flag            score          round
1    Milos Raonic            (8)         CAN    Nick Kyrgios          (WC)        AUS    674 62 64 764 Quarter-Finals
2   Roger Federer            (4)         SUI   Stan Wawrinka           (5)        SUI     36 765 64 64 Quarter-Finals
3 Grigor Dimitrov           (11)         BUL     Andy Murray           (3)        GBR        61 764 62 Quarter-Finals
4  Novak Djokovic            (1)         SRB     Marin Cilic          (26)        CRO  61 36 674 62 62 Quarter-Finals
5   Roger Federer            (4)         SUI    Milos Raonic           (8)        CAN         64 64 64    Semi-Finals
6  Novak Djokovic            (1)         SRB Grigor Dimitrov          (11)        BUL    64 36 762 767    Semi-Finals
7  Novak Djokovic            (1)         SRB   Roger Federer           (4)        SUI 677 64 764 57 64         Finals

As I mentioned, it’s quite slow but I thought I’d wrap it in system.time so I could see exactly how long it was taking:

> system.time(scrape_matches1("http://www.atpworldtour.com/en/scores/archive/wimbledon/540/2014/results"))
   user  system elapsed 
 25.570   0.111  31.416

About 30 seconds! The first thing I tried was downloading the file separately and running the function against the local file:

> system.time(scrape_matches1("data/raw/2014.html"))
   user  system elapsed 
 25.662   0.123  25.863

Hmmm, that’s only saved us 5 seconds so the bottleneck must be somewhere else. Still there’s no point making a HTTP request every time we run the script so we’ll stick with the local file version.

While browsing rvest’s vignette I noticed a function called html_table which I was curious about. I decided to try and replace some of my code with a call to that:

matches2= html("data/raw/2014.html") %>% 
  html_node("div#scoresResultsContent table.day-table") %>% html_table(header = FALSE) %>% 
  mutate(X1 = ifelse(X1 == "", NA, X1)) %>%
  mutate(round = ifelse(grepl("\\([0-9]\\)|\\(", X1), NA, X1)) %>% 
  mutate(round = na.locf(round)) %>%
  filter(!is.na(X8)) %>%
  select(winner = X3, winner_seeding = X1, loser = X7, loser_seeding = X5, score = X8, round)
 
> matches2 %>% filter(round %in% c("Finals", "Semi-Finals", "Quarter-Finals"))
           winner winner_seeding           loser loser_seeding            score          round
1  Novak Djokovic            (1)   Roger Federer           (4) 677 64 764 57 64         Finals
2  Novak Djokovic            (1) Grigor Dimitrov          (11)    64 36 762 767    Semi-Finals
3   Roger Federer            (4)    Milos Raonic           (8)         64 64 64    Semi-Finals
4  Novak Djokovic            (1)     Marin Cilic          (26)  61 36 674 62 62 Quarter-Finals
5 Grigor Dimitrov           (11)     Andy Murray           (3)        61 764 62 Quarter-Finals
6   Roger Federer            (4)   Stan Wawrinka           (5)     36 765 64 64 Quarter-Finals
7    Milos Raonic            (8)    Nick Kyrgios          (WC)    674 62 64 764 Quarter-Finals

I had to do some slightly clever stuff to get the ’round’ column into shape using zoo’s na.locf function which I wrote about previously.

Unfortunately I couldn’t work out how to extract the flag with this version – that value is hidden in the ‘alt’ tag of an img and presumably html_table is just grabbing the text value of each cell. This version is much quicker though!

system.time(html("data/raw/2014.html") %>% 
  html_node("div#scoresResultsContent table.day-table") %>% html_table(header = FALSE) %>% 
  mutate(X1 = ifelse(X1 == "", NA, X1)) %>%
  mutate(round = ifelse(grepl("\\([0-9]\\)|\\(", X1), NA, X1)) %>% 
  mutate(round = na.locf(round)) %>%
  filter(!is.na(X8)) %>%
  select(winner = X3, winner_seeding = X1, loser = X7, loser_seeding = X5, score = X8, round))
 
   user  system elapsed 
  0.545   0.002   0.548

What I realised from writing this version is that I need to match all the columns with one call to html_nodes rather than getting the row and then each column in a loop.

I rewrote the function to do that:

scrape_matches3 = function(uri) {
  s = html(uri)
 
  players  = s %>% html_nodes("div#scoresResultsContent tr td.day-table-name a")
  seedings = s %>% html_nodes("div#scoresResultsContent tr td.day-table-seed")
  scores   = s %>% html_nodes("div#scoresResultsContent tr td.day-table-score a")
  flags    = s %>% html_nodes("div#scoresResultsContent tr td.day-table-flag img") %>% html_attr("alt") %>% str_trim()
 
  matches3 = data.frame(
    winner         = sapply(seq(1,length(players),2),  function(idx) players[[idx]] %>% html_text()),
    winner_seeding = sapply(seq(1,length(seedings),2), function(idx) seedings[[idx]] %>% html_text() %>% str_trim()),
    winner_flag    = sapply(seq(1,length(flags),2),    function(idx) flags[[idx]]),  
    loser          = sapply(seq(2,length(players),2),  function(idx) players[[idx]] %>% html_text()),
    loser_seeding  = sapply(seq(2,length(seedings),2), function(idx) seedings[[idx]] %>% html_text() %>% str_trim()),
    loser_flag     = sapply(seq(2,length(flags),2),    function(idx) flags[[idx]]),
    score          = sapply(scores,                    function(score) score %>% html_text() %>% str_trim())
  )
  return(matches3)
}

Let’s run and time that to check we’re getting back the right results in a timely manner:

> matches3 %>% sample_n(10)
                   winner winner_seeding winner_flag               loser loser_seeding loser_flag         score
70           David Ferrer            (7)         ESP Pablo Carreno Busta                      ESP  60 673 61 61
128        Alex Kuznetsov           (26)         USA         Tim Smyczek           (3)        USA   46 63 63 63
220   Rogerio Dutra Silva                        BRA   Kristijan Mesaros                      CRO         62 63
83         Kevin Anderson           (20)         RSA        Aljaz Bedene          (LL)        GBR      63 75 62
73          Kei Nishikori           (10)         JPN   Kenny De Schepper                      FRA     64 765 75
56  Roberto Bautista Agut           (27)         ESP         Jan Hernych           (Q)        CZE   75 46 62 62
138            Ante Pavic                        CRO        Marc Gicquel          (29)        FRA  46 63 765 64
174             Tim Puetz                        GER     Ruben Bemelmans                      BEL         64 62
103        Lleyton Hewitt                        AUS   Michal Przysiezny                      POL 62 6714 61 64
35          Roger Federer            (4)         SUI       Gilles Muller           (Q)        LUX      63 75 63
 
> system.time(scrape_matches3("data/raw/2014.html"))
   user  system elapsed 
  0.815   0.006   0.827

It’s still quick – a bit slower than html_table but we can deal with that. As you can see, I also had to add some logic to separate the values for the winners and losers – the players, seeds, flags come back as as one big list. The odd rows represent the winner; the even rows the loser.

Annoyingly we’ve now lost the ’round’ column because that appears as a table heading so we can’t extract it the same way. I ended up cheating a bit to get it to work by working out how many matches each round should contain and generated a vector with that number of entries:

raw_rounds = s %>% html_nodes("th") %>% html_text()
 
> raw_rounds
 [1] "Finals"               "Semi-Finals"          "Quarter-Finals"       "Round of 16"          "Round of 32"         
 [6] "Round of 64"          "Round of 128"         "3rd Round Qualifying" "2nd Round Qualifying" "1st Round Qualifying"
 
rounds = c( sapply(0:6, function(idx) rep(raw_rounds[[idx + 1]], 2 ** idx)) %>% unlist(),
            sapply(7:9, function(idx) rep(raw_rounds[[idx + 1]], 2 ** (idx - 3))) %>% unlist())
 
> rounds[1:10]
 [1] "Finals"         "Semi-Finals"    "Semi-Finals"    "Quarter-Finals" "Quarter-Finals" "Quarter-Finals" "Quarter-Finals"
 [8] "Round of 16"    "Round of 16"    "Round of 16"

Let’s put that code into the function and see if we end up with the same resulting data frame:

scrape_matches4 = function(uri) {
  s = html(uri)
 
  players  = s %>% html_nodes("div#scoresResultsContent tr td.day-table-name a")
  seedings = s %>% html_nodes("div#scoresResultsContent tr td.day-table-seed")
  scores   = s %>% html_nodes("div#scoresResultsContent tr td.day-table-score a")
  flags    = s %>% html_nodes("div#scoresResultsContent tr td.day-table-flag img") %>% html_attr("alt") %>% str_trim()
 
  raw_rounds = s %>% html_nodes("th") %>% html_text()
  rounds = c( sapply(0:6, function(idx) rep(raw_rounds[[idx + 1]], 2 ** idx)) %>% unlist(),
              sapply(7:9, function(idx) rep(raw_rounds[[idx + 1]], 2 ** (idx - 3))) %>% unlist())
 
  matches4 = data.frame(
    winner         = sapply(seq(1,length(players),2),  function(idx) players[[idx]] %>% html_text()),
    winner_seeding = sapply(seq(1,length(seedings),2), function(idx) seedings[[idx]] %>% html_text() %>% str_trim()),
    winner_flag    = sapply(seq(1,length(flags),2),    function(idx) flags[[idx]]),  
    loser          = sapply(seq(2,length(players),2),  function(idx) players[[idx]] %>% html_text()),
    loser_seeding  = sapply(seq(2,length(seedings),2), function(idx) seedings[[idx]] %>% html_text() %>% str_trim()),
    loser_flag     = sapply(seq(2,length(flags),2),    function(idx) flags[[idx]]),
    score          = sapply(scores,                    function(score) score %>% html_text() %>% str_trim()),
    round          = rounds
  )
  return(matches4)
}
 
matches4 = scrape_matches4("data/raw/2014.html")
 
> matches4 %>% filter(round %in% c("Finals", "Semi-Finals", "Quarter-Finals"))
           winner winner_seeding winner_flag           loser loser_seeding loser_flag            score          round
1  Novak Djokovic            (1)         SRB   Roger Federer           (4)        SUI 677 64 764 57 64         Finals
2  Novak Djokovic            (1)         SRB Grigor Dimitrov          (11)        BUL    64 36 762 767    Semi-Finals
3   Roger Federer            (4)         SUI    Milos Raonic           (8)        CAN         64 64 64    Semi-Finals
4  Novak Djokovic            (1)         SRB     Marin Cilic          (26)        CRO  61 36 674 62 62 Quarter-Finals
5 Grigor Dimitrov           (11)         BUL     Andy Murray           (3)        GBR        61 764 62 Quarter-Finals
6   Roger Federer            (4)         SUI   Stan Wawrinka           (5)        SUI     36 765 64 64 Quarter-Finals
7    Milos Raonic            (8)         CAN    Nick Kyrgios          (WC)        AUS    674 62 64 764 Quarter-Finals

We shouldn’t have added much to the time but let’s check:

> system.time(scrape_matches4("data/raw/2014.html"))
   user  system elapsed 
  0.816   0.004   0.824

Sweet. We’ve saved ourselves 29 seconds per page as long as the number of rounds stayed constant over the years. For the 10 years that I’ve looked at it has but I expect if you go back further the draw sizes will have been different and our script would break.

For now though this will do!

Written by Mark Needham

June 29th, 2015 at 5:36 am

Posted in R

Tagged with