## R: I write more in the last week of the month, or do I?

I’ve been writing on this blog for almost 7 years and have always believed that I write more frequently towards the end of a month. Now that I’ve got all the data I thought it’d be interesting to test that belief.

I started with a data frame containing each post and its publication date and added an extra column which works out how many weeks from the end of the month that post was written:

> df %>% sample_n(5) title date 946 Python: Equivalent to flatMap for flattening an array of arrays 2015-03-23 00:45:00 175 Ruby: Hash default value 2010-10-16 14:02:37 375 Java/Scala: Runtime.exec hanging/in 'pipe_w' state 2011-11-20 20:20:08 1319 Coding Dojo #18: Groovy Bowling Game 2009-06-26 08:15:23 381 Continuous Delivery: Removing manual scenarios 2011-12-05 23:13:34 calculate_start_of_week = function(week, year) { date <- ymd(paste(year, 1, 1, sep="-")) week(date) = week return(date) } tidy_df = df %>% mutate(year = year(date), week = week(date), week_in_month = ceiling(day(date) / 7), max_week = max(week_in_month), weeks_from_end = max_week - week_in_month, start_of_week = calculate_start_of_week(week, year)) > tidy_df %>% select(date, weeks_from_end, start_of_week) %>% sample_n(5) date weeks_from_end start_of_week 1023 2008-08-08 21:16:02 3 2008-08-05 800 2014-01-31 06:51:06 0 2014-01-29 859 2014-08-14 10:24:52 3 2014-08-13 107 2010-07-10 22:49:52 3 2010-07-09 386 2011-12-20 23:57:51 2 2011-12-17 |

Next I want to get a count of how many posts were published in a given week. The following code does that transformation for us:

weeks_from_end_counts = tidy_df %>% group_by(start_of_week, weeks_from_end) %>% summarise(count = n()) > weeks_from_end_counts Source: local data frame [540 x 4] Groups: start_of_week, weeks_from_end start_of_week weeks_from_end year count 1 2006-08-27 0 2006 1 2 2006-08-27 4 2006 3 3 2006-09-03 4 2006 1 4 2008-02-05 3 2008 2 5 2008-02-12 3 2008 2 6 2008-07-15 2 2008 1 7 2008-07-22 1 2008 1 8 2008-08-05 3 2008 8 9 2008-08-12 2 2008 5 10 2008-08-12 3 2008 9 .. ... ... ... ... |

We group by both ‘start_of_week’ and ‘weeks_from_end’ because we could have posts published in the same week but different month and we want to capture that difference. Now we can run a correlation on the data frame to see if there’s any relationship between ‘count’ and ‘weeks_from_end’:

> cor(weeks_from_end_counts %>% ungroup() %>% select(weeks_from_end, count)) weeks_from_end count weeks_from_end 1.00000000 -0.08253569 count -0.08253569 1.00000000 |

This suggests there’s a slight negative correlation between the two variables i.e. ‘count’ decreases as ‘weeks_from_end’ increases. Let’s plug the data frame into a linear model to see how good ‘weeks_from_end’ is as a predictor of ‘count’:

> fit = lm(count ~ weeks_from_end, weeks_from_end_counts) > summary(fit) Call: lm(formula = count ~ weeks_from_end, data = weeks_from_end_counts) Residuals: Min 1Q Median 3Q Max -2.0000 -1.5758 -0.5758 1.1060 8.0000 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.00000 0.13764 21.795 <2e-16 *** weeks_from_end -0.10605 0.05521 -1.921 0.0553 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.698 on 538 degrees of freedom Multiple R-squared: 0.006812, Adjusted R-squared: 0.004966 F-statistic: 3.69 on 1 and 538 DF, p-value: 0.05527 |

We see a similar result here. The effect of ‘weeks_from_end’ is worth 0.1 posts per week with a p value of 0.0553 so it’s on the border line of being significant.

We also have a very low ‘R squared’ value which suggests the ‘weeks_from_end’ isn’t explaining much of the variation in the data which makes sense given that we didn’t see much of a correlation.

If we charged on and wanted to predict the number of posts likely to be published in a given week we could run the predict function like this:

> predict(fit, data.frame(weeks_from_end=c(1,2,3,4,5))) 1 2 3 4 5 2.893952 2.787905 2.681859 2.575812 2.469766 |

Obviously it’s a bit flawed since we could plug in any numeric value we want, even ones that don’t make any sense, and it’d still come back with a prediction:

> predict(fit, data.frame(weeks_from_end=c(30 ,-10))) 1 2 -0.181394 4.060462 |

I think we’d probably protect against that with a function wrapping our call to predict that doesn’t allow ‘weeks_from_end’ to be greater than 5 or less than 0.

So far it looks like my belief is incorrect! I’m a bit dubious about my calculation of ‘weeks_from_end’ though – it’s not completely capturing what I want since in some months the last week only contains a couple of days.

Next I’m going to explore whether it makes any difference if I calculate that value by counting the number of days back from the last day of the month rather than using week number.