Mark Needham

Thoughts on Software Development

Archive for the ‘R’ Category

R: Sentiment analysis of morning pages

without comments

A couple of months ago I came across a cool blog post by Julia Silge where she runs a sentiment analysis algorithm over her tweet stream to see how her tweet sentiment has varied over time.

I wanted to give it a try but couldn’t figure out how to get a dump of my tweets so I decided to try it out on the text from my morning pages writing which I’ve been experimenting with for a few months.

Here’s an explanation of morning pages if you haven’t come across it before:

Morning Pages are three pages of longhand, stream of consciousness writing, done first thing in the morning.

*There is no wrong way to do Morning Pages* – they are not high art.

They are not even “writing.” They are about anything and everything that crosses your mind– and they are for your eyes only.

Morning Pages provoke, clarify, comfort, cajole, prioritize and synchronize the day at hand. Do not over-think Morning Pages: just put three pages of anything on the page…and then do three more pages tomorrow.

Most of my writing is complete gibberish but I thought it’d be fun to see how my mood changes over time and see if it identifies any peaks or troughs in sentiment that I could then look into further.

I’ve got one file per day so we’ll start by building a data frame containing the text, one row per day:

library(syuzhet)
library(lubridate)
library(ggplot2)
library(scales)
library(reshape2)
library(dplyr)
 
root="/path/to/files"
files = list.files(root)
 
df = data.frame(file = files, stringsAsFactors=FALSE)
df$fullPath = paste(root, df$file, sep = "/")
df$text = sapply(df$fullPath, get_text_as_string)

We end up with a data frame with 3 fields:

> names(df)
 
[1] "file"     "fullPath" "text"

Next we’ll run the sentiment analysis function – syuzhet#get_nrc_sentiment – over the data frame and get a score for each type of sentiment for each entry:

get_nrc_sentiment(df$text) %>% head()
 
  anger anticipation disgust fear joy sadness surprise trust negative positive
1     7           14       5    7   8       6        6    12       14       27
2    11           12       2   13   9      10        4    11       22       24
3     6           12       3    8   7       7        5    13       16       21
4     5           17       4    7  10       6        7    13       16       37
5     4           13       3    7   7       9        5    14       16       25
6     7           11       5    7   8       8        6    15       16       26

Now we’ll merge these columns into our original data frame:

df = cbind(df, get_nrc_sentiment(df$text))
df$date = ymd(sapply(df$file, function(file) unlist(strsplit(file, "[.]"))[1]))
df %>% select(-text, -fullPath, -file) %>% head()
 
  anger anticipation disgust fear joy sadness surprise trust negative positive       date
1     7           14       5    7   8       6        6    12       14       27 2016-01-02
2    11           12       2   13   9      10        4    11       22       24 2016-01-03
3     6           12       3    8   7       7        5    13       16       21 2016-01-04
4     5           17       4    7  10       6        7    13       16       37 2016-01-05
5     4           13       3    7   7       9        5    14       16       25 2016-01-06
6     7           11       5    7   8       8        6    15       16       26 2016-01-07

Finally we can build some ‘sentiment over time’ charts like Julia has in her post:

posnegtime <- df %>% 
  group_by(date = cut(date, breaks="1 week")) %>%
  summarise(negative = mean(negative), positive = mean(positive)) %>% 
  melt
 
names(posnegtime) <- c("date", "sentiment", "meanvalue")
posnegtime$sentiment = factor(posnegtime$sentiment,levels(posnegtime$sentiment)[c(2,1)])
 
ggplot(data = posnegtime, aes(x = as.Date(date), y = meanvalue, group = sentiment)) +
  geom_line(size = 2.5, alpha = 0.7, aes(color = sentiment)) +
  geom_point(size = 0.5) +
  ylim(0, NA) + 
  scale_colour_manual(values = c("springgreen4", "firebrick3")) +
  theme(legend.title=element_blank(), axis.title.x = element_blank()) +
  scale_x_date(breaks = date_breaks("1 month"), labels = date_format("%b %Y")) +
  ylab("Average sentiment score") + 
  ggtitle("Sentiment Over Time")

2016 07 05 06 47 12

So overall it seems like my writing displays more positive sentiment than negative which is nice to know. The chart shows a rolling one week average and there isn’t a single week where there’s more negative sentiment than positive.

I thought it’d be fun to drill into the highest negative and positive days to see what was going on there:

> df %>% filter(negative == max(negative)) %>% select(date)
 
        date
1 2016-03-19
 
> df %>% filter(positive == max(positive)) %>% select(date)
 
        date
1 2016-01-05
2 2016-06-20

On the 19th March I was really frustrated because my boiler had broken down and I had to buy a new one – I’d completely forgotten how annoyed I was, so thanks sentiment analysis for reminding me!

I couldn’t find anything particularly positive on the 5th January or 20th June. The 5th January was the day after my birthday so perhaps I was happy about that but I couldn’t see any particular evidence that was the case.

Playing around with the get_nrc_sentiment function it does seem to identify positive sentiment when I wouldn’t say there is any. For example here’s some example sentences from my writing today:

> get_nrc_sentiment("There was one section that I didn't quite understand so will have another go at reading that.")
 
  anger anticipation disgust fear joy sadness surprise trust negative positive
1     0            0       0    0   0       0        0     0        0        1
> get_nrc_sentiment("Bit of a delay in starting my writing for the day...for some reason was feeling wheezy again.")
 
  anger anticipation disgust fear joy sadness surprise trust negative positive
1     2            1       2    2   1       2        1     1        2        2

I don’t think there’s any positive sentiment in either of those sentences but the function claims 3 bits of positive sentiment! It would be interesting to see if I fare any better with Stanford’s sentiment extraction tool which you can use with syuzhet but requires a bit of setup first.

I’ll give that a try next but in terms of getting an overview of my mood I thought I might get a better picture if I looked for the difference between positive and negative sentiment rather than absolute values.

The following code does the trick:

difftime <- df %>% 
  group_by(date = cut(date, breaks="1 week")) %>%
  summarise(diff = mean(positive) - mean(negative))
 
ggplot(data = difftime, aes(x = as.Date(date), y = diff)) +
  geom_line(size = 2.5, alpha = 0.7) +
  geom_point(size = 0.5) +
  ylim(0, NA) + 
  scale_colour_manual(values = c("springgreen4", "firebrick3")) +
  theme(legend.title=element_blank(), axis.title.x = element_blank()) +
  scale_x_date(breaks = date_breaks("1 month"), labels = date_format("%b %Y")) +
  ylab("Average sentiment difference score") + 
  ggtitle("Sentiment Over Time")
2016 07 09 07 05 34

This one identifies peak happiness in mid January/February. We can find the peak day for this measure as well:

> df %>% mutate(diff = positive - negative) %>% filter(diff == max(diff)) %>% select(date)
 
        date
1 2016-02-25

Or if we want to see the individual scores:

> df %>% mutate(diff = positive - negative) %>% filter(diff == max(diff)) %>% select(-text, -file, -fullPath)
 
  anger anticipation disgust fear joy sadness surprise trust negative positive       date diff
1     0           11       2    3   7       1        6     6        3       31 2016-02-25   28

After reading through the entry for this day I’m wondering if the individual pieces of sentiment might be more interesting than the positive/negative score.

On the 25th February I was:

  • quite excited about reading a distributed systems book I’d just bought (I know?!)
  • thinking about how to apply the tag clustering technique to meetup topics
  • preparing my submission to PyData London and thinking about what was gonna go in it
  • thinking about the soak testing we were about to start doing on our project

Each of those is a type of anticipation so it makes sense that this day scores highly. I looked through some other days which specifically rank highly for anticipation and couldn’t figure out what I was anticipating so even this is a bit hit and miss!

I have a few avenues to explore further but if you have any other ideas for what I can try next let me know in the comments.

Written by Mark Needham

July 9th, 2016 at 6:36 am

Posted in R

Tagged with

R: substr – Getting a vector of positions

with 2 comments

I recently found myself writing an R script to extract parts of a string based on a beginning and end index which is reasonably easy using the substr function:

> substr("mark loves graphs", 0, 4)
[1] "mark"

But what if we have a vector of start and end positions?

> substr("mark loves graphs", c(0, 6), c(4, 10))
[1] "mark"

Hmmm that didn’t work as I expected! It turns out we actually need to use the substring function instead which wasn’t initially obvious to me on reading the documentation:

> substring("mark loves graphs", c(0, 6, 12), c(4, 10, 17))
[1] "mark"   "loves"  "graphs"

Easy when you know how!

Written by Mark Needham

April 18th, 2016 at 7:49 pm

Posted in R

Tagged with

R: tm – Unique words/terms per document

without comments

I’ve been doing a bit of text mining over the weekend using the R tm package and I wanted to only count a term once per document which isn’t how it works out the box.

For example let’s say we’re writing a bit of code to calculate the frequency of terms across some documents. We might write the following code:

library(tm)
text = c("I am Mark I am Mark", "Neo4j is cool Neo4j is cool")
corpus = VCorpus(VectorSource(text))
tdm = as.matrix(TermDocumentMatrix(corpus, control = list(wordLengths = c(1, Inf))))
 
> tdm
       Docs
Terms   1 2
  am    2 0
  cool  0 2
  i     2 0
  is    0 2
  mark  2 0
  neo4j 0 2
 
> rowSums(tdm)
   am  cool     i    is  mark neo4j 
    2     2     2     2     2     2

We’ve created a small corpus over a vector which contains two bits of text. On the last line we output a TermDocumentMatrix which shows how frequently each term shows up across the corpus. I had to tweak the default word length of 3 to make sure we could see ‘am’ and ‘cool’.

But we’ve actually got some duplicate terms in each of our documents so we want to get rid of those and only count unique terms per document.

We can achieve that by mapping over the corpus using the tm_map function and then applying a function which returns unique terms. I wrote the following function:

uniqueWords = function(d) {
  return(paste(unique(strsplit(d, " ")[[1]]), collapse = ' '))
}

We can then apply the function like so:

corpus = tm_map(corpus, content_transformer(uniqueWords))
tdm = as.matrix(TermDocumentMatrix(corpus, control = list(wordLengths = c(1, Inf))))
 
> tdm
       Docs
Terms   1 2
  am    1 0
  cool  0 1
  i     1 0
  is    0 1
  mark  1 0
  neo4j 0 1
 
> rowSums(tdm)
   am  cool     i    is  mark neo4j 
    1     1     1     1     1     1

And now each term is only counted once. Success!

Written by Mark Needham

April 11th, 2016 at 5:40 am

Posted in R

Tagged with

R: Error in approxfun(x.values.1, y.values.1, method = “constant”, f = 1, : zero non-NA points

without comments

I’ve been following Michy Alice’s logistic regression tutorial to create an attendance model for London dev meetups and ran into an interesting problem while doing so.

Our dataset has a class imbalance i.e. most people RSVP ‘no’ to events which can lead to misleading accuracy score where predicting ‘no’ every time would lead to supposed high accuracy.

Source: local data frame [2 x 2]
 
  attended     n
     (dbl) (int)
1        0  1541
2        1    53

I sampled the data using caret‘s upSample function to avoid this:

attended = as.factor((df %>% dplyr::select(attended))$attended)
upSampledDf = upSample(df %>% dplyr::select(-attended), attended)
upSampledDf$attended = as.numeric(as.character(upSampledDf$Class))

I then trained a logistic regression model but when I tried to plot the area under the curve I ran into trouble:

p <- predict(model, newdata=test, type="response")
pr <- prediction(p, test$attended)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
 
Error in approxfun(x.values.1, y.values.1, method = "constant", f = 1,  : 
  zero non-NA points

I don’t have any NA values in my data frame so this message was a bit confusing to start with. As usual Stack Overflow came to the rescue with the suggestion that I was probably missing positive/negative values for the independent variable i.e. ‘approved’.

A quick count on the test data frame using dplyr confirmed my mistake:

> test %>% count(attended)
Source: local data frame [1 x 2]
 
  attended     n
     (dbl) (int)
1        1   582

I’ll have to randomly sort the data frame and then reassign my training and test data frames to work around it.

Written by Mark Needham

December 27th, 2015 at 12:24 pm

Posted in R

Tagged with

R: data.table – Finding the maximum row

with 2 comments

In my continued playing around with the R data.table package I wanted to find the maximum row based on one of the columns, grouped by another column, and then return back the whole row.

We’ll use the following data table to illustrate:

> blogDT = data.table(name = c("Property 1","Property 1","Property 1","Property 2","Property 2","Property 2"), 
                    price = c(10000, 12500, 18000, 245000, 512000, 1000000),
                    date = c("Day 1", "Day 7", "Day 10", "Day 3", "Day 5", "Day 12"))
 
> blogDT[, lag.price := c(NA, price[-.N]), by = name]
 
> blogDT[, diff := price - lag.price]
 
> blogDT
         name   price   date lag.price   diff
1: Property 1   10000  Day 1        NA     NA
2: Property 1   12500  Day 7     10000   2500
3: Property 1   18000 Day 10     12500   5500
4: Property 2  245000  Day 3        NA     NA
5: Property 2  512000  Day 5    245000 267000
6: Property 2 1000000 Day 12    512000 488000

I wanted to find the biggest difference in ‘price’ and ‘lag.price’ grouped by the ‘name’ column.

If we just want to get the max ‘diff’ grouped by ‘name’ it’s quite easy:

> blogDT[!is.na(diff), .(max = max(diff)), keyby = name]
         name    max
1: Property 1   5500
2: Property 2 488000

However now we’ve lost the information about which ‘date’ that was on and what the ‘price’ and ‘lag.price’ were which is a bit annoying.

If we only want to keep the rows which have the highest ‘diff’ grouped by ‘name’, one way to go about this is to add a ‘max.diff’ column to each row and then filter appropriately. e.g.

> maxDT = blogDT[!is.na(diff)][, max := max(diff), by = name]
 
> maxDT
         name   price   date lag.price   diff    max
1: Property 1   12500  Day 7     10000   2500   5500
2: Property 1   18000 Day 10     12500   5500   5500
3: Property 2  512000  Day 5    245000 267000 488000
4: Property 2 1000000 Day 12    512000 488000 488000
 
> maxDT[diff == max]
         name   price   date lag.price   diff    max
1: Property 1   18000 Day 10     12500   5500   5500
2: Property 2 1000000 Day 12    512000 488000 488000

I’ve only been playing with data.table for a few days so if there’s a better way do let me know in the comments.

Written by Mark Needham

October 2nd, 2015 at 6:42 pm

Posted in R

Tagged with , ,

R: data.table – Comparing adjacent rows

with 2 comments


As part of my exploration of the Land Registry price paid data set I wanted to compare the difference between consecutive sales of properties.

This means we need to group the sales by a property identifier and then get the previous sale price into a column on each row unless it’s the first sale in which case we’ll have ‘NA’. We can do this by creating a lag variable.

I’ll use a simpler data set which is very similar in structure to the Land Registry’s to demonstrate:

> blogDT = data.table(name = c("Property 1","Property 1","Property 1","Property 2","Property 2","Property 2"), 
                      price = c(10000, 12500, 18000, 245000, 512000, 1000000))
 
> blogDT
         name   price
1: Property 1   10000
2: Property 1   12500
3: Property 1   18000
4: Property 2  245000
5: Property 2  512000
6: Property 2 1000000

We want to group by the ‘name’ column and then have the price on row 1 show on row 2, the price on row 2 on row 3, the price on row 4 on row 5 and the price on row 5 on row 6. To do that we’ll introduce a ‘lag.price’ column:

> blogDT[, lag.price := c(NA, price[-.N]), by = name]
 
> blogDT
         name   price lag.price
1: Property 1   10000        NA
2: Property 1   12500     10000
3: Property 1   18000     12500
4: Property 2  245000        NA
5: Property 2  512000    245000
6: Property 2 1000000    512000

Next let’s calculate the difference between the two prices:

> blogDT[, diff := price - lag.price]
> blogDT
         name   price lag.price   diff
1: Property 1   10000        NA     NA
2: Property 1   12500     10000   2500
3: Property 1   18000     12500   5500
4: Property 2  245000        NA     NA
5: Property 2  512000    245000 267000
6: Property 2 1000000    512000 488000

Finally let’s order the data table by the biggest price gains:

> blogDT[order(-diff)]
         name   price lag.price   diff
1: Property 2 1000000    512000 488000
2: Property 2  512000    245000 267000
3: Property 1   18000     12500   5500
4: Property 1   12500     10000   2500
5: Property 1   10000        NA     NA
6: Property 2  245000        NA     NA

Written by Mark Needham

September 27th, 2015 at 10:02 pm

Posted in R

Tagged with ,

R: Querying a 20 million line CSV file – data.table vs data frame

with 8 comments

As I mentioned in a couple of blog posts already, I’ve been exploring the Land Registry price paid data set and although I’ve initially been using SparkR I was curious how easy it would be to explore the data set using plain R.

I thought I’d start out by loading the data into a data frame and run the same queries using deployer.

I’ve come across Hadley Wickham’s readr library before but hadn’t used it and since I needed to load a 20 million line CSV file this seemed the perfect time to give it a try.

The goal of readr is to provide a fast and friendly way to read tabular data into R.

Let’s’ get started:

> library(readr)
 
> system.time(read_csv("pp-complete.csv", col_names = FALSE))
   user  system elapsed 
127.367  21.957 159.963 
 
> df = read_csv("pp-complete.csv", col_names = FALSE)

So it took a little over 2 minutes to process the CSV file into a data frame. Let’s take a quick look at its contents:

> head(df)
Source: local data frame [6 x 16]
 
                                      X1     X2     X3       X4    X5    X6    X7    X8    X9
                                   (chr)  (int) (date)    (chr) (chr) (chr) (chr) (chr) (chr)
1 {0C7ADEF5-878D-4066-B785-0000003ED74A} 163000   <NA>  UB5 4PJ     T     N     F   106      
2 {35F67271-ABD4-40DA-AB09-00000085B9D3} 247500   <NA> TA19 9DD     D     N     F    58      
3 {B20B1C74-E8E1-4137-AB3E-0000011DF342} 320000   <NA>   W4 1DZ     F     N     L    58      
4 {7D6B0915-C56B-4275-AF9B-00000156BCE7} 104000   <NA> NE61 2BH     D     N     F    17      
5 {47B60101-B64C-413D-8F60-000002F1692D} 147995   <NA> PE33 0RU     D     N     F     4      
6 {51F797CA-7BEB-4958-821F-000003E464AE} 110000   <NA> NR35 2SF     T     N     F     5      
Variables not shown: X10 (chr), X11 (chr), X12 (chr), X13 (chr), X14 (chr), X15 (chr), address (chr)

Now let’s query the data frame to see which postcode has the highest average sale price. We’ll need to group by the ‘X4’ column before applying some aggregate functions:

> library(dplyr)
 
> system.time(df %>% 
                group_by(X4) %>% 
                summarise(total = sum(as.numeric(X2)), count = n(), ave = total / count) %>%
                arrange(desc(ave)))
   user  system elapsed 
122.557   1.135 124.211 
 
Source: local data frame [1,164,396 x 4]
 
         X4     total count      ave
      (chr)     (dbl) (int)    (dbl)
1   SW7 1DW  39000000     1 39000000
2  SW1W 0NH  32477000     1 32477000
3   W1K 7PX  27000000     1 27000000
4  SW1Y 6HD  24750000     1 24750000
5   SW6 1BA  18000000     1 18000000
6  SW1X 7EE 101505645     6 16917608
7    N6 4LA  16850000     1 16850000
8  EC4N 5AE  16500000     1 16500000
9    W8 7EA  82075000     6 13679167
10  W1K 1DP  13500000     1 13500000

What about if instead of the average price by post code we want to find the most expensive property ever sold instead?

> system.time(df %>% group_by(X4) %>% summarise(max = max(X2)) %>% arrange(desc(max)))
 
   user  system elapsed 
 35.438   0.478  36.026 
 
Source: local data frame [1,164,396 x 2]
 
         X4      max
      (chr)    (int)
1  SW10 9SU 54959000
2   SW7 1QJ 50000000
3  SW1X 8HG 46013365
4   SW7 1DW 39000000
5  SW1W 0NH 32477000
6  SW1X 7LJ 29350000
7    W8 7EA 27900000
8   SW3 3SR 27750000
9   W1K 7PX 27000000
10 SW1X 7EE 25533000
..      ...      ...

Interestingly that one was much quicker than the first one even though it seems like we only did slightly less work.

At this point I mentioned my experiment to Ashok who suggested I give data.table a try to see if that fared any better. I’d not used it before but was able to get it up and running reasonably quickly:

> library(data.table)
 
> system.time(fread("pp-complete.csv", header = FALSE))
Read 20075122 rows and 15 (of 15) columns from 3.221 GB file in 00:01:05
   user  system elapsed 
 59.324   5.798  68.956 
 
> dt = fread("pp-complete.csv", header = FALSE)
 
> head(dt)
                                       V1     V2               V3       V4 V5 V6 V7  V8 V9
1: {0C7ADEF5-878D-4066-B785-0000003ED74A} 163000 2003-02-21 00:00  UB5 4PJ  T  N  F 106   
2: {35F67271-ABD4-40DA-AB09-00000085B9D3} 247500 2005-07-15 00:00 TA19 9DD  D  N  F  58   
3: {B20B1C74-E8E1-4137-AB3E-0000011DF342} 320000 2010-09-10 00:00   W4 1DZ  F  N  L  58   
4: {7D6B0915-C56B-4275-AF9B-00000156BCE7} 104000 1997-08-27 00:00 NE61 2BH  D  N  F  17   
5: {47B60101-B64C-413D-8F60-000002F1692D} 147995 2003-05-02 00:00 PE33 0RU  D  N  F   4   
6: {51F797CA-7BEB-4958-821F-000003E464AE} 110000 2013-03-22 00:00 NR35 2SF  T  N  F   5   
               V10         V11         V12                          V13            V14 V15
1:    READING ROAD    NORTHOLT    NORTHOLT                       EALING GREATER LONDON   A
2:    ADAMS MEADOW   ILMINSTER   ILMINSTER               SOUTH SOMERSET       SOMERSET   A
3:   WHELLOCK ROAD                  LONDON                       EALING GREATER LONDON   A
4:        WESTGATE     MORPETH     MORPETH               CASTLE MORPETH NORTHUMBERLAND   A
5:   MASON GARDENS  WEST WINCH KING'S LYNN KING'S LYNN AND WEST NORFOLK        NORFOLK   A
6: WILD FLOWER WAY DITCHINGHAM      BUNGAY                SOUTH NORFOLK        NORFOLK   A

So we’ve already gained one minute in the parsing time which is pretty nice. Let’s try and find the postcode with the highest average price:

> dt[,list(length(V2), sum(V2)), by=V4][, V2 / V1, by=V4][order(-V1)][1:10]
Error in sum(V2) : invalid 'type' (character) of argument

Hmmm, seems like we need to make column ‘V2’ numeric. Let’s do that:

> dt = dt[, V2:= as.numeric(V2)]
 
> dt[,list(length(V2), sum(V2)), by=V4][, V2 / V1, by=V4][order(-V1)][1:10]
   user  system elapsed 
  5.108   0.670   6.183 
 
          V4       V1
 1:  SW7 1DW 39000000
 2: SW1W 0NH 32477000
 3:  W1K 7PX 27000000
 4: SW1Y 6HD 24750000
 5:  SW6 1BA 18000000
 6: SW1X 7EE 16917608
 7:   N6 4LA 16850000
 8: EC4N 5AE 16500000
 9:   W8 7EA 13679167
10:  W1K 1DP 13500000

That’s quite a bit faster than our data frame version – ~5 seconds compared to ~2 minutes. We have lost the total sales and number of sales columns but I expect that’s just because my data.table foo is weak and we could keep them if we wanted.

But a good start in terms of execution time. Now let’s try the maximum sale price by post code query:

> system.time(dt[,list(max(V2)), by=V4][order(-V1)][1:10])
   user  system elapsed 
  3.684   0.358   4.132 
 
          V4       V1
 1: SW10 9SU 54959000
 2:  SW7 1QJ 50000000
 3: SW1X 8HG 46013365
 4:  SW7 1DW 39000000
 5: SW1W 0NH 32477000
 6: SW1X 7LJ 29350000
 7:   W8 7EA 27900000
 8:  SW3 3SR 27750000
 9:  W1K 7PX 27000000
10: SW1X 7EE 25533000

We’ve got the same results as before and this time it took ~4 seconds compared to ~35 seconds.

We can actually do even better if we set the postcode column as a key:

> setkey(dt, V4)
 
> system.time(dt[,list(length(V2), sum(V2)), by=V4][, V2 / V1, by=V4][order(-V1)][1:10])
   user  system elapsed 
  1.500   0.047   1.548 
 
> system.time(dt[,list(max(V2)), by=V4][order(-V1)][1:10])
   user  system elapsed 
  0.578   0.026   0.604

And that’s as far as I’ve got with my experiment. If there’s anything else I can do to make either of the versions quicker do let me know in the comments.

Oh and for a bit of commentary on what we can learn from the queries…Knightsbridge is a seriously expensive area to live!

Written by Mark Needham

September 25th, 2015 at 6:28 am

Posted in R

Tagged with

R: Bootstrap confidence intervals

without comments

I recently came across an interesting post on Julia Evans’ blog showing how to generate a bigger set of data points by sampling the small set of data points that we actually have using bootstrapping. Julia’s examples are all in Python so I thought it’d be a fun exercise to translate them into R.

We’re doing the bootstrapping to simulate the number of no-shows for a flight so we can work out how many seats we can overbook the plane by.

We start out with a small sample of no-shows and work off the assumption that it’s ok to kick someone off a flight 5% of the time. Let’s work out how many people that’d be for our initial sample:

> data = c(0, 1, 3, 2, 8, 2, 3, 4)
> quantile(data, 0.05)
  5% 
0.35

0.35 people! That’s not a particularly useful result so we’re going to resample the initial data set 10,000 times, taking the 5%ile each time and see if we come up with something better:

We’re going to use the sample function with replacement to generate our resamples:

> sample(data, replace = TRUE)
[1] 0 3 2 8 8 0 8 0
> sample(data, replace = TRUE)
[1] 2 2 4 3 4 4 2 2

Now let’s write a function to do that multiple times:

library(ggplot)
 
bootstrap_5th_percentile = function(data, n_bootstraps) {
  return(sapply(1:n_bootstraps, 
                function(iteration) quantile(sample(data, replace = TRUE), 0.05)))
}
 
values = bootstrap_5th_percentile(data, 10000)
 
ggplot(aes(x = value), data = data.frame(value = values)) + geom_histogram(binwidth=0.25)

2015 07 19 18 05 48

So this visualisation is telling us that we can oversell by 0-2 people but we don’t know an exact number.

Let’s try the same exercise but with a bigger initial data set of 1,000 values rather than just 8. First we’ll generate a distribution (with a mean of 5 and standard deviation of 2) and visualise it:

library(dplyr)
 
df = data.frame(value = rnorm(1000,5, 2))
df = df %>% filter(value >= 0) %>% mutate(value = as.integer(round(value)))
ggplot(aes(x = value), data = df) + geom_histogram(binwidth=1)

2015 07 19 18 09 15

Our distribution seems to have a lot more values around 4 & 5 whereas the Python version has a flatter distribution – I’m not sure why that is so if you have any ideas let me know. In any case, let’s check the 5%ile for this data set:

> quantile(df$value, 0.05)
5% 
 2

Cool! Now at least we have an integer value rather than the 0.35 we got earlier. Finally let’s do some bootstrapping over our new distribution and see what 5%ile we come up with:

resampled = bootstrap_5th_percentile(df$value, 10000)
byValue = data.frame(value = resampled) %>% count(value)
 
> byValue
Source: local data frame [3 x 2]
 
  value    n
1   1.0    3
2   1.7    2
3   2.0 9995
 
ggplot(aes(x = value, y = n), data = byValue) + geom_bar(stat = "identity")

2015 07 19 18 23 29

‘2’ is by far the most popular 5%ile here although it seems weighted more towards that value than with Julia’s Python version, which I imagine is because we seem to have sampled from a slightly different distribution.

Written by Mark Needham

July 19th, 2015 at 7:44 pm

Posted in R

Tagged with

R: Blog post frequency anomaly detection

without comments

I came across Twitter’s anomaly detection library last year but haven’t yet had a reason to take it for a test run so having got my blog post frequency data into shape I thought it’d be fun to run it through the algorithm.

I wanted to see if it would detect any periods of time when the number of posts differed significantly – I don’t really have an action I’m going to take based on the results, it’s curiosity more than anything else!

First we need to get the library installed. It’s not on CRAN so we need to use devtools to install it from the github repository:

install.packages("devtools")
devtools::install_github("twitter/AnomalyDetection")
library(AnomalyDetection)

The expected data format is two columns – one containing a time stamp and the other a count. e.g. using the ‘raw_data’ data frame that is in scope when you add the library:

> library(dplyr)
> raw_data %>% head()
            timestamp   count
1 1980-09-25 14:01:00 182.478
2 1980-09-25 14:02:00 176.231
3 1980-09-25 14:03:00 183.917
4 1980-09-25 14:04:00 177.798
5 1980-09-25 14:05:00 165.469
6 1980-09-25 14:06:00 181.878

In our case the timestamps will be the start date of a week and the count the number of posts in that week. But first let’s get some practice calling the anomaly function using the canned data:

res = AnomalyDetectionTs(raw_data, max_anoms=0.02, direction='both', plot=TRUE)
res$plot

2015 07 18 00 09 22

From this visualisation we learn that we should expect both high and low outliers to be identified. Let’s give it a try with the blog post publication data.

We need to get the data into shape so we’ll start by getting a count of the number of blog posts by (week, year) pair:

> df %>% sample_n(5)
                                                           title                date
1425                            Coding: Copy/Paste then refactor 2009-10-31 07:54:31
783  Neo4j 2.0.0-M06 -> 2.0.0-RC1: Working with path expressions 2013-11-23 10:30:41
960                                        R: Removing for loops 2015-04-18 23:53:20
966   R: dplyr - Error in (list: invalid subscript type 'double' 2015-04-27 22:34:43
343                     Parsing XML from the unix terminal/shell 2011-09-03 23:42:11
 
> byWeek = df %>% 
    mutate(year = year(date), week = week(date)) %>% 
    group_by(week, year) %>% summarise(n = n()) %>% 
    ungroup() %>% arrange(desc(n))
 
> byWeek %>% sample_n(5)
Source: local data frame [5 x 3]
 
  week year n
1   44 2009 6
2   37 2011 4
3   39 2012 3
4    7 2013 4
5    6 2010 6

Great. The next step is to translate this data frame into one containing a date representing the start of that week and the number of posts:

> data = byWeek %>% 
    mutate(start_of_week = calculate_start_of_week(week, year)) %>%
    filter(start_of_week > ymd("2008-07-01")) %>%
    select(start_of_week, n)
 
> data %>% sample_n(5)
Source: local data frame [5 x 2]
 
  start_of_week n
1    2010-09-10 4
2    2013-04-09 4
3    2010-04-30 6
4    2012-03-11 3
5    2014-12-03 3

We’re now ready to plug it into the anomaly detection function:

res = AnomalyDetectionTs(data, 
                         max_anoms=0.02, 
                         direction='both', 
                         plot=TRUE)
res$plot

2015 07 18 00 24 20

Interestingly I don’t seem to have any low end anomalies – there were a couple of really high frequency weeks when I first started writing and I think one of the other weeks contains a New Year’s Eve when I was particularly bored!

If we group by month instead only the very first month stands out as an outlier:

data = byMonth %>% 
  mutate(start_of_month = ymd(paste(year, month, 1, sep="-"))) %>%
  filter(start_of_month > ymd("2008-07-01")) %>%
  select(start_of_month, n)
res = AnomalyDetectionTs(data, 
                         max_anoms=0.02, 
                         direction='both',       
                         #longterm = TRUE,
                         plot=TRUE)
res$plot

2015 07 18 00 34 02

I’m not sure what else to do as far as anomaly detection goes but if you have any ideas please let me know!

Written by Mark Needham

July 17th, 2015 at 11:34 pm

Posted in R

Tagged with

R: I write more in the last week of the month, or do I?

without comments

I’ve been writing on this blog for almost 7 years and have always believed that I write more frequently towards the end of a month. Now that I’ve got all the data I thought it’d be interesting to test that belief.

I started with a data frame containing each post and its publication date and added an extra column which works out how many weeks from the end of the month that post was written:

> df %>% sample_n(5)
                                                               title                date
946  Python: Equivalent to flatMap for flattening an array of arrays 2015-03-23 00:45:00
175                                         Ruby: Hash default value 2010-10-16 14:02:37
375               Java/Scala: Runtime.exec hanging/in 'pipe_w' state 2011-11-20 20:20:08
1319                            Coding Dojo #18: Groovy Bowling Game 2009-06-26 08:15:23
381                   Continuous Delivery: Removing manual scenarios 2011-12-05 23:13:34
 
calculate_start_of_week = function(week, year) {
  date <- ymd(paste(year, 1, 1, sep="-"))
  week(date) = week
  return(date)
}
 
tidy_df  = df %>% 
  mutate(year = year(date), 
         week = week(date),
         week_in_month = ceiling(day(date) / 7),
         max_week = max(week_in_month), 
         weeks_from_end = max_week - week_in_month,
         start_of_week = calculate_start_of_week(week, year))
 
> tidy_df %>% select(date, weeks_from_end, start_of_week) %>% sample_n(5)
 
                    date weeks_from_end start_of_week
1023 2008-08-08 21:16:02              3    2008-08-05
800  2014-01-31 06:51:06              0    2014-01-29
859  2014-08-14 10:24:52              3    2014-08-13
107  2010-07-10 22:49:52              3    2010-07-09
386  2011-12-20 23:57:51              2    2011-12-17

Next I want to get a count of how many posts were published in a given week. The following code does that transformation for us:

weeks_from_end_counts =  tidy_df %>%
  group_by(start_of_week, weeks_from_end) %>%
  summarise(count = n())
 
> weeks_from_end_counts
Source: local data frame [540 x 4]
Groups: start_of_week, weeks_from_end
 
   start_of_week weeks_from_end year count
1     2006-08-27              0 2006     1
2     2006-08-27              4 2006     3
3     2006-09-03              4 2006     1
4     2008-02-05              3 2008     2
5     2008-02-12              3 2008     2
6     2008-07-15              2 2008     1
7     2008-07-22              1 2008     1
8     2008-08-05              3 2008     8
9     2008-08-12              2 2008     5
10    2008-08-12              3 2008     9
..           ...            ...  ...   ...

We group by both ‘start_of_week’ and ‘weeks_from_end’ because we could have posts published in the same week but different month and we want to capture that difference. Now we can run a correlation on the data frame to see if there’s any relationship between ‘count’ and ‘weeks_from_end’:

> cor(weeks_from_end_counts %>% ungroup() %>% select(weeks_from_end, count))
               weeks_from_end       count
weeks_from_end     1.00000000 -0.08253569
count             -0.08253569  1.00000000

This suggests there’s a slight negative correlation between the two variables i.e. ‘count’ decreases as ‘weeks_from_end’ increases. Let’s plug the data frame into a linear model to see how good ‘weeks_from_end’ is as a predictor of ‘count’:

> fit = lm(count ~ weeks_from_end, weeks_from_end_counts)
 
> summary(fit)
 
Call:
lm(formula = count ~ weeks_from_end, data = weeks_from_end_counts)
 
Residuals:
    Min      1Q  Median      3Q     Max 
-2.0000 -1.5758 -0.5758  1.1060  8.0000 
 
Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     3.00000    0.13764  21.795   <2e-16 ***
weeks_from_end -0.10605    0.05521  -1.921   0.0553 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 1.698 on 538 degrees of freedom
Multiple R-squared:  0.006812,	Adjusted R-squared:  0.004966 
F-statistic:  3.69 on 1 and 538 DF,  p-value: 0.05527

We see a similar result here. The effect of ‘weeks_from_end’ is worth 0.1 posts per week with a p value of 0.0553 so it’s on the border line of being significant.

We also have a very low ‘R squared’ value which suggests the ‘weeks_from_end’ isn’t explaining much of the variation in the data which makes sense given that we didn’t see much of a correlation.

If we charged on and wanted to predict the number of posts likely to be published in a given week we could run the predict function like this:

> predict(fit, data.frame(weeks_from_end=c(1,2,3,4,5)))
       1        2        3        4        5 
2.893952 2.787905 2.681859 2.575812 2.469766

Obviously it’s a bit flawed since we could plug in any numeric value we want, even ones that don’t make any sense, and it’d still come back with a prediction:

> predict(fit, data.frame(weeks_from_end=c(30 ,-10)))
        1         2 
-0.181394  4.060462

I think we’d probably protect against that with a function wrapping our call to predict that doesn’t allow ‘weeks_from_end’ to be greater than 5 or less than 0.

So far it looks like my belief is incorrect! I’m a bit dubious about my calculation of ‘weeks_from_end’ though – it’s not completely capturing what I want since in some months the last week only contains a couple of days.

Next I’m going to explore whether it makes any difference if I calculate that value by counting the number of days back from the last day of the month rather than using week number.

Written by Mark Needham

July 12th, 2015 at 9:53 am

Posted in R

Tagged with