Mark Needham

Thoughts on Software Development

R: data.table – Comparing adjacent rows

with 2 comments

As part of my exploration of the Land Registry price paid data set I wanted to compare the difference between consecutive sales of properties.

This means we need to group the sales by a property identifier and then get the previous sale price into a column on each row unless it’s the first sale in which case we’ll have ‘NA’. We can do this by creating a lag variable.

I’ll use a simpler data set which is very similar in structure to the Land Registry’s to demonstrate:

> blogDT = data.table(name = c("Property 1","Property 1","Property 1","Property 2","Property 2","Property 2"), 
                      price = c(10000, 12500, 18000, 245000, 512000, 1000000))
 
> blogDT
         name   price
1: Property 1   10000
2: Property 1   12500
3: Property 1   18000
4: Property 2  245000
5: Property 2  512000
6: Property 2 1000000

We want to group by the ‘name’ column and then have the price on row 1 show on row 2, the price on row 2 on row 3, the price on row 4 on row 5 and the price on row 5 on row 6. To do that we’ll introduce a ‘lag.price’ column:

> blogDT[, lag.price := c(NA, price[-.N]), by = name]
 
> blogDT
         name   price lag.price
1: Property 1   10000        NA
2: Property 1   12500     10000
3: Property 1   18000     12500
4: Property 2  245000        NA
5: Property 2  512000    245000
6: Property 2 1000000    512000

Next let’s calculate the difference between the two prices:

> blogDT[, diff := price - lag.price]
> blogDT
         name   price lag.price   diff
1: Property 1   10000        NA     NA
2: Property 1   12500     10000   2500
3: Property 1   18000     12500   5500
4: Property 2  245000        NA     NA
5: Property 2  512000    245000 267000
6: Property 2 1000000    512000 488000

Finally let’s order the data table by the biggest price gains:

> blogDT[order(-diff)]
         name   price lag.price   diff
1: Property 2 1000000    512000 488000
2: Property 2  512000    245000 267000
3: Property 1   18000     12500   5500
4: Property 1   12500     10000   2500
5: Property 1   10000        NA     NA
6: Property 2  245000        NA     NA
Be Sociable, Share!

Written by Mark Needham

September 27th, 2015 at 10:02 pm

Posted in R

Tagged with ,

  • Arun Srinivasan

    Nice! Some comments, again :-).

    1. There’s a new function since v1.9.6 `shift()` take takes `type = “lag”` (default) and `type = “lead”`. With that we can do:

    blogDT[, diff := price – shift(price), by=name] # type ‘lag’ is default

    directly.

    2. While `x[order()]` works fine and internally optimised to use data.table’s fast ordering, it’s not efficient in terms of memory. So if you’d like it to be both fast and memory efficient, use `setorder()` instead (which works on data.frames as well).

    setorder(blogDT, -diff, na.last=TRUE)

    See http://stackoverflow.com/a/29331287/559784 for benchmarks on data large enough to highlight the advantages.

    Thanks for trying data.table.

  • Cool, thanks. Will keep those in mind in future 🙂