# Mark Needham

Thoughts on Software Development

## R: data.table – Comparing adjacent rows

with 2 comments

As part of my exploration of the Land Registry price paid data set I wanted to compare the difference between consecutive sales of properties.

This means we need to group the sales by a property identifier and then get the previous sale price into a column on each row unless it’s the first sale in which case we’ll have ‘NA’. We can do this by creating a lag variable.

I’ll use a simpler data set which is very similar in structure to the Land Registry’s to demonstrate:

```> blogDT = data.table(name = c("Property 1","Property 1","Property 1","Property 2","Property 2","Property 2"), price = c(10000, 12500, 18000, 245000, 512000, 1000000))   > blogDT name price 1: Property 1 10000 2: Property 1 12500 3: Property 1 18000 4: Property 2 245000 5: Property 2 512000 6: Property 2 1000000```

We want to group by the ‘name’ column and then have the price on row 1 show on row 2, the price on row 2 on row 3, the price on row 4 on row 5 and the price on row 5 on row 6. To do that we’ll introduce a ‘lag.price’ column:

```> blogDT[, lag.price := c(NA, price[-.N]), by = name]   > blogDT name price lag.price 1: Property 1 10000 NA 2: Property 1 12500 10000 3: Property 1 18000 12500 4: Property 2 245000 NA 5: Property 2 512000 245000 6: Property 2 1000000 512000```

Next let’s calculate the difference between the two prices:

```> blogDT[, diff := price - lag.price] > blogDT name price lag.price diff 1: Property 1 10000 NA NA 2: Property 1 12500 10000 2500 3: Property 1 18000 12500 5500 4: Property 2 245000 NA NA 5: Property 2 512000 245000 267000 6: Property 2 1000000 512000 488000```

Finally let’s order the data table by the biggest price gains:

```> blogDT[order(-diff)] name price lag.price diff 1: Property 2 1000000 512000 488000 2: Property 2 512000 245000 267000 3: Property 1 18000 12500 5500 4: Property 1 12500 10000 2500 5: Property 1 10000 NA NA 6: Property 2 245000 NA NA```
Be Sociable, Share!

Written by Mark Needham

September 27th, 2015 at 10:02 pm

Posted in R

Tagged with ,

• Arun Srinivasan

Nice! Some comments, again :-).

1. There’s a new function since v1.9.6 `shift()` take takes `type = “lag”` (default) and `type = “lead”`. With that we can do:

blogDT[, diff := price – shift(price), by=name] # type ‘lag’ is default

directly.

2. While `x[order()]` works fine and internally optimised to use data.table’s fast ordering, it’s not efficient in terms of memory. So if you’d like it to be both fast and memory efficient, use `setorder()` instead (which works on data.frames as well).

setorder(blogDT, -diff, na.last=TRUE)

See http://stackoverflow.com/a/29331287/559784 for benchmarks on data large enough to highlight the advantages.

Thanks for trying data.table.

• Cool, thanks. Will keep those in mind in future ðŸ™‚