Mark Needham

Thoughts on Software Development

R: data.table – Finding the maximum row

with 2 comments

In my continued playing around with the R data.table package I wanted to find the maximum row based on one of the columns, grouped by another column, and then return back the whole row.

We’ll use the following data table to illustrate:

> blogDT = data.table(name = c("Property 1","Property 1","Property 1","Property 2","Property 2","Property 2"), 
                    price = c(10000, 12500, 18000, 245000, 512000, 1000000),
                    date = c("Day 1", "Day 7", "Day 10", "Day 3", "Day 5", "Day 12"))
 
> blogDT[, lag.price := c(NA, price[-.N]), by = name]
 
> blogDT[, diff := price - lag.price]
 
> blogDT
         name   price   date lag.price   diff
1: Property 1   10000  Day 1        NA     NA
2: Property 1   12500  Day 7     10000   2500
3: Property 1   18000 Day 10     12500   5500
4: Property 2  245000  Day 3        NA     NA
5: Property 2  512000  Day 5    245000 267000
6: Property 2 1000000 Day 12    512000 488000

I wanted to find the biggest difference in ‘price’ and ‘lag.price’ grouped by the ‘name’ column.

If we just want to get the max ‘diff’ grouped by ‘name’ it’s quite easy:

> blogDT[!is.na(diff), .(max = max(diff)), keyby = name]
         name    max
1: Property 1   5500
2: Property 2 488000

However now we’ve lost the information about which ‘date’ that was on and what the ‘price’ and ‘lag.price’ were which is a bit annoying.

If we only want to keep the rows which have the highest ‘diff’ grouped by ‘name’, one way to go about this is to add a ‘max.diff’ column to each row and then filter appropriately. e.g.

> maxDT = blogDT[!is.na(diff)][, max := max(diff), by = name]
 
> maxDT
         name   price   date lag.price   diff    max
1: Property 1   12500  Day 7     10000   2500   5500
2: Property 1   18000 Day 10     12500   5500   5500
3: Property 2  512000  Day 5    245000 267000 488000
4: Property 2 1000000 Day 12    512000 488000 488000
 
> maxDT[diff == max]
         name   price   date lag.price   diff    max
1: Property 1   18000 Day 10     12500   5500   5500
2: Property 2 1000000 Day 12    512000 488000 488000

I’ve only been playing with data.table for a few days so if there’s a better way do let me know in the comments.

Be Sociable, Share!

Written by Mark Needham

October 2nd, 2015 at 6:42 pm

Posted in R

Tagged with , ,

  • blogDT[, .SD[which.max(diff)], by=name]
    .SD is explained in ?data.table and for more examples there are currently 1,021 results on Stack Overflow for the search string “[data.table] .SD”.

  • @mattdowle:disqus Oh nice, I have come across .SD but didn’t realise you could use it for this type of filtering. Neat.