25 Sep 2015

R: Querying a 20 million line CSV file - data.table vs data frame

As I mentioned in a couple of blog posts already, I’ve been exploring the Land Registry price paid data set and although I’ve initially been using SparkR I was curious how easy it would be to explore the data set using plain R.

I thought I’d start out by loading the data into a data frame and run the same queries using deployer.

I’ve come across Hadley Wickham’s readr library before but hadn’t used it and since I needed to load a 20 million line CSV file this seemed the perfect time to give it a try.

The goal of readr is to provide a fast and friendly way to read tabular data into R.

Let’s' get started:

> library(readr)

> system.time(read_csv("pp-complete.csv", col_names = FALSE))
   user  system elapsed
127.367  21.957 159.963

> df = read_csv("pp-complete.csv", col_names = FALSE)

So it took a little over 2 minutes to process the CSV file into a data frame. Let’s take a quick look at its contents:

> head(df)
Source: local data frame [6 x 16]

                                      X1     X2     X3       X4    X5    X6    X7    X8    X9
                                   (chr)  (int) (date)    (chr) (chr) (chr) (chr) (chr) (chr)
1 {0C7ADEF5-878D-4066-B785-0000003ED74A} 163000   <NA>  UB5 4PJ     T     N     F   106
2 {35F67271-ABD4-40DA-AB09-00000085B9D3} 247500   <NA> TA19 9DD     D     N     F    58
3 {B20B1C74-E8E1-4137-AB3E-0000011DF342} 320000   <NA>   W4 1DZ     F     N     L    58
4 {7D6B0915-C56B-4275-AF9B-00000156BCE7} 104000   <NA> NE61 2BH     D     N     F    17
5 {47B60101-B64C-413D-8F60-000002F1692D} 147995   <NA> PE33 0RU     D     N     F     4
6 {51F797CA-7BEB-4958-821F-000003E464AE} 110000   <NA> NR35 2SF     T     N     F     5
Variables not shown: X10 (chr), X11 (chr), X12 (chr), X13 (chr), X14 (chr), X15 (chr), address (chr)

Now let’s query the data frame to see which postcode has the highest average sale price. We’ll need to group by the 'X4' column before applying some aggregate functions:

> library(dplyr)

> system.time(df %>%
                group_by(X4) %>%
                summarise(total = sum(as.numeric(X2)), count = n(), ave = total / count) %>%
                arrange(desc(ave)))
   user  system elapsed
122.557   1.135 124.211

Source: local data frame [1,164,396 x 4]

         X4     total count      ave
      (chr)     (dbl) (int)    (dbl)
1   SW7 1DW  39000000     1 39000000
2  SW1W 0NH  32477000     1 32477000
3   W1K 7PX  27000000     1 27000000
4  SW1Y 6HD  24750000     1 24750000
5   SW6 1BA  18000000     1 18000000
6  SW1X 7EE 101505645     6 16917608
7    N6 4LA  16850000     1 16850000
8  EC4N 5AE  16500000     1 16500000
9    W8 7EA  82075000     6 13679167
10  W1K 1DP  13500000     1 13500000

What about if instead of the average price by post code we want to find the most expensive property ever sold instead?

> system.time(df %>% group_by(X4) %>% summarise(max = max(X2)) %>% arrange(desc(max)))

   user  system elapsed
 35.438   0.478  36.026

Source: local data frame [1,164,396 x 2]

         X4      max
      (chr)    (int)
1  SW10 9SU 54959000
2   SW7 1QJ 50000000
3  SW1X 8HG 46013365
4   SW7 1DW 39000000
5  SW1W 0NH 32477000
6  SW1X 7LJ 29350000
7    W8 7EA 27900000
8   SW3 3SR 27750000
9   W1K 7PX 27000000
10 SW1X 7EE 25533000
..      ...      ...

Interestingly that one was much quicker than the first one even though it seems like we only did slightly less work.

At this point I mentioned my experiment to Ashok who suggested I give data.table a try to see if that fared any better. I’d not used it before but was able to get it up and running reasonably quickly:

> library(data.table)

> system.time(fread("pp-complete.csv", header = FALSE))
Read 20075122 rows and 15 (of 15) columns from 3.221 GB file in 00:01:05
   user  system elapsed
 59.324   5.798  68.956

> dt = fread("pp-complete.csv", header = FALSE)

> head(dt)
                                       V1     V2               V3       V4 V5 V6 V7  V8 V9
1: {0C7ADEF5-878D-4066-B785-0000003ED74A} 163000 2003-02-21 00:00  UB5 4PJ  T  N  F 106
2: {35F67271-ABD4-40DA-AB09-00000085B9D3} 247500 2005-07-15 00:00 TA19 9DD  D  N  F  58
3: {B20B1C74-E8E1-4137-AB3E-0000011DF342} 320000 2010-09-10 00:00   W4 1DZ  F  N  L  58
4: {7D6B0915-C56B-4275-AF9B-00000156BCE7} 104000 1997-08-27 00:00 NE61 2BH  D  N  F  17
5: {47B60101-B64C-413D-8F60-000002F1692D} 147995 2003-05-02 00:00 PE33 0RU  D  N  F   4
6: {51F797CA-7BEB-4958-821F-000003E464AE} 110000 2013-03-22 00:00 NR35 2SF  T  N  F   5
               V10         V11         V12                          V13            V14 V15
1:    READING ROAD    NORTHOLT    NORTHOLT                       EALING GREATER LONDON   A
2:    ADAMS MEADOW   ILMINSTER   ILMINSTER               SOUTH SOMERSET       SOMERSET   A
3:   WHELLOCK ROAD                  LONDON                       EALING GREATER LONDON   A
4:        WESTGATE     MORPETH     MORPETH               CASTLE MORPETH NORTHUMBERLAND   A
5:   MASON GARDENS  WEST WINCH KING'S LYNN KING'S LYNN AND WEST NORFOLK        NORFOLK   A
6: WILD FLOWER WAY DITCHINGHAM      BUNGAY                SOUTH NORFOLK        NORFOLK   A

So we’ve already gained one minute in the parsing time which is pretty nice. Let’s try and find the postcode with the highest average price:

> dt[,list(length(V2), sum(V2)), by=V4][, V2 / V1, by=V4][order(-V1)][1:10]
Error in sum(V2) : invalid 'type' (character) of argument

Hmmm, seems like we need to make column 'V2' numeric. Let’s do that:

> dt = dt[, V2:= as.numeric(V2)]

> dt[,list(length(V2), sum(V2)), by=V4][, V2 / V1, by=V4][order(-V1)][1:10]
   user  system elapsed
  5.108   0.670   6.183

          V4       V1
 1:  SW7 1DW 39000000
 2: SW1W 0NH 32477000
 3:  W1K 7PX 27000000
 4: SW1Y 6HD 24750000
 5:  SW6 1BA 18000000
 6: SW1X 7EE 16917608
 7:   N6 4LA 16850000
 8: EC4N 5AE 16500000
 9:   W8 7EA 13679167
10:  W1K 1DP 13500000

That’s quite a bit faster than our data frame version - ~5 seconds compared to ~2 minutes. We have lost the total sales and number of sales columns but I expect that’s just because my data.table foo is weak and we could keep them if we wanted.

But a good start in terms of execution time. Now let’s try the maximum sale price by post code query:

> system.time(dt[,list(max(V2)), by=V4][order(-V1)][1:10])
   user  system elapsed
  3.684   0.358   4.132

          V4       V1
 1: SW10 9SU 54959000
 2:  SW7 1QJ 50000000
 3: SW1X 8HG 46013365
 4:  SW7 1DW 39000000
 5: SW1W 0NH 32477000
 6: SW1X 7LJ 29350000
 7:   W8 7EA 27900000
 8:  SW3 3SR 27750000
 9:  W1K 7PX 27000000
10: SW1X 7EE 25533000

We’ve got the same results as before and this time it took ~4 seconds compared to ~35 seconds.

We can actually do even better if we set the postcode column as a key:

> setkey(dt, V4)

> system.time(dt[,list(length(V2), sum(V2)), by=V4][, V2 / V1, by=V4][order(-V1)][1:10])
   user  system elapsed
  1.500   0.047   1.548

> system.time(dt[,list(max(V2)), by=V4][order(-V1)][1:10])
   user  system elapsed
  0.578   0.026   0.604

And that’s as far as I’ve got with my experiment. If there’s anything else I can do to make either of the versions quicker do let me know in the comments.

Oh and for a bit of commentary on what we can learn from the queries...Knightsbridge is a seriously expensive area to live!

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.