Mark Needham

Thoughts on Software Development

R: Querying a 20 million line CSV file – data.table vs data frame

with 8 comments

As I mentioned in a couple of blog posts already, I’ve been exploring the Land Registry price paid data set and although I’ve initially been using SparkR I was curious how easy it would be to explore the data set using plain R.

I thought I’d start out by loading the data into a data frame and run the same queries using deployer.

I’ve come across Hadley Wickham’s readr library before but hadn’t used it and since I needed to load a 20 million line CSV file this seemed the perfect time to give it a try.

The goal of readr is to provide a fast and friendly way to read tabular data into R.

Let’s’ get started:

> library(readr)
 
> system.time(read_csv("pp-complete.csv", col_names = FALSE))
   user  system elapsed 
127.367  21.957 159.963 
 
> df = read_csv("pp-complete.csv", col_names = FALSE)

So it took a little over 2 minutes to process the CSV file into a data frame. Let’s take a quick look at its contents:

> head(df)
Source: local data frame [6 x 16]
 
                                      X1     X2     X3       X4    X5    X6    X7    X8    X9
                                   (chr)  (int) (date)    (chr) (chr) (chr) (chr) (chr) (chr)
1 {0C7ADEF5-878D-4066-B785-0000003ED74A} 163000   <NA>  UB5 4PJ     T     N     F   106      
2 {35F67271-ABD4-40DA-AB09-00000085B9D3} 247500   <NA> TA19 9DD     D     N     F    58      
3 {B20B1C74-E8E1-4137-AB3E-0000011DF342} 320000   <NA>   W4 1DZ     F     N     L    58      
4 {7D6B0915-C56B-4275-AF9B-00000156BCE7} 104000   <NA> NE61 2BH     D     N     F    17      
5 {47B60101-B64C-413D-8F60-000002F1692D} 147995   <NA> PE33 0RU     D     N     F     4      
6 {51F797CA-7BEB-4958-821F-000003E464AE} 110000   <NA> NR35 2SF     T     N     F     5      
Variables not shown: X10 (chr), X11 (chr), X12 (chr), X13 (chr), X14 (chr), X15 (chr), address (chr)

Now let’s query the data frame to see which postcode has the highest average sale price. We’ll need to group by the ‘X4’ column before applying some aggregate functions:

> library(dplyr)
 
> system.time(df %>% 
                group_by(X4) %>% 
                summarise(total = sum(as.numeric(X2)), count = n(), ave = total / count) %>%
                arrange(desc(ave)))
   user  system elapsed 
122.557   1.135 124.211 
 
Source: local data frame [1,164,396 x 4]
 
         X4     total count      ave
      (chr)     (dbl) (int)    (dbl)
1   SW7 1DW  39000000     1 39000000
2  SW1W 0NH  32477000     1 32477000
3   W1K 7PX  27000000     1 27000000
4  SW1Y 6HD  24750000     1 24750000
5   SW6 1BA  18000000     1 18000000
6  SW1X 7EE 101505645     6 16917608
7    N6 4LA  16850000     1 16850000
8  EC4N 5AE  16500000     1 16500000
9    W8 7EA  82075000     6 13679167
10  W1K 1DP  13500000     1 13500000

What about if instead of the average price by post code we want to find the most expensive property ever sold instead?

> system.time(df %>% group_by(X4) %>% summarise(max = max(X2)) %>% arrange(desc(max)))
 
   user  system elapsed 
 35.438   0.478  36.026 
 
Source: local data frame [1,164,396 x 2]
 
         X4      max
      (chr)    (int)
1  SW10 9SU 54959000
2   SW7 1QJ 50000000
3  SW1X 8HG 46013365
4   SW7 1DW 39000000
5  SW1W 0NH 32477000
6  SW1X 7LJ 29350000
7    W8 7EA 27900000
8   SW3 3SR 27750000
9   W1K 7PX 27000000
10 SW1X 7EE 25533000
..      ...      ...

Interestingly that one was much quicker than the first one even though it seems like we only did slightly less work.

At this point I mentioned my experiment to Ashok who suggested I give data.table a try to see if that fared any better. I’d not used it before but was able to get it up and running reasonably quickly:

> library(data.table)
 
> system.time(fread("pp-complete.csv", header = FALSE))
Read 20075122 rows and 15 (of 15) columns from 3.221 GB file in 00:01:05
   user  system elapsed 
 59.324   5.798  68.956 
 
> dt = fread("pp-complete.csv", header = FALSE)
 
> head(dt)
                                       V1     V2               V3       V4 V5 V6 V7  V8 V9
1: {0C7ADEF5-878D-4066-B785-0000003ED74A} 163000 2003-02-21 00:00  UB5 4PJ  T  N  F 106   
2: {35F67271-ABD4-40DA-AB09-00000085B9D3} 247500 2005-07-15 00:00 TA19 9DD  D  N  F  58   
3: {B20B1C74-E8E1-4137-AB3E-0000011DF342} 320000 2010-09-10 00:00   W4 1DZ  F  N  L  58   
4: {7D6B0915-C56B-4275-AF9B-00000156BCE7} 104000 1997-08-27 00:00 NE61 2BH  D  N  F  17   
5: {47B60101-B64C-413D-8F60-000002F1692D} 147995 2003-05-02 00:00 PE33 0RU  D  N  F   4   
6: {51F797CA-7BEB-4958-821F-000003E464AE} 110000 2013-03-22 00:00 NR35 2SF  T  N  F   5   
               V10         V11         V12                          V13            V14 V15
1:    READING ROAD    NORTHOLT    NORTHOLT                       EALING GREATER LONDON   A
2:    ADAMS MEADOW   ILMINSTER   ILMINSTER               SOUTH SOMERSET       SOMERSET   A
3:   WHELLOCK ROAD                  LONDON                       EALING GREATER LONDON   A
4:        WESTGATE     MORPETH     MORPETH               CASTLE MORPETH NORTHUMBERLAND   A
5:   MASON GARDENS  WEST WINCH KING'S LYNN KING'S LYNN AND WEST NORFOLK        NORFOLK   A
6: WILD FLOWER WAY DITCHINGHAM      BUNGAY                SOUTH NORFOLK        NORFOLK   A

So we’ve already gained one minute in the parsing time which is pretty nice. Let’s try and find the postcode with the highest average price:

> dt[,list(length(V2), sum(V2)), by=V4][, V2 / V1, by=V4][order(-V1)][1:10]
Error in sum(V2) : invalid 'type' (character) of argument

Hmmm, seems like we need to make column ‘V2’ numeric. Let’s do that:

> dt = dt[, V2:= as.numeric(V2)]
 
> dt[,list(length(V2), sum(V2)), by=V4][, V2 / V1, by=V4][order(-V1)][1:10]
   user  system elapsed 
  5.108   0.670   6.183 
 
          V4       V1
 1:  SW7 1DW 39000000
 2: SW1W 0NH 32477000
 3:  W1K 7PX 27000000
 4: SW1Y 6HD 24750000
 5:  SW6 1BA 18000000
 6: SW1X 7EE 16917608
 7:   N6 4LA 16850000
 8: EC4N 5AE 16500000
 9:   W8 7EA 13679167
10:  W1K 1DP 13500000

That’s quite a bit faster than our data frame version – ~5 seconds compared to ~2 minutes. We have lost the total sales and number of sales columns but I expect that’s just because my data.table foo is weak and we could keep them if we wanted.

But a good start in terms of execution time. Now let’s try the maximum sale price by post code query:

> system.time(dt[,list(max(V2)), by=V4][order(-V1)][1:10])
   user  system elapsed 
  3.684   0.358   4.132 
 
          V4       V1
 1: SW10 9SU 54959000
 2:  SW7 1QJ 50000000
 3: SW1X 8HG 46013365
 4:  SW7 1DW 39000000
 5: SW1W 0NH 32477000
 6: SW1X 7LJ 29350000
 7:   W8 7EA 27900000
 8:  SW3 3SR 27750000
 9:  W1K 7PX 27000000
10: SW1X 7EE 25533000

We’ve got the same results as before and this time it took ~4 seconds compared to ~35 seconds.

We can actually do even better if we set the postcode column as a key:

> setkey(dt, V4)
 
> system.time(dt[,list(length(V2), sum(V2)), by=V4][, V2 / V1, by=V4][order(-V1)][1:10])
   user  system elapsed 
  1.500   0.047   1.548 
 
> system.time(dt[,list(max(V2)), by=V4][order(-V1)][1:10])
   user  system elapsed 
  0.578   0.026   0.604

And that’s as far as I’ve got with my experiment. If there’s anything else I can do to make either of the versions quicker do let me know in the comments.

Oh and for a bit of commentary on what we can learn from the queries…Knightsbridge is a seriously expensive area to live!

Be Sociable, Share!

Written by Mark Needham

September 25th, 2015 at 6:28 am

Posted in R

Tagged with

  • Arun Srinivasan

    Nice post. Here are some observations and comments:

    1. Did you notice that the entire `X3` column is NA? Any ideas why?

    2. It’s useful to mention the version of the package you use. I’m not sure if you’re using v1.9.4 of data.table or the recently released v1.9.6. I’m not sure why `V2` is read as character.. Would be nice to know the version.

    3. In `dplyr` code, you convert `V2` to numeric type for each group. It’s a bit inefficient. I’d suggest to move it up and convert to numeric just once (as you’ve done for data.table code). In essence, the dplyr code should be:

    df %>% mutate(V2 = as.numeric(V2) %>% group_by(V4) %>% summarise(total = sum(V2), count = n(), ave = total / count) %>% arrange(desc(ave))

    4. In `data.table`, the equivalent of `n()` is `.N` (special symbol). Also you’d need to use `keyby=V4` instead of `by=V4` for the first aggregation to get the result *identical* to dplyr’s, as ‘summarise’ always returns a sorted result.

    In addition, you don’t need to group by “V4” again to get the average. Also, you’d have to name the columns similarly to get the results identical. In essence, the data.table code would be:

    dt[, .(total = sum(V2), count = .N), keyby=V4][, ave := total / count][order(-ave)]

    With this, I get `TRUE` for `identical(ans1, ans2)`.

    And the corresponding run times are: 88s (dplyr) vs 4.5s.

    You’ve to `keyby` on the 2nd task to get the results identical again. With that I get 37s (dplyr) vs 4.5s.

    These are all minor points though. Thanks for the comparison.

  • Antonios K.

    The data.table process is very fast already, but I’ve made some of my dplyr processes, with 10+ mil rows, faster just by using factor variables instead of character. If you use the postcodes as a factor variable you might be able to create a faster “group by”. It will be also faster if you need to arrange rows by postcode (even if you don’t need it here). Would be interesting to try and compare with using the dplyr syntax and a data.table structured dataset on the background.

  • @arun_sriniv:disqus thanks for the tips! I’m giving them a try and will write up what I find out trying your suggestions.

  • Arun Srinivasan some more observations:

    > 1. Did you notice that the entire `X3` column is NA? Any ideas why?

    Yeh that’s the data column. It seemed to get parsed correctly by the fread but not by read_csv. Not sure why that is!

    > 2. It’s useful to mention the version of the package you use.

    Here’s my session info:

    [1] lubridate_1.3.3 SparkR_1.5.0 data.table_1.9.6 dplyr_0.4.3 readr_0.1.1

    > 3. In `dplyr` code, you convert `V2` to numeric type for each group. It’s a bit inefficient.

    Cool thanks for the tip. I’ll refactor that bit.

    > In essence, the data.table code would be:

    > dt[, .(total = sum(V2), count = .N), keyby=V4][, ave := total / count][order(-ave)]

    I was trying to work out what the equivalent of n() was but my searching foo was weak 🙂 It’s sometimes quite tricky working out what to search for when trying to solve problems using data.table and dplyr.

  • Arun Srinivasan

    Mark, thanks for the followup. I’m quite surprised with the error message you got:

    Error in sum(V2) : invalid ‘type’ (character) of argument

    It reads in as integer type, but the problem is that the values are large so that it overflows on summing up. So we’ll need to convert to numeric (double) type, but why ‘character’..?!? Are you able to reproduce this error still?

    I agree it’s not always easy to know what to search for. The CRAN version of data.table (v1.9.6) has already four new vignettes (improperly named at the moment on CRAN, but that’s fixed already) – https://cran.r-project.org/web/packages/data.table/index.html – which explains some of the special symbols.

    The other vignettes planned are here: https://github.com/Rdatatable/data.table/issues/944

  • @arun_sriniv:disqus good question! I’ll see if I can reproduce that again. And thanks for the link to the vignettes – this one is great 🙂 https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-intro-vignette.html

  • Arun Srinivasan

    Yes please. And thanks for the kind words :-).

  • I tried again and I get a slightly different error this time:

    > dt[,sum(V2), by=V4][order(desc(V1))]

    Error in gsum(V2) :

    Type ‘character’ not supported by GForce sum (gsum). Either add the prefix base::sum(.) or turn off GForce optimization using options(datatable.optimize=1)