R: dplyr - Update rows with earlier/previous rows values
Recently I had a data frame which contained a column which had mostly empty values:
> data.frame(col1 = c(1,2,3,4,5), col2 = c("a", NA, NA , "b", NA))
col1 col2
1 1 a
2 2 <NA>
3 3 <NA>
4 4 b
5 5 <NA>
I wanted to fill in the NA values with the last non NA value from that column. So I want the data frame to look like this:
1 1 a
2 2 a
3 3 a
4 4 b
5 5 b
I spent ages searching around before I came across the na.locf function in the zoo library which does the job:
library(zoo)
library(dplyr)
> data.frame(col1 = c(1,2,3,4,5), col2 = c("a", NA, NA , "b", NA)) %>%
do(na.locf(.))
col1 col2
1 1 a
2 2 a
3 3 a
4 4 b
5 5 b
This will fill in the missing values for every column, so if we had a third column with missing values it would populate those too:
> data.frame(col1 = c(1,2,3,4,5), col2 = c("a", NA, NA , "b", NA), col3 = c("A", NA, "B", NA, NA)) %>%
do(na.locf(.))
col1 col2 col3
1 1 a A
2 2 a A
3 3 a B
4 4 b B
5 5 b B
If we only want to populate 'col2' and leave 'col3' as it is we can apply the function specifically to that column:
> data.frame(col1 = c(1,2,3,4,5), col2 = c("a", NA, NA , "b", NA), col3 = c("A", NA, "B", NA, NA)) %>%
mutate(col2 = na.locf(col2))
col1 col2 col3
1 1 a A
2 2 a <NA>
3 3 a B
4 4 b <NA>
5 5 b <NA>
It’s quite a neat function and certainly comes in helpful when cleaning up data sets which don’t tend to be as uniform as you’d hope!
About the author
I'm currently working on real-time user-facing analytics with Apache Pinot at StarTree. I previously worked on graph analytics at Neo4j, where I also I co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.