I wanted to create a post here with rather selfish intentions, honestly. This is a list of tips and code snippets that I reference often, either from my own code or from Stack Overflow or my own code based on Stack Overflow posts. I’m planning to potentially add to this post as time goes on. It is my “Cookbook” so-to-speak.
Warning: This is not particularly intended for beginner data.table users
If you do happen to be a beginner, I highly suggest taking a look at their training on datacamp.com, which is really the best way to get a feel for the syntax and brings you into ‘advanced’ data.table. There is also a handy cheat sheet, although I wish the cheat sheet was an RMarkdown document! Even after completing this, I do not feel like an expert, which is why I often find myself on Stack Overflow.
Tips and Tricks List:
Ordering data by multiple columns using frank()
and setorder()
In my work I often find myself with a collection of records where a single record can have multiple values. Sometimes you want to rank your whole dataset or sometimes you might want to rank by group. First tip is if you do need the rank values for your entire table, the frank()
function provided with the data.table
package will do the job. It is pretty fast, even when ranking by groups over a million rows. In the exa
An example that I might face can be seen in the below snippet of code. Even if I increased the sample to 100M rows, I am able to run the below code on my bottom of the line MacBook Pro.
library(data.table)
set.seed(1234)
vals <- sample(1000000, replace = T)
dt1 <- data.table(A = rep(LETTERS[1:10], 1000000),
B = vals)
dt1
## A B
## 1: A 761680
## 2: B 630678
## 3: C 304108
## 4: D 932745
## 5: E 295846
## ---
## 9999996: F 895094
## 9999997: G 367164
## 9999998: H 493172
## 9999999: I 454539
## 10000000: J 50808
system.time(dt1[, B_rank := frank(B), A])
## user system elapsed
## 0.664 0.099 0.857
# Increase to 10M
set.seed(1234)
vals <- sample(10000000, replace = T)
dt2 <- data.table(A = rep(LETTERS[1:10], 10000000),
B = vals)
dt2
## A B
## 1: A 1810256
## 2: B 6595564
## 3: C 7224201
## 4: D 8684454
## 5: E 9407364
## ---
## 99999996: F 1262113
## 99999997: G 9788484
## 99999998: H 1672373
## 99999999: I 1824994
## 100000000: J 3949637
system.time(dt2[, B_rank := frank(B), A])
## user system elapsed
## 9.123 2.502 15.627
# A B B_rank
# 1: A 1137035 1139316
# 2: B 6222995 6218296
# 3: C 6092748 6094866
# 4: D 6233795 6227806
With that said, I often time find myself not needing a column to rank things on but rather just need to order by or extract rows with the highest one or two values of a group in a column. An example might be that I have a lot of duplication in my dataset, but there is one column that might have a value that may determine the ‘best’ selection of the duplicates. That’s where I find setorder()
to be a better choice for me.
set.seed(1234)
vals <- sample(1000000, replace = T)
dt1 <- data.table(A = rep(LETTERS[1:10], 1000000),
B = vals)
system.time(setorder(dt1, A, B))
## user system elapsed
## 0.814 0.044 0.973
# Increase to 10M
set.seed(1234)
vals <- sample(10000000, replace = T)
dt2 <- data.table(A = rep(LETTERS[1:10], 10000000),
B = vals)
system.time(setorder(dt2, A, B))
## user system elapsed
## 11.397 1.305 17.241
I just used simple benchmarking here but you can see particularly for the bigger dataset that the ordering is much faster than creating the column with rank values. From here you can add your own primitive rank or you could extract the top row(s) by group.
# You can add a primitive version of a rank, and altogether the operation would be faster than frank()
# that said, it does not handle ties in any particularly good way.
dt2 <- data.table(A = rep(LETTERS[1:10], 10000000),
B = vals)
system.time(setorder(dt2, A, B)[, B_rank := seq(1:.N), A])
## user system elapsed
## 12.322 1.348 15.630
dt2
# Extract top 1 row
dt2 <- data.table(A = rep(LETTERS[1:10], 10000000),
B = vals)
dt2[dt2[, .I[1], A]$V1]
## A B
## 1: A 1810256
## 2: B 6595564
## 3: C 7224201
## 4: D 8684454
## 5: E 9407364
## 6: F 6417511
## 7: G 4539471
## 8: H 871806
## 9: I 249732
## 10: J 9759272
system.time(dt2[dt2[, .I[1], A]$V1])
## user system elapsed
## 2.643 0.830 3.705
# OR
system.time(dt2[dt2[, head(.I, 1), A]$V1]) #better if picking more than 1 (in my opinion!)
## user system elapsed
## 2.904 1.773 7.370
For a quick explanation: the dt2[, .I[1], A]$V1
expression which is inside the dt2[]
i argument is returning the first row number for each group in column A. dt2[dt2[, head(.I, 1), A]$V1]
is a good option, PARTICULARLY if you are not sure if choosing the top 5 rows per group and you are not sure if each group has 5 values. BTW, the .I
notation I will go over in a tip further down.
Fill NAs in a data.table using set()
There are times where you need to fill the NAs in a data.table with a certain value, like a 0 or an empty string. I often refer to this SO answer .
f_dowle3 = function(DT) {
# either of the following for loops
# by name :
for (j in names(DT))
set(DT,which(is.na(DT[[j]])),j,0)
# or by number (slightly faster than by name) :
for (j in seq_len(ncol(DT)))
set(DT,which(is.na(DT[[j]])),j,0)
}
This is also a general reminder that set()
is a great function in data.table, especially in cases where the :=
notation is tricky for some reason or when the data.table is particularly large.
Writing functions/procedures containing data.tables – Sometimes modifying the input data.table is OK, but be sure you know exactly what you’re doing!
There are times in my work where I find myself applying some series of transformations on multiple data.table objects (you might wonder why I have multiple data.tables with similar information, but that is usually a side effect of me generally getting datasets ready for Shiny dashboards). Sometimes, I can put the data.tables through a for loop, sometimes I might write functions that operate on the original data.table – where the “output” is basically a modified version of the input. I’m still not sure which is the best, but to me it has been relieving to know that it’s not out of question to have some side effect on the input data
What the heck is .I
?
Coming soon
What the heck is CJ()
?
Coming soon