I am a self confessed convert to the tidyverse way of doing data import, tidy and analysis in R, making use of packages such as stringr, dplyr and purrr every day. However, I have recently found that I have been running into significant performance problems with larger data sets, particularly when using dplyr to aggregate across many groups.
The data.table package is one that I have been aware of for some time, and the performance benefits it can provide, but until a recent project hadn’t felt the need to use it. The project was with a major University in the UK, where we were trying to understand student engagement with the University’s Virtual Learning Environment (VLE). I was bringing together and analysing multiple data sets of millions of rows each (where each row represented a single access to the VLE by a student).
Working with dplyr to group and aggregate this data I was finding that performance was restrictively slow so implemented a data.table solution and was seriously impressed with the speedup I achieved. In this post I’ll present a few examples to show the increased performance, even on relatively simple tasks such as counts.
The data set I used for these examples contained 3 full years of VLE data, which is 18.6 million rows of 17 columns (2.4 GB in memory).
Example 1 - group by year and student number and count to calculate the number of VLE uses per student per year
Here there were 3 possible values for Year but over 46000 unique student numbers. By using data.table we were able to find our answer in around half a second, whereas dplyr took almost 4 seconds. This speed up may not seem that important but the next example will show where really significant performance boosts are possible.
Example 2 - group by date and student number and count to calculate the number of VLE uses per student per day
In this case the increased performance is particularly significant, an action that took over 1 minute with dplyr was completed in less than 2 seconds with data.table. The group sizes are both large here, with 987 days of data and, as previously, over 46000 unique student numbers.
I highly recommend the data.table package and am sure that it will improve your productivity, particularly with large data sets that have multiple large groups. I have also heard that joins in data.table can be much quicker than dplyr but have not investigated this yet.