Reading Data Files Quickly in R
Overview
R essentially provides the read.csv()
function for reading csv data, but if you’re engaged in practical analysis, its performance is so lacking that it’s not worth using. As an alternative, it is highly recommended to use read_csv()
from the readr package. read_csv()
is written in c++ and can read csv files at a very high speed.
Code
The following is a measurement of the time taken for read.csv()
and read\_csv()
:
link = "../attachment/cfile1.uf@992EC4385D6B6D21123DAE.csv"
system.time({
x<-read.csv(link,header=F)
})
library(readr)
system.time({
y<-read\_csv(link)
})
Even including the time to download, a speed difference of more than tenfold can be observed. The file used in the example
dummy.csv is only about 10,000 lines and 1.1MB in size, but it’s actually a greatly reduced version of an 80MB original dataset. It was so reduced because there was no way to tell whether read.csv()
was so slow that it crashed midway or not. While it might be possible to eventually read the original data if you wait long enough, the patience required to do so was lacking, leading to a significant reduction.