Practical limits of R data frame Practical limits of R data frame r r

Practical limits of R data frame


R is suited for large data sets, but you may have to change your way of working somewhat from what the introductory textbooks teach you. I did a post on Big Data for R which crunches a 30 GB data set and which you may find useful for inspiration.

The usual sources for information to get started are High-Performance Computing Task View and the R-SIG HPC mailing list at R-SIG HPC.

The main limit you have to work around is a historic limit on the length of a vector to 2^31-1 elements which wouldn't be so bad if R did not store matrices as vectors. (The limit is for compatibility with some BLAS libraries.)

We regularly analyse telco call data records and marketing databases with multi-million customers using R, so would be happy to talk more if you are interested.


The physical limits arise from the use of 32-bit indexes on vectors. As a result, vectors up to 2^31 - 1 are allowed. Matrices are vectors with dimensions, so the product of nrow(mat) and ncol(mat) must be within 2^31 - 1. Data frames and lists are general vectors, so each component can take 2^31 - 1 entries, which for data frames means you can have that many rows and columns. For lists you can have 2^31 - 1 components, each of 2^31 - 1 elements. This is drawn from a recent posting by Duncan Murdoch in reply to a Q on R-Help

Now that all has to fit in RAM with standard R so that might be a more pressing limit, but the High-Performance Computing Task View that others have mentioned contains details of packages that can circumvent the in-memory issues.


1) The R Import / Export manual should be the first port of call for questions about importing data - there are many options and what will work for your could be very specific.

http://cran.r-project.org/doc/manuals/R-data.html

read.table specifically has greatly improved performance if the options provided to it are used, particular colClasses, comment.char, and nrows - this is because this information has to be inferred from the data itself, which can be costly.

2) There is a specific limit for the length (total number of elements) for any vector, matrix, array, column in a data.frame, or list. This is due to a 32-bit index used under the hood, and is true for 32-bit and 64-bit R. The number is 2^31 - 1. This is the maximum number of rows for a data.frame, but it is so large you are far more likely to run out of memory for even single vectors before you start collecting several of them.

See help(Memory-limits) and help(Memory) for details.

A single vector of that length will take many gigabytes of memory (depends on the type and storage mode of each vector - 17.1 for numeric) so it's unlikely to be a proper limit unless you are really pushing things. If you really need to push things past the available system memory (64-bit is mandatory here) then standard database techniques as discussed in the import/export manual, or memory-mapped file options (like the ff package), are worth considering. The CRAN Task View High Performance Computing is a good resource for this end of things.

Finally, if you have stacks of RAM (16Gb or more) and need 64-bit indexing it might come in a future release of R. http://www.mail-archive.com/r-help@r-project.org/msg92035.html

Also, Ross Ihaka discusses some of the historical decisions and future directions for an R like language in papers and talks here: http://www.stat.auckland.ac.nz/~ihaka/?Papers_and_Talks