Practical limits of R data frame

r performance dataframe rcpp

R is suited for large data sets, but you may have to change your way of working somewhat from what the introductory textbooks teach you. I did a post on Big Data for R which crunches a 30 GB data set and which you may find useful for inspiration.

The usual sources for information to get started are High-Performance Computing Task View and the R-SIG HPC mailing list at R-SIG HPC.

The main limit you have to work around is a historic limit on the length of a vector to 2^31-1 elements which wouldn't be so bad if R did not store matrices as vectors. (The limit is for compatibility with some BLAS libraries.)

We regularly analyse telco call data records and marketing databases with multi-million customers using R, so would be happy to talk more if you are interested.

r performance dataframe rcpp

The physical limits arise from the use of 32-bit indexes on vectors. As a result, vectors up to 2^31 - 1 are allowed. Matrices are vectors with dimensions, so the product of nrow(mat) and ncol(mat) must be within 2^31 - 1. Data frames and lists are general vectors, so each component can take 2^31 - 1 entries, which for data frames means you can have that many rows and columns. For lists you can have 2^31 - 1 components, each of 2^31 - 1 elements. This is drawn from a recent posting by Duncan Murdoch in reply to a Q on R-Help

Now that all has to fit in RAM with standard R so that might be a more pressing limit, but the High-Performance Computing Task View that others have mentioned contains details of packages that can circumvent the in-memory issues.

r performance dataframe rcpp

1) The R Import / Export manual should be the first port of call for questions about importing data - there are many options and what will work for your could be very specific.

http://cran.r-project.org/doc/manuals/R-data.html

read.table specifically has greatly improved performance if the options provided to it are used, particular colClasses, comment.char, and nrows - this is because this information has to be inferred from the data itself, which can be costly.

2) There is a specific limit for the length (total number of elements) for any vector, matrix, array, column in a data.frame, or list. This is due to a 32-bit index used under the hood, and is true for 32-bit and 64-bit R. The number is 2^31 - 1. This is the maximum number of rows for a data.frame, but it is so large you are far more likely to run out of memory for even single vectors before you start collecting several of them.

See help(Memory-limits) and help(Memory) for details.

A single vector of that length will take many gigabytes of memory (depends on the type and storage mode of each vector - 17.1 for numeric) so it's unlikely to be a proper limit unless you are really pushing things. If you really need to push things past the available system memory (64-bit is mandatory here) then standard database techniques as discussed in the import/export manual, or memory-mapped file options (like the ff package), are worth considering. The CRAN Task View High Performance Computing is a good resource for this end of things.

Finally, if you have stacks of RAM (16Gb or more) and need 64-bit indexing it might come in a future release of R. http://www.mail-archive.com/r-help@r-project.org/msg92035.html

Also, Ross Ihaka discusses some of the historical decisions and future directions for an R like language in papers and talks here: http://www.stat.auckland.ac.nz/~ihaka/?Papers_and_Talks

CodeHunter

Practical limits of R data frame

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last