Recommended package for very large dataset processing and machine learning in R [closed]

r machine-learning signal-processing bigdata

Have a look at the "Large memory and out-of-memory data" subsection of the high performance computing task view on CRAN. bigmemory and ff are two popular packages. For bigmemory (and the related biganalytics, and bigtabulate), the bigmemory website has a few very good presentations, vignettes, and overviews from Jay Emerson. For ff, I recommend reading Adler Oehlschlägel and colleagues' excellent slide presentations on the ff website.

Also, consider storing data in a database and reading in smaller batches for analysis. There are likely any number of approaches to consider. To get started, consdier looking through some of the examples in the biglm package, as well as this presentation from Thomas Lumley.

And do investigate the other packages on the high-performance computing task view and mentioned in the other answers. The packages I mention above are simply the ones I've happened to have more experience with.

r machine-learning signal-processing bigdata

I think the amount of data you can process is more limited by ones programming skills than anything else. Although a lot of standard functionality is focused on in memory analysis, cutting your data into chunks already helps a lot. Ofcourse, this takes more time to program than picking up standard R code, but often times it is quite possible.

Cutting up data can for exale be done using read.table or readBin which support only reading a subset of the data. Alternatively, you can take a look at the high performance computing task view for packages which deliver out of the box out of memory functionality. You could also put your data in a database. For spatial raster data, the excellent raster package provides out of memory analysis.

r machine-learning signal-processing bigdata

For machine learning tasks I can recommend using biglm package, used to do "Regression for data too large to fit in memory". For using R with really big data, one can use Hadoop as a backend and then use package rmr to perform statistical (or other) analysis via MapReduce on a Hadoop cluster.

CodeHunter

Recommended package for very large dataset processing and machine learning in R [closed]

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last