Comparing R to Matlab for Data Mining

r matlab machine-learning data-mining language-comparisons

For the past three years or so, i have used R daily, and the largest portion of that daily use is spent on Machine Learning/Data Mining problems.

I was an exclusive Matlab user while in University; at the time i thought it wasan excellent set of tools/platform. I am sure it is today as well.

The Neural Network Toolbox, the Optimization Toolbox, Statistics Toolbox, and Curve Fitting Toolbox are each highly desirable (if not essential) for someone using MATLAB for ML/Data Mining work, yet they are all separate from the base MATLAB environment--in other words, they have to be purchased separately.

My Top 5 list for Learning ML/Data Mining in R:

Mining Association Rules in R

This refers to a couple things: First, a group of R Package that all begin arules (available from CRAN); you can find the complete list (arules, aruluesViz, etc.) on the Project Homepage. Second, all of these packages are based on a data-mining technique known as Market-Basked Analysis and alternatively as Association Rules. In many respects, this family of algorithms is the essence of data-mining--exhaustively traverse large transaction databases and find above-average associations or correlations among the fields (variables or features) in those databases. In practice, you connect them to a data source and let them run overnight. The central R Package in the set mentioned above is called arules; On the CRAN Package page for arules, you will find links to a couple of excellent secondary sources (vignettes in R's lexicon) on the arules package and on Association Rules technique in general.

The standard reference, The Elements of StatisticalLearning by Hastie et al.

The most current edition of this book is available in digital form for free. Likewise, at the book's website (linked to just above) are all data sets used in ESL, available for free download. (As an aside, i have the free digital version; i also purchased the hardback version from BN.com; all of the color plots in the digital version are reproduced in the hardbound version.) ESL contains thorough introductions to at least one exemplar from most of the majorML rubrics--e.g., neural metworks, SVM, KNN; unsupervisedtechniques (LDA, PCA, MDS, SOM, clustering), numerous flavors of regression, CART, Bayesian techniques, as well as model aggregation techniques (Boosting, Bagging) and model tuning (regularization). Finally, get the R Package that accompanies the book from CRAN (which will save the trouble of having to download the enter the datasets).

CRAN Task View: Machine Learning

The +3,500 Packages availablefor R are divided up by domain into about 30 package families or 'Task Views'. Machine Learningis one of these families. The Machine Learning Task View contains about 50 or soPackages. Some of these Packages are part of the core distribution, including e1071(a sprawling ML package that includes working code for quite a few of the usual ML categories.)

Revolution Analytics Blog

With particular focus on the posts tagged with Predictive Analytics

ML in R tutorial comprised of slide deck and R code by Josh Reich

A thorough study of the code would, by itself, be an excellent introduction to ML in R.

And one final resource that i think is excellent, but didn't make in the top 5:

A Guide to Getting Stared in Machine Learning [in R]

posted at the blog A Beautiful WWW

r matlab machine-learning data-mining language-comparisons

Please look at the CRAN Task Views and in particular at the CRAN Task View on Machine Learning and Statistical Learning which summarises this nicely.

r matlab machine-learning data-mining language-comparisons

Both Matlab and R are good if you are doing matrix-heavy operations. Because they can use highly optimized low-level code (BLAS libraries and such) for this.

However, there is more to data-mining than just crunching matrixes. A lot of people totally neglect the whole data organization aspect of data mining (as opposed to say, plain machine learning).

And once you get to data organization, R and Matlab are a pain. Try implementing an R*-tree in R or matlab to take an O(n^2) algorithm down to O(n log n) runtime. First of all, it totally goes against the way R and Matlab are designed (use bulk math operations wherever possible), secondly it will kill your performance. Interpreted R code for example seems to run at around 50% of the speed of the C code (try R built-in k-means vs. flexclus k-means); and the BLAS libraries are optimized to an insane level, exploiting cache sizes, data alignment, advanced CPU features. If you are adventurous, try implementing a manual matrix multiplication in R or Matlab, and benchmark it against the native one.

Don't get me wrong. There is a lot of stuff where R and matlab are just elegant and excellent for prototyping. You can solve a lot of things in just 10 lines of code, and get a decent performance out of it. Writing the same thing by hand would be hundreds of lines, and probably 10x slower. But sometimes you can optimize by a level of complexity, which for large data sets does beat the optimized matrix operations of R and matlab.

If you want to scale up to "Hadoop size" on the long run, you will have to think about data layout and organization, too, unless all you need is a linear scan over the data. But then, you could just be sampling, too!

CodeHunter

Comparing R to Matlab for Data Mining

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last