Numpy for R user? Numpy for R user? numpy numpy

Numpy for R user?


R's strength when looking for an environment to do machine learning and statistics is most certainly the diversity of its libraries. To my knowledge, SciPy + SciKits cannot be a replacement for CRAN.

Regarding memory usage, R is using a pass-by-value paradigm while Python is using pass-by-reference. Pass-by-value can lead to more "intuitive" code, pass-by-reference can help optimize memory usage. Numpy also allows to have "views" on arrays (kind of subarrays without a copy being made).

Regarding speed, pure Python is faster than pure R for accessing individual elements in an array, but this advantage disappears when dealing with numpy arrays (benchmark). Fortunately, Cython lets one get serious speed improvements easily.

If working with Big Data, I find the support for storage-based arrays better with Python (HDF5).

I am not sure you should ditch one for the other but rpy2 can help you explore your options about a possible transition (arrays can be shuttled between R and Numpy without a copy being made).


I use NumPy daily and R nearly so.

For heavy number crunching, i prefer NumPy to R by a large margin (including R packages, like 'Matrix') I find the syntax cleaner, the function set larger, and computation is quicker (although i don't find R slow by any means). NumPy's Broadcasting functionality for instance, i do not think has an analog in R.

For instance, to read in a data set from a csv file and 'normalize' it for input to an ML algorithm (e.g., mean center then re-scale each dimension) requires just this:

data = NP.loadtxt(data1, delimiter=",")    # 'data' is a NumPy arraydata -= NP.mean(data, axis=0)data /= NP.max(data, axis=0)

Also, i find that when coding ML algorithms, i need data structures that i can operate on element-wise and that also understand linear algebra (e.g., matrix multiplication, transpose, etc.). NumPy gets this and allows you to create these hybrid structures easily (no operator overloading or subclassing, etc.).

You won't be disappointed by NumPy/SciPy, more likely you'll be amazed.

So, a few recommendations--in general and in particular, given the facts in your question:

  • install both NumPy and Scipy. As a rough guide, NumPy provides thecore data structures (in particularthe ndarray) and SciPy (which isactually several times larger thanNumPy) provides the domain-specificfunctions (e.g., statistics, signalprocessing, integration).

  • install the repository versions, particularly w/r/t NumPy because thedev version is 2.0. Matplotlib and NumPy are tightly integrated, you can use one without the other of course, but both are the best in their respective class among python libraries. You can get all three via easy_install, which i assume you already.

  • NumPy/SciPy have several modulesspecifically directed to MachineLearning/Statistics, including the Clustering package and the Statistics package.

  • As well as packages directed togeneral computation, but which aremake coding ML algorithms a lotfaster, in particular,Optimization and Linear Algebra.

  • There are also the SciKits, not included in the base NumPy orSciPy libraries; you need to install them separately.Generally speaking, each SciKit is aset of convenience wrappers tostreamline coding in a given domain. The SciKits you are likely to find most relevant are: ann (approximate Nearest Neighbor), and learn (a set of ML/Statistics regression and classification algorithms, e.g., Logistic Regression, Multi-Layer Perceptron, Support Vector Machine).


I can't comment on R, but here are a couple of links on Numpy/Scipy and ML:

And a book (I've only looked at some of its code):Marsland, Machine Learning (with numpy), 2009 406p isbn 1420067184

If you could collect a few notes on your experience up the Numpy/Scipy learning curve,that might be useful to others.