How to organize large R programs? How to organize large R programs? r r

How to organize large R programs?


The standard answer is to use packages -- see the Writing R Extensions manual as well as different tutorials on the web.

It gives you

  • a quasi-automatic way to organize your code by topic
  • strongly encourages you to write a help file, making you think about the interface
  • a lot of sanity checks via R CMD check
  • a chance to add regression tests
  • as well as a means for namespaces.

Just running source() over code works for really short snippets. Everything else should be in a package -- even if you do not plan to publish it as you can write internal packages for internal repositories.

As for the 'how to edit' part, the R Internals manual has excellent R coding standards in Section 6. Otherwise, I tend to use defaults in Emacs' ESS mode.

Update 2008-Aug-13: David Smith just blogged about the Google R Style Guide.


I like putting different functionality in their own files.

But I don't like R's package system. It's rather hard to use.

I prefer a lightweight alternative, to place a file's functions inside an environment (what every other language calls a "namespace") and attach it. For example, I made a 'util' group of functions like so:

util = new.env()util$bgrep = function [...]util$timeit = function [...]while("util" %in% search())  detach("util")attach(util)

This is all in a file util.R. When you source it, you get the environment 'util' so you can call util$bgrep() and such; but furthermore, the attach() call makes it so just bgrep() and such work directly. If you didn't put all those functions in their own environment, they'd pollute the interpreter's top-level namespace (the one that ls() shows).

I was trying to simulate Python's system, where every file is a module. That would be better to have, but this seems OK.


This might sound a little obvious especially if you're a programmer, but here's how I think about logical and physical units of code.

I don't know if this is your case, but when I'm working in R, I rarely start out with a large complex program in mind. I usually start in one script and separate code into logically separable units, often using functions. Data manipulation and visualization code get placed in their own functions, etc. And such functions are grouped together in one section of the file (data manipulation at the top, then visualization, etc). Ultimately you want to think about how to make it easier for you to maintain your script and lower the defect rate.

How fine/coarse grained you make your functions will vary and there are various rules of thumb: e.g. 15 lines of code, or "a function should be responsible for doing one task which is identified by its name", etc. Your mileage will vary. Since R doesn't support call-by-reference, I'm usually vary of making my functions too fine grained when it involves passing data frames or similar structures around. But this may be overcompensation for some silly performance mistakes when I first started out with R.

When to extract logical units into their own physical units (like source files and bigger groupings like packages)? I have two cases. First, if the file gets too large and scrolling around among logically unrelated units is an annoyance. Second, if I have functions that can be reused by other programs. I usually start out by placing some grouped unit, say data manipulation functions, into a separate file. I can then source this file from any other script.

If you're going to deploy your functions, then you need to start thinking about packages. I don't deploy R code in production or for re-use by others for various reasons (briefly: org culture prefers other langauges, concerns about performance, GPL, etc). Also, I tend to constantly refine and add to my collections of sourced files, and I'd rather not deal with packages when I make a change. So you should check out the other package related answers, like Dirk's, for more details on this front.

Finally, I think your question isn't necessarily particular to R. I would really recommend reading Code Complete by Steve McConnell which contains a lot of wisdom about such issues and coding practices at large.