What does the capital letter "I" in R linear regression formula mean? What does the capital letter "I" in R linear regression formula mean? r r

What does the capital letter "I" in R linear regression formula mean?


I isolates or insulates the contents of I( ... ) from the gaze of R's formula parsing code. It allows the standard R operators to work as they would if you used them outside of a formula, rather than being treated as special formula operators.

For example:

y ~ x + x^2

would, to R, mean "give me:

  1. x = the main effect of x, and
  2. x^2 = the main effect and the second order interaction of x",

not the intended x plus x-squared:

> model.frame( y ~ x + x^2, data = data.frame(x = rnorm(5), y = rnorm(5)))           y           x1 -1.4355144 -1.853740452  0.3620872 -0.077946073 -1.7590868  0.968566344 -0.3245440  0.184925965 -0.6515630 -1.37994358

This is because ^ is a special operator in a formula, as described in ?formula. You end up only including x in the model frame because the main effect of x is already included from the x term in the formula, and there is nothing to cross x with to get the second-order interactions in the x^2 term.

To get the usual operator, you need to use I() to isolate the call from the formula code:

> model.frame( y ~ x + I(x^2), data = data.frame(x = rnorm(5), y = rnorm(5)))            y          x       I(x^2)1 -0.02881534  1.0865514 1.180593....2  0.23252515 -0.7625449 0.581474....3 -0.30120868 -0.8286625 0.686681....4 -0.67761458  0.8344739 0.696346....5  0.65522764 -0.9676520 0.936350....

(that last column is correct, it just looks odd because it is of class AsIs.)

In your example, - when used in a formula would indicate removal of a term from the model, where you wanted - to have it's usual binary operator meaning of subtraction:

> model.frame( y ~ x - mean(x), data = data.frame(x = rnorm(5), y = rnorm(5)))Error in model.frame.default(y ~ x - mean(x), data = data.frame(x = rnorm(5),  :   variable lengths differ (found for 'mean(x)')

This fails for reason that mean(x) is a length 1 vector and model.frame() quite rightly tells you this doesn't match the length of the other variables. A way round this is I():

> model.frame( y ~ I(x - mean(x)), data = data.frame(x = rnorm(5), y = rnorm(5)))           y I(x - mean(x))1  1.1727063   1.142200....2 -1.4798270   -0.66914....3 -0.4303878   -0.28716....4 -1.0516386   0.542774....5  1.5225863   -0.72865....

Hence, where you want to use an operator that has special meaning in a formula, but you need its non-formula meaning, you need to wrap the elements of the operation in I( ).

Read ?formula for more on the special operators, and ?I for more details on the function itself and its other main use-case within data frames (which is where the AsIs bit originates from, if you are interested).


From the docs:

Function I has two main uses.

  • In function data.frame. Protecting an object by enclosing it in I() in a call to data.frame inhibits the conversion of character vectors to factors and the dropping of names, and ensures that matrices are inserted as single columns. I can also be used to protect objects which are to be added to a data frame, or converted to a data frame via as.data.frame.

To address this point:

df1 <- data.frame(stringi = I("dog"))df2 <- data.frame(stringi = "dog")str(df1)str(df2)
  • In function formula. There it is used to inhibit the interpretation of operators such as "+", "-", "*" and "^" as formula operators, so they are used as arithmetical operators. This is interpreted as a symbol by terms.formula.

To address this point:

lm(mpg ~ disp + drat, mtcars)lm(mpg ~ I(disp + drat), mtcars)

Second line. "Creates a new predictor" that is the literal sum of disp + drat