What does the capital letter "I" in R linear regression formula mean?
I
isolates or insulates the contents of I( ... )
from the gaze of R's formula parsing code. It allows the standard R operators to work as they would if you used them outside of a formula, rather than being treated as special formula operators.
For example:
y ~ x + x^2
would, to R, mean "give me:
x
= the main effect ofx
, andx^2
= the main effect and the second order interaction ofx
",
not the intended x
plus x
-squared:
> model.frame( y ~ x + x^2, data = data.frame(x = rnorm(5), y = rnorm(5))) y x1 -1.4355144 -1.853740452 0.3620872 -0.077946073 -1.7590868 0.968566344 -0.3245440 0.184925965 -0.6515630 -1.37994358
This is because ^
is a special operator in a formula, as described in ?formula
. You end up only including x
in the model frame because the main effect of x
is already included from the x
term in the formula, and there is nothing to cross x
with to get the second-order interactions in the x^2
term.
To get the usual operator, you need to use I()
to isolate the call from the formula code:
> model.frame( y ~ x + I(x^2), data = data.frame(x = rnorm(5), y = rnorm(5))) y x I(x^2)1 -0.02881534 1.0865514 1.180593....2 0.23252515 -0.7625449 0.581474....3 -0.30120868 -0.8286625 0.686681....4 -0.67761458 0.8344739 0.696346....5 0.65522764 -0.9676520 0.936350....
(that last column is correct, it just looks odd because it is of class AsIs
.)
In your example, -
when used in a formula would indicate removal of a term from the model, where you wanted -
to have it's usual binary operator meaning of subtraction:
> model.frame( y ~ x - mean(x), data = data.frame(x = rnorm(5), y = rnorm(5)))Error in model.frame.default(y ~ x - mean(x), data = data.frame(x = rnorm(5), : variable lengths differ (found for 'mean(x)')
This fails for reason that mean(x)
is a length 1 vector and model.frame()
quite rightly tells you this doesn't match the length of the other variables. A way round this is I()
:
> model.frame( y ~ I(x - mean(x)), data = data.frame(x = rnorm(5), y = rnorm(5))) y I(x - mean(x))1 1.1727063 1.142200....2 -1.4798270 -0.66914....3 -0.4303878 -0.28716....4 -1.0516386 0.542774....5 1.5225863 -0.72865....
Hence, where you want to use an operator that has special meaning in a formula, but you need its non-formula meaning, you need to wrap the elements of the operation in I( )
.
Read ?formula
for more on the special operators, and ?I
for more details on the function itself and its other main use-case within data frames (which is where the AsIs
bit originates from, if you are interested).
From the docs:
Function I has two main uses.
- In function data.frame. Protecting an object by enclosing it in I() in a call to data.frame inhibits the conversion of character vectors to factors and the dropping of names, and ensures that matrices are inserted as single columns. I can also be used to protect objects which are to be added to a data frame, or converted to a data frame via as.data.frame.
To address this point:
df1 <- data.frame(stringi = I("dog"))df2 <- data.frame(stringi = "dog")str(df1)str(df2)
- In function formula. There it is used to inhibit the interpretation of operators such as "+", "-", "*" and "^" as formula operators, so they are used as arithmetical operators. This is interpreted as a symbol by terms.formula.
To address this point:
lm(mpg ~ disp + drat, mtcars)lm(mpg ~ I(disp + drat), mtcars)
Second line. "Creates a new predictor" that is the literal sum of disp + drat