What exactly is a connection in R? What exactly is a connection in R? r r

What exactly is a connection in R?


Connections were introduced in R 1.2.0 and described by Brian Ripley in the first issue of R NEWS (now called The R Journal) of January 2001 (page 16-17) as an abstracted interface to IO streams such as a file, url, socket, or pipe. In 2013, Simon Urbanek added a Connections.h C API which enables R packages to implement custom connection types, such as the curl package.

One feature of connections is that you can incrementally read or write pieces of data from/to the connection using the readBin, writeBin, readLines and writeLines functions. This allows for asynchronous data processing, for example when dealing with large data or network connections:

# Read the first 30 lines, 10 lines at a timecon <- url("http://jeroen.github.io/data/diamonds.json") open(con, "r")data1 <- readLines(con, n = 10)data2 <- readLines(con, n = 10)data3 <- readLines(con, n = 10)close(con)

Same for writing, e.g. to a file:

tmp <- file(tempfile())open(tmp, "w")writeLines("A line", tmp)writeLines("Another line", tmp)close(tmp)

Open the connection as rb or wb to read/write binary data (called raw vectors in R):

# Read the first 3000 bytes, 1000 bytes at a timecon <- url("http://jeroen.github.io/data/diamonds.json") open(con, "rb")data1 <- readBin(con, raw(), n = 1000)data2 <- readBin(con, raw(), n = 1000)data3 <- readBin(con, raw(), n = 1000)close(con)

The pipe() connection is used to run a system command and pipe text to stdin or from stdout as you would do with the | operator in a shell. E.g. (lets stick with the curl examples), you can run the curl command line program and pipe the output to R:

con <- pipe("curl -H 'Accept: application/json' https://jeroen.github.io/data/diamonds.json")open(con, "r")data1 <- readLines(con, n = 10)data2 <- readLines(con, n = 10)data3 <- readLines(con, n = 10)

Some aspects of connections are a bit confusing: to incrementally read/write data you need to explicitly open() and close() the connection. However, readLines and writeLines automatically open and close (but not destroy!) an unopened connection. As a result, the example below will read the first 10 lines over and over again which is not very useful:

con <- url("http://jeroen.github.io/data/diamonds.json") data1 <- readLines(con, n = 10)data2 <- readLines(con, n = 10)data3 <- readLines(con, n = 10)identical(data1, data2)

Another gotcha is that the C API can both close and destroy a connection, but R only exposes a function called close() which actually means destroy. After calling close() on a connection it is destroyed and completely useless.

To stream-process data form a connection you want to use a pattern like this:

stream <- function(){  con <- url("http://jeroen.github.io/data/diamonds.json")  open(con, "r")  on.exit(close(con))  while(length(txt <- readLines(con, n = 10))){    some_callback(txt)  } }

The jsonlite package relies heavily on connections to import/export ndjson data:

library(jsonlite)library(curl)diamonds <- stream_in(curl("https://jeroen.github.io/data/diamonds.json"))

The streaming (by default 1000 lines at a time) makes it fast and memory efficient:

library(nycflights13)stream_out(flights, file(tmp <- tempfile()))flights2 <- stream_in(file(tmp))all.equal(flights2, as.data.frame(flights))

Finally one nice feature about connections is that the garbage collector will automatically close them if you forget to do so, with an annoying warning:

con <- file(system.file("DESCRIPTION"), open = "r")rm(con)gc()