What exactly is a connection in R?
Connections were introduced in R 1.2.0 and described by Brian Ripley in the first issue of R NEWS (now called The R Journal) of January 2001 (page 16-17) as an abstracted interface to IO streams such as a file, url, socket, or pipe. In 2013, Simon Urbanek added a Connections.h C API which enables R packages to implement custom connection types, such as the curl package.
One feature of connections is that you can incrementally read or write pieces of data from/to the connection using the readBin
, writeBin
, readLines
and writeLines
functions. This allows for asynchronous data processing, for example when dealing with large data or network connections:
# Read the first 30 lines, 10 lines at a timecon <- url("http://jeroen.github.io/data/diamonds.json") open(con, "r")data1 <- readLines(con, n = 10)data2 <- readLines(con, n = 10)data3 <- readLines(con, n = 10)close(con)
Same for writing, e.g. to a file:
tmp <- file(tempfile())open(tmp, "w")writeLines("A line", tmp)writeLines("Another line", tmp)close(tmp)
Open the connection as rb
or wb
to read/write binary data (called raw vectors in R):
# Read the first 3000 bytes, 1000 bytes at a timecon <- url("http://jeroen.github.io/data/diamonds.json") open(con, "rb")data1 <- readBin(con, raw(), n = 1000)data2 <- readBin(con, raw(), n = 1000)data3 <- readBin(con, raw(), n = 1000)close(con)
The pipe()
connection is used to run a system command and pipe text to stdin
or from stdout
as you would do with the |
operator in a shell. E.g. (lets stick with the curl examples), you can run the curl
command line program and pipe the output to R:
con <- pipe("curl -H 'Accept: application/json' https://jeroen.github.io/data/diamonds.json")open(con, "r")data1 <- readLines(con, n = 10)data2 <- readLines(con, n = 10)data3 <- readLines(con, n = 10)
Some aspects of connections are a bit confusing: to incrementally read/write data you need to explicitly open()
and close()
the connection. However, readLines
and writeLines
automatically open and close (but not destroy!) an unopened connection. As a result, the example below will read the first 10 lines over and over again which is not very useful:
con <- url("http://jeroen.github.io/data/diamonds.json") data1 <- readLines(con, n = 10)data2 <- readLines(con, n = 10)data3 <- readLines(con, n = 10)identical(data1, data2)
Another gotcha is that the C API can both close and destroy a connection, but R only exposes a function called close()
which actually means destroy. After calling close()
on a connection it is destroyed and completely useless.
To stream-process data form a connection you want to use a pattern like this:
stream <- function(){ con <- url("http://jeroen.github.io/data/diamonds.json") open(con, "r") on.exit(close(con)) while(length(txt <- readLines(con, n = 10))){ some_callback(txt) } }
The jsonlite
package relies heavily on connections to import/export ndjson data:
library(jsonlite)library(curl)diamonds <- stream_in(curl("https://jeroen.github.io/data/diamonds.json"))
The streaming (by default 1000 lines at a time) makes it fast and memory efficient:
library(nycflights13)stream_out(flights, file(tmp <- tempfile()))flights2 <- stream_in(file(tmp))all.equal(flights2, as.data.frame(flights))
Finally one nice feature about connections is that the garbage collector will automatically close them if you forget to do so, with an annoying warning:
con <- file(system.file("DESCRIPTION"), open = "r")rm(con)gc()