How to efficiently read the first character from each line of a text file?
If you allow/have access to Unix command-line tools you can use
scan(pipe("cut -c 1 test.txt"), what="", quiet=TRUE)
Obviously less portable but probably very fast.
Using @RichieCotton's benchmarking code with the OP's suggested "bigtest.txt" file:
expr min lq mean median uq RC readLines 14.797830 17.083849 19.261917 18.103020 20.007341 RS read.fwf 125.113935 133.259220 148.122596 138.024203 150.528754 BB scan pipe cut 6.277267 7.027964 7.686314 7.337207 8.004137 RC readChar 1163.126377 1219.982117 1324.576432 1278.417578 1368.321464 RS scan 13.927765 14.752597 16.634288 15.274470 16.992124
data.table::fread()
seems to beat all of the solutions so far proposed, and has the great virtue of running comparably fast on both Windows and *NIX machines:
library(data.table)substring(fread("bigtest.txt", sep="\n", header=FALSE)[[1]], 1, 1)
Here are microbenchmark timings on a Linux box (actually a dual-boot laptop, booted up as Ubuntu):
Unit: milliseconds expr min lq mean median uq max neval RC readLines 15.830318 16.617075 18.294723 17.116666 18.959381 27.54451 100 JOB fread 5.532777 6.013432 7.225067 6.292191 7.727054 12.79815 100 RS read.fwf 111.099578 113.803053 118.844635 116.501270 123.987873 141.14975 100 BB scan pipe cut 6.583634 8.290366 9.925221 10.115399 11.013237 15.63060 100 RC readChar 1347.017408 1407.878731 1453.580001 1450.693865 1491.764668 1583.92091 100
And here are timings from the same laptop booted up as a Windows machine (with the command-line tool cut
supplied by Rtools):
Unit: milliseconds expr min lq mean median uq max neval cld RC readLines 26.653266 27.493167 33.13860 28.057552 33.208309 61.72567 100 b JOB fread 4.964205 5.343063 6.71591 5.538246 6.027024 13.54647 100 a RS read.fwf 213.951792 217.749833 229.31050 220.793649 237.400166 287.03953 100 c BB scan pipe cut 180.963117 263.469528 278.04720 276.138088 280.227259 387.87889 100 d RC readChar 1505.263964 1572.132785 1646.88564 1622.410703 1688.809031 2149.10773 100 e
Figure out the file size, read it in as a single binary blob, find the offsets of the characters of interest (don't count the last '\n', at the end of the file!), and coerce to final form
f0 <- function() { sz <- file.info("bigtest.txt")$size what <- charToRaw("\n") x = readBin("bigtest.txt", raw(), sz) idx = which(x == what) rawToChar(x[c(1L, idx[-length(idx)] + 1L)], multiple=TRUE)}
The data.table solution (was I think the fastest so far -- need to include the first line as part of the data!)
library(data.table)f1 <- function() substring(fread("bigtest.txt", header=FALSE)[[1]], 1, 1)
and in comparison
> identical(f0(), f1())[1] TRUE> library(microbenchmark)> microbenchmark(f0(), f1())Unit: milliseconds expr min lq mean median uq max neval f0() 5.144873 5.515219 5.571327 5.547899 5.623171 5.897335 100 f1() 9.153364 9.470571 9.994560 10.162012 10.350990 11.047261 100
Still wasteful, since the entire file is read in to memory before mostly being discarded.