How to efficiently read the first character from each line of a text file? How to efficiently read the first character from each line of a text file? r r

How to efficiently read the first character from each line of a text file?


If you allow/have access to Unix command-line tools you can use

scan(pipe("cut -c 1 test.txt"), what="", quiet=TRUE) 

Obviously less portable but probably very fast.

Using @RichieCotton's benchmarking code with the OP's suggested "bigtest.txt" file:

           expr         min          lq        mean      median          uq     RC readLines   14.797830   17.083849   19.261917   18.103020   20.007341      RS read.fwf  125.113935  133.259220  148.122596  138.024203  150.528754 BB scan pipe cut    6.277267    7.027964    7.686314    7.337207    8.004137      RC readChar 1163.126377 1219.982117 1324.576432 1278.417578 1368.321464          RS scan   13.927765   14.752597   16.634288   15.274470   16.992124


data.table::fread() seems to beat all of the solutions so far proposed, and has the great virtue of running comparably fast on both Windows and *NIX machines:

library(data.table)substring(fread("bigtest.txt", sep="\n", header=FALSE)[[1]], 1, 1)

Here are microbenchmark timings on a Linux box (actually a dual-boot laptop, booted up as Ubuntu):

Unit: milliseconds             expr         min          lq        mean      median          uq        max neval     RC readLines   15.830318   16.617075   18.294723   17.116666   18.959381   27.54451   100        JOB fread    5.532777    6.013432    7.225067    6.292191    7.727054   12.79815   100      RS read.fwf  111.099578  113.803053  118.844635  116.501270  123.987873  141.14975   100 BB scan pipe cut    6.583634    8.290366    9.925221   10.115399   11.013237   15.63060   100      RC readChar 1347.017408 1407.878731 1453.580001 1450.693865 1491.764668 1583.92091   100

And here are timings from the same laptop booted up as a Windows machine (with the command-line tool cut supplied by Rtools):

Unit: milliseconds             expr         min          lq       mean      median          uq        max neval   cld     RC readLines   26.653266   27.493167   33.13860   28.057552   33.208309   61.72567   100  b         JOB fread    4.964205    5.343063    6.71591    5.538246    6.027024   13.54647   100 a        RS read.fwf  213.951792  217.749833  229.31050  220.793649  237.400166  287.03953   100   c  BB scan pipe cut  180.963117  263.469528  278.04720  276.138088  280.227259  387.87889   100    d       RC readChar 1505.263964 1572.132785 1646.88564 1622.410703 1688.809031 2149.10773   100     e


Figure out the file size, read it in as a single binary blob, find the offsets of the characters of interest (don't count the last '\n', at the end of the file!), and coerce to final form

f0 <- function() {    sz <- file.info("bigtest.txt")$size    what <- charToRaw("\n")    x = readBin("bigtest.txt", raw(), sz)    idx = which(x == what)    rawToChar(x[c(1L,  idx[-length(idx)] + 1L)], multiple=TRUE)}

The data.table solution (was I think the fastest so far -- need to include the first line as part of the data!)

library(data.table)f1 <- function()    substring(fread("bigtest.txt", header=FALSE)[[1]], 1, 1)

and in comparison

> identical(f0(), f1())[1] TRUE> library(microbenchmark)> microbenchmark(f0(), f1())Unit: milliseconds expr      min       lq     mean    median        uq       max neval f0() 5.144873 5.515219 5.571327  5.547899  5.623171  5.897335   100 f1() 9.153364 9.470571 9.994560 10.162012 10.350990 11.047261   100

Still wasteful, since the entire file is read in to memory before mostly being discarded.