R reading a huge csv R reading a huge csv r r

R reading a huge csv


Make sure you're using 64-bit R, not just 64-bit Windows, so that you can increase your RAM allocation to all 16 GB.

In addition, you can read in the file in chunks:

file_in    <- file("in.csv","r")chunk_size <- 100000 # choose the best size for youx          <- readLines(file_in, n=chunk_size)

You can use data.table to handle reading and manipulating large files more efficiently:

require(data.table)fread("in.csv", header = T)

If needed, you can leverage storage memory with ff:

library("ff")x <- read.csv.ffdf(file="file.csv", header=TRUE, VERBOSE=TRUE,                    first.rows=10000, next.rows=50000, colClasses=NA)


You might want to consider leveraging some on-disk processing and not have that entire object in R's memory. One option would be to store the data in a proper database then have R access that. dplyr is able to deal with a remote source (it actually writes the SQL statements to query the database). I've just tested this with a small example (a mere 17,500 rows), but hopefully it scales up to your requirements.

Install SQLite

https://www.sqlite.org/download.html

Enter the data into a new SQLite database

  • Save the following in a new file named import.sql

CREATE TABLE tableName (COL1, COL2, COL3, COL4);.separator ,.import YOURDATA.csv tableName

Yes, you'll need to specify the column names yourself (I believe) but you can specify their types here too if you wish. This won't work if you have commas anywhere in your names/data, of course.

  • Import the data into the SQLite database via the command line

sqlite3.exe BIGDATA.sqlite3 < import.sql

Point dplyr to the SQLite database

As we're using SQLite, all of the dependencies are handled by dplyr already.

library(dplyr)my_db <- src_sqlite("/PATH/TO/YOUR/DB/BIGDATA.sqlite3", create = FALSE)my_tbl <- tbl(my_db, "tableName")

Do your exploratory analysis

dplyr will write the SQLite commands needed to query this data source. It will otherwise behave like a local table. The big exception will be that you can't query the number of rows.

my_tbl %>% group_by(COL2) %>% summarise(meanVal = mean(COL3))

#>  Source:   query [?? x 2]#>  Database: sqlite 3.8.6 [/PATH/TO/YOUR/DB/BIGDATA.sqlite3]#>  #>         COL2    meanVal#>        <chr>      <dbl>#>  1      1979   15.26476#>  2      1980   16.09677#>  3      1981   15.83936#>  4      1982   14.47380#>  5      1983   15.36479


This may not be possible on your computer. In certain cases, data.table takes up more space than its .csv counterpart.

DT <- data.table(x = sample(1:2,10000000,replace = T))write.csv(DT, "test.csv") #29 MB fileDT <- fread("test.csv", row.names = F)   object.size(DT)> 40001072 bytes #40 MB

Two OOM larger:

DT <- data.table(x = sample(1:2,1000000000,replace = T))write.csv(DT, "test.csv") #2.92 GB fileDT <- fread("test.csv", row.names = F)   object.size(DT)> 4000001072 bytes #4.00 GB

There is natural overhead to storing an object in R. Based on these numbers, there is roughly a 1.33 factor when reading files, However, this varies based on data. For example, using

  • x = sample(1:10000000,10000000,replace = T) gives a factor roughly 2x (R:csv).

  • x = sample(c("foofoofoo","barbarbar"),10000000,replace = T) gives a factor of 0.5x (R:csv).

Based on the max, your 9GB file would take a potential 18GB of memory to store in R, if not more. Based on your error message, it is far more likely that you are hitting hard memory constraints vs. an allocation issue. Therefore, just reading your file in chucks and consolidating would not work - you would also need to partition your analysis + workflow. Another alternative is to use an in-memory tool like SQL.