R reading a huge csv
Make sure you're using 64-bit R, not just 64-bit Windows, so that you can increase your RAM allocation to all 16 GB.
In addition, you can read in the file in chunks:
file_in <- file("in.csv","r")chunk_size <- 100000 # choose the best size for youx <- readLines(file_in, n=chunk_size)
You can use data.table
to handle reading and manipulating large files more efficiently:
require(data.table)fread("in.csv", header = T)
If needed, you can leverage storage memory with ff
:
library("ff")x <- read.csv.ffdf(file="file.csv", header=TRUE, VERBOSE=TRUE, first.rows=10000, next.rows=50000, colClasses=NA)
You might want to consider leveraging some on-disk processing and not have that entire object in R's memory. One option would be to store the data in a proper database then have R access that. dplyr
is able to deal with a remote source (it actually writes the SQL statements to query the database). I've just tested this with a small example (a mere 17,500 rows), but hopefully it scales up to your requirements.
Install SQLite
https://www.sqlite.org/download.html
Enter the data into a new SQLite database
- Save the following in a new file named
import.sql
CREATE TABLE tableName (COL1, COL2, COL3, COL4);.separator ,.import YOURDATA.csv tableName
Yes, you'll need to specify the column names yourself (I believe) but you can specify their types here too if you wish. This won't work if you have commas anywhere in your names/data, of course.
- Import the data into the SQLite database via the command line
sqlite3.exe BIGDATA.sqlite3 < import.sql
Point dplyr
to the SQLite database
As we're using SQLite, all of the dependencies are handled by dplyr
already.
library(dplyr)my_db <- src_sqlite("/PATH/TO/YOUR/DB/BIGDATA.sqlite3", create = FALSE)my_tbl <- tbl(my_db, "tableName")
Do your exploratory analysis
dplyr
will write the SQLite commands needed to query this data source. It will otherwise behave like a local table. The big exception will be that you can't query the number of rows.
my_tbl %>% group_by(COL2) %>% summarise(meanVal = mean(COL3))
#> Source: query [?? x 2]#> Database: sqlite 3.8.6 [/PATH/TO/YOUR/DB/BIGDATA.sqlite3]#> #> COL2 meanVal#> <chr> <dbl>#> 1 1979 15.26476#> 2 1980 16.09677#> 3 1981 15.83936#> 4 1982 14.47380#> 5 1983 15.36479
This may not be possible on your computer. In certain cases, data.table
takes up more space than its .csv
counterpart.
DT <- data.table(x = sample(1:2,10000000,replace = T))write.csv(DT, "test.csv") #29 MB fileDT <- fread("test.csv", row.names = F) object.size(DT)> 40001072 bytes #40 MB
Two OOM larger:
DT <- data.table(x = sample(1:2,1000000000,replace = T))write.csv(DT, "test.csv") #2.92 GB fileDT <- fread("test.csv", row.names = F) object.size(DT)> 4000001072 bytes #4.00 GB
There is natural overhead to storing an object in R. Based on these numbers, there is roughly a 1.33 factor when reading files, However, this varies based on data. For example, using
x = sample(1:10000000,10000000,replace = T)
gives a factor roughly 2x (R:csv).x = sample(c("foofoofoo","barbarbar"),10000000,replace = T)
gives a factor of 0.5x (R:csv).
Based on the max, your 9GB file would take a potential 18GB of memory to store in R, if not more. Based on your error message, it is far more likely that you are hitting hard memory constraints vs. an allocation issue. Therefore, just reading your file in chucks and consolidating would not work - you would also need to partition your analysis + workflow. Another alternative is to use an in-memory tool like SQL.