R reading a huge csv

Make sure you're using 64-bit R, not just 64-bit Windows, so that you can increase your RAM allocation to all 16 GB.

In addition, you can read in the file in chunks:

file_in    <- file("in.csv","r")chunk_size <- 100000 # choose the best size for youx          <- readLines(file_in, n=chunk_size)

You can use data.table to handle reading and manipulating large files more efficiently:

require(data.table)fread("in.csv", header = T)

If needed, you can leverage storage memory with ff:

library("ff")x <- read.csv.ffdf(file="file.csv", header=TRUE, VERBOSE=TRUE,                    first.rows=10000, next.rows=50000, colClasses=NA)

r windows csv ram

You might want to consider leveraging some on-disk processing and not have that entire object in R's memory. One option would be to store the data in a proper database then have R access that. dplyr is able to deal with a remote source (it actually writes the SQL statements to query the database). I've just tested this with a small example (a mere 17,500 rows), but hopefully it scales up to your requirements.

Install SQLite

https://www.sqlite.org/download.html

Enter the data into a new SQLite database

Save the following in a new file named import.sql

CREATE TABLE tableName (COL1, COL2, COL3, COL4);.separator ,.import YOURDATA.csv tableName

Yes, you'll need to specify the column names yourself (I believe) but you can specify their types here too if you wish. This won't work if you have commas anywhere in your names/data, of course.

Import the data into the SQLite database via the command line

sqlite3.exe BIGDATA.sqlite3 < import.sql

Point `dplyr` to the SQLite database

As we're using SQLite, all of the dependencies are handled by dplyr already.

library(dplyr)my_db <- src_sqlite("/PATH/TO/YOUR/DB/BIGDATA.sqlite3", create = FALSE)my_tbl <- tbl(my_db, "tableName")

Do your exploratory analysis

dplyr will write the SQLite commands needed to query this data source. It will otherwise behave like a local table. The big exception will be that you can't query the number of rows.

my_tbl %>% group_by(COL2) %>% summarise(meanVal = mean(COL3))

#>  Source:   query [?? x 2]#>  Database: sqlite 3.8.6 [/PATH/TO/YOUR/DB/BIGDATA.sqlite3]#>  #>         COL2    meanVal#>        <chr>      <dbl>#>  1      1979   15.26476#>  2      1980   16.09677#>  3      1981   15.83936#>  4      1982   14.47380#>  5      1983   15.36479

r windows csv ram

This may not be possible on your computer. In certain cases, data.table takes up more space than its .csv counterpart.

DT <- data.table(x = sample(1:2,10000000,replace = T))write.csv(DT, "test.csv") #29 MB fileDT <- fread("test.csv", row.names = F)   object.size(DT)> 40001072 bytes #40 MB

Two OOM larger:

DT <- data.table(x = sample(1:2,1000000000,replace = T))write.csv(DT, "test.csv") #2.92 GB fileDT <- fread("test.csv", row.names = F)   object.size(DT)> 4000001072 bytes #4.00 GB

There is natural overhead to storing an object in R. Based on these numbers, there is roughly a 1.33 factor when reading files, However, this varies based on data. For example, using

x = sample(1:10000000,10000000,replace = T) gives a factor roughly 2x (R:csv).
x = sample(c("foofoofoo","barbarbar"),10000000,replace = T) gives a factor of 0.5x (R:csv).

Based on the max, your 9GB file would take a potential 18GB of memory to store in R, if not more. Based on your error message, it is far more likely that you are hitting hard memory constraints vs. an allocation issue. Therefore, just reading your file in chucks and consolidating would not work - you would also need to partition your analysis + workflow. Another alternative is to use an in-memory tool like SQL.

CodeHunter

R reading a huge csv

Install SQLite

Enter the data into a new SQLite database

Point `dplyr` to the SQLite database

Do your exploratory analysis

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last

R reading a huge csv

Install SQLite

Enter the data into a new SQLite database

Point dplyr to the SQLite database

Do your exploratory analysis

Recent Posts

Point `dplyr` to the SQLite database