Trimming a huge (3.5 GB) csv file to read into R

r csv

My try with readLines. This piece of a code creates csv with selected years.

file_in <- file("in.csv","r")file_out <- file("out.csv","a")x <- readLines(file_in, n=1)writeLines(x, file_out) # copy headersB <- 300000 # depends how large is one packwhile(length(x)) {    ind <- grep("^[^;]*;[^;]*; 20(09|10)", x)    if (length(ind)) writeLines(x[ind], file_out)    x <- readLines(file_in, n=B)}close(file_in)close(file_out)

r csv

I'm not an expert at this, but you might consider trying MapReduce, which would basically mean taking a "divide and conquer" approach. R has several options for this, including:

mapReduce (pure R)
RHIPE (which uses Hadoop); see example 6.2.2 in the documentation for an example of subsetting files

Alternatively, R provides several packages to deal with large data that go outside memory (onto disk). You could probably load the whole dataset into a bigmemory object and do the reduction completely within R. See http://www.bigmemory.org/ for a set of tools to handle this.

r csv

Is there a similar way to read in files a piece at a time in R?

Yes. The readChar() function will read in a block of characters without assuming they are null-terminated. If you want to read data in a line at a time you can use readLines(). If you read a block or a line, do an operation, then write the data out, you can avoid the memory issue. Although if you feel like firing up a big memory instance on Amazon's EC2 you can get up to 64GB of RAM. That should hold your file plus plenty of room to manipulate the data.

If you need more speed, then Shane's recommendation to use Map Reduce is a very good one. However if you go the route of using a big memory instance on EC2 you should look at the multicore package for using all cores on a machine.

If you find yourself wanting to read many gigs of delimited data into R you should at least research the sqldf package which allows you to import directly into sqldf from R and then operate on the data from within R. I've found sqldf to be one of the fastest ways to import gigs of data into R, as mentioned in this previous question.

CodeHunter

Trimming a huge (3.5 GB) csv file to read into R

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last