Fastest way to load huge .dat into array

python numpy bigdata

Looking at the source, it appears that numpy.loadtxt contains a lot of code to handle many different formats. In case you have a well defined input file, it is not too difficult to write your own function optimized for your particular file format. Something like this (untested):

def load_big_file(fname):    '''only works for well-formed text file of space-separated doubles'''    rows = []  # unknown number of lines, so use list    with open(fname) as f:        for line in f:            line = [float(s) for s in line.split()]            rows.append(np.array(line, dtype = np.double))    return np.vstack(rows)  # convert list of vectors to array

An alternative solution, if the number of rows and columns is known before, might be:

def load_known_size(fname, nrow, ncol)    x = np.empty((nrow, ncol), dtype = np.double)    with open(fname) as f:        for irow, line in enumerate(f):            for icol, s in enumerate(line.split()):                x[irow, icol] = float(s)    return x

In this way, you don't have to allocate all the intermediate lists.

EDIT: Seems that the second solution is a bit slower, the list comprehension is probably faster than the explicit for loop. Combining the two solutions, and using the trick that Numpy does implicit conversion from string to float (only discovered that just now), this might possibly be faster:

def load_known_size(fname, nrow, ncol)    x = np.empty((nrow, ncol), dtype = np.double)    with open(fname) as f:        for irow, line in enumerate(f):            x[irow, :] = line.split()    return x

To get any further speedup, you would probably have to use some code written in C or Cython. I would be interested to know how much time these functions take to open your files.

CodeHunter

Fastest way to load huge .dat into array

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last