Fastest way to load huge .dat into array Fastest way to load huge .dat into array numpy numpy

Fastest way to load huge .dat into array


Looking at the source, it appears that numpy.loadtxt contains a lot of code to handle many different formats. In case you have a well defined input file, it is not too difficult to write your own function optimized for your particular file format. Something like this (untested):

def load_big_file(fname):    '''only works for well-formed text file of space-separated doubles'''    rows = []  # unknown number of lines, so use list    with open(fname) as f:        for line in f:            line = [float(s) for s in line.split()]            rows.append(np.array(line, dtype = np.double))    return np.vstack(rows)  # convert list of vectors to array

An alternative solution, if the number of rows and columns is known before, might be:

def load_known_size(fname, nrow, ncol)    x = np.empty((nrow, ncol), dtype = np.double)    with open(fname) as f:        for irow, line in enumerate(f):            for icol, s in enumerate(line.split()):                x[irow, icol] = float(s)    return x

In this way, you don't have to allocate all the intermediate lists.

EDIT: Seems that the second solution is a bit slower, the list comprehension is probably faster than the explicit for loop. Combining the two solutions, and using the trick that Numpy does implicit conversion from string to float (only discovered that just now), this might possibly be faster:

def load_known_size(fname, nrow, ncol)    x = np.empty((nrow, ncol), dtype = np.double)    with open(fname) as f:        for irow, line in enumerate(f):            x[irow, :] = line.split()    return x

To get any further speedup, you would probably have to use some code written in C or Cython. I would be interested to know how much time these functions take to open your files.