Quickest ways to read large files with varying number columns in Python Quickest ways to read large files with varying number columns in Python numpy numpy

Quickest ways to read large files with varying number columns in Python


After a few thousand rows, this is doing tons of extra work:

    data = data + cline

Just data.extend(cline). (Or .append(), if you want to know which numbers appeared together on a line.)

Consider storing doubles instead of text:

    data.extend([float(c) for c in line.split()])


numpy.loadtxt would have been perfect here doesn't apply here because the number of columns change.

You want a flat list, you could speed it up a bit by using a list comprehension:

from numpy import *with open("file.txt") as f:    data = array([float(x) for l in f for x in l.split()])

(Now I'm pretty sure it will be much faster considering the mistake that JH pointed out in his answer: data = data + line creates a new list each time: quadratic complexity. You avoid that with the list comprehesion)


Pandas is much better/faster at handling ragged columns than numpy is, and should be faster than a vanilla python implementation with a loop.

Use read_csv, followed by stack, and then access the values attribute to return a numpy array.

max_per_row = 10 # set this to the max possible number of elements in a rowvals = pd.read_csv(buf, header=None, names=range(max_per_row),                             delim_whitespace=True).stack().valuesprint(vals)array([  3. ,   2.5,   1.1,  30.2,  11.5,   5. ,   6.2,  12.2,  70.2,        14.7,   3.2,   1.1])