NumPy reading file with filtering lines on the fly
I can think of two approaches that provide some of the functionality you are asking for:
To read a file either in chunks / or in strides of n-lines / etc.:
You can pass agenerator
to numpy.genfromtxt as well as to numpy.loadtxt. This way you can load a large dataset from a textfile memory-efficiently while retaining all the convenient parsing features of the two functions.To read data only from lines that match a criterion that can be expressed as a regex:
You can use numpy.fromregex and use aregular expression
to precisely define which tokens from a given line in the input file should be loaded. Lines not matching the pattern will be ignored.
To illustrate the two approaches, I'm going to use an example from my research context.
I often need to load files with the following structure:
6 generated by VMD CM 5.420501 3.880814 6.988216 HM1 5.645992 2.839786 7.044024 HM2 5.707437 4.336298 7.926170 HM3 4.279596 4.059821 7.029471 OD1 3.587806 6.069084 8.018103 OD2 4.504519 4.977242 9.7091506 generated by VMD CM 5.421396 3.878586 6.989128 HM1 5.639769 2.841884 7.045364 HM2 5.707584 4.343513 7.928119 HM3 4.277448 4.057222 7.022429 OD1 3.588119 6.069086 8.017814
These files can be huge (GBs) and I'm only interested in the numerical data. All data blocks have the same size -- 6
in this example -- and they are always separated by two lines. So the stride
of the blocks is 8
.
Using the first approach:
First I'm going to define a generator that filters out the undesired lines:
def filter_lines(f, stride): for i, line in enumerate(f): if i%stride and (i-1)%stride: yield line
Then I open the file, create a filter_lines
-generator (here I need to know the stride
), and pass that generator to genfromtxt
:
with open(fname) as f: data = np.genfromtxt(filter_lines(f, 8), dtype='f', usecols=(1, 2, 3))
This works like a breeze. Note that I'm able to use usecols
to get rid of the first column of the data. In the same way, you could use all the other features of genfromtxt
-- detecting the types, varying types from column to column, missing values, converters, etc.
In this example data.shape
was (204000, 3)
while the original file consisted of 272000
lines.
Here the generator
is used to filter homogenously strided lines but one can likewise imagine it filtering out inhomogenous blocks of lines based on (simple) criteria.
Using the second approach:
Here's the regexp
I'm going to use:
regexp = r'\s+\w+' + r'\s+([-.0-9]+)' * 3 + r'\s*\n'
Groups -- i.e. inside ()
-- define the tokens to be extracted from a given line.Next, fromregex
does the job and ignores lines not matching the pattern:
data = np.fromregex(fname, regexp, dtype='f')
The result is exactly the same as in the first approach.
If you pass a list of types (the format condition), use a try block and use yield to use genfromtxt as a generator, we should be able to replicate textscan()
.
def genfromtext(fname, formatTypes): with open(fname, 'r') as file: for line in file: try: line = line.split(',') # Do you care about line anymore? r = [] for type, cell in zip(formatTypes, line): r.append(type(cell)) except: pass # Fail silently on this line since we hit an error yield r
Edit: I forgot the except block. It runs okay now and you can use genfromtext as a generator like so (using a random CSV log I have sitting around):
>>> a = genfromtext('log.txt', [str, str, str, int])>>> a.next()['10.10.9.45', ' 2013/01/17 16:29:26', '00:00:36', 0]>>> a.next()['10.10.9.45', ' 2013/01/17 16:22:20', '00:08:14', 0]>>> a.next()['10.10.9.45', ' 2013/01/17 16:31:05', '00:00:11', 3]
I should probably note that I'm using zip
to zip together the comma split line and the formatSpec which will tuplify the two lists (stopping when one of the lists runs out of items) so we can iterate over them together, avoiding a loop dependent on len(line)
or something like that.
Trying to demonstrate comment to OP.
def fread(name, cond): with open(name) as file: for line in file: if cond(line): yield line.split()def a_genfromtxt_cond(fname, cond=(lambda str: True)): """Seems to work without need to convert to float.""" return np.array(list(fread(fname, cond)), dtype=np.float64)def b_genfromtxt_cond(fname, cond=(lambda str: True)): r = [[int(float(i)) for i in l] for l in fread(fname, cond)] return np.array(r, dtype=np.integer)a = a_genfromtxt_cond("tar.data")print aaa = b_genfromtxt_cond("tar.data")print aa
Output
[[ 1. 2.3 4.5] [ 4.7 9.2 6.7] [ 4.7 1.8 4.3]][[1 2 4] [4 9 6] [4 1 4]]