numpy.genfromtxt- ValueError- Line # (got n columns instead of m)
Looks like you've already read genfromtxt
about missing values. Does it say anything about the use of delimiters?
I think it can handle missing values with lines like
'one, 1, 234.4, , ,''two, 3, , 4, 5'
but when the delimiter is the default 'white-space' it can't. One of the first steps after reading a line is
strings = line.split(delimiter)
And objects if len(strings)
doesn't match with the initial target. Apparently it does not try to guess that you want to pad the line with n-len(strings)
missing values.
Options that come to mind:
try Pandas; it may make more effort to guess your intentions
write your own reader. Pandas is compiled;
genfromtxt
is plain numpy Python. It reads the file line by line, splits and converts fields, and appends the list to a master list. It converts that list of lists into array at the end. Your own reader should be just as efficient.preprocess your file to add the missing values or change the delimiter.
genfromtxt
accepts anything that feeds it lines. So it works with a list of strings, a file reader that yields modified lines, etc. This may be simplest.def foo(astr): strs=astr.split() if len(strs)<6: strs.extend([b' ']*(6-len(strs))) return b','.join(strs)
Simulating with a list of strings (in Py3):
In [139]: txt=b"""14 HO2 O3 OH O2 O2 ...: 15 HO2 HO2 H2O2 O2 ...: 16 H2O2 OH HO2 H2O ...: 17 O O O2 ...: 18 O O2 O3 ...: 19 O O3 O2 O2""".splitlines()In [140]: [foo(l) for l in txt]Out[140]: [b'14,HO2,O3,OH,O2,O2', b'15,HO2,HO2,H2O2,O2, ', b'16,H2O2,OH,HO2,H2O, ', b'17,O,O,O2, , ', b'18,O,O2,O3, , ', b'19,O,O3,O2,O2, ']In [141]: np.genfromtxt([foo(l) for l in txt], dtype=None, delimiter=',')Out[141]: array([(14, b'HO2', b'O3', b'OH', b'O2', b'O2'), (15, b'HO2', b'HO2', b'H2O2', b'O2', b''), (16, b'H2O2', b'OH', b'HO2', b'H2O', b''), (17, b'O', b'O', b'O2', b' ', b''), (18, b'O', b'O2', b'O3', b' ', b''), (19, b'O', b'O3', b'O2', b'O2', b'')], dtype=[('f0', '<i4'), ('f1', 'S4'), ('f2', 'S3'), ('f3', 'S4'), ('f4', 'S3'), ('f5', 'S2')])
It looks like your data is nicely aligned in fields of exactly 10 characters. If that is always the case, you can tell genfromtxt
the field widths to use by specifying the sequence of field widths in the delimiter
argument.
Here's an example.
First, your data file:
In [20]: !cat reaction.dat14 HO2 O3 OH O2 O215 HO2 HO2 H2O2 O216 H2O2 OH HO2 H2O17 O O O218 O O2 O319 O O3 O2 O2
For convenience, I'll define the number of fields and the field width here. (In general, it is not necessary that all the fields have the same width.)
In [21]: numfields = 6In [22]: fieldwidth = 10
Tell genfromtxt
that the data is in fixed width columns by passing in the argument delimiter=(10, 10, 10, 10, 10, 10)
:
In [23]: data = genfromtxt('reaction.dat', dtype='S%d' % fieldwidth, delimiter=(fieldwidth,)*numfields)
Here's the result. Note that "missing" fields are empty strings. Also note that non-empty fields include the white space, and the last non-empty field in each row includes the newline character:
In [24]: dataOut[24]: array([[b'14 ', b'HO2 ', b'O3 ', b'OH ', b'O2 ', b'O2\n'], [b'15 ', b'HO2 ', b'HO2 ', b'H2O2 ', b'O2\n', b''], [b'16 ', b'H2O2 ', b'OH ', b'HO2 ', b'H2O\n', b''], [b'17 ', b'O ', b'O ', b'O2\n', b'', b''], [b'18 ', b'O ', b'O2 ', b'O3\n', b'', b''], [b'19 ', b'O ', b'O3 ', b'O2 ', b'O2\n', b'']], dtype='|S10')In [25]: data[1]Out[25]: array([b'15 ', b'HO2 ', b'HO2 ', b'H2O2 ', b'O2\n', b''], dtype='|S10')
We could clean up the strings in a second step, or we can have genfromtxt
do it by providing a converter for each field that simply strips the white space from the field:
In [26]: data = genfromtxt('reaction.dat', dtype='S%d' % fieldwidth, delimiter=(fieldwidth,)*numfields, converters={k: lambda s: s. ...: strip() for k in range(numfields)})In [27]: dataOut[27]: array([[b'14', b'HO2', b'O3', b'OH', b'O2', b'O2'], [b'15', b'HO2', b'HO2', b'H2O2', b'O2', b''], [b'16', b'H2O2', b'OH', b'HO2', b'H2O', b''], [b'17', b'O', b'O', b'O2', b'', b''], [b'18', b'O', b'O2', b'O3', b'', b''], [b'19', b'O', b'O3', b'O2', b'O2', b'']], dtype='|S10')In [28]: data[:,0]Out[28]: array([b'14', b'15', b'16', b'17', b'18', b'19'], dtype='|S10')In [29]: data[:,5]Out[29]: array([b'O2', b'', b'', b'', b'', b''], dtype='|S10')