numpy.genfromtxt- ValueError- Line # (got n columns instead of m) numpy.genfromtxt- ValueError- Line # (got n columns instead of m) numpy numpy

numpy.genfromtxt- ValueError- Line # (got n columns instead of m)


Looks like you've already read genfromtxt about missing values. Does it say anything about the use of delimiters?

I think it can handle missing values with lines like

'one, 1, 234.4, , ,''two, 3, , 4, 5'

but when the delimiter is the default 'white-space' it can't. One of the first steps after reading a line is

 strings = line.split(delimiter)

And objects if len(strings) doesn't match with the initial target. Apparently it does not try to guess that you want to pad the line with n-len(strings) missing values.

Options that come to mind:

  • try Pandas; it may make more effort to guess your intentions

  • write your own reader. Pandas is compiled; genfromtxt is plain numpy Python. It reads the file line by line, splits and converts fields, and appends the list to a master list. It converts that list of lists into array at the end. Your own reader should be just as efficient.

  • preprocess your file to add the missing values or change the delimiter. genfromtxt accepts anything that feeds it lines. So it works with a list of strings, a file reader that yields modified lines, etc. This may be simplest.

    def foo(astr): strs=astr.split() if len(strs)<6: strs.extend([b' ']*(6-len(strs))) return b','.join(strs)

Simulating with a list of strings (in Py3):

In [139]: txt=b"""14        HO2       O3        OH        O2        O2     ...: 15        HO2       HO2       H2O2      O2     ...: 16        H2O2      OH        HO2       H2O     ...: 17        O         O         O2     ...: 18        O         O2        O3     ...: 19        O         O3        O2        O2""".splitlines()In [140]: [foo(l) for l in txt]Out[140]: [b'14,HO2,O3,OH,O2,O2', b'15,HO2,HO2,H2O2,O2, ', b'16,H2O2,OH,HO2,H2O, ', b'17,O,O,O2, , ', b'18,O,O2,O3, , ', b'19,O,O3,O2,O2, ']In [141]: np.genfromtxt([foo(l) for l in txt], dtype=None, delimiter=',')Out[141]: array([(14, b'HO2', b'O3', b'OH', b'O2', b'O2'),       (15, b'HO2', b'HO2', b'H2O2', b'O2', b''),       (16, b'H2O2', b'OH', b'HO2', b'H2O', b''),       (17, b'O', b'O', b'O2', b' ', b''),       (18, b'O', b'O2', b'O3', b' ', b''),       (19, b'O', b'O3', b'O2', b'O2', b'')],       dtype=[('f0', '<i4'), ('f1', 'S4'), ('f2', 'S3'), ('f3', 'S4'), ('f4', 'S3'), ('f5', 'S2')])


It looks like your data is nicely aligned in fields of exactly 10 characters. If that is always the case, you can tell genfromtxt the field widths to use by specifying the sequence of field widths in the delimiter argument.

Here's an example.

First, your data file:

In [20]: !cat reaction.dat14        HO2       O3        OH        O2        O215        HO2       HO2       H2O2      O216        H2O2      OH        HO2       H2O17        O         O         O218        O         O2        O319        O         O3        O2        O2

For convenience, I'll define the number of fields and the field width here. (In general, it is not necessary that all the fields have the same width.)

In [21]: numfields = 6In [22]: fieldwidth = 10

Tell genfromtxt that the data is in fixed width columns by passing in the argument delimiter=(10, 10, 10, 10, 10, 10):

In [23]: data = genfromtxt('reaction.dat', dtype='S%d' % fieldwidth, delimiter=(fieldwidth,)*numfields)

Here's the result. Note that "missing" fields are empty strings. Also note that non-empty fields include the white space, and the last non-empty field in each row includes the newline character:

In [24]: dataOut[24]: array([[b'14        ', b'HO2       ', b'O3        ', b'OH        ',        b'O2        ', b'O2\n'],       [b'15        ', b'HO2       ', b'HO2       ', b'H2O2      ',        b'O2\n', b''],       [b'16        ', b'H2O2      ', b'OH        ', b'HO2       ',        b'H2O\n', b''],       [b'17        ', b'O         ', b'O         ', b'O2\n', b'', b''],       [b'18        ', b'O         ', b'O2        ', b'O3\n', b'', b''],       [b'19        ', b'O         ', b'O3        ', b'O2        ',        b'O2\n', b'']],       dtype='|S10')In [25]: data[1]Out[25]: array([b'15        ', b'HO2       ', b'HO2       ', b'H2O2      ', b'O2\n',       b''],       dtype='|S10')

We could clean up the strings in a second step, or we can have genfromtxt do it by providing a converter for each field that simply strips the white space from the field:

In [26]: data = genfromtxt('reaction.dat', dtype='S%d' % fieldwidth, delimiter=(fieldwidth,)*numfields, converters={k: lambda s: s.    ...: strip() for k in range(numfields)})In [27]: dataOut[27]: array([[b'14', b'HO2', b'O3', b'OH', b'O2', b'O2'],       [b'15', b'HO2', b'HO2', b'H2O2', b'O2', b''],       [b'16', b'H2O2', b'OH', b'HO2', b'H2O', b''],       [b'17', b'O', b'O', b'O2', b'', b''],       [b'18', b'O', b'O2', b'O3', b'', b''],       [b'19', b'O', b'O3', b'O2', b'O2', b'']],       dtype='|S10')In [28]: data[:,0]Out[28]: array([b'14', b'15', b'16', b'17', b'18', b'19'],       dtype='|S10')In [29]: data[:,5]Out[29]: array([b'O2', b'', b'', b'', b'', b''],       dtype='|S10')