NumPy dtype issues in genfromtxt(), reads string in as bytestring NumPy dtype issues in genfromtxt(), reads string in as bytestring numpy numpy

NumPy dtype issues in genfromtxt(), reads string in as bytestring


In Python2.7

array([('ZINC00043096', 'C.3', 'C1', -0.154, 'methyl'),       ('ZINC00043096', 'C.3', 'C2', 0.0638, 'methylene'),       ('ZINC00043096', 'C.3', 'C4', 0.0669, 'methylene'),       ('ZINC00090377', 'C.3', 'C7', 0.207, 'methylene')],       dtype=[('f0', 'S12'), ('f1', 'S3'), ('f2', 'S2'), ('f3', '<f8'), ('f4', 'S9')])

in Python3

array([(b'ZINC00043096', b'C.3', b'C1', -0.154, b'methyl'),       (b'ZINC00043096', b'C.3', b'C2', 0.0638, b'methylene'),       (b'ZINC00043096', b'C.3', b'C4', 0.0669, b'methylene'),       (b'ZINC00090377', b'C.3', b'C7', 0.207, b'methylene')],       dtype=[('f0', 'S12'), ('f1', 'S3'), ('f2', 'S2'), ('f3', '<f8'), ('f4', 'S9')])

The 'regular' strings in Python3 are unicode. But your text file has byte strings. all_data is the same in both cases (136 bytes), but Python3's way of displaying a byte string is b'C.3', not just 'C.3'.

What kinds of operations do you plan on doing with these strings? 'ZIN' in all_data['f0'][1] works with the 2.7 version, but in 3 you have to use b'ZIN' in all_data['f0'][1].

Variable/unknown length string/unicode dtype in numpyreminds me that you can specify a unicode string type in the dtype. However this becomes more complicated if you don't know the lengths of the strings beforehand.

alttype = np.dtype([('f0', 'U12'), ('f1', 'U3'), ('f2', 'U2'), ('f3', '<f8'), ('f4', 'U9')])all_data_u = np.genfromtxt(csv_file, dtype=alttype, delimiter=',')

producing

array([('ZINC00043096', 'C.3', 'C1', -0.154, 'methyl'),       ('ZINC00043096', 'C.3', 'C2', 0.0638, 'methylene'),       ('ZINC00043096', 'C.3', 'C4', 0.0669, 'methylene'),       ('ZINC00090377', 'C.3', 'C7', 0.207, 'methylene')],       dtype=[('f0', '<U12'), ('f1', '<U3'), ('f2', '<U2'), ('f3', '<f8'), ('f4', '<U9')])

In Python2.7 all_data_u displays as

(u'ZINC00043096', u'C.3', u'C1', -0.154, u'methyl')

all_data_u is 448 bytes, because numpy allocates 4 bytes for each unicode character. Each U4 item is 16 bytes long.


Changes in v 1.14: https://docs.scipy.org/doc/numpy/release.html#encoding-argument-for-text-io-functions


In python 3.6,

all_data = np.genfromtxt('csv_file.csv', delimiter=',', dtype='unicode')

works just fine.


np.genfromtxt(csv_file, dtype='|S12', delimiter=',')

Or you could select the columns that you know are strings using the usecols parameter:

np.genfromtxt(csv_file, dtype=None, delimiter=',',usecols=(0,1,2,4))