numpy.loadtxt: how to ignore comma delimiters that appear inside quotes? numpy.loadtxt: how to ignore comma delimiters that appear inside quotes? numpy numpy

numpy.loadtxt: how to ignore comma delimiters that appear inside quotes?


One way you could do it with a single numpy function call would be to use np.fromregex, which allows you to use Python's regular expression syntax to parse the contents of your text file in any arbitrary way. For example:

np.fromregex('tmp.csv', r'(\d+),"(.+)",(\d+)', np.object)

gives you:

array([['10', 'Apple, Banana', '20'],       ['30', 'Orange, Watermelon', '40']], dtype=object)

To unpack that regular expression a bit, '(\d+)' will match one or more digits and '"(.+)"' will match one or more of any character inside double quotes. np.fromregex tries to match this expression within every line in your .csv file, and the parts that are inside the brackets are taken as the individual elements in each row of the output array.

If you want a record array as your output with different 'fields' for the three 'columns' in your .csv file, you could specify separate dtypes for each set of brackets in the regex:

np.fromregex('tmp.csv', r'(\d+),"(.+)",(\d+)', 'i8, S20, i8')

gives you:

array([(10, 'Apple, Banana', 20), (30, 'Orange, Watermelon', 40)],       dtype=[('f0', '<i8'), ('f1', 'S20'), ('f2', '<i8')])


This issue has been discussed before. There isn't a parameter in loadtxt (or genfromtxt) that does what you want. In other words, it is not quote sensitive. The python csv module has some sort of quote awareness. The pandas reader is also quote aware.

But processing the lines before passing them to loadtxt is quite acceptable. All the function needs is an iterable - something that can feed it lines one at a time. So that can be a file, a list of lines, or generator.

A simple processor would just replace the commas within quotes with some other character. Or replace the ones outside of quotes with a delimiter of your choice. It doesn't have to be fancy to do the job.

Using numpy.genfromtxt to read a csv file with strings containing commas

For example:

txt = """10,"Apple, Banana",2030,"Pear, Orange",4050,"Peach, Mango",60"""def foo(astr):    # replace , outside quotes with ;    # a bit crude and specialized    x = astr.split('"')    return ';'.join([i.strip(',') for i in x]) txt1 = [foo(astr) for astr in txt.splitlines()]txtgen = (foo(astr) for astr in txt.splitlines())  # or as generator# ['10;Apple, Banana;20', '30;Pear, Orange;40', '50;Peach, Mango;60']np.genfromtxt(txtgen, delimiter=';', dtype=None)

produces:

array([(10, 'Apple, Banana', 20), (30, 'Pear, Orange', 40),       (50, 'Peach, Mango', 60)],       dtype=[('f0', '<i4'), ('f1', 'S13'), ('f2', '<i4')])

I hadn't paid attention to np.fromregex before. Compared to genfromtxt it is surprisingly simple. To use with my sample txt I have to use a string buffer:

s=StringIO.StringIO(txt)np.fromregex(s, r'(\d+),"(.+)",(\d+)', dtype='i4,S20,i4')

It's action distills down to:

pat=re.compile(r'(\d+),"(.+)",(\d+)'); dt=np.dtype('i4,S20,i4')np.array(pat.findall(txt),dtype=dt)

It reads the whole file (f.read()) and does a findall which should produce a list like:

[('10', 'Apple, Banana', '20'), ('30', 'Pear, Orange', '40'), ('50', 'Peach, Mango', '60')]

A list of tuples is exactly what a structured array requires.

No fancy processing, error checks or filtering of comment lines. Just a pattern match followed by array construction.


Both my foo and fromregex assume a specific sequence of numbers and quoted strings. The csv.reader might be the simplest general purpose quote reader. The join is required because reader produces a list of lists, while genfromtxt wants an iterable of strings (it does its own 'split').

from csv import readers=StringIO.StringIO(txt)np.genfromtxt((';'.join(x) for x in reader(s)), delimiter=';', dtype=None)

producing

array([(10, 'Apple, Banana', 20), (30, 'Pear, Orange', 40),       (50, 'Peach, Mango', 60)],       dtype=[('f0', '<i4'), ('f1', 'S13'), ('f2', '<i4')])

Or in following the fromregex example, the reader output could be turned into a list of tuples and given to np.array directly:

np.array([tuple(x) for x in reader(s)], dtype='i4,S20,i4')


I solved this by using my code below.

def transformCommas(line):    out = ''    insideQuote = False    for c in line:        if c == '"':            insideQuote = not insideQuote        if insideQuote == True and c == ',':            out += '.'        else:            out += c    return outf = open("data/raw_data_all.csv", "rb")replaced = (transformCommas(line) for line in f)rawData = numpy.loadtxt(replaced,delimiter=',', skiprows=0, dtype=str)

Data:

1366x768,18,"5,237",73.38%,"3,843",79.55%,1.75,00:01:26,4.09%,214,$0.001366x768,22,"5,088",76.04%,"3,869",78.46%,1.82,00:01:20,3.93%,200,$0.001366x768,17,"4,887",74.34%,"3,633",78.37%,1.81,00:01:19,3.25%,159,$0.00