How do you dynamically identify unknown delimiters in a data file?

How about trying Python CSV's standard: http://docs.python.org/library/csv.html#csv.Sniffer

import csvsniffer = csv.Sniffer()dialect = sniffer.sniff('quarter, dime, nickel, penny')print dialect.delimiter# returns ','

python parsing csv text-files textinput

If you're using python, I'd suggest just calling re.split on the line with all valid expected separators:

>>> l = "big long list of space separated words">>> re.split(r'[ ,|;"]+', l)['big', 'long', 'list', 'of', 'space', 'separated', 'words']

The only issue would be if one of the files used a separator as part of the data.

If you must identify the separator, your best bet is to count everything excluding spaces. If there are almost no occurrences, then it's probably space, otherwise, it's the max of the mapped characters.

Unfortunately, there's really no way to be sure. You may have space separated data filled with commas, or you may have | separated data filled with semicolons. It may not always work.

python parsing csv text-files textinput

I ended up going with the regex, because of the problem of spaces. Here's my finished code, in case anyone's interested, or could use anything else in it. On a tangential note, it would be neat to find a way to dynamically identify column order, but I realize that's a bit more tricky. In the meantime, I'm falling back on old tricks to sort that out.

for infile in glob.glob(os.path.join(self._input_dir, self._file_mask)):            #couldn't quite figure out a way to make this a single block             #(rather than three separate if/elifs. But you can see the split is            #generalized already, so if anyone can come up with a better way,            #I'm all ears!! :)            for row in open(infile,'r').readlines():                if infile.find('comma') > -1:                     datefmt = "%m/%d/%Y"                    last, first, gender, color, dobraw = \                            [x.strip() for x in re.split(r'[ ,|;"\t]+', row)]                elif infile.find('space') > -1:                     datefmt = "%m-%d-%Y"                    last, first, unused, gender, dobraw, color = \                            [x.strip() for x in re.split(r'[ ,|;"\t]+', row)]                elif infile.find('pipe') > -1:                    datefmt = "%m-%d-%Y"                    last, first, unused, gender, color, dobraw = \                            [x.strip() for x in re.split(r'[ ,|;"\t]+', row)]                    #There is also a way to do this with csv.Sniffer, but the                     #spaces around the pipe delimiter also confuse sniffer, so                    #I couldn't use it.                else: raise ValueError(infile + "is not an acceptable input file.")

CodeHunter

How do you dynamically identify unknown delimiters in a data file?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last