Data Type Recognition/Guessing of CSV data in python Data Type Recognition/Guessing of CSV data in python python python

Data Type Recognition/Guessing of CSV data in python


You may be interested in this python library which does exactly this kind of type guessing on CSVs and XLS files for you:

It happily scales to very large files, to streaming data off the internet etc.

There is also an even simpler wrapper library that includes a command line tool named dataconverters: http://okfnlabs.org/dataconverters/ (and an online service: https://github.com/okfn/dataproxy!)

The core algorithm that does the type guessing is here: https://github.com/okfn/messytables/blob/7e4f12abef257a4d70a8020e0d024df6fbb02976/messytables/types.py#L164


After putting some thought into it, this is how I would design the algorithm myself:

  • For performance reasons: take a sample for each column (say, 1%)
  • run a regex match for each cell in the sample, checking for the data type
  • Choose the appropriate data type for the column based on the frequency distribution

The two questions that arise:

  • What's a sufficient sample size? For small data sets? For large data sets?
  • What's a high enough threshold for selecting a data type based on the frequency distribution?


You could try a pre parse using regex. For example:

import repattern = re.compile(r'^-?\d+.{1}\d+$')data = '123.42'print pattern.match(data) # ----> objectdata2 = 'NOT123.42GONNA31.4HAPPEN'print pattern.match(data2) # ----> None

This way you could do a dictionary of regex and try each of them until you find a match

myregex = {int: r'^-?\d+$', float: r'^\d+.{1}\d+$', ....}for key, reg in myregex.items():    to_del = []    for index, data in enumerate(arr1):        if re.match(reg,data):            d = key(data) # You will need to insert data differently depending on function            ....#---> do something             to_del.append(data) # ---> delete this when you can from arr1

Don't forget the '^' at the beggining and the '$' at the end, if not the regex could match part of the string and return an object.

Hope this helps :)