Data Type Recognition/Guessing of CSV data in python
You may be interested in this python library which does exactly this kind of type guessing on CSVs and XLS files for you:
It happily scales to very large files, to streaming data off the internet etc.
There is also an even simpler wrapper library that includes a command line tool named dataconverters: http://okfnlabs.org/dataconverters/ (and an online service: https://github.com/okfn/dataproxy!)
The core algorithm that does the type guessing is here: https://github.com/okfn/messytables/blob/7e4f12abef257a4d70a8020e0d024df6fbb02976/messytables/types.py#L164
After putting some thought into it, this is how I would design the algorithm myself:
- For performance reasons: take a sample for each column (say, 1%)
- run a regex match for each cell in the sample, checking for the data type
- Choose the appropriate data type for the column based on the frequency distribution
The two questions that arise:
- What's a sufficient sample size? For small data sets? For large data sets?
- What's a high enough threshold for selecting a data type based on the frequency distribution?
You could try a pre parse using regex. For example:
import repattern = re.compile(r'^-?\d+.{1}\d+$')data = '123.42'print pattern.match(data) # ----> objectdata2 = 'NOT123.42GONNA31.4HAPPEN'print pattern.match(data2) # ----> None
This way you could do a dictionary of regex and try each of them until you find a match
myregex = {int: r'^-?\d+$', float: r'^\d+.{1}\d+$', ....}for key, reg in myregex.items(): to_del = [] for index, data in enumerate(arr1): if re.match(reg,data): d = key(data) # You will need to insert data differently depending on function ....#---> do something to_del.append(data) # ---> delete this when you can from arr1
Don't forget the '^' at the beggining and the '$' at the end, if not the regex could match part of the string and return an object.
Hope this helps :)