Data Type Recognition/Guessing of CSV data in python

python algorithm csv schema heuristics

You may be interested in this python library which does exactly this kind of type guessing on CSVs and XLS files for you:

It happily scales to very large files, to streaming data off the internet etc.

There is also an even simpler wrapper library that includes a command line tool named dataconverters: http://okfnlabs.org/dataconverters/ (and an online service: https://github.com/okfn/dataproxy!)

The core algorithm that does the type guessing is here: https://github.com/okfn/messytables/blob/7e4f12abef257a4d70a8020e0d024df6fbb02976/messytables/types.py#L164

python algorithm csv schema heuristics

After putting some thought into it, this is how I would design the algorithm myself:

For performance reasons: take a sample for each column (say, 1%)
run a regex match for each cell in the sample, checking for the data type
Choose the appropriate data type for the column based on the frequency distribution

The two questions that arise:

What's a sufficient sample size? For small data sets? For large data sets?
What's a high enough threshold for selecting a data type based on the frequency distribution?

python algorithm csv schema heuristics

You could try a pre parse using regex. For example:

import repattern = re.compile(r'^-?\d+.{1}\d+$')data = '123.42'print pattern.match(data) # ----> objectdata2 = 'NOT123.42GONNA31.4HAPPEN'print pattern.match(data2) # ----> None

This way you could do a dictionary of regex and try each of them until you find a match

myregex = {int: r'^-?\d+$', float: r'^\d+.{1}\d+$', ....}for key, reg in myregex.items():    to_del = []    for index, data in enumerate(arr1):        if re.match(reg,data):            d = key(data) # You will need to insert data differently depending on function            ....#---> do something             to_del.append(data) # ---> delete this when you can from arr1

Don't forget the '^' at the beggining and the '$' at the end, if not the regex could match part of the string and return an object.

Hope this helps :)

CodeHunter

Data Type Recognition/Guessing of CSV data in python

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last