Read data from CSV file and transform from string to correct data-type, including a list-of-integer column
As the docs explain, the CSV reader doesn't perform automatic data conversion. You have the QUOTE_NONNUMERIC format option, but that would only convert all non-quoted fields into floats. This is a very similar behaviour to other csv readers.
I don't believe Python's csv module would be of any help for this case at all. As others have already pointed out, literal_eval()
is a far better choice.
The following does work and converts:
- strings
- int
- floats
- lists
- dictionaries
You may also use it for booleans and NoneType, although these have to be formatted accordingly for literal_eval()
to pass. LibreOffice Calc displays booleans in capital letters, when in Python booleans are Capitalized. Also, you would have to replace empty strings with None
(without quotes)
I'm writing an importer for mongodb that does all this. The following is part of the code I've written so far.
[NOTE: My csv uses tab as field delimiter. You may want to add some exception handling too]
def getFieldnames(csvFile): """ Read the first row and store values in a tuple """ with open(csvFile) as csvfile: firstRow = csvfile.readlines(1) fieldnames = tuple(firstRow[0].strip('\n').split("\t")) return fieldnamesdef writeCursor(csvFile, fieldnames): """ Convert csv rows into an array of dictionaries All data types are automatically checked and converted """ cursor = [] # Placeholder for the dictionaries/documents with open(csvFile) as csvFile: for row in islice(csvFile, 1, None): values = list(row.strip('\n').split("\t")) for i, value in enumerate(values): nValue = ast.literal_eval(value) values[i] = nValue cursor.append(dict(zip(fieldnames, values))) return cursor
You have to map your rows:
data = """True,foo,1,2.3,bazFalse,bar,7,9.8,qux"""reader = csv.reader(StringIO.StringIO(data), delimiter=",")parsed = (({'True':True}.get(row[0], False), row[1], int(row[2]), float(row[3]), row[4]) for row in reader)for row in parsed: print row
results in
(True, 'foo', 1, 2.3, 'baz')(False, 'bar', 7, 9.8, 'qux')
I know this is a fairly old question, tagged python-2.5, but here's answer that works with Python 3.6+ which might be of interest to folks using more up-to-date versions of the language.
It leverages the built-in typing.NamedTuple
class which was added in Python 3.5. What may not be evident from the documentation is that the "type" of each field can be a function.
The example usage code also uses so-called f-string literals which weren't added until Python 3.6, but their use isn't required to do the core data-type transformations.
#!/usr/bin/env python3.6import astimport csvfrom typing import NamedTupleclass Record(NamedTuple): """ Define the fields and their types in a record. """ IsActive: bool Type: str Price: float States: ast.literal_eval # Handles string represenation of literals. @classmethod def _transform(cls: 'Record', dict_: dict) -> dict: """ Convert string values in given dictionary to corresponding Record field type. """ return {name: cls.__annotations__[name](value) for name, value in dict_.items()}filename = 'test_transform.csv'with open(filename, newline='') as file: for i, row in enumerate(csv.DictReader(file)): row = Record._transform(row) print(f'row {i}: {row}')
Output:
row 0: {'IsActive': True, 'Type': 'Cellphone', 'Price': 34.0, 'States': [1, 2]}row 1: {'IsActive': False, 'Type': 'FlatTv', 'Price': 3.5, 'States': [2]}row 2: {'IsActive': True, 'Type': 'Screen', 'Price': 100.23, 'States': [5, 1]}row 3: {'IsActive': True, 'Type': 'Notebook', 'Price': 50.0, 'States': [1]}
Generalizing this by creating a base class with just the generic classmethod in it is not simple because of the way typing.NamedTuple
is implemented.
To avoid that issue, in Python 3.7+, a dataclasses.dataclass
could be used instead because they do not have the inheritance issue — so creating a generic base class that can be reused is simple:
#!/usr/bin/env python3.7import astimport csvfrom dataclasses import dataclass, fieldsfrom typing import Type, TypeVarT = TypeVar('T', bound='GenericRecord')class GenericRecord: """ Generic base class for transforming dataclasses. """ @classmethod def _transform(cls: Type[T], dict_: dict) -> dict: """ Convert string values in given dictionary to corresponding type. """ return {field.name: field.type(dict_[field.name]) for field in fields(cls)}@dataclassclass CSV_Record(GenericRecord): """ Define the fields and their types in a record. Field names must match column names in CSV file header. """ IsActive: bool Type: str Price: float States: ast.literal_eval # Handles string represenation of literals.filename = 'test_transform.csv'with open(filename, newline='') as file: for i, row in enumerate(csv.DictReader(file)): row = CSV_Record._transform(row) print(f'row {i}: {row}')
In one sense it's not really very important which one you use because an instance of the class in never created — using one is just a clean way of specifying and holding a definition of the field names and their type in a record data-structure.
A TypedDict
was added to the typing
module in Python 3.8 that can also be used to provide the typing information, but must be used in a slightly different manner since it doesn't actually define a new type like NamedTuple
and dataclasses
do — so it requires having a standalone transforming function:
#!/usr/bin/env python3.8import astimport csvfrom dataclasses import dataclass, fieldsfrom typing import TypedDictdef transform(dict_, typed_dict) -> dict: """ Convert values in given dictionary to corresponding types in TypedDict . """ fields = typed_dict.__annotations__ return {name: fields[name](value) for name, value in dict_.items()}class CSV_Record_Types(TypedDict): """ Define the fields and their types in a record. Field names must match column names in CSV file header. """ IsActive: bool Type: str Price: float States: ast.literal_evalfilename = 'test_transform.csv'with open(filename, newline='') as file: for i, row in enumerate(csv.DictReader(file), 1): row = transform(row, CSV_Record_Types) print(f'row {i}: {row}')