How do I read CSV data into a record array in NumPy? How do I read CSV data into a record array in NumPy? python python

How do I read CSV data into a record array in NumPy?


You can use Numpy's genfromtxt() method to do so, by setting the delimiter kwarg to a comma.

from numpy import genfromtxtmy_data = genfromtxt('my_file.csv', delimiter=',')

More information on the function can be found at its respective documentation.


I would recommend the read_csv function from the pandas library:

import pandas as pddf=pd.read_csv('myfile.csv', sep=',',header=None)df.valuesarray([[ 1. ,  2. ,  3. ],       [ 4. ,  5.5,  6. ]])

This gives a pandas DataFrame - allowing many useful data manipulation functions which are not directly available with numpy record arrays.

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table...


I would also recommend genfromtxt. However, since the question asks for a record array, as opposed to a normal array, the dtype=None parameter needs to be added to the genfromtxt call:

Given an input file, myfile.csv:

1.0, 2, 34, 5.5, 6import numpy as npnp.genfromtxt('myfile.csv',delimiter=',')

gives an array:

array([[ 1. ,  2. ,  3. ],       [ 4. ,  5.5,  6. ]])

and

np.genfromtxt('myfile.csv',delimiter=',',dtype=None)

gives a record array:

array([(1.0, 2.0, 3), (4.0, 5.5, 6)],       dtype=[('f0', '<f8'), ('f1', '<f8'), ('f2', '<i4')])

This has the advantage that file with multiple data types (including strings) can be easily imported.


I timed the

from numpy import genfromtxtgenfromtxt(fname = dest_file, dtype = (<whatever options>))

versus

import csvimport numpy as npwith open(dest_file,'r') as dest_f:    data_iter = csv.reader(dest_f,                           delimiter = delimiter,                           quotechar = '"')    data = [data for data in data_iter]data_array = np.asarray(data, dtype = <whatever options>)

on 4.6 million rows with about 70 columns and found that the NumPy path took 2 min 16 secs and the csv-list comprehension method took 13 seconds.

I would recommend the csv-list comprehension method as it is most likely relies on pre-compiled libraries and not the interpreter as much as NumPy. I suspect the pandas method would have similar interpreter overhead.