How do I fill two (or more) numpy arrays from a single iterable of tuples? How do I fill two (or more) numpy arrays from a single iterable of tuples? numpy numpy

How do I fill two (or more) numpy arrays from a single iterable of tuples?


Perhaps build a single, structured array using np.fromiter:

import numpy as npdef gendata():    # You, of course, have a different gendata...    for i in xrange(N):        yield (np.random.random(), str(i))N = 100arr = np.fromiter(gendata(), dtype='<f8,|S20')

Sorting it by the first column, using the second for tie-breakers will take O(N log N) time:

arr.sort(order=['f0','f1'])

Finding the row by the value in the first column can be done with searchsorted in O(log N) time:

# Some pseudo-random value in arr['f0']val = arr['f0'][10]print(arr[10])# (0.049875262239617246, '46')idx = arr['f0'].searchsorted(val)print(arr[idx])# (0.049875262239617246, '46')

You've asked many important questions in the comments; let me attempt to answer them here:

  • The basic dtypes are explained in the numpybook. There may be one ortwo extra dtypes (like float16 which have been added since thatbook was written, but the basics are all explained there.)

    Perhaps a more thorough discussion is in the online documentation. Which is a good supplement to the examples you mentioned here.

  • Dtypes can be used to define structured arrays with column names, orwith default column names. 'f0', 'f1', etc. are default columnnames. Since I defined the dtype as '<f8,|S20' I failed to providecolumn names, so NumPy named the first column 'f0', and the second'f1'. If we had used

    dtype='[('fval','<f8'), ('text','|S20')]

    then the structured array arr would have column names 'fval' and'text'.

  • Unfortunately, the dtype has to be fixed at the time np.fromiter is called. Youcould conceivably iterate through gendata once to discover themaximum length of the strings, build your dtype and then callnp.fromiter (and iterate through gendata a second time), butthat's rather burdensome. It is of course better if you know inadvance the maximum size of the strings. (|S20 defines the stringfield as having a fixed length of 20 bytes.)
  • NumPy arrays place data of apre-defined size in arrays of a fixed size. Think of the array (even multidimensional ones) as a contiguous block of one-dimensional memory. (That's an oversimplification -- there are non-contiguous arrays -- but will help your imagination for the following.) NumPy derives much of its speed by taking advantage of the fixed sizes (set by the dtype) to quickly compute the offsets needed to access elements in the array. If the strings had variable sizes, then itwould be hard for NumPy to find the right offsets. By hard, I meanNumPy would need an index or somehow be redesigned. NumPy is simply notbuilt this way.
  • NumPy does have an object dtype which allows you to place a 4-bytepointer to any Python object you desire. This way, you can have NumPyarrays with arbitrary Python data. Unfortunately, the np.fromiterfunction does not allow you to create arrays of dtype object. I'm not sure why there is this restriction...
  • Note that np.fromiter has better performance when the count isspecified. By knowing the count (the number of rows) and thedtype (and thus the size of each row) NumPy can pre-allocateexactly enough memory for the resultant array. If you do not specifythe count, then NumPy will make a guess for the initial size of thearray, and if too small, it will try to resize the array. If theoriginal block of memory can be extended you are in luck. But ifNumPy has to allocate an entirely new hunk of memory then all the olddata will have to be copied to the new location, which will slow downthe performance significantly.


Here is a way to build N separate arrays out of a generator of N-tuples:

import numpy as npimport itertools as ITdef gendata():    # You, of course, have a different gendata...    N = 100    for i in xrange(N):        yield (np.random.random(), str(i))def fromiter(iterable, dtype, chunksize=7):    chunk = np.fromiter(IT.islice(iterable, chunksize), dtype=dtype)    result = [chunk[name].copy() for name in chunk.dtype.names]    size = len(chunk)    while True:        chunk = np.fromiter(IT.islice(iterable, chunksize), dtype=dtype)        N = len(chunk)        if N == 0:            break        newsize = size + N        for arr, name in zip(result, chunk.dtype.names):            col = chunk[name]            arr.resize(newsize, refcheck=0)            arr[size:] = col        size = newsize    return resultx, y = fromiter(gendata(), '<f8,|S20')order = np.argsort(x)x = x[order]y = y[order]# Some pseudo-random value in xN = 10val = x[N]print(x[N], y[N])# (0.049875262239617246, '46')idx = x.searchsorted(val)print(x[idx], y[idx])# (0.049875262239617246, '46')

The fromiter function above reads the iterable in chunks (of size chunksize). It calls the NumPy array method resize to extend the resultant arrays as necessary.

I used a small default chunksize since I was testing this code on small data. You, of course, will want to either change the default chunksize or pass a chunksize parameter with a larger value.