How do I fill two (or more) numpy arrays from a single iterable of tuples?
Perhaps build a single, structured array using np.fromiter
:
import numpy as npdef gendata(): # You, of course, have a different gendata... for i in xrange(N): yield (np.random.random(), str(i))N = 100arr = np.fromiter(gendata(), dtype='<f8,|S20')
Sorting it by the first column, using the second for tie-breakers will take O(N log N) time:
arr.sort(order=['f0','f1'])
Finding the row by the value in the first column can be done with searchsorted
in O(log N) time:
# Some pseudo-random value in arr['f0']val = arr['f0'][10]print(arr[10])# (0.049875262239617246, '46')idx = arr['f0'].searchsorted(val)print(arr[idx])# (0.049875262239617246, '46')
You've asked many important questions in the comments; let me attempt to answer them here:
The basic dtypes are explained in the numpybook. There may be one ortwo extra dtypes (like
float16
which have been added since thatbook was written, but the basics are all explained there.)Perhaps a more thorough discussion is in the online documentation. Which is a good supplement to the examples you mentioned here.
Dtypes can be used to define structured arrays with column names, orwith default column names.
'f0'
,'f1'
, etc. are default columnnames. Since I defined the dtype as'<f8,|S20'
I failed to providecolumn names, so NumPy named the first column'f0'
, and the second'f1'
. If we had useddtype='[('fval','<f8'), ('text','|S20')]
then the structured array
arr
would have column names'fval'
and'text'
.- Unfortunately, the dtype has to be fixed at the time
np.fromiter
is called. Youcould conceivably iterate throughgendata
once to discover themaximum length of the strings, build your dtype and then callnp.fromiter
(and iterate throughgendata
a second time), butthat's rather burdensome. It is of course better if you know inadvance the maximum size of the strings. (|S20
defines the stringfield as having a fixed length of 20 bytes.) - NumPy arrays place data of apre-defined size in arrays of a fixed size. Think of the array (even multidimensional ones) as a contiguous block of one-dimensional memory. (That's an oversimplification -- there are non-contiguous arrays -- but will help your imagination for the following.) NumPy derives much of its speed by taking advantage of the fixed sizes (set by the
dtype
) to quickly compute the offsets needed to access elements in the array. If the strings had variable sizes, then itwould be hard for NumPy to find the right offsets. By hard, I meanNumPy would need an index or somehow be redesigned. NumPy is simply notbuilt this way. - NumPy does have an
object
dtype which allows you to place a 4-bytepointer to any Python object you desire. This way, you can have NumPyarrays with arbitrary Python data. Unfortunately, thenp.fromiter
function does not allow you to create arrays of dtypeobject
. I'm not sure why there is this restriction... - Note that
np.fromiter
has better performance when thecount
isspecified. By knowing thecount
(the number of rows) and thedtype
(and thus the size of each row) NumPy can pre-allocateexactly enough memory for the resultant array. If you do not specifythecount
, then NumPy will make a guess for the initial size of thearray, and if too small, it will try to resize the array. If theoriginal block of memory can be extended you are in luck. But ifNumPy has to allocate an entirely new hunk of memory then all the olddata will have to be copied to the new location, which will slow downthe performance significantly.
Here is a way to build N
separate arrays out of a generator of N
-tuples:
import numpy as npimport itertools as ITdef gendata(): # You, of course, have a different gendata... N = 100 for i in xrange(N): yield (np.random.random(), str(i))def fromiter(iterable, dtype, chunksize=7): chunk = np.fromiter(IT.islice(iterable, chunksize), dtype=dtype) result = [chunk[name].copy() for name in chunk.dtype.names] size = len(chunk) while True: chunk = np.fromiter(IT.islice(iterable, chunksize), dtype=dtype) N = len(chunk) if N == 0: break newsize = size + N for arr, name in zip(result, chunk.dtype.names): col = chunk[name] arr.resize(newsize, refcheck=0) arr[size:] = col size = newsize return resultx, y = fromiter(gendata(), '<f8,|S20')order = np.argsort(x)x = x[order]y = y[order]# Some pseudo-random value in xN = 10val = x[N]print(x[N], y[N])# (0.049875262239617246, '46')idx = x.searchsorted(val)print(x[idx], y[idx])# (0.049875262239617246, '46')
The fromiter
function above reads the iterable in chunks (of size chunksize
). It calls the NumPy array method resize
to extend the resultant arrays as necessary.
I used a small default chunksize
since I was testing this code on small data. You, of course, will want to either change the default chunksize or pass a chunksize
parameter with a larger value.