Convert Python sequence to NumPy array, filling missing values Convert Python sequence to NumPy array, filling missing values numpy numpy

Convert Python sequence to NumPy array, filling missing values


You can use itertools.zip_longest:

import itertoolsnp.array(list(itertools.zip_longest(*v, fillvalue=0))).TOut: array([[1, 0],       [1, 2]])

Note: For Python 2, it is itertools.izip_longest.


Here's an almost* vectorized boolean-indexing based approach that I have used in several other posts -

def boolean_indexing(v):    lens = np.array([len(item) for item in v])    mask = lens[:,None] > np.arange(lens.max())    out = np.zeros(mask.shape,dtype=int)    out[mask] = np.concatenate(v)    return out

Sample run

In [27]: vOut[27]: [[1], [1, 2], [3, 6, 7, 8, 9], [4]]In [28]: outOut[28]: array([[1, 0, 0, 0, 0],       [1, 2, 0, 0, 0],       [3, 6, 7, 8, 9],       [4, 0, 0, 0, 0]])

*Please note that this coined as almost vectorized because the only looping performed here is at the start, where we are getting the lengths of the list elements. But that part not being so computationally demanding should have minimal effect on the total runtime.

Runtime test

In this section I am timing DataFrame-based solution by @Alberto Garcia-Raboso, itertools-based solution by @ayhan as they seem to scale well and the boolean-indexing based one from this post for a relatively larger dataset with three levels of size variation across the list elements.

Case #1 : Larger size variation

In [44]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8,9,3,6,4,8,3,2,4,5,6,6,8,7,9,3,6,4]]In [45]: v = v*1000In [46]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)100 loops, best of 3: 9.82 ms per loopIn [47]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T100 loops, best of 3: 5.11 ms per loopIn [48]: %timeit boolean_indexing(v)100 loops, best of 3: 6.88 ms per loop

Case #2 : Lesser size variation

In [49]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8]]In [50]: v = v*1000In [51]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)100 loops, best of 3: 3.12 ms per loopIn [52]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T1000 loops, best of 3: 1.55 ms per loopIn [53]: %timeit boolean_indexing(v)100 loops, best of 3: 5 ms per loop

Case #3 : Larger number of elements (100 max) per list element

In [139]: # Setup inputs     ...: N = 10000 # Number of elems in list     ...: maxn = 100 # Max. size of a list element     ...: lens = np.random.randint(0,maxn,(N))     ...: v = [list(np.random.randint(0,9,(L))) for L in lens]     ...: In [140]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)1 loops, best of 3: 292 ms per loopIn [141]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T1 loops, best of 3: 264 ms per loopIn [142]: %timeit boolean_indexing(v)10 loops, best of 3: 95.7 ms per loop

To me, it seems itertools.izip_longest is doing pretty well! there's no clear winner, but would have to be taken on a case-by-case basis!


Pandas and its DataFrame-s deal beautifully with missing data.

import numpy as npimport pandas as pdv = [[1], [1, 2]]print(pd.DataFrame(v).fillna(0).values.astype(np.int32))# array([[1, 0],#        [1, 2]], dtype=int32)