Convert Python sequence to NumPy array, filling missing values

python arrays numpy sequence variable-length-array

import itertoolsnp.array(list(itertools.zip_longest(*v, fillvalue=0))).TOut: array([[1, 0],       [1, 2]])

Note: For Python 2, it is itertools.izip_longest.

python arrays numpy sequence variable-length-array

Here's an almost* vectorized boolean-indexing based approach that I have used in several other posts -

def boolean_indexing(v):    lens = np.array([len(item) for item in v])    mask = lens[:,None] > np.arange(lens.max())    out = np.zeros(mask.shape,dtype=int)    out[mask] = np.concatenate(v)    return out

Sample run

In [27]: vOut[27]: [[1], [1, 2], [3, 6, 7, 8, 9], [4]]In [28]: outOut[28]: array([[1, 0, 0, 0, 0],       [1, 2, 0, 0, 0],       [3, 6, 7, 8, 9],       [4, 0, 0, 0, 0]])

*Please note that this coined as almost vectorized because the only looping performed here is at the start, where we are getting the lengths of the list elements. But that part not being so computationally demanding should have minimal effect on the total runtime.

Runtime test

In this section I am timing DataFrame-based solution by @Alberto Garcia-Raboso, itertools-based solution by @ayhan as they seem to scale well and the boolean-indexing based one from this post for a relatively larger dataset with three levels of size variation across the list elements.

Case #1 : Larger size variation

In [44]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8,9,3,6,4,8,3,2,4,5,6,6,8,7,9,3,6,4]]In [45]: v = v*1000In [46]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)100 loops, best of 3: 9.82 ms per loopIn [47]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T100 loops, best of 3: 5.11 ms per loopIn [48]: %timeit boolean_indexing(v)100 loops, best of 3: 6.88 ms per loop

Case #2 : Lesser size variation

In [49]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8]]In [50]: v = v*1000In [51]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)100 loops, best of 3: 3.12 ms per loopIn [52]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T1000 loops, best of 3: 1.55 ms per loopIn [53]: %timeit boolean_indexing(v)100 loops, best of 3: 5 ms per loop

Case #3 : Larger number of elements (100 max) per list element

In [139]: # Setup inputs     ...: N = 10000 # Number of elems in list     ...: maxn = 100 # Max. size of a list element     ...: lens = np.random.randint(0,maxn,(N))     ...: v = [list(np.random.randint(0,9,(L))) for L in lens]     ...: In [140]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)1 loops, best of 3: 292 ms per loopIn [141]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T1 loops, best of 3: 264 ms per loopIn [142]: %timeit boolean_indexing(v)10 loops, best of 3: 95.7 ms per loop

To me, it seems ~~itertools.izip_longest is doing pretty well!~~ there's no clear winner, but would have to be taken on a case-by-case basis!

python arrays numpy sequence variable-length-array

Pandas and its DataFrame-s deal beautifully with missing data.

import numpy as npimport pandas as pdv = [[1], [1, 2]]print(pd.DataFrame(v).fillna(0).values.astype(np.int32))# array([[1, 0],#        [1, 2]], dtype=int32)

CodeHunter

Convert Python sequence to NumPy array, filling missing values

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last