Convert Python sequence to NumPy array, filling missing values

Question

The implicit conversion of a Python sequence of variable-length lists into a NumPy array cause the array to be of type object.

v = [[1], [1, 2]]
np.array(v)
>>> array([[1], [1, 2]], dtype=object)

Trying to force another type will cause an exception:

np.array(v, dtype=np.int32)
ValueError: setting an array element with a sequence.

What is the most efficient way to get a dense NumPy array of type int32, by filling the “missing” values with a given placeholder?

From my sample sequence v, I would like to get something like this, if 0 is the placeholder

array([[1, 0], [1, 2]], dtype=int32)

Asked By: Marco Ancona

||

Source

Answer 1

Pandas and its DataFrame-s deal beautifully with missing data.

import numpy as np
import pandas as pd

v = [[1], [1, 2]]
print(pd.DataFrame(v).fillna(0).values.astype(np.int32))

# array([[1, 0],
#        [1, 2]], dtype=int32)

Answered By: Alicia Garcia-Raboso

Answer 2

You can use itertools.zip_longest:

import itertools
np.array(list(itertools.zip_longest(*v, fillvalue=0))).T
Out: 
array([[1, 0],
       [1, 2]])

Note: For Python 2, it is itertools.izip_longest.

Answered By: ayhan

Answer 3

max_len = max(len(sub_list) for sub_list in v)

result = np.array([sub_list + [0] * (max_len - len(sub_list)) for sub_list in v])

>>> result
array([[1, 0],
       [1, 2]])

>>> type(result)
numpy.ndarray

Answered By: Alexander

Answer 4

Here’s an almost* vectorized boolean-indexing based approach that I have used in several other posts –

def boolean_indexing(v):
    lens = np.array([len(item) for item in v])
    mask = lens[:,None] > np.arange(lens.max())
    out = np.zeros(mask.shape,dtype=int)
    out[mask] = np.concatenate(v)
    return out

Sample run

In [27]: v
Out[27]: [[1], [1, 2], [3, 6, 7, 8, 9], [4]]

In [28]: out
Out[28]: 
array([[1, 0, 0, 0, 0],
       [1, 2, 0, 0, 0],
       [3, 6, 7, 8, 9],
       [4, 0, 0, 0, 0]])

*Please note that this coined as almost vectorized because the only looping performed here is at the start, where we are getting the lengths of the list elements. But that part not being so computationally demanding should have minimal effect on the total runtime.

Runtime test

In this section I am timing DataFrame-based solution by @Alberto Garcia-Raboso, itertools-based solution by @ayhan as they seem to scale well and the boolean-indexing based one from this post for a relatively larger dataset with three levels of size variation across the list elements.

Case #1 : Larger size variation

In [44]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8,9,3,6,4,8,3,2,4,5,6,6,8,7,9,3,6,4]]

In [45]: v = v*1000

In [46]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
100 loops, best of 3: 9.82 ms per loop

In [47]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
100 loops, best of 3: 5.11 ms per loop

In [48]: %timeit boolean_indexing(v)
100 loops, best of 3: 6.88 ms per loop

Case #2 : Lesser size variation

In [49]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8]]

In [50]: v = v*1000

In [51]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
100 loops, best of 3: 3.12 ms per loop

In [52]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
1000 loops, best of 3: 1.55 ms per loop

In [53]: %timeit boolean_indexing(v)
100 loops, best of 3: 5 ms per loop

Case #3 : Larger number of elements (100 max) per list element

In [139]: # Setup inputs
     ...: N = 10000 # Number of elems in list
     ...: maxn = 100 # Max. size of a list element
     ...: lens = np.random.randint(0,maxn,(N))
     ...: v = [list(np.random.randint(0,9,(L))) for L in lens]
     ...: 

In [140]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
1 loops, best of 3: 292 ms per loop

In [141]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
1 loops, best of 3: 264 ms per loop

In [142]: %timeit boolean_indexing(v)
10 loops, best of 3: 95.7 ms per loop

To me, it seems ~~itertools.izip_longest is doing pretty well!~~ there’s no clear winner, but would have to be taken on a case-by-case basis!

Answered By: Divakar

Answer 5

Here is a general way:

>>> v = [[1], [2, 3, 4], [5, 6], [7, 8, 9, 10], [11, 12]]
>>> max_len = np.argmax(v)
>>> np.hstack(np.insert(v, range(1, len(v)+1),[[0]*(max_len-len(i)) for i in v])).astype('int32').reshape(len(v), max_len)
array([[ 1,  0,  0,  0],
       [ 2,  3,  4,  0],
       [ 5,  6,  0,  0],
       [ 7,  8,  9, 10],
       [11, 12,  0,  0]], dtype=int32)

Answered By: Mazdak

Answer 6

you can try to convert pandas dataframe first, after that convert it to numpy array

ll = [[1, 2, 3], [4, 5], [6, 7, 8, 9]]

df = pd.DataFrame(ll)
print(df)
#    0  1    2    3
# 0  1  2  3.0  NaN
# 1  4  5  NaN  NaN
# 2  6  7  8.0  9.0

npl = df.to_numpy()
print(npl)

# [[ 1.  2.  3. nan]
#  [ 4.  5. nan nan]
#  [ 6.  7.  8.  9.]]

Answered By: bitbang

Answer 7

I was having a numpy broadcast error with Alexander’s answer so I added a small variation with numpy.pad:

pad = len(max(X, key=len))

result = np.array([np.pad(i, (0, pad-len(i)), 'constant') for i in X])

Answered By: tmsss

Answer 8

If you want to extend the same logic to deeper levels (list of lists of lists,..) you can use tensorflow ragged tensors and convert to tensors/arrays. For example:

import tensorflow as tf
v = [[1], [1, 2]]
padded_v = tf.ragged.constant(v).to_tensor(0)

This creates an array padded with 0.
or a deeper example:

w = [[[1]], [[2],[1, 2]]]
padded_w = tf.ragged.constant(w).to_tensor(0)

Answered By: Fmess

Convert Python sequence to NumPy array, filling missing values

Question:

Answers: