how to convert a Series of arrays into a single matrix in pandas/numpy?
Question:
I somehow got a pandas.Series
which contains a bunch of arrays in it, as the s
in the code below.
data = [[1,2,3],[2,3,4],[3,4,5],[2,3,4],[3,4,5],[2,3,4],
[3,4,5],[2,3,4],[3,4,5],[2,3,4],[3,4,5]]
s = pd.Series(data = data)
s.shape # output ---> (11L,)
# try to convert s to matrix
sm = s.as_matrix()
# but...
sm.shape # output ---> (11L,)
How can I convert the s
into a matrix with shape (11,3)? Thanks!
Answers:
If, for some reason, you have found yourself with that abomination of a Series
, getting it back into the sort of matrix
or array
you want is straightforward:
In [16]: s
Out[16]:
0 [1, 2, 3]
1 [2, 3, 4]
2 [3, 4, 5]
3 [2, 3, 4]
4 [3, 4, 5]
5 [2, 3, 4]
6 [3, 4, 5]
7 [2, 3, 4]
8 [3, 4, 5]
9 [2, 3, 4]
10 [3, 4, 5]
dtype: object
In [17]: sm = np.array(s.tolist())
In [18]: sm
Out[18]:
array([[1, 2, 3],
[2, 3, 4],
[3, 4, 5],
[2, 3, 4],
[3, 4, 5],
[2, 3, 4],
[3, 4, 5],
[2, 3, 4],
[3, 4, 5],
[2, 3, 4],
[3, 4, 5]])
In [19]: sm.shape
Out[19]: (11, 3)
But unless it’s something you can’t change, having that Series makes little sense to begin with.
Another way is to extract the values of your series and use numpy.stack on them.
np.stack(s.values)
PS. I’ve run into similar situations often.
For the pandas>=0.24, you can also np.stack(s.to_numpy())
or np.concatenate(s.to_numpy())
, depending on your requirement.
I tested above methods with 5793 of 100D vectors. The old method, converting to list first, is fastest.
%time print(np.stack(df.features.values).shape)
%time print(np.stack(df.features.to_numpy()).shape)
%time print(np.array(df.features.tolist()).shape)
%time print(np.array(list(df.features)).shape)
Result
(5793, 100)
CPU times: user 11.7 ms, sys: 3.42 ms, total: 15.1 ms
Wall time: 22.7 ms
(5793, 100)
CPU times: user 11.1 ms, sys: 137 µs, total: 11.3 ms
Wall time: 11.9 ms
(5793, 100)
CPU times: user 5.96 ms, sys: 0 ns, total: 5.96 ms
Wall time: 6.91 ms
(5793, 100)
CPU times: user 5.74 ms, sys: 0 ns, total: 5.74 ms
Wall time: 6.43 ms
I somehow got a pandas.Series
which contains a bunch of arrays in it, as the s
in the code below.
data = [[1,2,3],[2,3,4],[3,4,5],[2,3,4],[3,4,5],[2,3,4],
[3,4,5],[2,3,4],[3,4,5],[2,3,4],[3,4,5]]
s = pd.Series(data = data)
s.shape # output ---> (11L,)
# try to convert s to matrix
sm = s.as_matrix()
# but...
sm.shape # output ---> (11L,)
How can I convert the s
into a matrix with shape (11,3)? Thanks!
If, for some reason, you have found yourself with that abomination of a Series
, getting it back into the sort of matrix
or array
you want is straightforward:
In [16]: s
Out[16]:
0 [1, 2, 3]
1 [2, 3, 4]
2 [3, 4, 5]
3 [2, 3, 4]
4 [3, 4, 5]
5 [2, 3, 4]
6 [3, 4, 5]
7 [2, 3, 4]
8 [3, 4, 5]
9 [2, 3, 4]
10 [3, 4, 5]
dtype: object
In [17]: sm = np.array(s.tolist())
In [18]: sm
Out[18]:
array([[1, 2, 3],
[2, 3, 4],
[3, 4, 5],
[2, 3, 4],
[3, 4, 5],
[2, 3, 4],
[3, 4, 5],
[2, 3, 4],
[3, 4, 5],
[2, 3, 4],
[3, 4, 5]])
In [19]: sm.shape
Out[19]: (11, 3)
But unless it’s something you can’t change, having that Series makes little sense to begin with.
Another way is to extract the values of your series and use numpy.stack on them.
np.stack(s.values)
PS. I’ve run into similar situations often.
For the pandas>=0.24, you can also np.stack(s.to_numpy())
or np.concatenate(s.to_numpy())
, depending on your requirement.
I tested above methods with 5793 of 100D vectors. The old method, converting to list first, is fastest.
%time print(np.stack(df.features.values).shape)
%time print(np.stack(df.features.to_numpy()).shape)
%time print(np.array(df.features.tolist()).shape)
%time print(np.array(list(df.features)).shape)
Result
(5793, 100)
CPU times: user 11.7 ms, sys: 3.42 ms, total: 15.1 ms
Wall time: 22.7 ms
(5793, 100)
CPU times: user 11.1 ms, sys: 137 µs, total: 11.3 ms
Wall time: 11.9 ms
(5793, 100)
CPU times: user 5.96 ms, sys: 0 ns, total: 5.96 ms
Wall time: 6.91 ms
(5793, 100)
CPU times: user 5.74 ms, sys: 0 ns, total: 5.74 ms
Wall time: 6.43 ms