data set convert to matrix with tupel

Question:

I need to convert a part of my data to make it compatible with this solution: https://stackoverflow.com/a/64854873

The data is a pandas.core.frame.DataFrame with:

      result  data_1  data_2
1    1.523     4        1223
3    1.33     84        1534

Some index values might be removed, therefore 1, 3, …

It should be a tuple with data values and the result. The type in the solution was scipy.sparse._coo.coo_matrix, like:

  (4, 1223) 1.523
  (84, 1534) 1.33

Just scipy.sparse.coo_matrix(df.values) seems to mix the data.

  (0, 0)    1.523
  (0, 1)    1.53
  (0, 24)   1.92
  : :
  (2, 151)  123.0
  (2, 142)  834.0

How can I generate a compatible matrix?

Asked By: Ximi

||

Answers:

You can filter out the data columns, then apply tuple on axis=1 which will essentially create the tuple out of row values, I’m assigning it as a new column as the output you’ve mentioned is not clear if its an array or dataframe, but I think you should be able to move forward with the remaining outcome you need.

>>> df.assign(data=df.filter(like='data').apply(tuple, axis=1))

   result  data_1  data_2        data
1   1.523       4    1223   (4, 1223)
3   1.330      84    1534  (84, 1534)

Answered By: ThePyGuy

Try this:

df['tuple'] = list(zip(df.data_1, df.data_2))
result = df[['tuple', 'result']].to_numpy()
print(result)

Result:

[[(4, 1223) 1.523]
 [(84, 1534) 1.33]]

Source:
How to form tuple column from two columns in Pandas
Convert pandas dataframe to NumPy array

Answered By: zalevskiaa

You can recreate the sparse matrix (not just copy its display) with:

In [87]: from scipy import sparse

3 arrays that can be derived from columns of the dataframe:

In [88]: data = np.array([1.523, 1.33])    
In [89]: row = np.array([4,84])    
In [90]: col = np.array([1223, 1534])

The actual matrix:

In [91]: M = sparse.coo_matrix((data,(row, col)))

The repr display:

In [92]: M
Out[92]: 
<85x1535 sparse matrix of type '<class 'numpy.float64'>'
    with 2 stored elements in COOrdinate format>

and its str display:

In [93]: print(M)
  (4, 1223) 1.523
  (84, 1534)    1.33

This M.shape is derived from the max values of the arrays; in practice you might want to specify a larger shape.

M.toarray() creates a numpy array from this, but with that shape it will be too large to display.

I’m not sure how the dataframe was derived from such a matrix.

Answered By: hpaulj