Pass the values from a dataset with indexes and values to a sparse Numpy array
Question:
I want to make a sparse numpy array using the indexes and values stored in a pandas DataSet
The dataset has ‘userIndex’, ‘movieIndex’ and ‘rating’ with a million rows
For example:
—
movieIndex
userIndex
rating
0
0
4
2.5
1
2
2
3.0
2
1
1
4.0
3
2
0
4.0
4
4
2
3.0
Would be transformed to a numpy array like this:
[[0 0 0 0 2.5],
[0 4.0 0 0 0],
[4.0 0 3.0 0 0],
[0 0 0 0 0],
[0 0 3.0 0 0]]
So, first I’m making a np.zeros array with the correct size:
Y = np.zeros([nm,nu])
And for now, I’m passing the information as:
for i in range(len(ratings)):
Y[int(ratings.iloc[i].movieIndex),int(ratings.iloc[i].userIndex)]
= ratings.iloc[i].rating
And it works just fine with O(n), so it’s not really bad but it takes 3 minutes to do so.
I know it’s not a good idea to use "for" in a dataset, and I should use the vector functions to do it, but I can’t find a way to make this work. Any ideas?
Answers:
Maybe it will work faster:
Y[ratings["movieIndex"].values, ratings["userIndex"].values] = ratings["rating"].values
I want to make a sparse numpy array using the indexes and values stored in a pandas DataSet
The dataset has ‘userIndex’, ‘movieIndex’ and ‘rating’ with a million rows
For example:
— | movieIndex | userIndex | rating |
---|---|---|---|
0 | 0 | 4 | 2.5 |
1 | 2 | 2 | 3.0 |
2 | 1 | 1 | 4.0 |
3 | 2 | 0 | 4.0 |
4 | 4 | 2 | 3.0 |
Would be transformed to a numpy array like this:
[[0 0 0 0 2.5],
[0 4.0 0 0 0],
[4.0 0 3.0 0 0],
[0 0 0 0 0],
[0 0 3.0 0 0]]
So, first I’m making a np.zeros array with the correct size:
Y = np.zeros([nm,nu])
And for now, I’m passing the information as:
for i in range(len(ratings)):
Y[int(ratings.iloc[i].movieIndex),int(ratings.iloc[i].userIndex)]
= ratings.iloc[i].rating
And it works just fine with O(n), so it’s not really bad but it takes 3 minutes to do so.
I know it’s not a good idea to use "for" in a dataset, and I should use the vector functions to do it, but I can’t find a way to make this work. Any ideas?
Maybe it will work faster:
Y[ratings["movieIndex"].values, ratings["userIndex"].values] = ratings["rating"].values