What's the fastest way in Python to calculate cosine similarity given sparse matrix data?

Question:

Given a sparse matrix listing, what’s the best way to calculate the cosine similarity between each of the columns (or rows) in the matrix? I would rather not iterate n-choose-two times.

Say the input matrix is:

A= 
[0 1 0 0 1
 0 0 1 1 1
 1 1 0 1 0]

The sparse representation is:

A = 
0, 1
0, 4
1, 2
1, 3
1, 4
2, 0
2, 1
2, 3

In Python, it’s straightforward to work with the matrix-input format:

import numpy as np
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine

A = np.array(
[[0, 1, 0, 0, 1],
[0, 0, 1, 1, 1],
[1, 1, 0, 1, 0]])

dist_out = 1-pairwise_distances(A, metric="cosine")
dist_out

Gives:

array([[ 1.        ,  0.40824829,  0.40824829],
       [ 0.40824829,  1.        ,  0.33333333],
       [ 0.40824829,  0.33333333,  1.        ]])

That’s fine for a full-matrix input, but I really want to start with the sparse representation (due to the size and sparsity of my matrix). Any ideas about how this could best be accomplished?

Asked By: zbinsd

||

Answers:

You should check out scipy.sparse. You can apply operations on those sparse matrices just like how you use a normal matrix.

Answered By: cheeyos

The following method is about 30 times faster than scipy.spatial.distance.pdist. It works pretty quickly on large matrices (assuming you have enough RAM)

See below for a discussion of how to optimize for sparsity.

import numpy as np

# base similarity matrix (all dot products)
# replace this with A.dot(A.T).toarray() for sparse representation
similarity = np.dot(A, A.T)

# squared magnitude of preference vectors (number of occurrences)
square_mag = np.diag(similarity)

# inverse squared magnitude
inv_square_mag = 1 / square_mag

# if it doesn't occur, set it's inverse magnitude to zero (instead of inf)
inv_square_mag[np.isinf(inv_square_mag)] = 0

# inverse of the magnitude
inv_mag = np.sqrt(inv_square_mag)
    
# cosine similarity (elementwise multiply by inverse magnitudes)
cosine = similarity * inv_mag
cosine = cosine.T * inv_mag

If your problem is typical for large scale binary preference problems, you have a lot more entries in one dimension than the other. Also, the short dimension is the one whose entries you want to calculate similarities between. Let’s call this dimension the ‘item’ dimension.

If this is the case, list your ‘items’ in rows and create A using scipy.sparse. Then replace the first line as indicated.

If your problem is atypical you’ll need more modifications. Those should be pretty straightforward replacements of basic numpy operations with their scipy.sparse equivalents.

Answered By: Waylon Flinn

Hi you can do it this way

    temp = sp.coo_matrix((data, (row, col)), shape=(3, 59))
    temp1 = temp.tocsr()

    #Cosine similarity
    row_sums = ((temp1.multiply(temp1)).sum(axis=1))
    rows_sums_sqrt = np.array(np.sqrt(row_sums))[:,0]
    row_indices, col_indices = temp1.nonzero()
    temp1.data /= rows_sums_sqrt[row_indices]
    temp2 = temp1.transpose()
    temp3 = temp1*temp2
Answered By: Vaali

You can compute pairwise cosine similarity on the rows of a sparse matrix directly using sklearn. As of version 0.17 it also supports sparse output:

from sklearn.metrics.pairwise import cosine_similarity
from scipy import sparse

A =  np.array([[0, 1, 0, 0, 1], [0, 0, 1, 1, 1],[1, 1, 0, 1, 0]])
A_sparse = sparse.csr_matrix(A)

similarities = cosine_similarity(A_sparse)
print('pairwise dense output:n {}n'.format(similarities))

#also can output sparse matrices
similarities_sparse = cosine_similarity(A_sparse,dense_output=False)
print('pairwise sparse output:n {}n'.format(similarities_sparse))

Results:

pairwise dense output:
[[ 1.          0.40824829  0.40824829]
[ 0.40824829  1.          0.33333333]
[ 0.40824829  0.33333333  1.        ]]

pairwise sparse output:
(0, 1)  0.408248290464
(0, 2)  0.408248290464
(0, 0)  1.0
(1, 0)  0.408248290464
(1, 2)  0.333333333333
(1, 1)  1.0
(2, 1)  0.333333333333
(2, 0)  0.408248290464
(2, 2)  1.0

If you want column-wise cosine similarities simply transpose your input matrix beforehand:

A_sparse.transpose()
Answered By: Jeff

I took all these answers and wrote a script to 1. validate each of the results (see assertion below) and 2. see which is the fastest.
Code and results are below:

# Imports
import numpy as np
import scipy.sparse as sp
from scipy.spatial.distance import squareform, pdist
from sklearn.metrics.pairwise import linear_kernel
from sklearn.preprocessing import normalize
from sklearn.metrics.pairwise import cosine_similarity

# Create an adjacency matrix
np.random.seed(42)
A = np.random.randint(0, 2, (10000, 100)).astype(float).T

# Make it sparse
rows, cols = np.where(A)
data = np.ones(len(rows))
Asp = sp.csr_matrix((data, (rows, cols)), shape = (rows.max()+1, cols.max()+1))

print "Input data shape:", Asp.shape

# Define a function to calculate the cosine similarities a few different ways
def calc_sim(A, method=1):
    if method == 1:
        return 1 - squareform(pdist(A, metric='cosine'))
    if method == 2:
        Anorm = A / np.linalg.norm(A, axis=-1)[:, np.newaxis]
        return np.dot(Anorm, Anorm.T)
    if method == 3:
        Anorm = A / np.linalg.norm(A, axis=-1)[:, np.newaxis]
        return linear_kernel(Anorm)
    if method == 4:
        similarity = np.dot(A, A.T)

        # squared magnitude of preference vectors (number of occurrences)
        square_mag = np.diag(similarity)

        # inverse squared magnitude
        inv_square_mag = 1 / square_mag

        # if it doesn't occur, set it's inverse magnitude to zero (instead of inf)
        inv_square_mag[np.isinf(inv_square_mag)] = 0

        # inverse of the magnitude
        inv_mag = np.sqrt(inv_square_mag)

        # cosine similarity (elementwise multiply by inverse magnitudes)
        cosine = similarity * inv_mag
        return cosine.T * inv_mag
    if method == 5:
        '''
        Just a version of method 4 that takes in sparse arrays
        '''
        similarity = A*A.T
        square_mag = np.array(A.sum(axis=1))
        # inverse squared magnitude
        inv_square_mag = 1 / square_mag

        # if it doesn't occur, set it's inverse magnitude to zero (instead of inf)
        inv_square_mag[np.isinf(inv_square_mag)] = 0

        # inverse of the magnitude
        inv_mag = np.sqrt(inv_square_mag).T

        # cosine similarity (elementwise multiply by inverse magnitudes)
        cosine = np.array(similarity.multiply(inv_mag))
        return cosine * inv_mag.T
    if method == 6:
        return cosine_similarity(A)

# Assert that all results are consistent with the first model ("truth")
for m in range(1, 7):
    if m in [5]: # The sparse case
        np.testing.assert_allclose(calc_sim(A, method=1), calc_sim(Asp, method=m))
    else:
        np.testing.assert_allclose(calc_sim(A, method=1), calc_sim(A, method=m))

# Time them:
print "Method 1"
%timeit calc_sim(A, method=1)
print "Method 2"
%timeit calc_sim(A, method=2)
print "Method 3"
%timeit calc_sim(A, method=3)
print "Method 4"
%timeit calc_sim(A, method=4)
print "Method 5"
%timeit calc_sim(Asp, method=5)
print "Method 6"
%timeit calc_sim(A, method=6)

Results:

Input data shape: (100, 10000)
Method 1
10 loops, best of 3: 71.3 ms per loop
Method 2
100 loops, best of 3: 8.2 ms per loop
Method 3
100 loops, best of 3: 8.6 ms per loop
Method 4
100 loops, best of 3: 2.54 ms per loop
Method 5
10 loops, best of 3: 73.7 ms per loop
Method 6
10 loops, best of 3: 77.3 ms per loop
Answered By: zbinsd

I have tried some methods above. However, the experiment by @zbinsd has its limitation. The sparsity of matrix used in the experiment is extremely low while the real sparsity is usually over 90%.
In my condition, the sparse is with the shape of (7000, 25000) and the sparsity of 97%. The method 4 is extremely slow and I can’t tolerant getting the results. I use the method 6 which is finished in 10 s. Amazingly, I try the method below and it’s finished in only 0.247 s.

import sklearn.preprocessing as pp

def cosine_similarities(mat):
    col_normed_mat = pp.normalize(mat.tocsc(), axis=0)
    return col_normed_mat.T * col_normed_mat

This efficient method is linked by enter link description here

Answered By: tianshi miao

Building off of Vaali’s solution:

def sparse_cosine_similarity(sparse_matrix):
    out = (sparse_matrix.copy() if type(sparse_matrix) is csr_matrix else
           sparse_matrix.tocsr())
    squared = out.multiply(out)
    sqrt_sum_squared_rows = np.array(np.sqrt(squared.sum(axis=1)))[:, 0]
    row_indices, col_indices = out.nonzero()
    out.data /= sqrt_sum_squared_rows[row_indices]
    return out.dot(out.T)

This takes a sparse matrix (preferably a csr_matrix) and returns a csr_matrix. It should do the more intensive parts using sparse calculations with pretty minimal memory overhead. I haven’t tested it extensively though, so caveat emptor (Update: I feel confident in this solution now that I’ve tested and benchmarked it)

Also, here is the sparse version of Waylon’s solution in case it helps anyone, not sure which solution is actually better.

def sparse_cosine_similarity_b(sparse_matrix):
    input_csr_matrix = sparse_matrix.tocsr()
    similarity = input_csr_matrix * input_csr_matrix.T
    square_mag = similarity.diagonal()
    inv_square_mag = 1 / square_mag
    inv_square_mag[np.isinf(inv_square_mag)] = 0
    inv_mag = np.sqrt(inv_square_mag)
    return similarity.multiply(inv_mag).T.multiply(inv_mag)

Both solutions seem to have parity with sklearn.metrics.pairwise.cosine_similarity

😀

Update:

Now I have tested both solutions against my existing Cython implementation: https://github.com/davidmashburn/sparse_dot/blob/master/test/benchmarks_v3_output_table.txt
and it looks like the first algorithm performs the best of the three most of the time.

Answered By: David
def norm(vector):
    return sqrt(sum(x * x for x in vector))    

def cosine_similarity(vec_a, vec_b):
        norm_a = norm(vec_a)
        norm_b = norm(vec_b)
        dot = sum(a * b for a, b in zip(vec_a, vec_b))
        return dot / (norm_a * norm_b)

This method seems to be somewhat faster than using sklearn’s implementation if you pass in one pair of vectors at a time.

Answered By: Daniel

I suggest to run in two steps:

1) generate mapping A that maps A:column index->non zero objects

2) for each object i (row) with non-zero occurrences(columns) {k1,..kn} calculate cosine similarity just for elements in the union set A[k1] U A[k2] U.. A[kn]

Assuming a big sparse matrix with high sparsity this will gain a significant boost over brute force

Answered By: moshik

@jeff ‘s solution is changed

As version of scikit-learn 1.1.2, you don’t need to use scipy’s sparse before cosine_similarity.

All you need is cosine_similarity

from typing import Tuple

import numpy as np
import perfplot
import scipy
from sklearn.metrics.pairwise import cosine_similarity as cosine_similarity_sklearn_internal
from scipy import spatial
from scipy import sparse
import sklearn.preprocessing as pp

target_dtype = "float16"


class prettyfloat(float):
    def __repr__(self):
        return "%.2f" % self


def cosine_similarity_sklearn(x):
    return cosine_similarity_sklearn_internal(x)


def cosine_similarity_sklearn_sparse(x):
    x_sparse = sparse.csr_matrix(x)

    return cosine_similarity_sklearn_internal(x_sparse)


def cosine_similarity_einsum(x, y=None):
    """
    Calculate the cosine similarity between two vectors.
    if x == y, only use x
    """
    # cosine_similarity in einsum notation without astype
    normed_x = x / np.linalg.norm(x, axis=1)[:, None]
    normed_y = y / np.linalg.norm(y, axis=1)[:, None] if y else normed_x
    return np.einsum("ik,jk->ij", normed_x, normed_y)


def cosine_similarity_scipy(x, y=None):
    """
    Calculate the cosine similarity between two vectors.
    if x == y, only use x
    """
    return 1 - spatial.distance.cosine(x, x)


def setup_n(n) -> Tuple[np.ndarray, np.ndarray]:
    nd_arr = np.random.randn(int(2 ** n), 512).astype(target_dtype)
    return nd_arr


def equality_check(a, b):
    if type(a) != np.ndarray:
        a = a.todense()
    if type(b) != np.ndarray:
        b = b.todense()

    return np.isclose(a.astype(target_dtype), b.astype(target_dtype), atol=1e-3).all()


fig = perfplot.show(
    setup=setup_n,
    n_range=[k for k in range(1, 10)],
    kernels=[
        cosine_similarity_sklearn,
        cosine_similarity_sklearn_sparse,
        cosine_similarity_einsum,
        # cosine_similarity_scipy,
    ],
    labels=["sk-def", "sk+sparse", "einsum"],
    logx=False,
    logy=False,
    xlabel='2^n',
    equality_check=equality_check,
)

Using perfplot, it show, `from typing import Tuple

import numpy as np
import perfplot
import scipy
from sklearn.metrics.pairwise import cosine_similarity` is the best.

in scikit-learn==1.1.2,1.1.3

It can be different result in float64 and float16.

  • For float64,
    enter image description here

  • For float16,
    enter image description here

Answered By: Arthur Kim