Row wise cosine similarity calculation in pandas

Question:

I have a dataframe that looks like this:

    api_spec_id label   Paths_modified        Tags_modified     Endpoints_added
933 803.0   minor              8.0                      3.0                    6               
934 803.0   patch              0.0                      4.0                    2
935 803.0   patch              3.0                      1.0                    0
938 803.0   patch             10.0                      0.0                    4
939 803.0   patch              3.0                      5.0                    1
940 803.0   patch              6.0                      0.0                    0
942 803.0   patch              0.0                      6.0                    2
946 803.0   patch              3.0                      2.0                    3
947 803.0   patch              0.0                      0.0                    1

I want to calculate the row wise cosine similarity between every consecutive row. The dataframe is already sorted on the api_spec_id and date.

The expected output should be something like this( the values are not exact):

    api_spec_id label   Paths_modified        Tags_modified  Endpoints_added         Distance
933 803.0   minor              8.0                      3.0         6                  ...
934 803.0   patch              0.0                      4.0         2                  1.00234
935 803.0   patch              3.0                      1.0         0
938 803.0   patch             10.0                      0.0         4
939 803.0   patch              3.0                      5.0         1
940 803.0   patch              6.0                      0.0         0
942 803.0   patch              0.0                      6.0         2
946 803.0   patch              3.0                      2.0         3
947 803.0   patch              0.0                      0.0         1

I tried looking at the solutions here in stack overflow, but the use case seems to be a bit different in all the cases. I have many more features, around 32 in total, and I want to consider all those feature columns (Paths modified, tags modified and endpoints added in the df above are examples of some features), and calculate the distance metric for each row.

This is what I could think of,but it does not fulfil the purpose:

df = pd.DataFrame(np.random.randint(0, 5, (3, 5)), columns=['id', 'commit_date', 'feature1', 'feature2', 'feature3'])

similarity_df = df.iloc[:, 2:].apply(lambda x: cosine_similarity([x], df.iloc[:, 2:])[0], axis=1)

Does anyone have suggestions on how could I proceed with this?

Edit: A possible block in my use case is I cannot get rid of my other columns, I will still need to retain at least the api_spec_id to provide a way to map the distance back to the original dataframe.

Asked By: Brie MerryWeather

||

Answers:

This can be done without apply (faster):

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(0, 5, (3, 5)), columns=['id', 'commit_date', 'feature1', 'feature2', 'feature3'])


# Calculate L2 norm of features in row
df["l2norm"] = np.linalg.norm(df.loc[:, "feature1":"feature3"], axis=1)

# Create shifted dataframe
df2 = df.shift(1, fill_value=0)


# Dot product of current with previous row
dot_product = (df.loc[:, "feature1":"feature3"] * df2.loc[:, "feature1":"feature3"]).sum(axis=1)

# L2 norm product of current and previous row
norm_product = df["l2norm"] * df2["l2norm"]

# Divide and print
print(dot_product / norm_product)
Answered By: Michael Butscher

Try this approach using cosine_similarity

from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

df['Distance'] = (df.iloc[:, 2:]
                    .apply(lambda row: cosine_similarity([row],
                    [df.iloc[row.name - 1, 2:]])[0][0]
                    if row.name > 0 else None, axis=1))
print(df)

You can also use for-loop but considering the size of your dataframe

similarity_df = cosine_similarity(df.iloc[:, 2:])
df['Distance'] = ([None] + [similarity_df[i, i-1] for i in range(1, len(df))]
print(df)

Note If the provided code do not produce your desired output, you might have to update your question by including the exact values of the Distance column.

Answered By: Jamiu S.

TLDR; The goal is to determine the cosine similarity between every row that follows another in a Pandas DataFrame that has numerous feature columns. The cosine_similarity function from sklearn.metrics.pairwise may be used to accomplish this, and the new column created will contain the similarity values. To accomplish this, each successive pair of rows is iterated over, their feature values are converted to a 2D shape, and the cosine_similarity between them is calculated. The resulting similarity values are then put to a list and added to the DataFrame as a new column.

To get the cosine similarity between each succeeding row in your DataFrame, use the cosine_similarity function from the sklearn.metrics.pairwise package.

from sklearn.metrics.pairwise import cosine_similarity

# Calculate cosine similarity for consecutive rows
similarity_list = []
for i in range(len(df) - 1):
    row1 = df.iloc[i, 2:].values.reshape(1, -1)
    row2 = df.iloc[i+1, 2:].values.reshape(1, -1)
    similarity = cosine_similarity(row1, row2)[0][0]
    similarity_list.append(similarity)

# Add similarity values to DataFrame
df['Distance'] = [np.nan] + similarity_list

The cosine_similarity function is used in this code to iteratively calculate the cosine similarity between each pair of subsequent rows in the DataFrame. These similarity values are then added to a list and added as a new column with the name Distance to the original DataFrame.

To ensure that the feature values have a 2D shape, which is necessary for the cosine similarity function, we reshape them using reshape(1, -1).

Answered By: Rishabh Anand

I was able to figure it how somehow, the loop is something I was looking for, since some of the api_spec_id's were not getting assigned NaN and the distance was getting calculated which is wrong.

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Feature columns to use for cosine similarity calculation
cols_to_use = labels.loc[:, "Info_contact_name_changes":"Paths_modified"].columns

# New column for cosine similarity
labels['cosine_sim'] = np.nan

# Looping through each api_spec_id
for api_spec_id in labels['api_spec_id'].unique():
    # Get the rows for the current api_spec_id
    api_rows = labels[labels['api_spec_id'] == api_spec_id].sort_values(by='commit_date')

    # Set the cosine similarity of the first row to NaN, since there is no previous row to compare to
    labels.loc[api_rows.index[0], 'cosine_sim'] = np.nan
    
    # Calculate the cosine similarity for consecutive rows
    for i in range(1, len(api_rows)):
        # Get the previous and current row
        prev_row = api_rows.iloc[i-1][cols_to_use]
        curr_row = api_rows.iloc[i][cols_to_use]
        
        # Calculate the cosine similarity and store it in the 'cosine_sim' column
        cosine_sim = cosine_similarity([prev_row], [curr_row])[0][0]
        labels.loc[api_rows.index[i], 'cosine_sim'] = cosine_sim

Answered By: Brie MerryWeather
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.