Row wise cosine similarity calculation in pandas
Question:
I have a dataframe that looks like this:
api_spec_id label Paths_modified Tags_modified Endpoints_added
933 803.0 minor 8.0 3.0 6
934 803.0 patch 0.0 4.0 2
935 803.0 patch 3.0 1.0 0
938 803.0 patch 10.0 0.0 4
939 803.0 patch 3.0 5.0 1
940 803.0 patch 6.0 0.0 0
942 803.0 patch 0.0 6.0 2
946 803.0 patch 3.0 2.0 3
947 803.0 patch 0.0 0.0 1
I want to calculate the row wise cosine similarity between every consecutive row. The dataframe is already sorted on the api_spec_id
and date
.
The expected output should be something like this( the values are not exact):
api_spec_id label Paths_modified Tags_modified Endpoints_added Distance
933 803.0 minor 8.0 3.0 6 ...
934 803.0 patch 0.0 4.0 2 1.00234
935 803.0 patch 3.0 1.0 0
938 803.0 patch 10.0 0.0 4
939 803.0 patch 3.0 5.0 1
940 803.0 patch 6.0 0.0 0
942 803.0 patch 0.0 6.0 2
946 803.0 patch 3.0 2.0 3
947 803.0 patch 0.0 0.0 1
I tried looking at the solutions here in stack overflow, but the use case seems to be a bit different in all the cases. I have many more features, around 32 in total, and I want to consider all those feature columns (Paths modified, tags modified and endpoints added in the df above are examples of some features), and calculate the distance metric for each row.
This is what I could think of,but it does not fulfil the purpose:
df = pd.DataFrame(np.random.randint(0, 5, (3, 5)), columns=['id', 'commit_date', 'feature1', 'feature2', 'feature3'])
similarity_df = df.iloc[:, 2:].apply(lambda x: cosine_similarity([x], df.iloc[:, 2:])[0], axis=1)
Does anyone have suggestions on how could I proceed with this?
Edit: A possible block in my use case is I cannot get rid of my other columns, I will still need to retain at least the api_spec_id
to provide a way to map the distance back to the original dataframe.
Answers:
This can be done without apply
(faster):
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0, 5, (3, 5)), columns=['id', 'commit_date', 'feature1', 'feature2', 'feature3'])
# Calculate L2 norm of features in row
df["l2norm"] = np.linalg.norm(df.loc[:, "feature1":"feature3"], axis=1)
# Create shifted dataframe
df2 = df.shift(1, fill_value=0)
# Dot product of current with previous row
dot_product = (df.loc[:, "feature1":"feature3"] * df2.loc[:, "feature1":"feature3"]).sum(axis=1)
# L2 norm product of current and previous row
norm_product = df["l2norm"] * df2["l2norm"]
# Divide and print
print(dot_product / norm_product)
Try this approach using cosine_similarity
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
df['Distance'] = (df.iloc[:, 2:]
.apply(lambda row: cosine_similarity([row],
[df.iloc[row.name - 1, 2:]])[0][0]
if row.name > 0 else None, axis=1))
print(df)
You can also use for-loop
but considering the size of your dataframe
similarity_df = cosine_similarity(df.iloc[:, 2:])
df['Distance'] = ([None] + [similarity_df[i, i-1] for i in range(1, len(df))]
print(df)
Note If the provided code do not produce your desired output, you might have to update your question by including the exact values of the Distance
column.
TLDR; The goal is to determine the cosine similarity between every row that follows another in a Pandas DataFrame that has numerous feature columns. The cosine_similarity
function from sklearn.metrics.pairwise
may be used to accomplish this, and the new column created will contain the similarity values. To accomplish this, each successive pair of rows is iterated over, their feature values are converted to a 2D shape, and the cosine_similarity
between them is calculated. The resulting similarity values are then put to a list and added to the DataFrame as a new column.
To get the cosine similarity between each succeeding row in your DataFrame, use the cosine_similarity
function from the sklearn.metrics.pairwise
package.
from sklearn.metrics.pairwise import cosine_similarity
# Calculate cosine similarity for consecutive rows
similarity_list = []
for i in range(len(df) - 1):
row1 = df.iloc[i, 2:].values.reshape(1, -1)
row2 = df.iloc[i+1, 2:].values.reshape(1, -1)
similarity = cosine_similarity(row1, row2)[0][0]
similarity_list.append(similarity)
# Add similarity values to DataFrame
df['Distance'] = [np.nan] + similarity_list
The cosine_similarity
function is used in this code to iteratively calculate the cosine similarity between each pair of subsequent rows in the DataFrame. These similarity values are then added to a list and added as a new column with the name Distance
to the original DataFrame.
To ensure that the feature values have a 2D shape, which is necessary for the cosine similarity function, we reshape them using reshape(1, -1).
I was able to figure it how somehow, the loop is something I was looking for, since some of the api_spec_id's
were not getting assigned NaN
and the distance was getting calculated which is wrong.
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
# Feature columns to use for cosine similarity calculation
cols_to_use = labels.loc[:, "Info_contact_name_changes":"Paths_modified"].columns
# New column for cosine similarity
labels['cosine_sim'] = np.nan
# Looping through each api_spec_id
for api_spec_id in labels['api_spec_id'].unique():
# Get the rows for the current api_spec_id
api_rows = labels[labels['api_spec_id'] == api_spec_id].sort_values(by='commit_date')
# Set the cosine similarity of the first row to NaN, since there is no previous row to compare to
labels.loc[api_rows.index[0], 'cosine_sim'] = np.nan
# Calculate the cosine similarity for consecutive rows
for i in range(1, len(api_rows)):
# Get the previous and current row
prev_row = api_rows.iloc[i-1][cols_to_use]
curr_row = api_rows.iloc[i][cols_to_use]
# Calculate the cosine similarity and store it in the 'cosine_sim' column
cosine_sim = cosine_similarity([prev_row], [curr_row])[0][0]
labels.loc[api_rows.index[i], 'cosine_sim'] = cosine_sim
I have a dataframe that looks like this:
api_spec_id label Paths_modified Tags_modified Endpoints_added
933 803.0 minor 8.0 3.0 6
934 803.0 patch 0.0 4.0 2
935 803.0 patch 3.0 1.0 0
938 803.0 patch 10.0 0.0 4
939 803.0 patch 3.0 5.0 1
940 803.0 patch 6.0 0.0 0
942 803.0 patch 0.0 6.0 2
946 803.0 patch 3.0 2.0 3
947 803.0 patch 0.0 0.0 1
I want to calculate the row wise cosine similarity between every consecutive row. The dataframe is already sorted on the api_spec_id
and date
.
The expected output should be something like this( the values are not exact):
api_spec_id label Paths_modified Tags_modified Endpoints_added Distance
933 803.0 minor 8.0 3.0 6 ...
934 803.0 patch 0.0 4.0 2 1.00234
935 803.0 patch 3.0 1.0 0
938 803.0 patch 10.0 0.0 4
939 803.0 patch 3.0 5.0 1
940 803.0 patch 6.0 0.0 0
942 803.0 patch 0.0 6.0 2
946 803.0 patch 3.0 2.0 3
947 803.0 patch 0.0 0.0 1
I tried looking at the solutions here in stack overflow, but the use case seems to be a bit different in all the cases. I have many more features, around 32 in total, and I want to consider all those feature columns (Paths modified, tags modified and endpoints added in the df above are examples of some features), and calculate the distance metric for each row.
This is what I could think of,but it does not fulfil the purpose:
df = pd.DataFrame(np.random.randint(0, 5, (3, 5)), columns=['id', 'commit_date', 'feature1', 'feature2', 'feature3'])
similarity_df = df.iloc[:, 2:].apply(lambda x: cosine_similarity([x], df.iloc[:, 2:])[0], axis=1)
Does anyone have suggestions on how could I proceed with this?
Edit: A possible block in my use case is I cannot get rid of my other columns, I will still need to retain at least the api_spec_id
to provide a way to map the distance back to the original dataframe.
This can be done without apply
(faster):
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0, 5, (3, 5)), columns=['id', 'commit_date', 'feature1', 'feature2', 'feature3'])
# Calculate L2 norm of features in row
df["l2norm"] = np.linalg.norm(df.loc[:, "feature1":"feature3"], axis=1)
# Create shifted dataframe
df2 = df.shift(1, fill_value=0)
# Dot product of current with previous row
dot_product = (df.loc[:, "feature1":"feature3"] * df2.loc[:, "feature1":"feature3"]).sum(axis=1)
# L2 norm product of current and previous row
norm_product = df["l2norm"] * df2["l2norm"]
# Divide and print
print(dot_product / norm_product)
Try this approach using cosine_similarity
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
df['Distance'] = (df.iloc[:, 2:]
.apply(lambda row: cosine_similarity([row],
[df.iloc[row.name - 1, 2:]])[0][0]
if row.name > 0 else None, axis=1))
print(df)
You can also use for-loop
but considering the size of your dataframe
similarity_df = cosine_similarity(df.iloc[:, 2:])
df['Distance'] = ([None] + [similarity_df[i, i-1] for i in range(1, len(df))]
print(df)
Note If the provided code do not produce your desired output, you might have to update your question by including the exact values of the Distance
column.
TLDR; The goal is to determine the cosine similarity between every row that follows another in a Pandas DataFrame that has numerous feature columns. The cosine_similarity
function from sklearn.metrics.pairwise
may be used to accomplish this, and the new column created will contain the similarity values. To accomplish this, each successive pair of rows is iterated over, their feature values are converted to a 2D shape, and the cosine_similarity
between them is calculated. The resulting similarity values are then put to a list and added to the DataFrame as a new column.
To get the cosine similarity between each succeeding row in your DataFrame, use the cosine_similarity
function from the sklearn.metrics.pairwise
package.
from sklearn.metrics.pairwise import cosine_similarity
# Calculate cosine similarity for consecutive rows
similarity_list = []
for i in range(len(df) - 1):
row1 = df.iloc[i, 2:].values.reshape(1, -1)
row2 = df.iloc[i+1, 2:].values.reshape(1, -1)
similarity = cosine_similarity(row1, row2)[0][0]
similarity_list.append(similarity)
# Add similarity values to DataFrame
df['Distance'] = [np.nan] + similarity_list
The cosine_similarity
function is used in this code to iteratively calculate the cosine similarity between each pair of subsequent rows in the DataFrame. These similarity values are then added to a list and added as a new column with the name Distance
to the original DataFrame.
To ensure that the feature values have a 2D shape, which is necessary for the cosine similarity function, we reshape them using reshape(1, -1).
I was able to figure it how somehow, the loop is something I was looking for, since some of the api_spec_id's
were not getting assigned NaN
and the distance was getting calculated which is wrong.
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
# Feature columns to use for cosine similarity calculation
cols_to_use = labels.loc[:, "Info_contact_name_changes":"Paths_modified"].columns
# New column for cosine similarity
labels['cosine_sim'] = np.nan
# Looping through each api_spec_id
for api_spec_id in labels['api_spec_id'].unique():
# Get the rows for the current api_spec_id
api_rows = labels[labels['api_spec_id'] == api_spec_id].sort_values(by='commit_date')
# Set the cosine similarity of the first row to NaN, since there is no previous row to compare to
labels.loc[api_rows.index[0], 'cosine_sim'] = np.nan
# Calculate the cosine similarity for consecutive rows
for i in range(1, len(api_rows)):
# Get the previous and current row
prev_row = api_rows.iloc[i-1][cols_to_use]
curr_row = api_rows.iloc[i][cols_to_use]
# Calculate the cosine similarity and store it in the 'cosine_sim' column
cosine_sim = cosine_similarity([prev_row], [curr_row])[0][0]
labels.loc[api_rows.index[i], 'cosine_sim'] = cosine_sim