How to check each column for value greater than and accept if 3 out of 5 columns has that value

Question:

I am trying to make a article similarity checker by comparing 6 articles with list of articles that I obtained from an API. I have used cosine similarity to compare each article one by one with the 6 articles that I am using as baseline.

My dataframe now looks like this:

id Article cosinesin1 cosinesin2 cosinesin3 cosinesin4 cosinesin5 cosinesin6 Similar
id1 [Article1] 0.2 0.5 0.6 0.8 0.7 0.8 True
id2 [Article2] 0.1 0.2 0.03 0.8 0.2 0.45 False

So I want to add Similar column in my dataframe that could check values for each Cosinesin (1-6) and return True if at least 3 out of 6 has value more than 0.5 otherwise return False.

Is there any way to do this in python?

Thanks

Asked By: Python-data

||

Answers:

In Python, you can treat True and False as integers, 1 and 0, respectively.

So if you compare all the similarity metrics to 0.5, you can sum over the resulting Boolean DataFrame along the columns, to get the number of comparisons that resulted in True for each row. Comparing those numbers to 3 yields the column you want:

cos_cols = [f"cosinesin{i}" for i in range(1, 7)]    
df['Similar'] = (df[cos_cols] > 0.5).sum(axis=1) >= 3
Answered By: Arne