How to find most common phrases from a list in python?

Question:

I struggle with the following:
I have an input list:

input_list = [
    "Beneficiile pozitive ale productName:",
    "Care sunt ingredientele product name?",
    "Ce este product name ddd?",
    "Ce face product name decât orice altă îngrijire a pielii?",
    "Cum funcționează?",
    "product name pret – Ia Piele Tineresc Natural! Pareri, Cumpăra",
    "Offer Nutra",
    "Offering Top Nutritional",
    "În cazul în care pentru a cumpara Crema product name?",
]

I need to count each list item and get a most frequent phrase or word from the whole list of items.

There are some answers here where I can count words but in that case I need a two-words phrase to be returned

Expected output:

In this case the returned output should be ‘product name’ because it occurs in 5 list items in this case.

Again – I don’t want to count words but phrases which occurs multiple times in list items.

Asked By: Michal

||

Answers:

This is an ugly task: it boils down to start with two-word phrases and count them, then three-word phrases and so on until finally the whole input list element considered as one phrase is counted. (May be, there are additional criteria, what counts as phrase, so some may be skipped.) Per phrase you have a runtime proportial to the square to the number of input words. It becomes even worse, if you are free to ignore word boundaries, i.e. in the first element "productName" should be counted as "product name" as well, since then you may require a dictionary to identify valid substrings (which may easily produce huge numbers of false hits.)

Answered By: guidot

Ok, I somehow figured out how to solve that. Its best to use some nlp lib like nltk

import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()


input_list = ['Beneficiile pozitive ale productName:', 'Care sunt ingredientele product name?', 'Ce este product name ddd?', 'Ce face product name decât orice altă îngrijire a pielii?', 'Cum funcționează?', 'product name pret – Ia Piele Tineresc Natural! Pareri, Cumpăra', 'Offer Nutra', 'Offering Top Nutritional', 'În cazul în care pentru a cumpara Crema product name?']

def func(some_list):
    outer_list = []
    for i in some_list:
        tokens = nltk.wordpunct_tokenize(i)
        finder = BigramCollocationFinder.from_words(tokens)
        finder.apply_freq_filter(1)
        outer_list.append(finder.nbest(bigram_measures.pmi, 10))

    flattened_list = [item for sublist in outer_list for item in sublist]
    frequency_distribution = nltk.FreqDist(flattened_list)
    most_common_element = frequency_distribution.max()

    return ' '.join(most_common_element)

print(func(input_list))


Answered By: Michal

this is my implementation, it’s a bit tricky but it works anyway:

from string import punctuation

input_list = [
    "Beneficiile pozitive ale productName:",
    "Care sunt ingredientele product name?",
    "Ce este product name ddd?",
    "Ce face product name decât orice altă îngrijire a pielii?",
    "Cum funcționează?",
    "product name pret – Ia Piele Tineresc Natural! Pareri, Cumpăra",
    "Offer Nutra",
    "Offering Top Nutritional",
    "În cazul în care pentru a cumpara Crema product name?"]

most_common_phrase = ''
duplicates_num = 0

f = lambda x: x.translate(str.maketrans('','',punctuation)).lower() # removes punctuation
phrases = f(' 000 '.join(input_list)) # adds dividers

for i in input_list:
    phrase = f(i).split()
    for j in range(len(phrase)-1):
        for y in range(j+2,len(phrase)+1):
            phrase_comb = ' '.join(phrase[j:y])
            if (n:=phrases.count(phrase_comb)) > duplicates_num:
                duplicates_num = n
                most_common_phrase = phrase_comb
                
print(f'{most_common_phrase = }n{duplicates_num = }')

>>> out
'''
most_common_phrase = 'product name'
duplicates_num = 5   
Answered By: SergFSM
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.