Python: finding longest version of names

Question:

I am using python to parse a news article and obtain a set of people names contained within it.
Currently every Named Entity classified as a PERson (by Stanford’s Stanza NLP library) gets added to a set as follows:

maxnames = set()  # initialize an empty set for PER references
for entity in doc.entities:
    if entity.type == "PER":
         if entity.text not in maxnames:
             maxnames.add(entity.text)

Here is a real example I end up with:

{'von der Leyen', 'Meloni', 'Lars Danielsson', 'Filippo Mannino', 'Danielsson', 'Giorgia Meloni', 'Ursula von der Leyen', 'Matteo Piantedosi', 'Lamberto Giannini'}

What I’m trying to achieve is to keep on the most complete name. In the above example this should become:

{'Lars Danielsson', 'Filippo Mannino', 'Giorgia Meloni', 'Ursula von der Leyen', 'Matteo Piantedosi', 'Lamberto Giannini'}

because in the first set:

  • ‘von der Leyen’ should be suppressed by ‘Ursula von der Leyen’
  • ‘Meloni’ suppressed by ‘Giorgia Meloni’
    and so on.

This is how I’m trying but am getting lost 🙁 Can you please spot the error?

def longestname(reference: str, nameset: set[str]) -> set[str]:
    """
    Return the longest name in a set of names
    """
    for name in nameset.copy():
        lenname = len(name)
        lenref = len(reference)
        if lenref < lenname:
            if reference in name:
                nameset.add(name)
            else:
                nameset.remove(name)
    nameset.add(reference)
    return nameset


nameset = set()
nameset = longestname("von der Leyen", nameset)
nameset = longestname("Meloni", nameset)
nameset = longestname("Lars Danielsson", nameset)
nameset = longestname("Lars", nameset)
nameset = longestname("Giorgia Meloni", nameset)
nameset = longestname("Ursula von der Leyen", nameset)
nameset = longestname("Giorgia", nameset)

print(nameset)
# should contain exactly: 
# {'Lars Danielsson', 'Giorgia Meloni', 'Ursula von der Leyen'}
Asked By: Robert Alexander

||

Answers:

This isn’t the most efficient solution (O(N^2)), but if the number of names isn’t huge I don’t think striving for maximum efficiency is that important.

>>> names = {'von der Leyen', 'Meloni', 'Lars Danielsson', 'Filippo Mannino', 'Danielsson', 'Giorgia Meloni', 'Ursula von der Leyen', 'Matteo Piantedosi', 'Lamberto Giannini'}
>>> {name for name in names if not any(
...     name in other and name != other for other in names
... )}
{'Matteo Piantedosi', 'Lars Danielsson', 'Ursula von der Leyen', 'Filippo Mannino', 'Lamberto Giannini', 'Giorgia Meloni'}

A more efficient solution might involve building a dictionary keyed on space-separated words so you can narrow down the possible set of matches instead of doing an O(N) search each time — however this gets a little tricky if you have overlaps (say you have "Jean-Claude Van Damme" and "Dick Van Dyke" both in the same article) so I leave figuring that out as an exercise for the reader.

Answered By: Samwise
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.