Define capture regex for name recognition of composite people connected by a connector

Question:

import re

def register_new_persons_names_to_identify_in_inputs(input_text):

    #Cases of compound human names:
    name_capture_pattern = r"(^[A-Z](?:w+)s*(?:del|des*el|de)s*^[A-Z](?:w+))?"
    regex_pattern = name_capture_pattern + r"s*(?i:ses*tratas*des*uns*nombre|(?:ser[íi]a|es)s*uns*nombre)"

    n0 = re.search(regex_pattern, input_text) #distingue entre mayusculas y minusculas

    if n0:
        word, = n0.groups()
        if(word == None or word == "" or word == " "): print("I think there was a problem, and although I thought you were giving me a name, I couldn't interpret it!")
        else: print(repr(word))


input_text = "Creo que María del Pilar se trata de un nombre"   #example 1
input_text = "Estoy segura que María dEl Pilar se tRatA De uN nOmbre"   #example 2
input_text = "María del Carmen es un nombre viejo"    #example 2

register_new_persons_names_to_identify_in_inputs(input_text)

In the Spanish language there are some names that are compounds, but in the middle they have a connector "del" placed, which is sometimes written in upper case, and many other times it is usually left in lower case (even if it is a name).

Because when defining the regex indicating that each part of the name must start with a capital letter, it fails and does not correctly capture the name of the person. I think the error in my capture regex is in the captures for each of the names ^[A-Z](?:w+))

I would also like to know if there is any way so that it does not matter if any of these connectors options (?:del|des*el|de) are written in uppercase or lowercase, however it does with the rest of the sentence. Something like (?i:del|des*el|de)?-i, but always without affecting the capture group (which is the name of the person)

This is the correct output that I need:

'María del Pilar'    #for example 1
'María del Pilar'    #for example 2
'María del Carmen'   #for example 3

Answers:

A few things:

  • remove 2 ^
  • add í to w ([wí], only added to first but maybe needs to be added to second too?)
  • add E to del (d[eE]l, or make case insensitive)

([A-Z](?:[wí]+)s*(?:d[Ee]l|des*el|de)s*[A-Z](?:w+))?

which I think can be further reduced to (remove ()):

([A-Z][wí]+s*(d[eE]l|des*el|de)s*[A-Z]w+)

https://regex101.com/r/bpKa12/1

Answered By: depperm