Python regex findall

Question:

I am trying to extract all occurrences of tagged words from a string using regex in Python 2.7.2. Or simply, I want to extract every piece of text inside the [p][/p] tags.
Here is my attempt:

regex = ur"[u005B1Pu005D.+?u005Bu002FPu005D]+?"
line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
person = re.findall(pattern, line)

Printing person produces ['President [P]', '[/P]', '[P] Bill Gates [/P]']

What is the correct regex to get: ['[P] Barack Obama [/P]', '[P] Bill Gates [/p]']
or ['Barrack Obama', 'Bill Gates'].

Asked By: Ignatius

||

Answers:

import re
regex = ur"[P] (.+?) [/P]+?"
line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
person = re.findall(regex, line)
print(person)

yields

['Barack Obama', 'Bill Gates']

The regex ur"[u005B1Pu005D.+?u005Bu002FPu005D]+?" is exactly the same
unicode as u'[[1P].+?[/P]]+?' except harder to read.

The first bracketed group [[1P] tells re that any of the characters in the list ['[', '1', 'P'] should match, and similarly with the second bracketed group [/P]].That’s not what you want at all. So,

  • Remove the outer enclosing square brackets. (Also remove the
    stray 1 in front of P.)
  • To protect the literal brackets in [P], escape the brackets with a
    backslash: [P].
  • To return only the words inside the tags, place grouping parentheses
    around .+?.
Answered By: unutbu

Try this :

   for match in re.finditer(r"[P[^]]*](.*?)[/P]", subject):
        # match start: match.start()
        # match end (exclusive): match.end()
        # matched text: match.group()
Answered By: FailedDev

Your question is not 100% clear, but I’m assuming you want to find every piece of text inside [P][/P] tags:

>>> import re
>>> line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
>>> re.findall('[P]s?(.+?)s?[/P]', line)
['Barack Obama', 'Bill Gates']
Answered By: Blair

you can replace your pattern with

regex = ur"[P]([ws]+)[/P]"
Answered By: pram

Use this pattern,

pattern = '[P].+?[/P]'

Check here

Answered By: Sohn
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.