Python regex findall
Question:
I am trying to extract all occurrences of tagged words from a string using regex in Python 2.7.2. Or simply, I want to extract every piece of text inside the [p][/p]
tags.
Here is my attempt:
regex = ur"[u005B1Pu005D.+?u005Bu002FPu005D]+?"
line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
person = re.findall(pattern, line)
Printing person
produces ['President [P]', '[/P]', '[P] Bill Gates [/P]']
What is the correct regex to get: ['[P] Barack Obama [/P]', '[P] Bill Gates [/p]']
or ['Barrack Obama', 'Bill Gates']
.
Answers:
import re
regex = ur"[P] (.+?) [/P]+?"
line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
person = re.findall(regex, line)
print(person)
yields
['Barack Obama', 'Bill Gates']
The regex ur"[u005B1Pu005D.+?u005Bu002FPu005D]+?"
is exactly the same
unicode as u'[[1P].+?[/P]]+?'
except harder to read.
The first bracketed group [[1P]
tells re that any of the characters in the list ['[', '1', 'P']
should match, and similarly with the second bracketed group [/P]]
.That’s not what you want at all. So,
- Remove the outer enclosing square brackets. (Also remove the
stray 1
in front of P
.)
- To protect the literal brackets in
[P]
, escape the brackets with a
backslash: [P]
.
- To return only the words inside the tags, place grouping parentheses
around .+?
.
Try this :
for match in re.finditer(r"[P[^]]*](.*?)[/P]", subject):
# match start: match.start()
# match end (exclusive): match.end()
# matched text: match.group()
Your question is not 100% clear, but I’m assuming you want to find every piece of text inside [P][/P]
tags:
>>> import re
>>> line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
>>> re.findall('[P]s?(.+?)s?[/P]', line)
['Barack Obama', 'Bill Gates']
you can replace your pattern with
regex = ur"[P]([ws]+)[/P]"
I am trying to extract all occurrences of tagged words from a string using regex in Python 2.7.2. Or simply, I want to extract every piece of text inside the [p][/p]
tags.
Here is my attempt:
regex = ur"[u005B1Pu005D.+?u005Bu002FPu005D]+?"
line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
person = re.findall(pattern, line)
Printing person
produces ['President [P]', '[/P]', '[P] Bill Gates [/P]']
What is the correct regex to get: ['[P] Barack Obama [/P]', '[P] Bill Gates [/p]']
or ['Barrack Obama', 'Bill Gates']
.
import re
regex = ur"[P] (.+?) [/P]+?"
line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
person = re.findall(regex, line)
print(person)
yields
['Barack Obama', 'Bill Gates']
The regex ur"[u005B1Pu005D.+?u005Bu002FPu005D]+?"
is exactly the same
unicode as u'[[1P].+?[/P]]+?'
except harder to read.
The first bracketed group [[1P]
tells re that any of the characters in the list ['[', '1', 'P']
should match, and similarly with the second bracketed group [/P]]
.That’s not what you want at all. So,
- Remove the outer enclosing square brackets. (Also remove the
stray1
in front ofP
.) - To protect the literal brackets in
[P]
, escape the brackets with a
backslash:[P]
. - To return only the words inside the tags, place grouping parentheses
around.+?
.
Try this :
for match in re.finditer(r"[P[^]]*](.*?)[/P]", subject):
# match start: match.start()
# match end (exclusive): match.end()
# matched text: match.group()
Your question is not 100% clear, but I’m assuming you want to find every piece of text inside [P][/P]
tags:
>>> import re
>>> line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
>>> re.findall('[P]s?(.+?)s?[/P]', line)
['Barack Obama', 'Bill Gates']
you can replace your pattern with
regex = ur"[P]([ws]+)[/P]"