Extract part of a regex match
Question:
I want a regular expression to extract the title from a HTML page. Currently I have this:
title = re.search('<title>.*</title>', html, re.IGNORECASE).group()
if title:
title = title.replace('<title>', '').replace('</title>', '')
Is there a regular expression to extract just the contents of <title> so I don’t have to remove the tags?
Answers:
Try:
title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)
re.search('<title>(.*)</title>', s, re.IGNORECASE).group(1)
Try using capturing groups:
title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)
I’d think this should suffice:
#!python
import re
pattern = re.compile(r'<title>([^<]*)</title>', re.MULTILINE|re.IGNORECASE)
pattern.search(text)
… assuming that your text (HTML) is in a variable named "text."
This also assumes that there are no other HTML tags which can be legally embedded inside of an HTML TITLE tag and there exists no way to legally embed any other < character within such a container/block.
However …
Don’t use regular expressions for HTML parsing in Python. Use an HTML parser! (Unless you’re going to write a full parser, which would be a of extra, and redundant work when various HTML, SGML and XML parsers are already in the standard libraries).
If you’re handling "real world" tag soup HTML (which is frequently non-conforming to any SGML/XML validator) then use the BeautifulSoup package. It isn’t in the standard libraries (yet) but is widely recommended for this purpose.
Another option is: lxml … which is written for properly structured (standards conformant) HTML. But it has an option to fallback to using BeautifulSoup as a parser: ElementSoup.
May I recommend you to Beautiful Soup. Soup is a very good lib to parse all of your html document.
soup = BeatifulSoup(html_doc)
titleName = soup.title.name
The provided pieces of code do not cope with Exceptions
May I suggest
getattr(re.search(r"<title>(.*)</title>", s, re.IGNORECASE), 'groups', lambda:[u""])()[0]
This returns an empty string by default if the pattern has not been found, or the first match.
Note that starting in Python 3.8
, and the introduction of assignment expressions (PEP 572) (:=
operator), it’s possible to improve a bit on Krzysztof Krasoń’s solution by capturing the match result directly within the if condition as a variable and re-use it in the condition’s body:
# pattern = '<title>(.*)</title>'
# text = '<title>hello</title>'
if match := re.search(pattern, text, re.IGNORECASE):
title = match.group(1)
# hello
The currently top-voted answer by Krzysztof Krasoń fails with <title>a</title><title>b</title>
. Also, it ignores title tags crossing line boundaries, e.g., for line-length reasons. Finally, it fails with <title >a</title>
(which is valid HTML: White space inside XML/HTML tags).
I therefore propose the following improvement:
import re
def search_title(html):
m = re.search(r"<titles*>(.*?)</titles*>", html, re.IGNORECASE | re.DOTALL)
return m.group(1) if m else None
Test cases:
print(search_title("<title >with spaces in tags</title >"))
print(search_title("<titlen>with newline in tags</titlen>"))
print(search_title("<title>first of two titles</title><title>second title</title>"))
print(search_title("<title>with newlinen in title</titlen>"))
Output:
with spaces in tags
with newline in tags
first of two titles
with newline
in title
Ultimately, I go along with others recommending an HTML parser – not only, but also to handle non-standard use of HTML tags.
Is there a particular reason why no one suggested using lookahead and lookbehind? I got here trying to do the exact same thing and (?<=<title>).+(?=</title>)
works great. It will only match whats between parentheses so you don’t have to do the whole group thing.
I needed something to match package-0.0.1
(name, version) but want to reject an invalid version such as 0.0.010
.
See regex101 example.
import re
RE_IDENTIFIER = re.compile(r'^([a-z]+)-((?:(?:0|[1-9](?:[0-9]+)?).){2}(?:0|[1-9](?:[0-9]+)?))$')
example = 'hello-0.0.1'
if match := RE_IDENTIFIER.search(example):
name, version = match.groups()
print(f'Name: {name}')
print(f'Version: {version}')
else:
raise ValueError(f'Invalid identifier {example}')
Output:
Name: hello
Version: 0.0.1
I want a regular expression to extract the title from a HTML page. Currently I have this:
title = re.search('<title>.*</title>', html, re.IGNORECASE).group()
if title:
title = title.replace('<title>', '').replace('</title>', '')
Is there a regular expression to extract just the contents of <title> so I don’t have to remove the tags?
Try:
title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)
re.search('<title>(.*)</title>', s, re.IGNORECASE).group(1)
Try using capturing groups:
title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)
I’d think this should suffice:
#!python
import re
pattern = re.compile(r'<title>([^<]*)</title>', re.MULTILINE|re.IGNORECASE)
pattern.search(text)
… assuming that your text (HTML) is in a variable named "text."
This also assumes that there are no other HTML tags which can be legally embedded inside of an HTML TITLE tag and there exists no way to legally embed any other < character within such a container/block.
However …
Don’t use regular expressions for HTML parsing in Python. Use an HTML parser! (Unless you’re going to write a full parser, which would be a of extra, and redundant work when various HTML, SGML and XML parsers are already in the standard libraries).
If you’re handling "real world" tag soup HTML (which is frequently non-conforming to any SGML/XML validator) then use the BeautifulSoup package. It isn’t in the standard libraries (yet) but is widely recommended for this purpose.
Another option is: lxml … which is written for properly structured (standards conformant) HTML. But it has an option to fallback to using BeautifulSoup as a parser: ElementSoup.
May I recommend you to Beautiful Soup. Soup is a very good lib to parse all of your html document.
soup = BeatifulSoup(html_doc)
titleName = soup.title.name
The provided pieces of code do not cope with Exceptions
May I suggest
getattr(re.search(r"<title>(.*)</title>", s, re.IGNORECASE), 'groups', lambda:[u""])()[0]
This returns an empty string by default if the pattern has not been found, or the first match.
Note that starting in Python 3.8
, and the introduction of assignment expressions (PEP 572) (:=
operator), it’s possible to improve a bit on Krzysztof Krasoń’s solution by capturing the match result directly within the if condition as a variable and re-use it in the condition’s body:
# pattern = '<title>(.*)</title>'
# text = '<title>hello</title>'
if match := re.search(pattern, text, re.IGNORECASE):
title = match.group(1)
# hello
The currently top-voted answer by Krzysztof Krasoń fails with <title>a</title><title>b</title>
. Also, it ignores title tags crossing line boundaries, e.g., for line-length reasons. Finally, it fails with <title >a</title>
(which is valid HTML: White space inside XML/HTML tags).
I therefore propose the following improvement:
import re
def search_title(html):
m = re.search(r"<titles*>(.*?)</titles*>", html, re.IGNORECASE | re.DOTALL)
return m.group(1) if m else None
Test cases:
print(search_title("<title >with spaces in tags</title >"))
print(search_title("<titlen>with newline in tags</titlen>"))
print(search_title("<title>first of two titles</title><title>second title</title>"))
print(search_title("<title>with newlinen in title</titlen>"))
Output:
with spaces in tags
with newline in tags
first of two titles
with newline
in title
Ultimately, I go along with others recommending an HTML parser – not only, but also to handle non-standard use of HTML tags.
Is there a particular reason why no one suggested using lookahead and lookbehind? I got here trying to do the exact same thing and (?<=<title>).+(?=</title>)
works great. It will only match whats between parentheses so you don’t have to do the whole group thing.
I needed something to match package-0.0.1
(name, version) but want to reject an invalid version such as 0.0.010
.
See regex101 example.
import re
RE_IDENTIFIER = re.compile(r'^([a-z]+)-((?:(?:0|[1-9](?:[0-9]+)?).){2}(?:0|[1-9](?:[0-9]+)?))$')
example = 'hello-0.0.1'
if match := RE_IDENTIFIER.search(example):
name, version = match.groups()
print(f'Name: {name}')
print(f'Version: {version}')
else:
raise ValueError(f'Invalid identifier {example}')
Output:
Name: hello
Version: 0.0.1