Get words between specific words in a Python string
Question:
I’m working on getting the words between certain words in a string.
Find string between two substrings Referring to this article, I succeeded in catching words in the following way.
s = 'asdf=5;iwantthis123jasd'
result = re.search('asdf=5;(.*)123jasd', s)
print(result.group(1))
But in the sentence below it failed.
s = ''' <div class="prod-origin-price ">
<span class="discount-rate">
4%
</span>
<span class="origin-price">'''
result = re.search('<span class="discount-rate">(.*)</span>', s)
print(result.group(1))
I’m trying to bring ‘4%’. Everything else succeeds, but I don’t know why only this one fails.
Help
Answers:
Try this (mind the white spaces and new lines)
import re
s = ''' <div class="prod-origin-price ">
<span class="discount-rate">
4%
</span>
<span class="origin-price">'''
result = re.search('<span class="discount-rate">s*(.*)s*</span>', s)
print(result.group(1))
Use re.DOTALL flag for matching new lines:
result = re.search('<span class="discount-rate">(.*)</span>', s, re.DOTALL)
Documentation: https://docs.python.org/3/library/re.html
There are newline characters in your string which won’t be matched against your regex.
Daniel’s solution works.
This is structured data, not just a string, so we can use a library like Beautiful Soup to help us simplify such tasks:
from bs4 import BeautifulSoup
s = ''' <div class="prod-origin-price ">
<span class="discount-rate">
4%
</span>
<span class="origin-price">'''
soup = BeautifulSoup(s)
value = soup.find(class_='discount-rate').get_text(strip=True)
print(value)
# Output:
4%
I’m working on getting the words between certain words in a string.
Find string between two substrings Referring to this article, I succeeded in catching words in the following way.
s = 'asdf=5;iwantthis123jasd'
result = re.search('asdf=5;(.*)123jasd', s)
print(result.group(1))
But in the sentence below it failed.
s = ''' <div class="prod-origin-price ">
<span class="discount-rate">
4%
</span>
<span class="origin-price">'''
result = re.search('<span class="discount-rate">(.*)</span>', s)
print(result.group(1))
I’m trying to bring ‘4%’. Everything else succeeds, but I don’t know why only this one fails.
Help
Try this (mind the white spaces and new lines)
import re
s = ''' <div class="prod-origin-price ">
<span class="discount-rate">
4%
</span>
<span class="origin-price">'''
result = re.search('<span class="discount-rate">s*(.*)s*</span>', s)
print(result.group(1))
Use re.DOTALL flag for matching new lines:
result = re.search('<span class="discount-rate">(.*)</span>', s, re.DOTALL)
Documentation: https://docs.python.org/3/library/re.html
There are newline characters in your string which won’t be matched against your regex.
Daniel’s solution works.
This is structured data, not just a string, so we can use a library like Beautiful Soup to help us simplify such tasks:
from bs4 import BeautifulSoup
s = ''' <div class="prod-origin-price ">
<span class="discount-rate">
4%
</span>
<span class="origin-price">'''
soup = BeautifulSoup(s)
value = soup.find(class_='discount-rate').get_text(strip=True)
print(value)
# Output:
4%