Good regex to extract price
Question:
I am trying to extract price from various currency values. Here are my sample input values:
でレンタル HD(高画質) ¥ 500
で購入 HD(高画質) ¥ 2,500
Buy SD £5.99
Buy SD £14.99
HD ausleihen EUR 3,99
HD kaufen EUR 11,99
Buy Movie HD $19.99
$1,200.84
How would I get this currency value into a float, for example 19.99
? The regex I had so far is:
re.findall(r'[d|,|.]+', s)[0].replace(',', '')
But it seems insufficient. What would be a better one?
Answers:
A regex that will match ANY currencies from a string, before or after a currency type word/symbol, you may use
(?:b(?:USD|GBP|EUR|JPY|CHF|SEK|DKK|NOK|SGD|HKD|AUD|TWD|NZD|CNY|KRW|INR|CAD|VEF|EGP|THB|IDR|PKR|MYR|PHP|MXN|VND|CZK|HUF|PLN|TRY|ZAR|ILS|ARS|CLP|BRL|RUB|QAR|AED|COP|PEN|CNH|KWD|SAR)|[$u00A2-u00A5u058Fu060Bu07FEu07FFu09F2u09F3u09FBu0AF1u0BF9u0E3Fu17DBu20A0-u20C0uA838uFDFCuFE69uFF04uFFE0uFFE1uFFE5uFFE6U00011FDD-U00011FE0U0001E2FFU0001ECB0])s*(d+(?:[.,]d+)*)|(d+(?:[.,]d+)*)s*(?:(?:USD|GBP|EUR|JPY|CHF|SEK|DKK|NOK|SGD|HKD|AUD|TWD|NZD|CNY|KRW|INR|CAD|VEF|EGP|THB|IDR|PKR|MYR|PHP|MXN|VND|CZK|HUF|PLN|TRY|ZAR|ILS|ARS|CLP|BRL|RUB|QAR|AED|COP|PEN|CNH|KWD|SAR)b|[$u00A2-u00A5u058Fu060Bu07FEu07FFu09F2u09F3u09FBu0AF1u0BF9u0E3Fu17DBu20A0-u20C0uA838uFDFCuFE69uFF04uFFE0uFFE1uFFE5uFFE6U00011FDD-U00011FE0U0001E2FFU0001ECB0])
See the regex demo. It includes USD|GBP|EUR|JPY|CHF|SEK|DKK|NOK|SGD|HKD|AUD|TWD|NZD|CNY|KRW|INR|CAD|VEF|EGP|THB|IDR|PKR|MYR|PHP|MXN|VND|CZK|HUF|PLN|TRY|ZAR|ILS|ARS|CLP|BRL|RUB|QAR|AED|COP|PEN|CNH|KWD|SAR
pattern that matches most common world currencies and [$u00A2-u00A5u058Fu060Bu07FEu07FFu09F2u09F3u09FBu0AF1u0BF9u0E3Fu17DBu20A0-u20C0uA838uFDFCuFE69uFF04uFFE0uFFE1uFFE5uFFE6U00011FDD-U00011FE0U0001E2FFU0001ECB0]
that matches any currency symbols (equivalent of p{Sc}
in PCRE).
In Python, you will need a bit of code to make it work as you need:
import re
texts = ['でレンタル HD(高画質) ¥ 500',
'で購入 HD(高画質) ¥ 2,500',
'Buy SD £5.99',
'Buy SD £14.99',
'HD ausleihen EUR 3,99',
'HD kaufen EUR 11,99',
'Buy Movie HD $19.99',
'$1,200.84'
]
curword = r'(?:USD|GBP|EUR|JPY|CHF|SEK|DKK|NOK|SGD|HKD|AUD|TWD|NZD|CNY|KRW|INR|CAD|VEF|EGP|THB|IDR|PKR|MYR|PHP|MXN|VND|CZK|HUF|PLN|TRY|ZAR|ILS|ARS|CLP|BRL|RUB|QAR|AED|COP|PEN|CNH|KWD|SAR)'
cursymbol = r'[$u00A2-u00A5u058Fu060Bu07FEu07FFu09F2u09F3u09FBu0AF1u0BF9u0E3Fu17DBu20A0-u20C0uA838uFDFCuFE69uFF04uFFE0uFFE1uFFE5uFFE6U00011FDD-U00011FE0U0001E2FFU0001ECB0]'
num = r'd+(?:[.,]d+)*'
pattern = re.compile(fr'(?:b{curword}|{cursymbol})s*({num})|({num})s*(?:{curword}b|{cursymbol})')
print(fr'(?:b{curword}|{cursymbol})s*({num})|({num})s*(?:{curword}b|{cursymbol})')
for text in texts:
m = pattern.search(text)
if m:
result = m.group(1) or m.group(2)
print(result)
See the Python demo. It prints
500
2,500
5.99
14.99
3,99
11,99
19.99
1,200.84
If you need to convert string result to int/float, you can also capture the country currency word/symbol, then convert the decimal separator to the one you need and then parse to int
or float
.
I am trying to extract price from various currency values. Here are my sample input values:
でレンタル HD(高画質) ¥ 500
で購入 HD(高画質) ¥ 2,500
Buy SD £5.99
Buy SD £14.99
HD ausleihen EUR 3,99
HD kaufen EUR 11,99
Buy Movie HD $19.99
$1,200.84
How would I get this currency value into a float, for example 19.99
? The regex I had so far is:
re.findall(r'[d|,|.]+', s)[0].replace(',', '')
But it seems insufficient. What would be a better one?
A regex that will match ANY currencies from a string, before or after a currency type word/symbol, you may use
(?:b(?:USD|GBP|EUR|JPY|CHF|SEK|DKK|NOK|SGD|HKD|AUD|TWD|NZD|CNY|KRW|INR|CAD|VEF|EGP|THB|IDR|PKR|MYR|PHP|MXN|VND|CZK|HUF|PLN|TRY|ZAR|ILS|ARS|CLP|BRL|RUB|QAR|AED|COP|PEN|CNH|KWD|SAR)|[$u00A2-u00A5u058Fu060Bu07FEu07FFu09F2u09F3u09FBu0AF1u0BF9u0E3Fu17DBu20A0-u20C0uA838uFDFCuFE69uFF04uFFE0uFFE1uFFE5uFFE6U00011FDD-U00011FE0U0001E2FFU0001ECB0])s*(d+(?:[.,]d+)*)|(d+(?:[.,]d+)*)s*(?:(?:USD|GBP|EUR|JPY|CHF|SEK|DKK|NOK|SGD|HKD|AUD|TWD|NZD|CNY|KRW|INR|CAD|VEF|EGP|THB|IDR|PKR|MYR|PHP|MXN|VND|CZK|HUF|PLN|TRY|ZAR|ILS|ARS|CLP|BRL|RUB|QAR|AED|COP|PEN|CNH|KWD|SAR)b|[$u00A2-u00A5u058Fu060Bu07FEu07FFu09F2u09F3u09FBu0AF1u0BF9u0E3Fu17DBu20A0-u20C0uA838uFDFCuFE69uFF04uFFE0uFFE1uFFE5uFFE6U00011FDD-U00011FE0U0001E2FFU0001ECB0])
See the regex demo. It includes USD|GBP|EUR|JPY|CHF|SEK|DKK|NOK|SGD|HKD|AUD|TWD|NZD|CNY|KRW|INR|CAD|VEF|EGP|THB|IDR|PKR|MYR|PHP|MXN|VND|CZK|HUF|PLN|TRY|ZAR|ILS|ARS|CLP|BRL|RUB|QAR|AED|COP|PEN|CNH|KWD|SAR
pattern that matches most common world currencies and [$u00A2-u00A5u058Fu060Bu07FEu07FFu09F2u09F3u09FBu0AF1u0BF9u0E3Fu17DBu20A0-u20C0uA838uFDFCuFE69uFF04uFFE0uFFE1uFFE5uFFE6U00011FDD-U00011FE0U0001E2FFU0001ECB0]
that matches any currency symbols (equivalent of p{Sc}
in PCRE).
In Python, you will need a bit of code to make it work as you need:
import re
texts = ['でレンタル HD(高画質) ¥ 500',
'で購入 HD(高画質) ¥ 2,500',
'Buy SD £5.99',
'Buy SD £14.99',
'HD ausleihen EUR 3,99',
'HD kaufen EUR 11,99',
'Buy Movie HD $19.99',
'$1,200.84'
]
curword = r'(?:USD|GBP|EUR|JPY|CHF|SEK|DKK|NOK|SGD|HKD|AUD|TWD|NZD|CNY|KRW|INR|CAD|VEF|EGP|THB|IDR|PKR|MYR|PHP|MXN|VND|CZK|HUF|PLN|TRY|ZAR|ILS|ARS|CLP|BRL|RUB|QAR|AED|COP|PEN|CNH|KWD|SAR)'
cursymbol = r'[$u00A2-u00A5u058Fu060Bu07FEu07FFu09F2u09F3u09FBu0AF1u0BF9u0E3Fu17DBu20A0-u20C0uA838uFDFCuFE69uFF04uFFE0uFFE1uFFE5uFFE6U00011FDD-U00011FE0U0001E2FFU0001ECB0]'
num = r'd+(?:[.,]d+)*'
pattern = re.compile(fr'(?:b{curword}|{cursymbol})s*({num})|({num})s*(?:{curword}b|{cursymbol})')
print(fr'(?:b{curword}|{cursymbol})s*({num})|({num})s*(?:{curword}b|{cursymbol})')
for text in texts:
m = pattern.search(text)
if m:
result = m.group(1) or m.group(2)
print(result)
See the Python demo. It prints
500
2,500
5.99
14.99
3,99
11,99
19.99
1,200.84
If you need to convert string result to int/float, you can also capture the country currency word/symbol, then convert the decimal separator to the one you need and then parse to int
or float
.