Good regex to extract price

Question:

I am trying to extract price from various currency values. Here are my sample input values:

でレンタル HD(高画質) ¥ 500
で購入  HD(高画質) ¥ 2,500
Buy SD £5.99
Buy SD £14.99
HD ausleihen EUR 3,99
HD kaufen EUR 11,99
Buy Movie HD $19.99
$1,200.84

How would I get this currency value into a float, for example 19.99 ? The regex I had so far is:

re.findall(r'[d|,|.]+', s)[0].replace(',', '')

But it seems insufficient. What would be a better one?

Asked By: David542

||

Answers:

A regex that will match ANY currencies from a string, before or after a currency type word/symbol, you may use

(?:b(?:USD|GBP|EUR|JPY|CHF|SEK|DKK|NOK|SGD|HKD|AUD|TWD|NZD|CNY|KRW|INR|CAD|VEF|EGP|THB|IDR|PKR|MYR|PHP|MXN|VND|CZK|HUF|PLN|TRY|ZAR|ILS|ARS|CLP|BRL|RUB|QAR|AED|COP|PEN|CNH|KWD|SAR)|[$u00A2-u00A5u058Fu060Bu07FEu07FFu09F2u09F3u09FBu0AF1u0BF9u0E3Fu17DBu20A0-u20C0uA838uFDFCuFE69uFF04uFFE0uFFE1uFFE5uFFE6U00011FDD-U00011FE0U0001E2FFU0001ECB0])s*(d+(?:[.,]d+)*)|(d+(?:[.,]d+)*)s*(?:(?:USD|GBP|EUR|JPY|CHF|SEK|DKK|NOK|SGD|HKD|AUD|TWD|NZD|CNY|KRW|INR|CAD|VEF|EGP|THB|IDR|PKR|MYR|PHP|MXN|VND|CZK|HUF|PLN|TRY|ZAR|ILS|ARS|CLP|BRL|RUB|QAR|AED|COP|PEN|CNH|KWD|SAR)b|[$u00A2-u00A5u058Fu060Bu07FEu07FFu09F2u09F3u09FBu0AF1u0BF9u0E3Fu17DBu20A0-u20C0uA838uFDFCuFE69uFF04uFFE0uFFE1uFFE5uFFE6U00011FDD-U00011FE0U0001E2FFU0001ECB0])

See the regex demo. It includes USD|GBP|EUR|JPY|CHF|SEK|DKK|NOK|SGD|HKD|AUD|TWD|NZD|CNY|KRW|INR|CAD|VEF|EGP|THB|IDR|PKR|MYR|PHP|MXN|VND|CZK|HUF|PLN|TRY|ZAR|ILS|ARS|CLP|BRL|RUB|QAR|AED|COP|PEN|CNH|KWD|SAR pattern that matches most common world currencies and [$u00A2-u00A5u058Fu060Bu07FEu07FFu09F2u09F3u09FBu0AF1u0BF9u0E3Fu17DBu20A0-u20C0uA838uFDFCuFE69uFF04uFFE0uFFE1uFFE5uFFE6U00011FDD-U00011FE0U0001E2FFU0001ECB0] that matches any currency symbols (equivalent of p{Sc} in PCRE).

In Python, you will need a bit of code to make it work as you need:

import re
texts = ['でレンタル HD(高画質) ¥ 500',
    'で購入  HD(高画質) ¥ 2,500',
    'Buy SD £5.99',
    'Buy SD £14.99',
    'HD ausleihen EUR 3,99',
    'HD kaufen EUR 11,99',
    'Buy Movie HD $19.99',
    '$1,200.84'
]
curword = r'(?:USD|GBP|EUR|JPY|CHF|SEK|DKK|NOK|SGD|HKD|AUD|TWD|NZD|CNY|KRW|INR|CAD|VEF|EGP|THB|IDR|PKR|MYR|PHP|MXN|VND|CZK|HUF|PLN|TRY|ZAR|ILS|ARS|CLP|BRL|RUB|QAR|AED|COP|PEN|CNH|KWD|SAR)'
cursymbol = r'[$u00A2-u00A5u058Fu060Bu07FEu07FFu09F2u09F3u09FBu0AF1u0BF9u0E3Fu17DBu20A0-u20C0uA838uFDFCuFE69uFF04uFFE0uFFE1uFFE5uFFE6U00011FDD-U00011FE0U0001E2FFU0001ECB0]'
num = r'd+(?:[.,]d+)*'
pattern = re.compile(fr'(?:b{curword}|{cursymbol})s*({num})|({num})s*(?:{curword}b|{cursymbol})')
print(fr'(?:b{curword}|{cursymbol})s*({num})|({num})s*(?:{curword}b|{cursymbol})')

for text in texts:
    m = pattern.search(text)
    if m:
        result = m.group(1) or m.group(2)
        print(result)

See the Python demo. It prints

500
2,500
5.99
14.99
3,99
11,99
19.99
1,200.84

If you need to convert string result to int/float, you can also capture the country currency word/symbol, then convert the decimal separator to the one you need and then parse to int or float.

Answered By: Wiktor Stribiżew
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.