Match two regex patterns multiple times

Question:

I have this string "Energy (kWh/m²)" and I want to get "Energy__KWh_m__", meaning, replacing all non word characters and sub/superscript characters with an underscore.

I have the regex for replacing the non word characters -> re.sub("[W]", "_", column_name) and the regex for replacing the superscript numbers -> re.sub("[²³¹⁰ⁱ⁴⁵⁶⁷⁸⁹⁺⁻⁼⁽⁾ⁿ]", "", column_name)

I have tried combining this into one single regex but I have had no luck. Every time I try I only get partial replacements like "Energy (KWh_m__" – with a regex like ([²³¹⁰ⁱ⁴⁵⁶⁷⁸⁹⁺⁻⁼⁽⁾ⁿ]).*(W)

Any help? Thanks!

Asked By: J.Doe

||

Answers:

To combine the two regular expressions you can use the | symbol, which means "or". Here’s an example of how you can use it:

import re

column_name = "Energy (kWh/m²)"

pattern = re.compile(r"[W]|[²³¹⁰ⁱ⁴⁵⁶⁷⁸⁹⁺⁻⁼⁽⁾ⁿ]")
result = pattern.sub("_", column_name)

print(result)

Alternative:

result = re.sub(r"[W]|[²³¹⁰ⁱ⁴⁵⁶⁷⁸⁹⁺⁻⁼⁽⁾ⁿ]", "_", column_name)

Output:

Energy__kWh_m__
Answered By: Jamiu Shaibu

As per your current code, if you plan to remove the superscript chars and replace all other non-word chars with an underscore, you can use

re.sub(r'([²³¹⁰ⁱ⁴⁵⁶⁷⁸⁹⁺⁻⁼⁽⁾ⁿ])|W', lambda x: '' if x.group(1) else '_', text)

If you plan to match all the non-word chars and the chars in the character class you have, just merge the two:

re.sub(r'[W²³¹⁰ⁱ⁴⁵⁶⁷⁸⁹⁺⁻⁼⁽⁾ⁿ]', '_', text)

See this second regex demo. Note that the W matches the symbols, so you can even shorten this to r'[W²³¹⁰ⁱ⁴⁵⁶⁷⁸⁹ⁿ]'.

See the Python demo:

import re
text="Energy (kWh/m²)"
print(re.sub(r'([²³¹⁰ⁱ⁴⁵⁶⁷⁸⁹⁺⁻⁼⁽⁾ⁿ])|W', lambda x: '' if x.group(1) else '_', text)) # => Energy__kWh_m_
print(re.sub(r'[W²³¹⁰ⁱ⁴⁵⁶⁷⁸⁹⁺⁻⁼⁽⁾ⁿ]', '_', text)) # => Energy__kWh_m__
Answered By: Wiktor Stribiżew
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.