Regular expression add space around all occurrence of a character within parentheses in python

Question:

My goal is to separate dashes between parenthess. For example: “Mr. Queen (The-American-Detective, EQ), Mr. Holmes (The-British-Detective) ”

I want the result to be

“Mr. Queen (The – American – Detective, EQ), Mr. Holmes (The – British – Detective) “

My code is

re.sub(r'(.*)((.*)(-)(.*))(.*)', r'12 3 45', String)

however, this code seems only separates the last dash occurs in the last parentheses of a string.

it gives the result “‘Mr. Queen (The-America-Detective, EQ), Mr. Holmes (The-British – Detective) ”

Can anyone help with it? I tried to find through here; but it seems my code should work the way I expected

Asked By: ElleryL

||

Answers:

This code achieves the task by dividing it into two separate parts instead of relying solely on a single regular expression.

  1. It searches the string target for portions that are enclosed by (...)
  2. It then searches and replaces each - with (SPACE)-(SPACE) in each found (...) using replacement functions

Here we have the solution code:

def expand_dashes(target):
    """
    replace all "-" with " - " when they are within ()

    target [string] - the original string

    return [string] - the replaced string

    * note, this function does not work with nested ()
    """
    return re.sub(r'(?<=()(.*?)(?=))', __helper_func, target)

def __helper_func(match):
    """
    a helper function meant to process individual groups
    """
    return match.group(0).replace('-', ' - ')

Here we have the demo output:

>>> x = "Mr. Queen (The-American-Detective, EQ), Mr. Holmes (The-British-Detective)"
>>> expand_dashes(x)
>>> "Mr. Queen (The - American - Detective, EQ), Mr. Holmes (The - British - Detective)"
Answered By: AlanSTACK

Many specifiers in most regular expression implementations (including Python’s) act greedily – that is, they match as much of the input string as possible. Thus, the first .* in your regex is matching all of your input string except for the very last set of parentheses – that first .* “eats up” everything it can while still leaving enough left for the whole regex to make a successful match. Once inside that set of parentheses, you first have another .*, which similarly matches everything it can and still have the rest of the regex have enough for a successful match – so all the dashes in that final pair of parentheses except for the last dash. Thus, the substitution only inserts spaces around the final dash in the final set of parentheses, because your regex only has a single non-overlapping match: it matches the entire input string, it’s just that the part of the regex that singles out dash-between-parentheses only includes the final such dash.

To fix this, you may need to reevaluate parts of your approach, because re.sub will substitute for non-overlapping matches, and it would be difficult (I’m skeptical it would even be doable) to construct a single regex that can match arbitrary numbers of dashes between a given pair of parentheses, with a corresponding replacement that puts spaces around each such dash, and still make each of those matches non-overlapping (with a regex system capable of arbitrary-number group captures, maybe, but as far as I am aware Python’s implementation only captures the last captured group of any repeatable group ((<group>)* or (<group>)+ etc) in a given match. Checking for parentheses surrounding dashes with regex will need to include them in the match, which means a regex that matches and performs a replacement for a single dash-between-parentheses will have overlapping matches where there are multiple dashes in the same pair of parentheses.

An incremental approach, while a bit more complicated in implementation, might be a better way to get the desired behavior. You could use re.split with an appropriate regex to split the string into parenthesized sections and the intervening non-parenthetical sections, then perform a regex replacement on only the parenthetical sections using a simpler regex like r'([^-]*)(-)([^-]*)' to match any dashes*, then reassemble the full sequence with the new parenthetical sections. This effectively breaks the ‘individually capture all dashes within parentheses’ problem which is a bit hard for a single regex to get the captures right for into two problems of ‘find parenthesized sections’ and ‘individually capture dashes’, which are easier problems to solve.

*Note that this regex suggestion uses the character class [^-] meaning ‘any characters that are not -‘. This avoids the issue displayed by your current regex of .* including dashes in what it matches and “eating up” all but the last ones, because [^-]* is forced to stop matching when the next character is a -. Simply replacing .* with [^-]* in your current regex won’t solve the issue, however, because re.sub won’t replace for matches that overlap, like multiple dashes within the same parentheses would in that case.

Try a simpler way:

import re
s = "Mr. Queen (The-American-Detective, EQ), Mr. Holmes (The-British-Detective) "
s = re.sub(r'(w+)(-)(w+)(-)(w+)', '\1 \2 \3 \4 \5', s)
print(s)

Outputs:

Mr. Queen (The - American - Detective, EQ), Mr. Holmes (The - British - Detective)

Here is the working:

  • w essentially is same as [a-zA-Z0-9_] that is it matches
    lowercase, uppercase, digits or underscore.

  • - matches -.

So, this regex matches any string of the form something-anything-anotherthing and replace it with something - anything - anotherthing

Answered By: Austin
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.