How can I do multiple substitutions using regex?

Question:

I can use this code below to create a new file with the substitution of a with aa using regular expressions.

import re

with open("notes.txt") as text:
    new_text = re.sub("a", "aa", text.read())
    with open("notes2.txt", "w") as result:
        result.write(new_text)

I was wondering do I have to use this line, new_text = re.sub("a", "aa", text.read()), multiple times but substitute the string for others letters that I want to change in order to change more than one letter in my text?

That is, so a–>aa,b–> bb and c–> cc.

So I have to write that line for all the letters I want to change or is there an easier way. Perhaps to create a “dictionary” of translations. Should I put those letters into an array? I’m not sure how to call on them if I do.

Asked By: Euridice01

||

Answers:

You can use capturing group and backreference:

re.sub(r"([characters])", r"11", text.read())

Put characters that you want to double up in between []. For the case of lower case a, b, c:

re.sub(r"([abc])", r"11", text.read())

In the replacement string, you can refer to whatever matched by a capturing group () with n notation where n is some positive integer (0 excluded). 1 refers to the first capturing group. There is another notation g<n> where n can be any non-negative integer (0 allowed); g<0> will refer to the whole text matched by the expression.


If you want to double up all characters except new line:

re.sub(r"(.)", r"11", text.read())

If you want to double up all characters (new line included):

re.sub(r"(.)", r"11", text.read(), 0, re.S)
Answered By: nhahtdh

The answer proposed by @nhahtdh is valid, but I would argue less pythonic than the canonical example, which uses code less opaque than his regex manipulations and takes advantage of python’s built-in data structures and anonymous function feature.

A dictionary of translations makes sense in this context. In fact, that’s how the Python Cookbook does it, as shown in this example (copied from ActiveState http://code.activestate.com/recipes/81330-single-pass-multiple-replace/ )

import re 

def multiple_replace(dict, text):
  # Create a regular expression  from the dictionary keys
  regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))

  # For each match, look-up corresponding value in dictionary
  return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text) 

if __name__ == "__main__": 

  text = "Larry Wall is the creator of Perl"

  dict = {
    "Larry Wall" : "Guido van Rossum",
    "creator" : "Benevolent Dictator for Life",
    "Perl" : "Python",
  } 

  print multiple_replace(dict, text)

So in your case, you could make a dict trans = {"a": "aa", "b": "bb"} and then pass it into multiple_replace along with the text you want translated. Basically all that function is doing is creating one huge regex containing all of your regexes to translate, then when one is found, passing a lambda function to regex.sub to perform the translation dictionary lookup.

You could use this function while reading from your file, for example:

with open("notes.txt") as text:
    new_text = multiple_replace(replacements, text.read())
with open("notes2.txt", "w") as result:
    result.write(new_text)

I’ve actually used this exact method in production, in a case where I needed to translate the months of the year from Czech into English for a web scraping task.

As @nhahtdh pointed out, one downside to this approach is that it is not prefix-free: dictionary keys that are prefixes of other dictionary keys will cause the method to break.

Answered By: Emmett Butler

Using tips from how to make a ‘stringy’ class, we can make an object identical to a string but for an extra sub method:

import re
class Substitutable(str):
  def __new__(cls, *args, **kwargs):
    newobj = str.__new__(cls, *args, **kwargs)
    newobj.sub = lambda fro,to: Substitutable(re.sub(fro, to, newobj))
    return newobj

This allows to use the builder pattern, which looks nicer, but works only for a pre-determined number of substitutions. If you use it in a loop, there is no point creating an extra class anymore. E.g.

>>> h = Substitutable('horse')
>>> h
'horse'
>>> h.sub('h', 'f')
'forse'
>>> h.sub('h', 'f').sub('f','h')
'horse'
Answered By: Leo

I found I had to modify Emmett J. Butler’s code by changing the lambda function to use myDict.get(mo.group(1),mo.group(1)). The original code wasn’t working for me; using myDict.get() also provides the benefit of a default value if a key is not found.

OIDNameContraction = {
                                'Fucntion':'Func',
                                'operated':'Operated',
                                'Asist':'Assist',
                                'Detection':'Det',
                                'Control':'Ctrl',
                                'Function':'Func'
}

replacementDictRegex = re.compile("(%s)" % "|".join(map(re.escape, OIDNameContraction.keys())))

oidDescriptionStr = replacementDictRegex.sub(lambda mo:OIDNameContraction.get(mo.group(1),mo.group(1)), oidDescriptionStr)
Answered By: Jordan McBain

You can use the pandas library and the replace function. I represent one example with five replacements:

df = pd.DataFrame({'text': ['Billy is going to visit Rome in November', 'I was born in 10/10/2010', 'I will be there at 20:00']})

to_replace=['Billy','Rome','January|February|March|April|May|June|July|August|September|October|November|December', 'd{2}:d{2}', 'd{2}/d{2}/d{4}']
replace_with=['name','city','month','time', 'date']

print(df.text.replace(to_replace, replace_with, regex=True))

And the modified text is:

0    name is going to visit city in month
1                      I was born in date
2                 I will be there at time

You can find the example here

Answered By: George Pipis

If you dealing with files, I have a simple python code about this problem.
More info here.

import re 

 def multiple_replace(dictionary, text):
  # Create a regular expression  from the dictionaryary keys

  regex = re.compile("(%s)" % "|".join(map(re.escape, dictionary.keys())))

  # For each match, look-up corresponding value in dictionaryary
  String = lambda mo: dictionary[mo.string[mo.start():mo.end()]]
  return regex.sub(String , text)


if __name__ == "__main__":

dictionary = {
    "Wiley Online Library" : "Wiley",
    "Chemical Society Reviews" : "Chem. Soc. Rev.",
} 

with open ('LightBib.bib', 'r') as Bib_read:
    with open ('Abbreviated.bib', 'w') as Bib_write:
        read_lines = Bib_read.readlines()
        for rows in read_lines:
            #print(rows)
            text = rows
            new_text = multiple_replace(dictionary, text)
            #print(new_text)
            Bib_write.write(new_text)
Answered By: Hamid Zaree

None of the other solutions work if your patterns are themselves regexes.

For that, you need:

def multi_sub(pairs, s):
    def repl_func(m):
        # only one group will be present, use the corresponding match
        return next(
            repl
            for (patt, repl), group in zip(pairs, m.groups())
            if group is not None
        )
    pattern = '|'.join("({})".format(patt) for patt, _ in pairs)
    return re.sub(pattern, repl_func, s)

Which can be used as:

>>> multi_sub([
...     ('a+b', 'Ab'),
...     ('b', 'B'),
...     ('a+', 'A.'),
... ], "aabbaa")  # matches as (aab)(b)(aa)
'AbBA.'

Note that this solution does not allow you to put capturing groups in your regexes, or use them in replacements.

Answered By: Eric

Based on Eric’s great answer, I came up with a more general solution that is capable of handling capturing groups and backreferences:

import re
from itertools import islice

def multiple_replace(s, repl_dict):
    groups_no = [re.compile(pattern).groups for pattern in repl_dict]

    def repl_func(m):
        all_groups = m.groups()

        # Use 'i' as the index within 'all_groups' and 'j' as the main
        # group index.
        i, j = 0, 0

        while i < len(all_groups) and all_groups[i] is None:
            # Skip the inner groups and move on to the next group.
            i += (groups_no[j] + 1)

            # Advance the main group index.
            j += 1

        # Extract the pattern and replacement at the j-th position.
        pattern, repl = next(islice(repl_dict.items(), j, j + 1))

        return re.sub(pattern, repl, all_groups[i])

    # Create the full pattern using the keys of 'repl_dict'.
    full_pattern = '|'.join(f'({pattern})' for pattern in repl_dict)

    return re.sub(full_pattern, repl_func, s)

Example. Calling the above with

s = 'This is a sample string. Which is getting replaced. 1234-5678.'

REPL_DICT = {
    r'(.*?)is(.*?)ing(.*?)ch': r'3-2-1',
    r'replaced': 'REPLACED',
    r'dd((d)(d)-(d)(d))dd': r'__54__32__',
    r'get|ing': '!@#'
}

gives:

>>> multiple_replace(s, REPL_DICT)
'. Whi- is a sample str-Th is !@#t!@# REPLACED. __65__43__.'

For a more efficient solution, one can create a simple wrapper to precompute groups_no and full_pattern, e.g.

import re
from itertools import islice

class ReplWrapper:
    def __init__(self, repl_dict):
        self.repl_dict = repl_dict
        self.groups_no = [re.compile(pattern).groups for pattern in repl_dict]
        self.full_pattern = '|'.join(f'({pattern})' for pattern in repl_dict)

    def get_pattern_repl(self, pos):
        return next(islice(self.repl_dict.items(), pos, pos + 1))

    def multiple_replace(self, s):
        def repl_func(m):
            all_groups = m.groups()

            # Use 'i' as the index within 'all_groups' and 'j' as the main
            # group index.
            i, j = 0, 0

            while i < len(all_groups) and all_groups[i] is None:
                # Skip the inner groups and move on to the next group.
                i += (self.groups_no[j] + 1)

                # Advance the main group index.
                j += 1

            return re.sub(*self.get_pattern_repl(j), all_groups[i])

        return re.sub(self.full_pattern, repl_func, s)

Use it as follows:

>>> ReplWrapper(REPL_DICT).multiple_replace(s)
'. Whi- is a sample str-Th is !@#t!@# REPLACED. __65__43__.'
Answered By: Constantin Mateescu

I dont know why most of the solutions try to compose a single regex pattern instead of replacing multiple times. This answer is just for the sake of completeness.

That being said, the output of this approach is different than the output of the combined regex approach. Namely, repeated substitutions may evolve the text over time. However, the following function returns the same output as a call to unix sed would:

def multi_replace(rules, data: str) -> str:
    ret = data
    for pattern, repl in rules:
        ret = re.sub(pattern, repl, ret)
    return ret

usage:

RULES = [
    (r'a', r'b'),
    (r'b', r'c'),
    (r'c', r'd'),
]
multi_replace(RULES, 'ab')  # output: dd

With the same input and rules, the other solutions will output "bc". Depending on your use case you may or may not want to replace strings consecutively. In my case I wanted to rebuild the sed behavior. Also, note that the order of rules matters. If you reverse the rule order, this example would also return "bc".

This solution is faster than combining the patterns into a single regex (by a factor of 100). So, if your use-case allows it, you should prefer the repeated substitution method.


Of course, you can compile the regex patterns:

class Sed:
    def __init__(self, rules) -> None:
        self._rules = [(re.compile(pattern), sub) for pattern, sub in rules]

    def replace(self, data: str) -> str:
        ret = data
        for regx, repl in self._rules:
            ret = regx.sub(repl, ret)
        return ret
Answered By: lupdidup
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.