Regex in python: is it possible to get the match, replacement, and final string?

Question:

For doing a regex substitution, there are three things that you give it:

  • The match pattern
  • The replacement pattern
  • The original string

There are three things that the regex engine finds that are of interest to me:

  • The matched string
  • The replacement string
  • The final processed string

When using re.sub, the final string is what’s returned. But is it possible to access the other two things, the matched string and replacement string?

Here’s an example:

orig = "This is the original string."
matchpat = "(orig.*?l)"
replacepat = "not the \1"

final = re.sub(matchpat, replacepat, orig)
print(final)
# This is the not the original string

The match string is "original" and the replacement string is "not the original". Is there a way to get them? I’m writing a script to to search and replace in many files, and I want it to print it what it’s finding and replacing, without printing out the entire line.

Asked By: wch

||

Answers:

I looked at the documentation and it seems like you can pass a function reference into the re.sub:

import re

def re_sub_verbose(pattern, replace, string):
  def substitute(match):
    print 'Matched:', match.group(0)
    print 'Replacing with:', match.expand(replace)

    return match.expand(replace)

  result = re.sub(pattern, substitute, string)
  print 'Final string:', result

  return result

And I get this output when running re_sub_verbose("(orig.*?l)", "not the \1", "This is the original string."):

Matched: original
Replacing with: not the original
This is the not the original string.
Answered By: Blender
class Replacement(object):

    def __init__(self, replacement):
        self.replacement = replacement
        self.matched = None
        self.replaced = None

    def __call__(self, match):
        self.matched = match.group(0)
        self.replaced = match.expand(self.replacement)
        return self.replaced

>>> repl = Replacement('not the \1')
>>> re.sub('(orig.*?l)', repl, 'This is the original string.')
    'This is the not the original string.'
>>> repl.matched
    'original'
>>> repl.replaced
    'not the original'

Edit: as @F.J has pointed out, the above will remember only the last match/replacement. This version handles multiple occurrences:

class Replacement(object):

    def __init__(self, replacement):
        self.replacement = replacement
        self.occurrences = []

    def __call__(self, match):
        matched = match.group(0)
        replaced = match.expand(self.replacement)
        self.occurrences.append((matched, replaced))
        return replaced

>>> repl = Replacement('[\1]')
>>> re.sub('s(d)', repl, '1 2 3')
    '1[2][3]'

>>> for matched, replaced in repl.occurrences:
   ....:     print matched, '=>', replaced
   ....:     
 2 => [2]
 3 => [3]
Answered By: Jakub Roztocil
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.