Regex in python: is it possible to get the match, replacement, and final string?
Question:
For doing a regex substitution, there are three things that you give it:
- The match pattern
- The replacement pattern
- The original string
There are three things that the regex engine finds that are of interest to me:
- The matched string
- The replacement string
- The final processed string
When using re.sub
, the final string is what’s returned. But is it possible to access the other two things, the matched string and replacement string?
Here’s an example:
orig = "This is the original string."
matchpat = "(orig.*?l)"
replacepat = "not the \1"
final = re.sub(matchpat, replacepat, orig)
print(final)
# This is the not the original string
The match string is "original"
and the replacement string is "not the original"
. Is there a way to get them? I’m writing a script to to search and replace in many files, and I want it to print it what it’s finding and replacing, without printing out the entire line.
Answers:
I looked at the documentation and it seems like you can pass a function reference into the re.sub
:
import re
def re_sub_verbose(pattern, replace, string):
def substitute(match):
print 'Matched:', match.group(0)
print 'Replacing with:', match.expand(replace)
return match.expand(replace)
result = re.sub(pattern, substitute, string)
print 'Final string:', result
return result
And I get this output when running re_sub_verbose("(orig.*?l)", "not the \1", "This is the original string.")
:
Matched: original
Replacing with: not the original
This is the not the original string.
class Replacement(object):
def __init__(self, replacement):
self.replacement = replacement
self.matched = None
self.replaced = None
def __call__(self, match):
self.matched = match.group(0)
self.replaced = match.expand(self.replacement)
return self.replaced
>>> repl = Replacement('not the \1')
>>> re.sub('(orig.*?l)', repl, 'This is the original string.')
'This is the not the original string.'
>>> repl.matched
'original'
>>> repl.replaced
'not the original'
Edit: as @F.J has pointed out, the above will remember only the last match/replacement. This version handles multiple occurrences:
class Replacement(object):
def __init__(self, replacement):
self.replacement = replacement
self.occurrences = []
def __call__(self, match):
matched = match.group(0)
replaced = match.expand(self.replacement)
self.occurrences.append((matched, replaced))
return replaced
>>> repl = Replacement('[\1]')
>>> re.sub('s(d)', repl, '1 2 3')
'1[2][3]'
>>> for matched, replaced in repl.occurrences:
....: print matched, '=>', replaced
....:
2 => [2]
3 => [3]
For doing a regex substitution, there are three things that you give it:
- The match pattern
- The replacement pattern
- The original string
There are three things that the regex engine finds that are of interest to me:
- The matched string
- The replacement string
- The final processed string
When using re.sub
, the final string is what’s returned. But is it possible to access the other two things, the matched string and replacement string?
Here’s an example:
orig = "This is the original string."
matchpat = "(orig.*?l)"
replacepat = "not the \1"
final = re.sub(matchpat, replacepat, orig)
print(final)
# This is the not the original string
The match string is "original"
and the replacement string is "not the original"
. Is there a way to get them? I’m writing a script to to search and replace in many files, and I want it to print it what it’s finding and replacing, without printing out the entire line.
I looked at the documentation and it seems like you can pass a function reference into the re.sub
:
import re
def re_sub_verbose(pattern, replace, string):
def substitute(match):
print 'Matched:', match.group(0)
print 'Replacing with:', match.expand(replace)
return match.expand(replace)
result = re.sub(pattern, substitute, string)
print 'Final string:', result
return result
And I get this output when running re_sub_verbose("(orig.*?l)", "not the \1", "This is the original string.")
:
Matched: original
Replacing with: not the original
This is the not the original string.
class Replacement(object):
def __init__(self, replacement):
self.replacement = replacement
self.matched = None
self.replaced = None
def __call__(self, match):
self.matched = match.group(0)
self.replaced = match.expand(self.replacement)
return self.replaced
>>> repl = Replacement('not the \1')
>>> re.sub('(orig.*?l)', repl, 'This is the original string.')
'This is the not the original string.'
>>> repl.matched
'original'
>>> repl.replaced
'not the original'
Edit: as @F.J has pointed out, the above will remember only the last match/replacement. This version handles multiple occurrences:
class Replacement(object):
def __init__(self, replacement):
self.replacement = replacement
self.occurrences = []
def __call__(self, match):
matched = match.group(0)
replaced = match.expand(self.replacement)
self.occurrences.append((matched, replaced))
return replaced
>>> repl = Replacement('[\1]')
>>> re.sub('s(d)', repl, '1 2 3')
'1[2][3]'
>>> for matched, replaced in repl.occurrences:
....: print matched, '=>', replaced
....:
2 => [2]
3 => [3]