Replace a substring with defined region and follow up variable region in Python

Question:

I have a seemingly simple problem that for the life of me is just outside my reach of understanding. What I mean by that is that I can come up with many complex ways to attempt this, but there must be an easy way.

What I am trying to do is find and replace a substring in a string, but the catch is that it is based on a mix of a defined region and then variable regions based on length.

Here is an example:

sequence = 'AATCGATCGTATATCTGCGTAGACTCTGTGCATGC' and I want to replace AATCGATCGTA with <span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>

So in this example the first part will always be constant AATCGA and will be used to locate the region to replace. This is then followed by a "spacer", in this case a single character but could be more than one and needs to be specified, and finally the last bit that will follow the "tail", in this case four characters, but could also be more or less. A set-up in this case would be:

to_find = 'AATCGA'
spacer = 'T' #Variable based on number and not on the character
tail = 'CGTA' #Variable based on number and not on the character

With this information I need to do something like:

new_seq = sequence.replace(f'{to_find}{len(spacer)}{len(tail)}', f'<span color="blue">{to_find}</span><span>{spacer}</span><span color="green">{tail}</span>')

print(new_seq)

<span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>TATCTGCGTAGACTCTGTGCATGC

But the spacer could be 3 characters from the end of to_find and it may vary, the same with the tail section. Also, it could be in reverse where the to_find is on the right hand side and then the tail is in the start.

Any help would be much appreciated!

Asked By: Roelof Coertze

||

Answers:

I’m not quite sure I understand you fully. Nevertheless, you don’t seem to be too far off. Just use regex.

import re

sequence = 'AATCGATCGTATATCTGCGTAGACTCTGTGCATGC'
expected_new_seq = '<span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>TATCTGCGTAGACTCTGTGCATGC'

to_find = 'AATCGA'
spacer = 'T' # Variable based on number and not on the character
tail = 'CGTA' # Variable based on number and not on the character

# In this case, the pattern is (AATCGA)(.{1})(.{4})
# It matches "AATCGA" that is followed by 1 character and then 4 characters.
# AATCGA is captured in group 1, then the next unknown character is captured 
# in group 2, and the next 4 unknown characters are captured in group 3 
# (the brackets create capturing groups).
pattern = f'({to_find})(.{{{len(spacer)}}})(.{{{len(tail)}}})'

# 1 refers to capture group 1 (to_find), 2 refers to capture group 2 (spacer), 
# and 3 refers to capture group 3 (tail).
# This no longer needs to be a f-string. But making it a raw string means we 
# don't need to escape the slashes
repl = r'<span color="blue">1</span><span>2</span><span color="green">3</span>'

new_seq = re.sub(pattern, repl, sequence)

print(new_seq)
print(new_seq == expected_new_seq)

Output:

<span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>TATCTGCGTAGACTCTGTGCATGC
True

Have a play around with it here (also includes interactive explanation): https://regex101.com/r/2mshrI/1

Also, it could be in reverse where the to_find is on the right hand side and then the tail is in the start.

How do you know when to replace it when it’s in reverse instead of forward? After all, all you’re doing is matching a short string followed/lead by n characters. I imagine you’d get matches in both directions, so which replacement do you carry out? Please provide more examples – longer input with expected output.

Answered By: GordonAitchJay
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.