Regex for finding a sequence

Question:

I have a string consisting of Latin letters and I need an regular expression that will find a sequence that does not contain [ABC][ABC]. For example, I have the string "AGSHDGSGBCHAHSNABHJDKOCA" and all matches will be "AGSHDGSGB", "CHAHSNA", "BHJDKOC", "A"

I tried using (?!.*[ABC][ABC]).+ but when i used the code

pattern = r'(?!.*[ABC][ABC]).+)'
text = 'AGSHDGSGBCHAHSNABHJDKOCA'
matches = re.findall(pattern, text)
print(matches)

it only outputted ['A'], but I would like the output to be"AGSHDGSGB", "CHAHSNA", "BHJDKOC", "A"

Asked By: ganspuzzles

||

Answers:

Use split with regex (?<=[ABC])(?=[ABC]). It will split text exactly like you want.

>>> re.split(r'(?<=[ABC])(?=[ABC])','AGSHDGSGBCHAHSNABHJDKOCA')
# ['AGSHDGSGB', 'CHAHSNA', 'BHJDKOC', 'A']

This regex matches transition between one letter of set {A,B,C} and another, using lookbehind and lookahead.

EDIT:
For cases when matching empty pattern is prohibited you can also use findall with pattern .*?(?:[ABC](?=[ABC]|$)|$).

>>> re.findall(r'.*?(?:[ABC](?=[ABC]|$)|$)', 'AGSHDGSGBCHAHSNABHJDKOCA')
# ['AGSHDGSGB', 'CHAHSNA', 'BHJDKOC', 'A']

Demo here.

Answered By: markalex
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.