Regex for finding a sequence
Question:
I have a string consisting of Latin letters and I need an regular expression that will find a sequence that does not contain [ABC][ABC]. For example, I have the string "AGSHDGSGBCHAHSNABHJDKOCA" and all matches will be "AGSHDGSGB", "CHAHSNA", "BHJDKOC", "A"
I tried using (?!.*[ABC][ABC]).+
but when i used the code
pattern = r'(?!.*[ABC][ABC]).+)'
text = 'AGSHDGSGBCHAHSNABHJDKOCA'
matches = re.findall(pattern, text)
print(matches)
it only outputted ['A']
, but I would like the output to be"AGSHDGSGB", "CHAHSNA", "BHJDKOC", "A"
Answers:
Use split with regex (?<=[ABC])(?=[ABC])
. It will split text exactly like you want.
>>> re.split(r'(?<=[ABC])(?=[ABC])','AGSHDGSGBCHAHSNABHJDKOCA')
# ['AGSHDGSGB', 'CHAHSNA', 'BHJDKOC', 'A']
This regex matches transition between one letter of set {A
,B
,C
} and another, using lookbehind and lookahead.
EDIT:
For cases when matching empty pattern is prohibited you can also use findall
with pattern .*?(?:[ABC](?=[ABC]|$)|$)
.
>>> re.findall(r'.*?(?:[ABC](?=[ABC]|$)|$)', 'AGSHDGSGBCHAHSNABHJDKOCA')
# ['AGSHDGSGB', 'CHAHSNA', 'BHJDKOC', 'A']
Demo here.
I have a string consisting of Latin letters and I need an regular expression that will find a sequence that does not contain [ABC][ABC]. For example, I have the string "AGSHDGSGBCHAHSNABHJDKOCA" and all matches will be "AGSHDGSGB", "CHAHSNA", "BHJDKOC", "A"
I tried using (?!.*[ABC][ABC]).+
but when i used the code
pattern = r'(?!.*[ABC][ABC]).+)'
text = 'AGSHDGSGBCHAHSNABHJDKOCA'
matches = re.findall(pattern, text)
print(matches)
it only outputted ['A']
, but I would like the output to be"AGSHDGSGB", "CHAHSNA", "BHJDKOC", "A"
Use split with regex (?<=[ABC])(?=[ABC])
. It will split text exactly like you want.
>>> re.split(r'(?<=[ABC])(?=[ABC])','AGSHDGSGBCHAHSNABHJDKOCA')
# ['AGSHDGSGB', 'CHAHSNA', 'BHJDKOC', 'A']
This regex matches transition between one letter of set {A
,B
,C
} and another, using lookbehind and lookahead.
EDIT:
For cases when matching empty pattern is prohibited you can also use findall
with pattern .*?(?:[ABC](?=[ABC]|$)|$)
.
>>> re.findall(r'.*?(?:[ABC](?=[ABC]|$)|$)', 'AGSHDGSGBCHAHSNABHJDKOCA')
# ['AGSHDGSGB', 'CHAHSNA', 'BHJDKOC', 'A']
Demo here.