How to extract substrings from a string using regular expression

Question:

I have a string s = "ATAATGCGTGGAATTATGACCGGAATC" I would like to extract all substrings starting with ATG and ending with GGA . So the results would be ATGCGTGGA and ATGACCGGA .

This is what I have done so far but not working. Thanks for helping me in advance.


s = "ATAATGCGTGGAATTATGACCGGAATC"
x = re.findall('^ATG.+GGA$', s)
print(x)  
Asked By: ptalebic

||

Answers:

With ^ and $ you are anchoring to start and end of line, don’t do that if you want to find substrings. Also by default regex is "greedy", it will match the longest possible sequence.

You need to use +? for a non-greedy (aka lazy) match that matches the shortest sequences:

x = re.findall('ATG.+?GGA', s)
Answered By: vaizki

Symbols ^ and $ refer to the beginning and end of the string, not the beginning and end of the substring.

Just remove ^ and $ from your regexp: re.findall('ATG.+GGA', s).

In addition, you might want to add ? after the +, to stop at the first found CGA rather than the last: re.findall('ATG.+?GGA', s)

Refer to Module re: regular expression syntax in the official python documentation, for more information about ^, $ and ?.

Answered By: Stef
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.