how to use python re to match a sting only with several specific charaters?

Question:

I want to search the DNA sequences in a file, the sequence contains only [ATGC], 4 characters.
I try this pattern:
m=re.search('([ATGC]+)',line_in_file)
but it gives me hits with all lines contain at least 1 character of ATGC.
so how do I search the line only contain those 4 characters, without others.

sorry for mis-describing my question. I’m not looking for the exactly match of ATGC as a word, but a string only containing ATCG 4 characters

Thanks

Asked By: WittWhite

||

Answers:

Currently your regex is matching against any part of the line. Using ^ $ signs you can force the regex to perform against the whole line having the four characters.

m=re.search('(^[ATGC]+$)',line_in_file) 

From your clarification msg at above:

If you want to match a sequence like this AAAGGGCCCCCCT with the order AGCT then the regex will be:

(A+G+C+T+)
Answered By: Sabuj Hassan

The square brackets in your search string tell the regex complier to match any of the letters in the set, not the full string. Remove the square brackets, and move the + to outside your parens.

 m=re.search('(ATGC)+',a)

EDIT:
According to your comment, this won’t match the pattern you actually want, just the one I thought you wanted. I can edit again once I understand the actual pattern.

EDIT2:
To match “ATGCCATG” but not “STUPID” try,

re.match("^[ATGC]$", str)

Then check for a NOT match, rather than a match.

The regex will hit if there are any characters NOT in [ATGC], then you exclude strings that match.

Answered By: Joan Smith

A slight modification:

def DNAcheck(dna):
    y = dna.upper()
    print(y)
    if re.match("^[ATGC]+$", y):
        return (2)
    else:
        return(1)

The if the entire sequence is composed of only A/T/G/C the code above should return back 2 else would return 1

Answered By: krishnan1
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.