how to use python re to match a sting only with several specific charaters?

Question

I want to search the DNA sequences in a file, the sequence contains only [ATGC], 4 characters.
I try this pattern:
m=re.search('([ATGC]+)',line_in_file)
but it gives me hits with all lines contain at least 1 character of ATGC.
so how do I search the line only contain those 4 characters, without others.

sorry for mis-describing my question. I’m not looking for the exactly match of ATGC as a word, but a string only containing ATCG 4 characters

Thanks

Asked By: WittWhite

||

Source

Answer 1

Currently your regex is matching against any part of the line. Using ^ $ signs you can force the regex to perform against the whole line having the four characters.

m=re.search('(^[ATGC]+$)',line_in_file)

From your clarification msg at above:

If you want to match a sequence like this AAAGGGCCCCCCT with the order AGCT then the regex will be:

(A+G+C+T+)

Answered By: Sabuj Hassan

Answer 2

The square brackets in your search string tell the regex complier to match any of the letters in the set, not the full string. Remove the square brackets, and move the + to outside your parens.

 m=re.search('(ATGC)+',a)

EDIT:
According to your comment, this won’t match the pattern you actually want, just the one I thought you wanted. I can edit again once I understand the actual pattern.

EDIT2:
To match “ATGCCATG” but not “STUPID” try,

re.match("^[ATGC]$", str)

Then check for a NOT match, rather than a match.

The regex will hit if there are any characters NOT in [ATGC], then you exclude strings that match.

Answered By: Joan Smith

Answer 3

A slight modification:

def DNAcheck(dna):
    y = dna.upper()
    print(y)
    if re.match("^[ATGC]+$", y):
        return (2)
    else:
        return(1)

The if the entire sequence is composed of only A/T/G/C the code above should return back 2 else would return 1

Answered By: krishnan1

how to use python re to match a sting only with several specific charaters?

Question:

Answers: