Trying to create a sliding window that checks for repeats in a DNA sequence

Question:

I’m trying to write a bioinformatics code that will check for certain repeats in a given string of nucleotides. The user inputs a certain patter, and the program outputs how many times something is repeated, or even highlights where they are. I’ve gotten a good start on it, but could use some help.

Below is my code so far.

while True:
    text = 'AGACGCCTGGGAACTGCGGCCGCGGGCTCGCGCTCCTCGCCAGGCCCTGCCGCCGGGCTGCCATCCTTGCCCTGCCATGTCTCGCCGGAAGCCTGCGTCGGGCGGCCTCGCTGCCTCCAGCTCAGCCCCTGCGAGGCAAGCGGTTTTGAGCCGATTCTTCCAGTCTACGGGAAGCCTGAAATCCACCTCCTCCTCCACAGGTGCAGCCGACCAGGTGGACCCTGGCGCTgcagcggctgcagcggccgcagcggccgcagcgCCCCCAGCGCCCCCAGCTCCCGCCTTCCCGCCCCAGCTGCCGCCGCACATA'
    print ("Input Pattern:")
    pattern = input("")


    def pattern_count(text, pattern):
        count = 0
        for i in range(len(text) - len(pattern) + 1):
            if text[i: i + len(pattern)] == pattern:
                count = count + 1
            return count


    print(pattern_count(text, pattern))

The issue lies in in the fact that I can only put the input from the beginning (ex. AGA or AGAC) to get an output. Any help or recommendations would be greatly appreciated. Thank you so much!

Asked By: ClarkThark

||

Answers:

Here is a modified version of your code that will allow the user to input a string of nucleotides and a pattern to search for. It will then output the number of times the pattern appears in the string. Note that this code is case sensitive, so "AGC" and "agc" will be treated as different patterns.

def pattern_count(text, pattern):
    count = 0
    for i in range(len(text) - len(pattern) + 1):
        if text[i: i + len(pattern)] == pattern:
            count = count + 1
    return count

while True:
    print("Input the string of nucleotides:")
    text = input()

    print("Input the pattern to search for:")
    pattern = input()

    count = pattern_count(text, pattern)
    print("The pattern appears {} times in the string.".format(count))

One potential optimization you could make to your code is to use the built-in count() method to count the number of times a pattern appears in a string. This would avoid the need to loop over the string and check each substring manually. Here is how you could modify your code to use this method:

def pattern_count(text, pattern):
    return text.count(pattern)

while True:
    print("Input the string of nucleotides:")
    text = input()

    print("Input the pattern to search for:")
    pattern = input()

    count = pattern_count(text, pattern)
    print("The pattern appears {} times in the string.".format(count))
Answered By: Cyzanfar

One possibility is to use re.findall:

import re
text = 'AGACGCCTGGGAACTGCGGCCGCGGGCTCGCGCTCCTCGCCAGGCCCTGCCGCCGGGCTGCCATCCTTGCCCTGCCATGTCTCGCCGGAAGCCTGCGTCGGGCGGCCTCGCTGCCTCCAGCTCAGCCCCTGCGAGGCAAGCGGTTTTGAGCCGATTCTTCCAGTCTACGGGAAGCCTGAAATCCACCTCCTCCTCCACAGGTGCAGCCGACCAGGTGGACCCTGGCGCTgcagcggctgcagcggccgcagcggccgcagcgCCCCCAGCGCCCCCAGCTCCCGCCTTCCCGCCCCAGCTGCCGCCGCACATA'
pattern = "CCT"
count = sum(1 for _ in re.findall(pattern, text))

The sum(1 for ...) is a common pattern to count the number of items, a generator returns. See e.g. this answer.

Answered By: treuss

Here’s a fixed version of your code:

def pattern_count(text, pattern):
    count = 0
    for i in range(len(text) - len(pattern) + 1):
        if text[i: i + len(pattern)] == pattern:
            count += 1
    return count


while True:
    text = 'AGACGCCTGGGAACTGCGGCCGCGGGCTCGCGCTCCTCGCCAGGCCCTGCCGCCGGGCTGCCATCCTTGCCCTGCCATGTCTCGCCGGAAGCCTGCGTCGGGCGGCCTCGCTGCCTCCAGCTCAGCCCCTGCGAGGCAAGCGGTTTTGAGCCGATTCTTCCAGTCTACGGGAAGCCTGAAATCCACCTCCTCCTCCACAGGTGCAGCCGACCAGGTGGACCCTGGCGCTgcagcggctgcagcggccgcagcggccgcagcgCCCCCAGCGCCCCCAGCTCCCGCCTTCCCGCCCCAGCTGCCGCCGCACATA'
    print("Input Pattern:")
    pattern = input("")

    print(pattern_count(text, pattern))

The issues with your code were that you had an extra indentation in the for loop, which caused the return statement to be executed after the first iteration of the loop, instead of after all iterations. I also added a += operator to increase the count, instead of overwriting the count with the result of count + 1. Finally, I moved the return statement outside the for loop, so that it returns the count after all iterations of the loop have been completed.

Answered By: GAP2002