Randomly changing letters in list of string based on probability

Question:

Given the following

data = ['AAACGGGATTn','CTGTGTCAGTn','AATCTCTACTn']

For every letter in a string not including (n), if its probability is greater than 5 (i.e. there is a 50% chance for a change), I’d like to replace the letter with a randomly selected letter from options (A,G,T,C), with the caveat that the replacement cannot be the same as original.

This is what I’m attempted thus far:

import random

def Applyerrors(string, string_length, probability):
    i=0
    while i < string_length:
        i = i + 1
        p = i/string_length
        if p > probability:
            new_var = string[i]
            options = ['A', 'G', 'T', 'C'] 
            [item.replace(new_var, '') for item in options]
            replacer = random.choice(options)
            [res.replace(new_var, replacer) for res in string]
        else:
            pass
        
# Testing
data_updated = [Applyerrors(unit, 10, 0.5) for unit in data]
data_updated

The result from this:

[None, None, None]

In addition to not getting the desired result, my probability doesn’t make sense as I’m hoping to achieve 50% overall change in the data_updated file.

Any insight would be greatly appreciated.Thanks

Asked By: newbzzs

||

Answers:

The Problem

There are a few problems that I can see right away.

  1. You are not returning anything in Applyerros, so the value in the loop [Applyerrors(unit, 10, 0.5) for unit in data] will be None every time.

  2. When you do [res.replace(new_var, replacer) for res in string] you are replacing every instance of a letter with another, so there would be a change 50% of the time, but the change would cover more than 50% of the data.

  3. When you do [item.replace(new_var, '') for item in options] you give replacer a chance to choose the empty string ('') as an option to replace a value, rather than removing the same value from the list of options.

  4. You increment i before using it, so it will always skip the first character of the string.

  5. You don’t do a check to avoid changing the newline character.


The Solution

def Applyerrors(string, string_length, probability):
    i = 0
    while i < string_length:
        if string[i] == "n":
            continue
        p = i/string_length
        if p > probability:
            new_var = string[i]
            options = ['A', 'G', 'T', 'C'] 
            options.remove(new_var)
            replacer = random.choice(options)
            string = string[:i] + replacer + string[i+1:]
            
        i = i + 1
    return string
  1. return string When Applyerrors is done, it returns the edited string at the end.

  2. string = string[:i] + replacer + string[i+1:] replaces just the character at index i. (string[:i] is everything from 0 up to, but not including i. string[i:] is everything from i to the end of the string`)

  3. options.remove(new_var) removes the value from the list entirely rather than replacing it with an empty string.

  4. i = i+1 was moved to the end of the loop to allow for i to be 0 for an iteration to include the first value.

  5. if string[i] == "n" Added a check to skip any newline characters. (continue skip the rest of the current iteration of the loop)


The Solution Continued

While none of the following changes are necessary, I have recreated the function below to show some best practices.

def apply_errors(data, prob):
    for i in range(len(data)):
        options = ['A', 'G', 'T', 'C']
        current = data[i]
        if current not in options or random.random() > prob:
            continue
        options.remove(current)
        data = data[:i] + random.choice(options) + data[i+1:]
    return data
  • for i in range(len(data)) removes the need for inputting the length of the string.
  • if current not in options or random.random() > prob
    • current not in options checks if the value is not present in options (so now it will ignore n and anything else that shouldn’t be changed)
    • random.random() > prob will now breakout of the iteration if the probability is greater than a random number (random.random() returns a random number between 0 and 1). This would make the input probability represent the chance that a value is changed. The way you have it currently, the input probability would be the probability that a value is not changed.
  • random.random() > prob makes it so that any value has the same probability of changing. The old version guarantees that the last 50% (when probability = 0.5) of the string will change.
  • rename string to data. Its not great to have a variable named the same as a data type (despite strings being represented as str in python).
  • The new function header apply_errors uses camel case, which is the preferred method of naming variables and functions in python.
Answered By: Zelkins

Here’s another solution. It doesn’t use string length or track the iteration count, as each character is appended to a new list which is transformed to a str with "".join(new_string). It preserves newlines characters.

import random

def apply_errors(string, probability):
    new_string = []
    for nucleobase in string:
        if nucleobase != "n" and probability > random.random():
            nucleobase = random.choice("ACGT".replace(nucleobase, ""))
        
        new_string.append(nucleobase)
    
    return "".join(new_string)
Answered By: GordonAitchJay