Randomly changing letters in list of string based on probability
Question:
Given the following
data = ['AAACGGGATTn','CTGTGTCAGTn','AATCTCTACTn']
For every letter in a string not including (n), if its probability is greater than 5 (i.e. there is a 50% chance for a change), I’d like to replace the letter with a randomly selected letter from options (A,G,T,C), with the caveat that the replacement cannot be the same as original.
This is what I’m attempted thus far:
import random
def Applyerrors(string, string_length, probability):
i=0
while i < string_length:
i = i + 1
p = i/string_length
if p > probability:
new_var = string[i]
options = ['A', 'G', 'T', 'C']
[item.replace(new_var, '') for item in options]
replacer = random.choice(options)
[res.replace(new_var, replacer) for res in string]
else:
pass
# Testing
data_updated = [Applyerrors(unit, 10, 0.5) for unit in data]
data_updated
The result from this:
[None, None, None]
In addition to not getting the desired result, my probability doesn’t make sense as I’m hoping to achieve 50% overall change in the data_updated file.
Any insight would be greatly appreciated.Thanks
Answers:
The Problem
There are a few problems that I can see right away.
-
You are not returning anything in Applyerros
, so the value in the loop [Applyerrors(unit, 10, 0.5) for unit in data]
will be None
every time.
-
When you do [res.replace(new_var, replacer) for res in string]
you are replacing every instance of a letter with another, so there would be a change 50% of the time, but the change would cover more than 50% of the data.
-
When you do [item.replace(new_var, '') for item in options]
you give replacer
a chance to choose the empty string (''
) as an option to replace a value, rather than removing the same value from the list of options.
-
You increment i before using it, so it will always skip the first character of the string.
-
You don’t do a check to avoid changing the newline character.
The Solution
def Applyerrors(string, string_length, probability):
i = 0
while i < string_length:
if string[i] == "n":
continue
p = i/string_length
if p > probability:
new_var = string[i]
options = ['A', 'G', 'T', 'C']
options.remove(new_var)
replacer = random.choice(options)
string = string[:i] + replacer + string[i+1:]
i = i + 1
return string
-
return string
When Applyerrors
is done, it returns the edited string at the end.
-
string = string[:i] + replacer + string[i+1:]
replaces just the character at index i
. (string[:i]
is everything from 0 up to, but not including i. string[i:]
is everything from i to the end of the string`)
-
options.remove(new_var)
removes the value from the list entirely rather than replacing it with an empty string.
-
i = i+1
was moved to the end of the loop to allow for i to be 0 for an iteration to include the first value.
-
if string[i] == "n"
Added a check to skip any newline characters. (continue
skip the rest of the current iteration of the loop)
The Solution Continued
While none of the following changes are necessary, I have recreated the function below to show some best practices.
def apply_errors(data, prob):
for i in range(len(data)):
options = ['A', 'G', 'T', 'C']
current = data[i]
if current not in options or random.random() > prob:
continue
options.remove(current)
data = data[:i] + random.choice(options) + data[i+1:]
return data
for i in range(len(data))
removes the need for inputting the length of the string.
if current not in options or random.random() > prob
current not in options
checks if the value is not present in options (so now it will ignore n
and anything else that shouldn’t be changed)
random.random() > prob
will now breakout of the iteration if the probability is greater than a random number (random.random()
returns a random number between 0 and 1). This would make the input probability represent the chance that a value is changed. The way you have it currently, the input probability would be the probability that a value is not changed.
random.random() > prob
makes it so that any value has the same probability of changing. The old version guarantees that the last 50% (when probability = 0.5) of the string will change.
- rename
string
to data
. Its not great to have a variable named the same as a data type (despite strings being represented as str
in python).
- The new function header
apply_errors
uses camel case, which is the preferred method of naming variables and functions in python.
Here’s another solution. It doesn’t use string length or track the iteration count, as each character is appended to a new list which is transformed to a str
with "".join(new_string)
. It preserves newlines characters.
import random
def apply_errors(string, probability):
new_string = []
for nucleobase in string:
if nucleobase != "n" and probability > random.random():
nucleobase = random.choice("ACGT".replace(nucleobase, ""))
new_string.append(nucleobase)
return "".join(new_string)
Given the following
data = ['AAACGGGATTn','CTGTGTCAGTn','AATCTCTACTn']
For every letter in a string not including (n), if its probability is greater than 5 (i.e. there is a 50% chance for a change), I’d like to replace the letter with a randomly selected letter from options (A,G,T,C), with the caveat that the replacement cannot be the same as original.
This is what I’m attempted thus far:
import random
def Applyerrors(string, string_length, probability):
i=0
while i < string_length:
i = i + 1
p = i/string_length
if p > probability:
new_var = string[i]
options = ['A', 'G', 'T', 'C']
[item.replace(new_var, '') for item in options]
replacer = random.choice(options)
[res.replace(new_var, replacer) for res in string]
else:
pass
# Testing
data_updated = [Applyerrors(unit, 10, 0.5) for unit in data]
data_updated
The result from this:
[None, None, None]
In addition to not getting the desired result, my probability doesn’t make sense as I’m hoping to achieve 50% overall change in the data_updated file.
Any insight would be greatly appreciated.Thanks
The Problem
There are a few problems that I can see right away.
-
You are not returning anything in
Applyerros
, so the value in the loop[Applyerrors(unit, 10, 0.5) for unit in data]
will beNone
every time. -
When you do
[res.replace(new_var, replacer) for res in string]
you are replacing every instance of a letter with another, so there would be a change 50% of the time, but the change would cover more than 50% of the data. -
When you do
[item.replace(new_var, '') for item in options]
you givereplacer
a chance to choose the empty string (''
) as an option to replace a value, rather than removing the same value from the list of options. -
You increment i before using it, so it will always skip the first character of the string.
-
You don’t do a check to avoid changing the newline character.
The Solution
def Applyerrors(string, string_length, probability):
i = 0
while i < string_length:
if string[i] == "n":
continue
p = i/string_length
if p > probability:
new_var = string[i]
options = ['A', 'G', 'T', 'C']
options.remove(new_var)
replacer = random.choice(options)
string = string[:i] + replacer + string[i+1:]
i = i + 1
return string
-
return string
WhenApplyerrors
is done, it returns the edited string at the end. -
string = string[:i] + replacer + string[i+1:]
replaces just the character at indexi
. (string[:i]
is everything from 0 up to, but not including i.string[i:]
is everything from i to the end of the string`) -
options.remove(new_var)
removes the value from the list entirely rather than replacing it with an empty string. -
i = i+1
was moved to the end of the loop to allow for i to be 0 for an iteration to include the first value. -
if string[i] == "n"
Added a check to skip any newline characters. (continue
skip the rest of the current iteration of the loop)
The Solution Continued
While none of the following changes are necessary, I have recreated the function below to show some best practices.
def apply_errors(data, prob):
for i in range(len(data)):
options = ['A', 'G', 'T', 'C']
current = data[i]
if current not in options or random.random() > prob:
continue
options.remove(current)
data = data[:i] + random.choice(options) + data[i+1:]
return data
for i in range(len(data))
removes the need for inputting the length of the string.if current not in options or random.random() > prob
current not in options
checks if the value is not present in options (so now it will ignoren
and anything else that shouldn’t be changed)random.random() > prob
will now breakout of the iteration if the probability is greater than a random number (random.random()
returns a random number between 0 and 1). This would make the input probability represent the chance that a value is changed. The way you have it currently, the input probability would be the probability that a value is not changed.
random.random() > prob
makes it so that any value has the same probability of changing. The old version guarantees that the last 50% (when probability = 0.5) of the string will change.- rename
string
todata
. Its not great to have a variable named the same as a data type (despite strings being represented asstr
in python). - The new function header
apply_errors
uses camel case, which is the preferred method of naming variables and functions in python.
Here’s another solution. It doesn’t use string length or track the iteration count, as each character is appended to a new list which is transformed to a str
with "".join(new_string)
. It preserves newlines characters.
import random
def apply_errors(string, probability):
new_string = []
for nucleobase in string:
if nucleobase != "n" and probability > random.random():
nucleobase = random.choice("ACGT".replace(nucleobase, ""))
new_string.append(nucleobase)
return "".join(new_string)