Removing characters from a list of strings if they don't follow a list
Question:
I have a python code and I’m working with a list of sequences
seq0,seq1,seq2,seq3,seq4,seq5 = 'CCACGCGTCCGCCGCGACCTGCGTTTTCCTGGGGGTCCGCAACTCTGGCTTGACCCAAGGACCCGGCCAC','attgccattatataACCCGGCCACCCCCATAGGCAGATGTCAGGACAACTCGCATCTCAGCAGAGCAGCCCCTGGCCCAGG','TCXCACCCATAGGCAGATGGCCTCCGCCCCACCCCCGGGAGGATTTCTTAATGGGGTGAAAATGC','CAGTCCCCGAAGCCAGGGTTCCGGGACCCCCGGGGCCGAGCTGGGCGCGGGAAAAGAAttacggacttaGTCAGCCCCGCAGGGG','ATGGGGTGATCGTCGCTCGCGGGCTCTGTCTTCCTGTTCACCCTCCTCTGCCCCCAACTCCATCTCTGAGACCTCCTGCCCCCCCA','AAAAAAGAAGTCGCTCGCGTCGCTCGCGGGCTGGGCTCTGTCTGCGTCGCTCGCGGGCTAGAGAGCCAGGGTGA'
NTs = [seq0,seq1,seq2,seq3,seq4,seq5]
nucleotides = ['G','A','C','T', 'U']
if any(x not in nucleotides for x in NTs):
print("ERROR: non-nucleotide characters present")
so this works so far and it does tell me if there are non-nucleotide characters present, but I also need the code to remove those non-nucleotide characters so I can further process the sequences. I was wondering how I would do this
Answers:
Line 3 and 5 has two different way of doing it. Line 3: gather all the characters that are also present in nucleotides irrespective of their case and then join them together. Line 5: similar as before but only match if they are upper case.
In [1]: seq0,seq1,seq2,seq3,seq4,seq5 = 'CCACGCGTCCGCCGCGACCTGCGTTTTCCTGGGGGTCCGCAACTCTGGCTTGACCCAAGGACCCGGCCAC','attgcc
...: attatataACCCGGCCACCCCCATAGGCAGATGTCAGGACAACTCGCATCTCAGCAGAGCAGCCCCTGGCCCAGG','TCXCACCCATAGGCAGATGGCCTCCGCCCCACCC
...: CCGGGAGGATTTCTTAATGGGGTGAAAATGC','CAGTCCCCGAAGCCAGGGTTCCGGGACCCCCGGGGCCGAGCTGGGCGCGGGAAAAGAAttacggacttaGTCAGCCCC
...: GCAGGGG','ATGGGGTGATCGTCGCTCGCGGGCTCTGTCTTCCTGTTCACCCTCCTCTGCCCCCAACTCCATCTCTGAGACCTCCTGCCCCCCCA','AAAAAAGAAGTCG
...: CTCGCGTCGCTCGCGGGCTGGGCTCTGTCTGCGTCGCTCGCGGGCTAGAGAGCCAGGGTGA'
...:
...: NTs = [seq0,seq1,seq2,seq3,seq4,seq5]
...:
...: nucleotides = ['G','A','C','T', 'U']
In [2]: # if lower case gatcu allowed
In [3]: [''.join(i for i in x if i.upper() in nucleotides) for x in NTs]
Out[3]:
['CCACGCGTCCGCCGCGACCTGCGTTTTCCTGGGGGTCCGCAACTCTGGCTTGACCCAAGGACCCGGCCAC',
'attgccattatataACCCGGCCACCCCCATAGGCAGATGTCAGGACAACTCGCATCTCAGCAGAGCAGCCCCTGGCCCAGG',
'TCCACCCATAGGCAGATGGCCTCCGCCCCACCCCCGGGAGGATTTCTTAATGGGGTGAAAATGC',
'CAGTCCCCGAAGCCAGGGTTCCGGGACCCCCGGGGCCGAGCTGGGCGCGGGAAAAGAAttacggacttaGTCAGCCCCGCAGGGG',
'ATGGGGTGATCGTCGCTCGCGGGCTCTGTCTTCCTGTTCACCCTCCTCTGCCCCCAACTCCATCTCTGAGACCTCCTGCCCCCCCA',
'AAAAAAGAAGTCGCTCGCGTCGCTCGCGGGCTGGGCTCTGTCTGCGTCGCTCGCGGGCTAGAGAGCCAGGGTGA']
In [4]: # if lower case is not allowed
In [5]: [''.join(i for i in x if i in nucleotides) for x in NTs]
Out[5]:
['CCACGCGTCCGCCGCGACCTGCGTTTTCCTGGGGGTCCGCAACTCTGGCTTGACCCAAGGACCCGGCCAC',
'ACCCGGCCACCCCCATAGGCAGATGTCAGGACAACTCGCATCTCAGCAGAGCAGCCCCTGGCCCAGG',
'TCCACCCATAGGCAGATGGCCTCCGCCCCACCCCCGGGAGGATTTCTTAATGGGGTGAAAATGC',
'CAGTCCCCGAAGCCAGGGTTCCGGGACCCCCGGGGCCGAGCTGGGCGCGGGAAAAGAAGTCAGCCCCGCAGGGG',
'ATGGGGTGATCGTCGCTCGCGGGCTCTGTCTTCCTGTTCACCCTCCTCTGCCCCCAACTCCATCTCTGAGACCTCCTGCCCCCCCA',
'AAAAAAGAAGTCGCTCGCGTCGCTCGCGGGCTGGGCTCTGTCTGCGTCGCTCGCGGGCTAGAGAGCCAGGGTGA']
You can not modify an existing string, but you can create a new string and append to it. Off the top of my head you could do something like
valid_seqs = []
for NT in NTs:
new_seq = ''
for x in sequence:
if x in nucleotides:
new_seq += x
valid_seqs.append(new_seq)
You would only add the valid nucleotides to the new_seq string. You would end up with a list of sequences in valid_seqs that only contains valid nucleotides.
I have a python code and I’m working with a list of sequences
seq0,seq1,seq2,seq3,seq4,seq5 = 'CCACGCGTCCGCCGCGACCTGCGTTTTCCTGGGGGTCCGCAACTCTGGCTTGACCCAAGGACCCGGCCAC','attgccattatataACCCGGCCACCCCCATAGGCAGATGTCAGGACAACTCGCATCTCAGCAGAGCAGCCCCTGGCCCAGG','TCXCACCCATAGGCAGATGGCCTCCGCCCCACCCCCGGGAGGATTTCTTAATGGGGTGAAAATGC','CAGTCCCCGAAGCCAGGGTTCCGGGACCCCCGGGGCCGAGCTGGGCGCGGGAAAAGAAttacggacttaGTCAGCCCCGCAGGGG','ATGGGGTGATCGTCGCTCGCGGGCTCTGTCTTCCTGTTCACCCTCCTCTGCCCCCAACTCCATCTCTGAGACCTCCTGCCCCCCCA','AAAAAAGAAGTCGCTCGCGTCGCTCGCGGGCTGGGCTCTGTCTGCGTCGCTCGCGGGCTAGAGAGCCAGGGTGA'
NTs = [seq0,seq1,seq2,seq3,seq4,seq5]
nucleotides = ['G','A','C','T', 'U']
if any(x not in nucleotides for x in NTs):
print("ERROR: non-nucleotide characters present")
so this works so far and it does tell me if there are non-nucleotide characters present, but I also need the code to remove those non-nucleotide characters so I can further process the sequences. I was wondering how I would do this
Line 3 and 5 has two different way of doing it. Line 3: gather all the characters that are also present in nucleotides irrespective of their case and then join them together. Line 5: similar as before but only match if they are upper case.
In [1]: seq0,seq1,seq2,seq3,seq4,seq5 = 'CCACGCGTCCGCCGCGACCTGCGTTTTCCTGGGGGTCCGCAACTCTGGCTTGACCCAAGGACCCGGCCAC','attgcc
...: attatataACCCGGCCACCCCCATAGGCAGATGTCAGGACAACTCGCATCTCAGCAGAGCAGCCCCTGGCCCAGG','TCXCACCCATAGGCAGATGGCCTCCGCCCCACCC
...: CCGGGAGGATTTCTTAATGGGGTGAAAATGC','CAGTCCCCGAAGCCAGGGTTCCGGGACCCCCGGGGCCGAGCTGGGCGCGGGAAAAGAAttacggacttaGTCAGCCCC
...: GCAGGGG','ATGGGGTGATCGTCGCTCGCGGGCTCTGTCTTCCTGTTCACCCTCCTCTGCCCCCAACTCCATCTCTGAGACCTCCTGCCCCCCCA','AAAAAAGAAGTCG
...: CTCGCGTCGCTCGCGGGCTGGGCTCTGTCTGCGTCGCTCGCGGGCTAGAGAGCCAGGGTGA'
...:
...: NTs = [seq0,seq1,seq2,seq3,seq4,seq5]
...:
...: nucleotides = ['G','A','C','T', 'U']
In [2]: # if lower case gatcu allowed
In [3]: [''.join(i for i in x if i.upper() in nucleotides) for x in NTs]
Out[3]:
['CCACGCGTCCGCCGCGACCTGCGTTTTCCTGGGGGTCCGCAACTCTGGCTTGACCCAAGGACCCGGCCAC',
'attgccattatataACCCGGCCACCCCCATAGGCAGATGTCAGGACAACTCGCATCTCAGCAGAGCAGCCCCTGGCCCAGG',
'TCCACCCATAGGCAGATGGCCTCCGCCCCACCCCCGGGAGGATTTCTTAATGGGGTGAAAATGC',
'CAGTCCCCGAAGCCAGGGTTCCGGGACCCCCGGGGCCGAGCTGGGCGCGGGAAAAGAAttacggacttaGTCAGCCCCGCAGGGG',
'ATGGGGTGATCGTCGCTCGCGGGCTCTGTCTTCCTGTTCACCCTCCTCTGCCCCCAACTCCATCTCTGAGACCTCCTGCCCCCCCA',
'AAAAAAGAAGTCGCTCGCGTCGCTCGCGGGCTGGGCTCTGTCTGCGTCGCTCGCGGGCTAGAGAGCCAGGGTGA']
In [4]: # if lower case is not allowed
In [5]: [''.join(i for i in x if i in nucleotides) for x in NTs]
Out[5]:
['CCACGCGTCCGCCGCGACCTGCGTTTTCCTGGGGGTCCGCAACTCTGGCTTGACCCAAGGACCCGGCCAC',
'ACCCGGCCACCCCCATAGGCAGATGTCAGGACAACTCGCATCTCAGCAGAGCAGCCCCTGGCCCAGG',
'TCCACCCATAGGCAGATGGCCTCCGCCCCACCCCCGGGAGGATTTCTTAATGGGGTGAAAATGC',
'CAGTCCCCGAAGCCAGGGTTCCGGGACCCCCGGGGCCGAGCTGGGCGCGGGAAAAGAAGTCAGCCCCGCAGGGG',
'ATGGGGTGATCGTCGCTCGCGGGCTCTGTCTTCCTGTTCACCCTCCTCTGCCCCCAACTCCATCTCTGAGACCTCCTGCCCCCCCA',
'AAAAAAGAAGTCGCTCGCGTCGCTCGCGGGCTGGGCTCTGTCTGCGTCGCTCGCGGGCTAGAGAGCCAGGGTGA']
You can not modify an existing string, but you can create a new string and append to it. Off the top of my head you could do something like
valid_seqs = []
for NT in NTs:
new_seq = ''
for x in sequence:
if x in nucleotides:
new_seq += x
valid_seqs.append(new_seq)
You would only add the valid nucleotides to the new_seq string. You would end up with a list of sequences in valid_seqs that only contains valid nucleotides.