Removing characters from a list of strings if they don't follow a list

Question:

I have a python code and I’m working with a list of sequences

seq0,seq1,seq2,seq3,seq4,seq5 = 'CCACGCGTCCGCCGCGACCTGCGTTTTCCTGGGGGTCCGCAACTCTGGCTTGACCCAAGGACCCGGCCAC','attgccattatataACCCGGCCACCCCCATAGGCAGATGTCAGGACAACTCGCATCTCAGCAGAGCAGCCCCTGGCCCAGG','TCXCACCCATAGGCAGATGGCCTCCGCCCCACCCCCGGGAGGATTTCTTAATGGGGTGAAAATGC','CAGTCCCCGAAGCCAGGGTTCCGGGACCCCCGGGGCCGAGCTGGGCGCGGGAAAAGAAttacggacttaGTCAGCCCCGCAGGGG','ATGGGGTGATCGTCGCTCGCGGGCTCTGTCTTCCTGTTCACCCTCCTCTGCCCCCAACTCCATCTCTGAGACCTCCTGCCCCCCCA','AAAAAAGAAGTCGCTCGCGTCGCTCGCGGGCTGGGCTCTGTCTGCGTCGCTCGCGGGCTAGAGAGCCAGGGTGA'

NTs = [seq0,seq1,seq2,seq3,seq4,seq5]

nucleotides = ['G','A','C','T', 'U']

if any(x not in nucleotides for x in NTs):
    print("ERROR: non-nucleotide characters present")

so this works so far and it does tell me if there are non-nucleotide characters present, but I also need the code to remove those non-nucleotide characters so I can further process the sequences. I was wondering how I would do this

Asked By: Inan Khan

||

Answers:

Line 3 and 5 has two different way of doing it. Line 3: gather all the characters that are also present in nucleotides irrespective of their case and then join them together. Line 5: similar as before but only match if they are upper case.

In [1]: seq0,seq1,seq2,seq3,seq4,seq5 = 'CCACGCGTCCGCCGCGACCTGCGTTTTCCTGGGGGTCCGCAACTCTGGCTTGACCCAAGGACCCGGCCAC','attgcc
   ...: attatataACCCGGCCACCCCCATAGGCAGATGTCAGGACAACTCGCATCTCAGCAGAGCAGCCCCTGGCCCAGG','TCXCACCCATAGGCAGATGGCCTCCGCCCCACCC
   ...: CCGGGAGGATTTCTTAATGGGGTGAAAATGC','CAGTCCCCGAAGCCAGGGTTCCGGGACCCCCGGGGCCGAGCTGGGCGCGGGAAAAGAAttacggacttaGTCAGCCCC
   ...: GCAGGGG','ATGGGGTGATCGTCGCTCGCGGGCTCTGTCTTCCTGTTCACCCTCCTCTGCCCCCAACTCCATCTCTGAGACCTCCTGCCCCCCCA','AAAAAAGAAGTCG
   ...: CTCGCGTCGCTCGCGGGCTGGGCTCTGTCTGCGTCGCTCGCGGGCTAGAGAGCCAGGGTGA'
   ...: 
   ...: NTs = [seq0,seq1,seq2,seq3,seq4,seq5]
   ...: 
   ...: nucleotides = ['G','A','C','T', 'U']

In [2]: # if lower case gatcu allowed

In [3]: [''.join(i for i in x if i.upper() in nucleotides) for x in NTs]
Out[3]: 
['CCACGCGTCCGCCGCGACCTGCGTTTTCCTGGGGGTCCGCAACTCTGGCTTGACCCAAGGACCCGGCCAC',
 'attgccattatataACCCGGCCACCCCCATAGGCAGATGTCAGGACAACTCGCATCTCAGCAGAGCAGCCCCTGGCCCAGG',
 'TCCACCCATAGGCAGATGGCCTCCGCCCCACCCCCGGGAGGATTTCTTAATGGGGTGAAAATGC',
 'CAGTCCCCGAAGCCAGGGTTCCGGGACCCCCGGGGCCGAGCTGGGCGCGGGAAAAGAAttacggacttaGTCAGCCCCGCAGGGG',
 'ATGGGGTGATCGTCGCTCGCGGGCTCTGTCTTCCTGTTCACCCTCCTCTGCCCCCAACTCCATCTCTGAGACCTCCTGCCCCCCCA',
 'AAAAAAGAAGTCGCTCGCGTCGCTCGCGGGCTGGGCTCTGTCTGCGTCGCTCGCGGGCTAGAGAGCCAGGGTGA']

In [4]: # if lower case is not allowed

In [5]: [''.join(i for i in x if i in nucleotides) for x in NTs]
Out[5]: 
['CCACGCGTCCGCCGCGACCTGCGTTTTCCTGGGGGTCCGCAACTCTGGCTTGACCCAAGGACCCGGCCAC',
 'ACCCGGCCACCCCCATAGGCAGATGTCAGGACAACTCGCATCTCAGCAGAGCAGCCCCTGGCCCAGG',
 'TCCACCCATAGGCAGATGGCCTCCGCCCCACCCCCGGGAGGATTTCTTAATGGGGTGAAAATGC',
 'CAGTCCCCGAAGCCAGGGTTCCGGGACCCCCGGGGCCGAGCTGGGCGCGGGAAAAGAAGTCAGCCCCGCAGGGG',
 'ATGGGGTGATCGTCGCTCGCGGGCTCTGTCTTCCTGTTCACCCTCCTCTGCCCCCAACTCCATCTCTGAGACCTCCTGCCCCCCCA',
 'AAAAAAGAAGTCGCTCGCGTCGCTCGCGGGCTGGGCTCTGTCTGCGTCGCTCGCGGGCTAGAGAGCCAGGGTGA']
Answered By: Osman Mamun

You can not modify an existing string, but you can create a new string and append to it. Off the top of my head you could do something like

valid_seqs = []
for NT in NTs:
    new_seq = ''
    for x in sequence:
        if x in nucleotides:
            new_seq += x
    valid_seqs.append(new_seq)

You would only add the valid nucleotides to the new_seq string. You would end up with a list of sequences in valid_seqs that only contains valid nucleotides.

Answered By: CSEngiNerd
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.