Remove duplicated sequences in FASTA with Python

Question:

I apologize if the question has been asked before, but I have been searching for days and could not find a solution in Python.

I have a large fasta file, containing headers and sequences.

>cavPor3_rmsk_tRNA-Leu-TTA(m) range=chrM:2643-2717 5'pad=0 3'pad=0 strand=+ repeatMasking=none
GTTAAGGTGGCAGAGCCGGTAATTGCATAAAATTTAAGACTTTACTCTCA
GAGGTTCAACTCCTCTCCTTAACAC

>cavPor3_rmsk_tRNA-Gln-CAA_ range=chrM:3745-3815 5'pad=0 3'pad=0 strand=- repeatMasking=none
AGAGGGTCATAAAGGTTATGGGGTTGGCTTGAAACCAGCTTTAGGGGGTT
CAATTCCTTCCTCTCT

>cavPor3_rmsk_tRNA-Ser-TCA(m) range=chrM:6875-6940 5'pad=0 3'pad=0 strand=- repeatMasking=none
AGAGGGTCATAAAGGTTATGGGGTTGGCTTGAAACCAGCTTTAGGGGGTT
CAATTCCTTCCTCTCT

This is a very small fragment of what the file looks like. I want to keep only the first entry (header and sequence) if, as you can see for the last two entries, the sequences are the same.

The output would look like this:

>cavPor3_rmsk_tRNA-Leu-TTA(m) range=chrM:2643-2717 5'pad=0 3'pad=0 strand=+ repeatMasking=none
GTTAAGGTGGCAGAGCCGGTAATTGCATAAAATTTAAGACTTTACTCTCA
GAGGTTCAACTCCTCTCCTTAACAC

>cavPor3_rmsk_tRNA-Gln-CAA_ range=chrM:3745-3815 5'pad=0 3'pad=0 strand=- repeatMasking=none
AGAGGGTCATAAAGGTTATGGGGTTGGCTTGAAACCAGCTTTAGGGGGTT
CAATTCCTTCCTCTCT

The problem is that the FASTA file is over one gigabyte in size. I have found ways of solving this by removing duplicates based on duplicate IDs or by using bash, but sadly I can’t do this on my computer.
This task is for a research project, not a homework or task.

Thank you in advance for your help!

Asked By: Marco Badici

||

Answers:

If the lines all have that same format, so that there are 6 space-separated fields before the sequences, then this is easy. You will have to store all of the unique values in memory.

memory = set()
for line in open('x.txt'):
    if len(line) < 5:
        continue
    parts = line.split(' ', 6)
    if parts[-1] not in memory:
        print( line.strip() )
        memory.add( parts[-1] )
Answered By: Tim Roberts

this copied from here: Remove Redundant Sequences from FASTA file in Python

uses Biopython, but works with fasta file where headers are of the:

‘> header’ type see FAsta Format Wiki

from Bio import SeqIO
import time

start = time.time() 

seen = []
records = []

for record in SeqIO.parse("INPUT-FILE", "fasta"):  
    if str(record.seq) not in seen:
        seen.append(str(record.seq))
        records.append(record)


#writing to a fasta file
SeqIO.write(records, "OUTPUT-FILE", "fasta")
end = time.time()

print(f"Run time is {(end- start)/60}") 


faster as suggested by MattMDo using a set istead of a list:

seen = set()
records = []

for record in SeqIO.parse("b4r2.fasta", "fasta"):  
    if record.seq not in seen:
        seen.add(record.seq)
        records.append(record)

I’ve got a longer one that uses argparser but its slower because counts the sequences can post it if needed

Answered By: pippo1980

if you want to keep both header while removing duplicates you can use:

input1 = open("fasta.fasta")
dict_fasta = {record.id:record.seq for record in SeqIO.parse(input1,"fasta")}

tmp = {}
for key, value in dict_fasta.items():
  if value in tmp:
    tmp[value].append(key)
  else:
    tmp[value] = [ key ]
Answered By: The Biotech

I applied the code copied from "Remove Redundant Sequences from FASTA file in Python" to unique sequences and 96 entries out of 2680 went lost. I don’t think it is a valid solution.

Answered By: Ivan Pchelin
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.