Change Name of a sequence by adding a number for duplicate
Question:
I actually have a dataframe such:
old_name new_name pident length
gene1_0035_0042 geneA 100 560
gene2_0035_0042 geneA 100 545
gene3_0042_0035 geneB 99 356
gene4_0042_0035 geneB 97 256
gene6_0035_0042 geneB 96 567
and here is the fasta file (example):
>gene1_0035_0042
ATTGAC
>gene2_0035_0042
ATGAGCC
>gene3_0042_0035
AGCCAG
>gene4_0042_0035
AGCCAT
>gene6_0035_0042
AGCCATG
in fact I wrote a script to replace in a fasta file the old_name of the sequence by the new_name by doing: (qseqid = old_name
and Busco_ID = new_names
in the ex).
blast=pd.read_table("matches_Busco_0035_0042_best_hit.m8",header=None)
blast.columns = ["qseqid", "Busco_ID", "pident", "length", "mismatch", "gapopen","qstart", "qend", "sstart", "send", "evalue", "bitscore"]
repl = blast[blast.pident > 95]
repl.to_csv("busco_blast_non-rename.txt",sep='t')
qseqid=repl.ix[:,0]
Busco_ID=repl.ix[:,1]
newfile = []
count = 0
running_inds = {}
for rec in SeqIO.parse("concatenate_0035_0042_dna2.fa", "fasta"):
#get corresponding value for record ID from dataframe
#repl["seq"] and "newseq" are the pandas column with the old and new sequence names, respectively
x = repl.loc[repl["qseqid"] == rec.id, "Busco_ID"]
#change record, if not empty
if x.any():
#append old identifier number to the new id name
running = running_inds.get(x.iloc[0], 1) # Get the running index for this sequence
running_inds[x.iloc[0]] = running + 1
rec.name = rec.description = rec.id = x.iloc[0] + rec.id[rec.id.index("_"):]
count += 1
#append record to list
newfile.append(rec)
#write list into new fasta file
SeqIO.write(newfile, "concatenate_with_busco_names_0035_0042_dna.fa", "fasta")
#tell us, how hard you had to work for us
print("I changed {} entries!".format(count))
as you can see I only filter my sequence by keeping these with a pident > 95 but as you can see I will get for all these sequences the same name which is the new_name but instead of that, I would like to add a number at the end of the new name. For the above example it would give in the fasta file:
>geneA_0035_0042_1
ATTGAC
>geneA_0035_0042_2
ATGAGCC
>geneB_0042_0035_1
AGCCAG
>geneB_0042_0035_2
AGCCAT
>geneB_0035_0042_1
AGCCATG
and so on
instead of:
>geneA_0035_0042
ATTGAC
>geneA_0035_0042
ATGAGCC
>geneB_0042_0035
AGCCAG
>geneB_0042_0035
AGCCAT
>geneB_0035_0042
AGCCATG
as my script does
Thanks for your help
Issue:
I got:
>EOG090X0FA0_0042_0042_1
>EOG090X0FA0_0042_0035_2
>EOG090X0FA0_0035_0035_3
>EOG090X0FA0_0035_0042_4
but since they are all different I should get:
>EOG090X0FA0_0042_0042_1
>EOG090X0FA0_0042_0035_1
>EOG090X0FA0_0035_0035_1
>EOG090X0FA0_0035_0042_1
Answers:
Add a dictionary before the start of the loop:
running_inds = {}
for rec in SeqIO.parse("concatenate_0035_0042_dna2.fa", "fasta"):
Now when you perform
rec.name = rec.description = rec.id = x.iloc[0] + rec.id[rec.id.index("_"):]
first do the following:
running = running_inds.get(x.iloc[0] + rec.id[rec.id.index("_"):], 1) # Get the running index for this sequence
running_inds[x.iloc[0] + rec.id[rec.id.index("_"):]] = running + 1
now simply append this to the name:
rec.name = rec.description = rec.id = x.iloc[0] + rec.id[rec.id.index("_"):] + '_' + str(running)
I actually have a dataframe such:
old_name new_name pident length
gene1_0035_0042 geneA 100 560
gene2_0035_0042 geneA 100 545
gene3_0042_0035 geneB 99 356
gene4_0042_0035 geneB 97 256
gene6_0035_0042 geneB 96 567
and here is the fasta file (example):
>gene1_0035_0042
ATTGAC
>gene2_0035_0042
ATGAGCC
>gene3_0042_0035
AGCCAG
>gene4_0042_0035
AGCCAT
>gene6_0035_0042
AGCCATG
in fact I wrote a script to replace in a fasta file the old_name of the sequence by the new_name by doing: (qseqid = old_name
and Busco_ID = new_names
in the ex).
blast=pd.read_table("matches_Busco_0035_0042_best_hit.m8",header=None)
blast.columns = ["qseqid", "Busco_ID", "pident", "length", "mismatch", "gapopen","qstart", "qend", "sstart", "send", "evalue", "bitscore"]
repl = blast[blast.pident > 95]
repl.to_csv("busco_blast_non-rename.txt",sep='t')
qseqid=repl.ix[:,0]
Busco_ID=repl.ix[:,1]
newfile = []
count = 0
running_inds = {}
for rec in SeqIO.parse("concatenate_0035_0042_dna2.fa", "fasta"):
#get corresponding value for record ID from dataframe
#repl["seq"] and "newseq" are the pandas column with the old and new sequence names, respectively
x = repl.loc[repl["qseqid"] == rec.id, "Busco_ID"]
#change record, if not empty
if x.any():
#append old identifier number to the new id name
running = running_inds.get(x.iloc[0], 1) # Get the running index for this sequence
running_inds[x.iloc[0]] = running + 1
rec.name = rec.description = rec.id = x.iloc[0] + rec.id[rec.id.index("_"):]
count += 1
#append record to list
newfile.append(rec)
#write list into new fasta file
SeqIO.write(newfile, "concatenate_with_busco_names_0035_0042_dna.fa", "fasta")
#tell us, how hard you had to work for us
print("I changed {} entries!".format(count))
as you can see I only filter my sequence by keeping these with a pident > 95 but as you can see I will get for all these sequences the same name which is the new_name but instead of that, I would like to add a number at the end of the new name. For the above example it would give in the fasta file:
>geneA_0035_0042_1
ATTGAC
>geneA_0035_0042_2
ATGAGCC
>geneB_0042_0035_1
AGCCAG
>geneB_0042_0035_2
AGCCAT
>geneB_0035_0042_1
AGCCATG
and so on
instead of:
>geneA_0035_0042
ATTGAC
>geneA_0035_0042
ATGAGCC
>geneB_0042_0035
AGCCAG
>geneB_0042_0035
AGCCAT
>geneB_0035_0042
AGCCATG
as my script does
Thanks for your help
Issue:
I got:
>EOG090X0FA0_0042_0042_1
>EOG090X0FA0_0042_0035_2
>EOG090X0FA0_0035_0035_3
>EOG090X0FA0_0035_0042_4
but since they are all different I should get:
>EOG090X0FA0_0042_0042_1
>EOG090X0FA0_0042_0035_1
>EOG090X0FA0_0035_0035_1
>EOG090X0FA0_0035_0042_1
Add a dictionary before the start of the loop:
running_inds = {}
for rec in SeqIO.parse("concatenate_0035_0042_dna2.fa", "fasta"):
Now when you perform
rec.name = rec.description = rec.id = x.iloc[0] + rec.id[rec.id.index("_"):]
first do the following:
running = running_inds.get(x.iloc[0] + rec.id[rec.id.index("_"):], 1) # Get the running index for this sequence
running_inds[x.iloc[0] + rec.id[rec.id.index("_"):]] = running + 1
now simply append this to the name:
rec.name = rec.description = rec.id = x.iloc[0] + rec.id[rec.id.index("_"):] + '_' + str(running)