Extracting ID from file using python script?

Question

I’m trying to extract just the sequence ID from the the file in Linux server.
To give you few examples TRINITY_DN0_c0_g1_i1.p1 and TRINITY_DN0_c0_g1_i3.p1 and sequence IDs. The sequence ID lengths are not same but they all start with TRINITY and ends with .p1.

I tried using awk '{print$1}' filename.cdhit > seq_id.fasta but instead I got this

I just want the ID but it would also extract non-interest information as well (the lengthy alphabetic protein seq).
I have attempted to create a python script in hopes to just extract the IDs:

import re

file_path = '/var2/user/de_novo/data/transdecoder_dir/Trinity.fasta.transdecoder.pep.cdhi
t'
new_file_path = '/var2/user/de_novo/data/transdecoder_dir/seqID.fasta'

with open(file_path, 'rt') as file:
    for myline in file:
        if ".p1" in file:
            with open(new_file_path, 'w') as new_file:
                new_file.write()
        else:
            print('No match found.')

Tried creating python script, running linux command
But comes out as not match found. Not sure where I went wrong.
Would appreciate with any help, thank you.

Asked By: Caroline

||

Source

Answer 1

This should do the trick:

awk '{print$1}' filename.cdhit | grep TRINITY | cut -c2-

As for python, check out https://biopython.org/. It’s a collection of bioinformatics-related goodies.

OR

Here a ‘.cdhit’-file reader:
https://pypi.org/project/cdhit-reader/

Answered By: cybe

Answer 2

you can combine that awk | grep | cut into :

mawk 'sub(".",_, $(NF = /TRINITY/))'

using regex to set NF = 1/0, which is same as print $1 | grep
when it’s false, NF is set to 0, so whole line is blanked out, thus sub() will fail and return a 0 (# instances replaced), thus skipping ineligible lines
the sub() does same thing as cut

Answered By: RARE Kpop Manifesto

Extracting ID from file using python script?

Question:

Answers: