How to Tokenize block of text as one token in python?

Question:

Recently I am working on a genome data set which consists of many blocks of genomes. On previous works on natural language processing, I have used sent_tokenize and word_tokenize from nltk to tokenize the sentences and words. But when I use these functions on genome data set, it is not able to tokenize the genomes correctly. The text below shows some part of the genome data set.

>NR_004049 1
tattattatacacaatcccggggcgttctatatagttatgtataatgtat
atttatattatttatgcctctaactggaacgtaccttgagcatatatgct
gtgacccgaaagatggtgaactatacttgatcaggttgaagtcaggggaa
accctgatggaagaccgaaacagttctgacgtgcaaatcgattgtcagaa
ttgagtataggggcgaaagaccaatcgaaccatctagtagctggttcctt
ccgaagtttccctcaggatagctggtgcattttaatattatataaaataa
tcttatctggtaaagcgaatgattagaggccttagggtcgaaacgatctt
aacctattctcaaactttaaatgggtaagaaccttaactttcttgatatg
aagttcaaggttatgatataatgtgcccagtgggccacttttggtaagca
gaactggcgctgtgggatgaaccaaacgtaatgttacggtgcccaaataa
caact
>NR_004048 1
aatgttttatataaattgcagtatgtgtcacccaaaatagcaaaccccat
aaccaaccagattattatgatacataatgcttatatgaaactaagacatt
tcgcaacatttattttaggtatataaatacatttattgaaggaattgata
tatgccagtaaaatggtgtatttttaatttctttcaataaaaacataatt
gacattatataaaaatgaattataaaactctaagcggtggatcactcggc
tcatgggtcgatgaagaacgcagcaaactgtgcgtcatcgtgtgaactgc
aggacacatgaacatcgacattttgaacgcatatcgcagtccatgctgtt
atgtactttaattaattttatagtgctgcttggactacatatggttgagg
gttgtaagactatgctaattaagttgcttataaatttttataagcatatg
gtatattattggataaatataataatttttattcataatattaaaaaata
aatgaaaaacattatctcacatttgaatgt
>NR_004047 1
atattcaggttcatcgggcttaacctctaagcagtttcacgtactgttta
actctctattcagagttcttttcaactttccctcacggtacttgtttact
atcggtctcatggttatatttagtgtttagatggagtttaccacccactt
agtgctgcactatcaagcaacactgactctttggaaacatcatctagtaa
tcattaacgttatacgggcctggcaccctctatgggtaaatggcctcatt
taagaaggacttaaatcgctaatttctcatactagaatattgacgctcca
tacactgcatctcacatttgccatatagacaaagtgacttagtgctgaac
tgtcttctttacggtcgccgctactaagaaaatccttggtagttactttt
cctcccctaattaatatgcttaaattcagggggtagtcccatatgagttg
>NR_004052 1

When the tokenizer of ntlk is applied on this dataset, each line of text (for example tattattatacacaatcccggggcgttctatatagttatgtataatgtat ) becomes one token which is not correct. and a block of sequences should be considered as one token. For example in this case contents between >NR_004049 1 and >NR_004048 1 should be consider as one token:

>NR_004049 1
tattattatacacaatcccggggcgttctatatagttatgtataatgtat
atttatattatttatgcctctaactggaacgtaccttgagcatatatgct
gtgacccgaaagatggtgaactatacttgatcaggttgaagtcaggggaa
accctgatggaagaccgaaacagttctgacgtgcaaatcgattgtcagaa
ttgagtataggggcgaaagaccaatcgaaccatctagtagctggttcctt
ccgaagtttccctcaggatagctggtgcattttaatattatataaaataa
tcttatctggtaaagcgaatgattagaggccttagggtcgaaacgatctt
aacctattctcaaactttaaatgggtaagaaccttaactttcttgatatg
aagttcaaggttatgatataatgtgcccagtgggccacttttggtaagca
gaactggcgctgtgggatgaaccaaacgtaatgttacggtgcccaaataa
caact
>NR_004048 1 

So each block starting with special words such as >NR_004049 1 until the next special character should be considered as one token. The problem here is tokenizing this kind of data set and i dont have any idea how can i correctly tokenize them.
I really appreciate answers which helps me to solve this.

Update:
One way to solve this problem is to append al lines within each block, and then using the nltk tokenizer. for example This means that to append all lines between >NR_004049 1 and >NR_004048 1 to make one string from several lines, so the nltk tokenizer will consider it as one token. Can any one help me how can i append lines within each block?

Asked By: Orca

||

Answers:

You just need to concatenate the lines between two ids apparently. There should be no need for nltk or any tokenizer, just a bit of programming 😉


patterns = {}
with open('data', "r") as f:
    id = None
    current = ""
    for line0 in f:
        line= line0.rstrip()
        if line[0] == '>' :  # new pattern
            if len(current)>0:
#                print("adding "+id+"  "+current)
                patterns[id] = current
                current = ""
            # to find the next id:
            tokens = line.split(" ")
            id = tokens[0][1:]
        else: # continuing pattern
            current = current + line
    if len(current)>0:
        patterns[id] = current
#        print("adding "+id+"  "+current)


# do whatever with the patterns:
for id, pattern in patterns.items():
    print(f"{id}t{pattern}")
Answered By: Erwan
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.