Read Clustal file in Python

Question:

I have a multiple sequence alignment (MSA) file derived from mafft in clustal format which I want to import into Python and save into a PDF file. I need to import the file and then highlight some specific words. I’ve tried to simply import the pdf of the MSA but after the highlight command doesn’t work.

I need to print the file like this:
enter image description here

CLUSTAL format alignment by MAFFT FFT-NS-i (v7.453)


Consensus       --------------------------------------acgttttcgatatttatgccat
AMP             tttatattttctcctttttatgatggaacaagtctgcgacgttttcgatatttatgccat
                                                      **********************

Consensus       atgtgcatgttgtaaggttgaaagcaaaaatgaggggaaaaaaaatgaggtttttaataa
AMP             atgtgcatgttgtaaggttgaaagcaaaaatgaggggaaaaaaaatgaggtttttaataa
                ************************************************************

Consensus       ctacacatttagaggtctaggaaataaaggagtattaccatggaaatgtatttccctaga
AMP             ctacacatttagaggtctaggaaataaaggagtattaccatggaaatgtaattccctaga
                ************************************************** *********

Consensus       tatgaaatattttcgtgcagttacaacatatgtgaatgaatcaaaatatgaaaaattgaa
AMP             tatgaaatattttcgtgcagttacaacatatgtgaatgaatcaaaatatgaaaaattgaa
                ************************************************************

Consensus       atataagagatgtaaatatttaaacaaagaaactgtggataatgtaaatgatatgcctaa
AMP             atataagagatgtaaatatttaaacaaagaaactgtggataatgtaaatgatatgcctaa
                ************************************************************

Consensus       ttctaaaaaattacaaaatgttgtagttatgggaagaacaaactgggaaagcattccaaa
AMP             ttctaaaaaattacaaaatgttgtagttatgggaagaacaaactgggaaagcattccaaa
                ************************************************************

Consensus       aaaatttaaacctttaagcaataggataaatgttatattgtctagaaccttaaaaaaaga
AMP             aaaatttaaacctttaagcaataggataaatgttatattgtctagaaccttaaaaaaaga
                ************************************************************

Consensus       agattttgatgaagatgtttatatcattaacaaagttgaagatctaatagttttacttgg
AMP             agattttgatgaagatgtttatatcattaacaaagttgaagatctaatagttttacttgg
                ************************************************************

Consensus       gaaattaaattactataaatgttttattataggaggttccgttgtttatcaagaattttt
AMP             gaaattaaattactataaatgttttattataggaggttccgttgtttatcaagaattttt
                ************************************************************

Consensus       agaaaagaaattaataaaaaaaatatattttactagaataaatagtacatatgaatgtga
AMP             agaaaagaaattaataaaaaaaatatattttactagaataaatagtacatatgaatgtga
                ************************************************************

Consensus       tgtattttttccagaaataaatgaaaatgagtatcaaattatttctgttagcgatgtata
AMP             tgtattttttccagaaataaatgaaaatgagtatcaaattatttctgttagcgatgtata
                ************************************************************

Consensus       tactagtaacaatacaacattgga----------------------------------
AMP             tactagtaacaatacaacattggattttatcatttataagaaaacgaataataaaatg
                ************************                                  

How can I import the alignment and print in the new PDF with the right alignment of the sequences.

Thanks

Multi.txt

Asked By: Denise Lavezzari

||

Answers:

Ok, figured out a way, not sure its the best one,

nedd to install fpdf2 (pip install fpdf2)

from io import StringIO

from Bio import AlignIO  # Biopython 1.80

from fpdf import FPDF # pip install fpdf2
 
alignment = AlignIO.read("Multi.txt", "clustal")


stri = StringIO()
    
AlignIO.write(alignment, stri, 'clustal' )

# print(stri.getvalue())

stri_lines = [ i for i in stri.getvalue().split('n')]

# print(stri_lines)

pdf = FPDF(orientation="P", unit="mm", format="A4")
 
# Add a page
pdf.add_page()

pdf.add_font('FreeMono', '', 'FreeMono.ttf')

pdf.set_font("FreeMono", size = 8)

for x in stri_lines:
    pdf.cell(0, 5, txt = x, border = 0,  new_x="LMARGIN" , new_y="NEXT",  align = 'L', fill = False)
    
    # print(len(x))

pdf.output("out.pdf") 

output pdf out.pdf :

enter image description here

Not sure why the file Header is changed, think is something within Biopythion (!!! ???) you can check adding:

with open('file_output.txt', 'w') as filehandler:
    AlignIO.write(alignment, filehandler, 'clustal')

I had to place a tff FreeMono font (Mono spaced font in my script directory) see: pdf.add_font('FreeMono', '', 'FreeMono.ttf') otherwise the alignement won’t be printed in the correct way
Which fonts have the same width for every character?.

Attached a png of my pdf. See that you can highlight it,

using:

pdf.set_fill_color(255, 255, 0)

filling = False
for x in stri_lines:
    if 'Consensus' in x:
        filling = True
    else:
        filling = False
    pdf.cell(0, 5, txt = x, border = 0,  new_x="LMARGIN" , new_y="NEXT",  align = 'L', fill = filling)

or something similar you can highlight while printing:

enter image description here

Answered By: pippo1980

adding, a second answer because the OP request about

I need to import the file and then highlight some specific word

was making me uneasy. I kept using the Pyfpdf2 library and made use of the Biopython SeqRecord.letter_annotations attribute to track Nucleotides index composing the words found by RegEx. Tried to use the page.search_for() of the PyMuPDF library directly on the output pdf of my previous answer but was not getting word accross line breaks [if anybody know better, please help].

Here my code, it’s a two half problem, first find the word (in my example is ‘atat‘), then write everything to pdf, I stole it from from /master/Bio/AlignIO/ClustalIO.py/ def write_alignment(self, alignment): from class ClustalWriter(SequentialAlignmentWriter) with some modding [I believe but not absolutely sure that …A cell by definition is uniform in font and color… ]:

import re

from Bio import AlignIO  # Biopython 1.80

from fpdf import FPDF # pip install fpdf2
 

alignment = AlignIO.read("Multi.txt", "clustal")


word_to_find = 'atat'

for rec in alignment:
    
    rec.letter_annotations['highlight'] = 'N'*len(rec.seq)

# for rec in alignment:    
#     print('rec.id : ', rec.id, 'highlight : ', rec.letter_annotations['highlight'])
    
    
for rec in alignment:
    
    # print(str(rec.seq))
    
    found = [m.start() for m in re.finditer( '(?='+re.escape(word_to_find)+')' , re.escape(str(rec.seq.replace('-', 'Z'))))]
    
    # print(rec.id , ' found : ', found)
    
    for i in found :
        
            a = list(rec.letter_annotations['highlight'])
    
            a[i:i+len(word_to_find)] = 'Y'*len(word_to_find)
            
            rec.letter_annotations['highlight'] = ''.join(i for i in a)
            
            # print(rec.letter_annotations['highlight'])
            
            

max_length = alignment.get_alignment_length()


# stolen from /master/Bio/AlignIO/ClustalIO.py/   def write_alignment(self, alignment): from class ClustalWriter(SequentialAlignmentWriter):
def write_alignment(alignment):
    
        x = 2
        
        y = 2

        pdf = FPDF(orientation="P", unit="mm", format="A4")
         
        # Add a page
        pdf.add_page()

        pdf.add_font('FreeMono', '', 'FreeMono.ttf')

        pdf.set_font("FreeMono", size = 12)
        
        pdf.set_fill_color(255, 255, 0)

    
        """Use this to write (another) single alignment to an open file."""
        if len(alignment) == 0:
            raise ValueError("Must have at least one sequence")
        if alignment.get_alignment_length() == 0:
            # This doubles as a check for an alignment object
            raise ValueError("Non-empty sequences are required")

        # Old versions of the parser in Bio.Clustalw used a ._version property
        try:
            version = str(alignment._version)
        except AttributeError:
            version = ""
        if not version:
            version = "1.81"
        if version.startswith("2."):
            # e.g. 2.0.x
            output = f"CLUSTAL {version} multiple sequence alignment"
        else:
            # e.g. 1.81 or 1.83
            output = f"CLUSTAL X ({version}) multiple sequence alignment"
            
        pdf.cell(0, y, txt = output , border = 0,  new_x='LMARGIN'  , new_y="NEXT" ,  align = 'L', fill = False)
        pdf.cell(0, 4*y, txt = '' , border = 0,  new_x='LMARGIN'  , new_y="NEXT" ,  align = 'L', fill = False)
        

        cur_char = 0
        # # max_length = len(alignment[0])
        max_length = alignment.get_alignment_length()


        if max_length <= 0:
            raise ValueError("Non-empty sequences are required")

        if "clustal_consensus" in alignment.column_annotations:
            star_info = alignment.column_annotations["clustal_consensus"]
        else:
            try:
                # This was originally stored by Bio.Clustalw as ._star_info
                star_info = alignment._star_info
            except AttributeError:
                star_info = None

        # keep displaying sequences until we reach the end
        
        rec_cnt = 0
        
        while cur_char != max_length:
            # calculate the number of sequences to show, which will
            # be less if we are at the end of the sequence
            if (cur_char + 50) > max_length:
                show_num = max_length - cur_char
            else:
                show_num = 50

            # go through all of the records and print out the sequences
            # when we output, we do a nice 80 column output, although this
            # may result in truncation of the ids.
            for record in alignment:
                
                rec_cnt += 1 
                
                # Make sure we don't get any spaces in the record
                # identifier when output in the file by replacing
                # them with underscores:
                line = record.id[0:30].replace(" ", "_").ljust(50)
                
                pdf.cell(50, y, txt = line, border = 0,  new_x='RIGHT' , align = 'L', fill = False)

                
                line_seq = list(str(record.seq[cur_char : (cur_char + show_num)]))
                
                line_seq_highlight = list(str(record.letter_annotations['highlight'][cur_char : (cur_char + show_num)]))
                
                
                for i in range(len(line_seq)):
                    
                    if line_seq_highlight[i] == 'N' :
                        
                        filling = False
                    
                    if line_seq_highlight[i] == 'Y' :
                        
                        filling = True
                        
                    else:
                        
                        filling = False
                        
                    pdf.cell(x, y, txt = line_seq[i], border = 0,  new_x= 'RIGHT' ,  align = 'C', fill = filling)
                    
                pdf.cell(x, 2*y, txt = ' ' , border = 0,  new_x="LMARGIN" , new_y="NEXT" ,  align = 'L', fill = False)
                
                # now we need to print out the star info, if we've got it
                
                if star_info and rec_cnt == len(alignment):
                    
                    pdf.cell(x, 2*y, txt = ' ' , border = 0,  new_x="LMARGIN" , new_y="NEXT" ,  align = 'L', fill = False)
                    
                    rec_cnt = 0
                    
                    # print(star_info)
                    
                    line = (' ' * 50)
                    
                    pdf.cell(50, y, txt = line, border = 0,  new_x='RIGHT' , align = 'L', fill = False)
                    
                    star_info_seq = list(str(star_info[cur_char : (cur_char + show_num)]))

                    for i in range(len(star_info_seq)):
                        
                        pdf.cell(x, y, txt = star_info_seq[i], border = 0,  new_x= 'RIGHT' ,  align = 'C', fill = False)


                    
                pdf.cell(x, 2*y, txt = ' ' , border = 0,  new_x="LMARGIN" , new_y="NEXT" ,  align = 'L', fill = False)
                
            
                
            cur_char += show_num  
            
            pdf.cell(x, 3*y, txt = ' ' , border = 0,  new_x="LMARGIN" , new_y="NEXT" ,  align = 'L', fill = False)


        
        pdf.output("out.pdf") 


write_alignment(alignment)


I tested it, with the Multi.txt Clustal alignment from above, output is

out.pdf :

enter image description here

.

One last note:

needed to replace - with Z in the Consensus string to be able to get reliable results from the RegEX line : found = [m.start() for m in re.finditer( '(?='+re.escape(word_to_find)+')' , re.escape(str(rec.seq.replace('-', 'Z'))))] , not sure what is going on here, got the snippet from a SO post.

Answered By: pippo1980