Parsing a .pdb file in Python

Question:

I’m trying to write a quick parser for .pdb files (they show protein structure). An example of a protein I am looking at is KRAS (common in cancer) and is here: http://www.rcsb.org/pdb/files/3GFT.pdb

If you scroll down far enough you will get to a line that looks like this:
ATOM 1 N MET A 1 63.645 97.355 31.526 1.00 33.80 N

The first element “atom” means this relates to an actual atom in the protein. The 1 relates to a general count, N relates to the type of atom, “MET” is the name of the residue, “A” relates to the the type of chain, 1 (the second “1”) is the atom count and then the next 3 numbers are the x-y-z positions in space.

What I need output is something like this (where the “1” below corresponds to the atom count, not the general count):
MET A 1 63.645 97.355 31.526

To make matters more complicated, sometimes the atom count (the second “1” in this case) is negative. In those cases I want to skip that line an continue until I hit a positive entry as those elements relate to the biochemistry needed to find the positions and not the actual protein. To make matters even worse, sometimes you get a line as such:

ATOM 139 CA AILE A 21 63.260 111.496 12.203 0.50 12.87 C
ATOM 140 CA BILE A 21 63.275 111.495 12.201 0.50 12.17 C

While they both refer to residue 21, the biochem isn’t precise enough to get an exact position, so they give two options. Ideally, I would specify “1”, “2” or whatever, but if I just take the first option that would be ok. Finally, for the type of atom (“N”) in my original example, I only want to get those lines with a “CA”.

I’m a newbie to python, and my training is in biostats so I was wondering what’s the best way to do this? Do I parse this line by line with a for loop? Is there a way to do it faster in python? How do I handle the double entries for some atoms?

I realize it’s a bit to ask, but some guidance would be a ton of help! I’ve programmed all the statistics bits in using R, but now I just need to get my files in the right format!

Thanks!

Asked By: user1357015

||

Answers:

It looks like a library to parse PDB files already exists. Check out:

http://www.csb.pitt.edu/prody/index.html

Briefly looking through the tutorials, it seems like this is what you’re talking about no?

http://www.csb.pitt.edu/prody/tutorial.html#atomic-data

Answered By: Jeff_C

That’s a long description. I am not sure I got all that correctly 🙂
If the fields (for the lines start with ATOM) are fixed,
you could use a split and some comaprisons.
I’ve used a hash to see if the entry is already seen to eliminate
the duplicate as you wanted.
Hope this would give you a start,

visited = {}
for line in open('3GFT.pdb'):
    list = line.split()
    id = list[0]
    if id == 'ATOM':
        type = list[2]
        if type == 'CA':
            residue = list[3]
            type_of_chain = list[4]
            atom_count = int(list[5])
            position = list[6:8]
            if atom_count >= 0:
                if type_of_chain not in visited:
                    visited[type_of_chain] = 1
                    print residue,type_of_chain,atom_count,' '.join(position)

Will output,

MET A 1 62.935 97.579
GLY B 0 39.524 105.916
GLY C 0 67.295 110.376
MET D 1 59.311 124.106
GLY E 0 44.038 96.819
GLY F 0 44.187 123.590
Answered By: dpp

For large pdb files with a lot of atoms, there is no blank left between fields, so you cannot use the split command. Instead, you can use the Protein Data Bank format definition to parse the pdb file:

with open('min.pdb') as pdbfile:
    for line in pdbfile:
        if line[:4] == 'ATOM' or line[:6] == "HETATM":
            print line
            # Split the line
            splitted_line = [line[:6], line[6:11], line[12:16], line[17:20], line[21], line[22:26], line[30:38], line[38:46], line[46:54]]
            print splitted_line
            # To format again the pdb file with the fields extracted
            print "%-6s%5s %4s %3s %s%4s    %8s%8s%8sn"%tuple(splitted_line)

Below is a sample output:

# Original line:
HETATM10000  O   HOH B3257       2.509  40.006  -4.097  1.00  0.00           O

# Fields extracted:
['HETATM', '10000', ' O  ', 'HOH', 'B', '3257', '   2.509', '  40.006', '  -4.097']

# Reformatted line:
HETATM10000  O   HOH B3257       2.509  40.006  -4.097
Answered By: bougui

I am mildly surprised that no one mentioned the Bio.PDB package from BioPython. Writing a PDB parser on your own is a rather serious case of unnecessarily reinventing, I mean reimplementing the wheel.

BioPython is a useful collection of packages for working with other kinds of biological data (e.g. protein or nucleic acid sequences) as well.

Update

I added a few functions I use for downloading and reading PDB files by way of example:

import os
import sys
import urllib.request

import Bio
import Bio.PDB
import Bio.SeqRecord

def download_read_pdb(pdbcode, datadir, keepfile=True):
    """
    Downloads a PDB file from the Internet and saves it in a data directory.
    Then it reads and returns the structure inside.
    :param pdbcode: The standard PDB ID e.g. '3ICB'
    :param datadir: The directory where the downloaded file will be saved
    :param keepfile: if False, then the downloaded file will be deleted (default: keep the downloaded file)
    :return: a Bio.PDB Structure object or None if something went wrong
    """
    pdbfilenm = download_pdb(pdbcode, datadir)
    if pdbfilenm is None:
        return None
    struct = read_pdb(pdbcode, pdbfilenm)
    if not keepfile:
        os.remove(pdbfilenm)
    return struct
        
def download_pdb(pdbcode, datadir, downloadurl="http://files.rcsb.org/download/"):
    """
    Downloads a PDB file from the Internet and saves it in a data directory.
    :param pdbcode: The standard PDB ID e.g. '3ICB' or '3icb'
    :param datadir: The directory where the downloaded file will be saved
    :param downloadurl: The base PDB download URL, cf.
        `https://www.rcsb.org/pages/download/http#structures` for details
        Note that the unencrypted HTTP protocol is used by default
        to avoid spurious OpenSSL errors...
    :return: the full path to the downloaded PDB file or None if something went wrong
    """
    pdbfn = pdbcode + ".pdb"
    url = downloadurl + pdbfn
    outfnm = os.path.join(datadir, pdbfn)
    try:
        urllib.request.urlretrieve(url, outfnm)
        return outfnm
    except Exception as err:
        # all sorts of things could have gone wrong...
        print(str(err), file=sys.stderr)
        return None

def read_pdb(pdbcode, pdbfilenm):
    """
    Read a PDB structure from a file.
    :param pdbcode: A PDB ID string
    :param pdbfilenm: The PDB file
    :return: a Bio.PDB.Structure object or None if something went wrong
    """
    try:
        pdbparser = Bio.PDB.PDBParser(QUIET=True)   # suppress PDBConstructionWarning
        struct = pdbparser.get_structure(pdbcode, pdbfilenm)
        return struct
    except Exception as err:
        print(str(err), file=sys.stderr)
        return None 

def extract_seqrecords(pdbcode, struct):
    """
    Extracts the sequence records from a Bio.PDB structure.
    :param pdbcode: the PDB ID of the structure, needed to add a sequence ID to the result
    :param struct: a Bio.PDB.Structure object
    :return: a list of Bio.SeqRecord objects
    """
    ppb = Bio.PDB.PPBuilder()
    seqrecords = []
    for i, chain in enumerate(struct.get_chains()):
        # extract and store sequences as list of SeqRecord objects
        pps = ppb.build_peptides(chain)    # polypeptides
        seq = pps[0].get_sequence() # just take the first, hope there's no chain break
        seqid = pdbcode + chain.id
        seqrec = Bio.SeqRecord.SeqRecord(seq, id=seqid, 
            description="Sequence #{}, {}".format(i+1, seqid))
        seqrecords.append(seqrec)
    return seqrecords

def get_calphas(struct):
    """
    Extracts the C-alpha atoms from a PDB structure.
    :param struct: A Bio.PDB.Structure object.
    :return: A list of Bio.PDB.Atom objects representing the C-alpha atoms in `struct`.
    """
    calphas = [ atom for atom in struct.get_atoms() if atom.get_fullname() == " CA " ]
    return calphas

I am aware that this was an old question but maybe someone still finds this pointer helpful.

Answered By: András Aszódi
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.