Parsing a .pdb file in Python
Question:
I’m trying to write a quick parser for .pdb files (they show protein structure). An example of a protein I am looking at is KRAS (common in cancer) and is here: http://www.rcsb.org/pdb/files/3GFT.pdb
If you scroll down far enough you will get to a line that looks like this:
ATOM 1 N MET A 1 63.645 97.355 31.526 1.00 33.80 N
The first element “atom” means this relates to an actual atom in the protein. The 1 relates to a general count, N relates to the type of atom, “MET” is the name of the residue, “A” relates to the the type of chain, 1 (the second “1”) is the atom count and then the next 3 numbers are the x-y-z positions in space.
What I need output is something like this (where the “1” below corresponds to the atom count, not the general count):
MET A 1 63.645 97.355 31.526
To make matters more complicated, sometimes the atom count (the second “1” in this case) is negative. In those cases I want to skip that line an continue until I hit a positive entry as those elements relate to the biochemistry needed to find the positions and not the actual protein. To make matters even worse, sometimes you get a line as such:
ATOM 139 CA AILE A 21 63.260 111.496 12.203 0.50 12.87 C
ATOM 140 CA BILE A 21 63.275 111.495 12.201 0.50 12.17 C
While they both refer to residue 21, the biochem isn’t precise enough to get an exact position, so they give two options. Ideally, I would specify “1”, “2” or whatever, but if I just take the first option that would be ok. Finally, for the type of atom (“N”) in my original example, I only want to get those lines with a “CA”.
I’m a newbie to python, and my training is in biostats so I was wondering what’s the best way to do this? Do I parse this line by line with a for loop? Is there a way to do it faster in python? How do I handle the double entries for some atoms?
I realize it’s a bit to ask, but some guidance would be a ton of help! I’ve programmed all the statistics bits in using R, but now I just need to get my files in the right format!
Thanks!
Answers:
It looks like a library to parse PDB files already exists. Check out:
http://www.csb.pitt.edu/prody/index.html
Briefly looking through the tutorials, it seems like this is what you’re talking about no?
That’s a long description. I am not sure I got all that correctly 🙂
If the fields (for the lines start with ATOM) are fixed,
you could use a split and some comaprisons.
I’ve used a hash to see if the entry is already seen to eliminate
the duplicate as you wanted.
Hope this would give you a start,
visited = {}
for line in open('3GFT.pdb'):
list = line.split()
id = list[0]
if id == 'ATOM':
type = list[2]
if type == 'CA':
residue = list[3]
type_of_chain = list[4]
atom_count = int(list[5])
position = list[6:8]
if atom_count >= 0:
if type_of_chain not in visited:
visited[type_of_chain] = 1
print residue,type_of_chain,atom_count,' '.join(position)
Will output,
MET A 1 62.935 97.579
GLY B 0 39.524 105.916
GLY C 0 67.295 110.376
MET D 1 59.311 124.106
GLY E 0 44.038 96.819
GLY F 0 44.187 123.590
For large pdb
files with a lot of atoms, there is no blank left between fields, so you cannot use the split
command. Instead, you can use the Protein Data Bank format definition to parse the pdb
file:
with open('min.pdb') as pdbfile:
for line in pdbfile:
if line[:4] == 'ATOM' or line[:6] == "HETATM":
print line
# Split the line
splitted_line = [line[:6], line[6:11], line[12:16], line[17:20], line[21], line[22:26], line[30:38], line[38:46], line[46:54]]
print splitted_line
# To format again the pdb file with the fields extracted
print "%-6s%5s %4s %3s %s%4s %8s%8s%8sn"%tuple(splitted_line)
Below is a sample output:
# Original line:
HETATM10000 O HOH B3257 2.509 40.006 -4.097 1.00 0.00 O
# Fields extracted:
['HETATM', '10000', ' O ', 'HOH', 'B', '3257', ' 2.509', ' 40.006', ' -4.097']
# Reformatted line:
HETATM10000 O HOH B3257 2.509 40.006 -4.097
I am mildly surprised that no one mentioned the Bio.PDB package from BioPython. Writing a PDB parser on your own is a rather serious case of unnecessarily reinventing, I mean reimplementing the wheel.
BioPython is a useful collection of packages for working with other kinds of biological data (e.g. protein or nucleic acid sequences) as well.
Update
I added a few functions I use for downloading and reading PDB files by way of example:
import os
import sys
import urllib.request
import Bio
import Bio.PDB
import Bio.SeqRecord
def download_read_pdb(pdbcode, datadir, keepfile=True):
"""
Downloads a PDB file from the Internet and saves it in a data directory.
Then it reads and returns the structure inside.
:param pdbcode: The standard PDB ID e.g. '3ICB'
:param datadir: The directory where the downloaded file will be saved
:param keepfile: if False, then the downloaded file will be deleted (default: keep the downloaded file)
:return: a Bio.PDB Structure object or None if something went wrong
"""
pdbfilenm = download_pdb(pdbcode, datadir)
if pdbfilenm is None:
return None
struct = read_pdb(pdbcode, pdbfilenm)
if not keepfile:
os.remove(pdbfilenm)
return struct
def download_pdb(pdbcode, datadir, downloadurl="http://files.rcsb.org/download/"):
"""
Downloads a PDB file from the Internet and saves it in a data directory.
:param pdbcode: The standard PDB ID e.g. '3ICB' or '3icb'
:param datadir: The directory where the downloaded file will be saved
:param downloadurl: The base PDB download URL, cf.
`https://www.rcsb.org/pages/download/http#structures` for details
Note that the unencrypted HTTP protocol is used by default
to avoid spurious OpenSSL errors...
:return: the full path to the downloaded PDB file or None if something went wrong
"""
pdbfn = pdbcode + ".pdb"
url = downloadurl + pdbfn
outfnm = os.path.join(datadir, pdbfn)
try:
urllib.request.urlretrieve(url, outfnm)
return outfnm
except Exception as err:
# all sorts of things could have gone wrong...
print(str(err), file=sys.stderr)
return None
def read_pdb(pdbcode, pdbfilenm):
"""
Read a PDB structure from a file.
:param pdbcode: A PDB ID string
:param pdbfilenm: The PDB file
:return: a Bio.PDB.Structure object or None if something went wrong
"""
try:
pdbparser = Bio.PDB.PDBParser(QUIET=True) # suppress PDBConstructionWarning
struct = pdbparser.get_structure(pdbcode, pdbfilenm)
return struct
except Exception as err:
print(str(err), file=sys.stderr)
return None
def extract_seqrecords(pdbcode, struct):
"""
Extracts the sequence records from a Bio.PDB structure.
:param pdbcode: the PDB ID of the structure, needed to add a sequence ID to the result
:param struct: a Bio.PDB.Structure object
:return: a list of Bio.SeqRecord objects
"""
ppb = Bio.PDB.PPBuilder()
seqrecords = []
for i, chain in enumerate(struct.get_chains()):
# extract and store sequences as list of SeqRecord objects
pps = ppb.build_peptides(chain) # polypeptides
seq = pps[0].get_sequence() # just take the first, hope there's no chain break
seqid = pdbcode + chain.id
seqrec = Bio.SeqRecord.SeqRecord(seq, id=seqid,
description="Sequence #{}, {}".format(i+1, seqid))
seqrecords.append(seqrec)
return seqrecords
def get_calphas(struct):
"""
Extracts the C-alpha atoms from a PDB structure.
:param struct: A Bio.PDB.Structure object.
:return: A list of Bio.PDB.Atom objects representing the C-alpha atoms in `struct`.
"""
calphas = [ atom for atom in struct.get_atoms() if atom.get_fullname() == " CA " ]
return calphas
I am aware that this was an old question but maybe someone still finds this pointer helpful.
I’m trying to write a quick parser for .pdb files (they show protein structure). An example of a protein I am looking at is KRAS (common in cancer) and is here: http://www.rcsb.org/pdb/files/3GFT.pdb
If you scroll down far enough you will get to a line that looks like this:
ATOM 1 N MET A 1 63.645 97.355 31.526 1.00 33.80 N
The first element “atom” means this relates to an actual atom in the protein. The 1 relates to a general count, N relates to the type of atom, “MET” is the name of the residue, “A” relates to the the type of chain, 1 (the second “1”) is the atom count and then the next 3 numbers are the x-y-z positions in space.
What I need output is something like this (where the “1” below corresponds to the atom count, not the general count):
MET A 1 63.645 97.355 31.526
To make matters more complicated, sometimes the atom count (the second “1” in this case) is negative. In those cases I want to skip that line an continue until I hit a positive entry as those elements relate to the biochemistry needed to find the positions and not the actual protein. To make matters even worse, sometimes you get a line as such:
ATOM 139 CA AILE A 21 63.260 111.496 12.203 0.50 12.87 C
ATOM 140 CA BILE A 21 63.275 111.495 12.201 0.50 12.17 C
While they both refer to residue 21, the biochem isn’t precise enough to get an exact position, so they give two options. Ideally, I would specify “1”, “2” or whatever, but if I just take the first option that would be ok. Finally, for the type of atom (“N”) in my original example, I only want to get those lines with a “CA”.
I’m a newbie to python, and my training is in biostats so I was wondering what’s the best way to do this? Do I parse this line by line with a for loop? Is there a way to do it faster in python? How do I handle the double entries for some atoms?
I realize it’s a bit to ask, but some guidance would be a ton of help! I’ve programmed all the statistics bits in using R, but now I just need to get my files in the right format!
Thanks!
It looks like a library to parse PDB files already exists. Check out:
http://www.csb.pitt.edu/prody/index.html
Briefly looking through the tutorials, it seems like this is what you’re talking about no?
That’s a long description. I am not sure I got all that correctly 🙂
If the fields (for the lines start with ATOM) are fixed,
you could use a split and some comaprisons.
I’ve used a hash to see if the entry is already seen to eliminate
the duplicate as you wanted.
Hope this would give you a start,
visited = {}
for line in open('3GFT.pdb'):
list = line.split()
id = list[0]
if id == 'ATOM':
type = list[2]
if type == 'CA':
residue = list[3]
type_of_chain = list[4]
atom_count = int(list[5])
position = list[6:8]
if atom_count >= 0:
if type_of_chain not in visited:
visited[type_of_chain] = 1
print residue,type_of_chain,atom_count,' '.join(position)
Will output,
MET A 1 62.935 97.579
GLY B 0 39.524 105.916
GLY C 0 67.295 110.376
MET D 1 59.311 124.106
GLY E 0 44.038 96.819
GLY F 0 44.187 123.590
For large pdb
files with a lot of atoms, there is no blank left between fields, so you cannot use the split
command. Instead, you can use the Protein Data Bank format definition to parse the pdb
file:
with open('min.pdb') as pdbfile:
for line in pdbfile:
if line[:4] == 'ATOM' or line[:6] == "HETATM":
print line
# Split the line
splitted_line = [line[:6], line[6:11], line[12:16], line[17:20], line[21], line[22:26], line[30:38], line[38:46], line[46:54]]
print splitted_line
# To format again the pdb file with the fields extracted
print "%-6s%5s %4s %3s %s%4s %8s%8s%8sn"%tuple(splitted_line)
Below is a sample output:
# Original line:
HETATM10000 O HOH B3257 2.509 40.006 -4.097 1.00 0.00 O
# Fields extracted:
['HETATM', '10000', ' O ', 'HOH', 'B', '3257', ' 2.509', ' 40.006', ' -4.097']
# Reformatted line:
HETATM10000 O HOH B3257 2.509 40.006 -4.097
I am mildly surprised that no one mentioned the Bio.PDB package from BioPython. Writing a PDB parser on your own is a rather serious case of unnecessarily reinventing, I mean reimplementing the wheel.
BioPython is a useful collection of packages for working with other kinds of biological data (e.g. protein or nucleic acid sequences) as well.
Update
I added a few functions I use for downloading and reading PDB files by way of example:
import os
import sys
import urllib.request
import Bio
import Bio.PDB
import Bio.SeqRecord
def download_read_pdb(pdbcode, datadir, keepfile=True):
"""
Downloads a PDB file from the Internet and saves it in a data directory.
Then it reads and returns the structure inside.
:param pdbcode: The standard PDB ID e.g. '3ICB'
:param datadir: The directory where the downloaded file will be saved
:param keepfile: if False, then the downloaded file will be deleted (default: keep the downloaded file)
:return: a Bio.PDB Structure object or None if something went wrong
"""
pdbfilenm = download_pdb(pdbcode, datadir)
if pdbfilenm is None:
return None
struct = read_pdb(pdbcode, pdbfilenm)
if not keepfile:
os.remove(pdbfilenm)
return struct
def download_pdb(pdbcode, datadir, downloadurl="http://files.rcsb.org/download/"):
"""
Downloads a PDB file from the Internet and saves it in a data directory.
:param pdbcode: The standard PDB ID e.g. '3ICB' or '3icb'
:param datadir: The directory where the downloaded file will be saved
:param downloadurl: The base PDB download URL, cf.
`https://www.rcsb.org/pages/download/http#structures` for details
Note that the unencrypted HTTP protocol is used by default
to avoid spurious OpenSSL errors...
:return: the full path to the downloaded PDB file or None if something went wrong
"""
pdbfn = pdbcode + ".pdb"
url = downloadurl + pdbfn
outfnm = os.path.join(datadir, pdbfn)
try:
urllib.request.urlretrieve(url, outfnm)
return outfnm
except Exception as err:
# all sorts of things could have gone wrong...
print(str(err), file=sys.stderr)
return None
def read_pdb(pdbcode, pdbfilenm):
"""
Read a PDB structure from a file.
:param pdbcode: A PDB ID string
:param pdbfilenm: The PDB file
:return: a Bio.PDB.Structure object or None if something went wrong
"""
try:
pdbparser = Bio.PDB.PDBParser(QUIET=True) # suppress PDBConstructionWarning
struct = pdbparser.get_structure(pdbcode, pdbfilenm)
return struct
except Exception as err:
print(str(err), file=sys.stderr)
return None
def extract_seqrecords(pdbcode, struct):
"""
Extracts the sequence records from a Bio.PDB structure.
:param pdbcode: the PDB ID of the structure, needed to add a sequence ID to the result
:param struct: a Bio.PDB.Structure object
:return: a list of Bio.SeqRecord objects
"""
ppb = Bio.PDB.PPBuilder()
seqrecords = []
for i, chain in enumerate(struct.get_chains()):
# extract and store sequences as list of SeqRecord objects
pps = ppb.build_peptides(chain) # polypeptides
seq = pps[0].get_sequence() # just take the first, hope there's no chain break
seqid = pdbcode + chain.id
seqrec = Bio.SeqRecord.SeqRecord(seq, id=seqid,
description="Sequence #{}, {}".format(i+1, seqid))
seqrecords.append(seqrec)
return seqrecords
def get_calphas(struct):
"""
Extracts the C-alpha atoms from a PDB structure.
:param struct: A Bio.PDB.Structure object.
:return: A list of Bio.PDB.Atom objects representing the C-alpha atoms in `struct`.
"""
calphas = [ atom for atom in struct.get_atoms() if atom.get_fullname() == " CA " ]
return calphas
I am aware that this was an old question but maybe someone still finds this pointer helpful.