Converting an Isometric SMILE into its atoms and non-hydrogen neighbours
Question:
Hope whoever is reading this is well.
I have an issue with my code. I am trying to convert an isometric SMILE, a descriptor of a molecule, into its atomic groups and neightbours.
My code is below.
import rdkit
from rdkit import Chem
def collect_bonded_atoms(smile):
mol = Chem.MolFromSmiles(smile)
atom_counts = {}
for atom in mol.GetAtoms():
neighbors = [(neighbor.GetSymbol(), bond.GetBondType())
for neighbor, bond in zip(atom.GetNeighbors(), atom.GetBonds())]
neighbors.sort()
key = "{}-{}".format(atom.GetSymbol(), "".join(f"{symbol}{'-' if bond_order == 1 else '=' if bond_order == 2 else '#'}" for symbol, bond_order in neighbors))
atom_counts[key] = atom_counts.get(key, 0) + 1
return atom_counts
smile = "CC(C)(C)C(=O)O"
print(collect_bonded_atoms(smile))
And output is
{'C-C-': 3, 'C-C-C-C-C-': 1, 'C-C-O-O=': 1, 'O-C=': 1, 'O-C-': 1}
Whilst this works well for this molecule’s SMILE, though preferably I would’ve liked it to be structured as,
{'C-C-': 3, 'C-C(-C)(-C)-C-': 1, 'C-C-O(=O)': 1, 'O=C': 1, 'O-C-': 1}
I can’t figure out how to fix this. This is a side issue.
The main issue I have is when using this molecule
smile = "CCCCCCCCN1C=C[N+](=C1)C.F[P-](F)(F)(F)(F)F"
My output is very wrong. This is my output.
{'C-C-': 1, 'C-C-C-': 6, 'C-C-N-': 1, 'N-C-C#C#': 2, 'C-C#N#': 2, 'C-N#N#': 1, 'C-N-': 1, 'F-P-': 6, 'P-F-F-F-F-F-F-': 1}
First is that double bonds (bond_order == 2) are shown as a #. Second where it shows the number 1 in the molecule SMILE, that represents a ring. This means that it connects to the next 1. In the output, it is all over the place.
Can I please have some guidance on this?
Thanks
Advice on how to fix it, or even better a modification. The side issue isn’t as important, but if possible same for it.
Answers:
You’re getting the #
because those bonds are part of an aromatic ring, making their bond type AROMATIC instead of a SINGLE or DOUBLE. In RdKit, AROMATIC bonds have a bond order of 1.5 so all those bonds are ending up in the else
loop.
To fix this, you can do two things:
- Change your
if-else
condition to acknowledge the 1.5 bond order
- Kekulize the
mol
object to remove conjugation due to the aromatic ring and make all the bonds static. You can do this by updating your code as:
mol = Chem.MolFromSmiles(smile)
Chem.Kekulize(mol)
Hope whoever is reading this is well.
I have an issue with my code. I am trying to convert an isometric SMILE, a descriptor of a molecule, into its atomic groups and neightbours.
My code is below.
import rdkit
from rdkit import Chem
def collect_bonded_atoms(smile):
mol = Chem.MolFromSmiles(smile)
atom_counts = {}
for atom in mol.GetAtoms():
neighbors = [(neighbor.GetSymbol(), bond.GetBondType())
for neighbor, bond in zip(atom.GetNeighbors(), atom.GetBonds())]
neighbors.sort()
key = "{}-{}".format(atom.GetSymbol(), "".join(f"{symbol}{'-' if bond_order == 1 else '=' if bond_order == 2 else '#'}" for symbol, bond_order in neighbors))
atom_counts[key] = atom_counts.get(key, 0) + 1
return atom_counts
smile = "CC(C)(C)C(=O)O"
print(collect_bonded_atoms(smile))
And output is
{'C-C-': 3, 'C-C-C-C-C-': 1, 'C-C-O-O=': 1, 'O-C=': 1, 'O-C-': 1}
Whilst this works well for this molecule’s SMILE, though preferably I would’ve liked it to be structured as,
{'C-C-': 3, 'C-C(-C)(-C)-C-': 1, 'C-C-O(=O)': 1, 'O=C': 1, 'O-C-': 1}
I can’t figure out how to fix this. This is a side issue.
The main issue I have is when using this molecule
smile = "CCCCCCCCN1C=C[N+](=C1)C.F[P-](F)(F)(F)(F)F"
My output is very wrong. This is my output.
{'C-C-': 1, 'C-C-C-': 6, 'C-C-N-': 1, 'N-C-C#C#': 2, 'C-C#N#': 2, 'C-N#N#': 1, 'C-N-': 1, 'F-P-': 6, 'P-F-F-F-F-F-F-': 1}
First is that double bonds (bond_order == 2) are shown as a #. Second where it shows the number 1 in the molecule SMILE, that represents a ring. This means that it connects to the next 1. In the output, it is all over the place.
Can I please have some guidance on this?
Thanks
Advice on how to fix it, or even better a modification. The side issue isn’t as important, but if possible same for it.
You’re getting the #
because those bonds are part of an aromatic ring, making their bond type AROMATIC instead of a SINGLE or DOUBLE. In RdKit, AROMATIC bonds have a bond order of 1.5 so all those bonds are ending up in the else
loop.
To fix this, you can do two things:
- Change your
if-else
condition to acknowledge the 1.5 bond order - Kekulize the
mol
object to remove conjugation due to the aromatic ring and make all the bonds static. You can do this by updating your code as:
mol = Chem.MolFromSmiles(smile)
Chem.Kekulize(mol)