Indexing of strings (molecule SMILES)

Question

Also, please can someone adjust or give me advice on how to look at the second order of parenthesis. Same process as this, but with only parenthesis in second order (this code is first order). Can you make it so I can easily adjust? By second order I mean (()). for example C(C(C))C. Everything except The C with 2 brackets around is 0. Also, the same conditions occur. Much appreciated.

Hope you are well. I have this code where I am trying to index within a parenthesis. I want all atoms that are not in parenthesis (branched) to be 0.
For example

CCC(C)CC
[0 0 0 1 0 0]

CC(CCC)CC
[0 0 1 2 3 0 0]

CC(C)(C)C
[0 0 1 1 0]

CC(CCC)(C)C
[0 0 1 2 3 1 0]

As you can see from the above examples, I am counting the number of atoms within the parenthesis, however any nested parenthesis (a branched atom within a branch) is given the value of the atom before the branch (the atom without a parenthesis around it).

Such as C(C(C)C)C would have [0 1 1 2 0].

This code works for all cases except ones such as these. Below are my desired outputs, incorrect output and my code. Thanks

Desired output

CCC(CC(C)CC)(C)C  [0, 0, 0, 1, 2, 2, 3, 4, 1, 0]
             ^                             ^

Incorrect output

CCC(CC(C)CC)(C)C  [0, 0, 0, 1, 2, 2, 3, 4, 4, 0]
             ^                             ^

import pandas as pd
from rdkit import Chem

def smile_grouping(s):
    group_counter = 1
    res = []
    open_brackets = 0
    branch_start_index = None
    last_non_nested_group = None

    for i, letter in enumerate(s):
        if letter == '(':
            open_brackets += 1
            if open_brackets == 1:
                branch_start_index = i
                if last_non_nested_group is not None:
                    group_counter = last_non_nested_group + 1
        elif letter == ')':
            if open_brackets == 1:
                last_non_nested_group = None
            open_brackets -= 1
            if open_brackets == 0:
                branch_start_index = None
        elif letter not in ['[', ']', '+', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'i', 'e']:
            if open_brackets == 1:
                if branch_start_index is not None and branch_start_index + 1 != i:
                    group_counter += 1
                res.append(group_counter)
                last_non_nested_group = group_counter
            elif open_brackets == 0:
                res.append(0)
            elif open_brackets > 1:
                res.append(last_non_nested_group)

    mol = Chem.MolFromSmiles(s)
    num_atoms = mol.GetNumAtoms()
    while len(res) < num_atoms:
        res.append(0)

    return res

df = pd.DataFrame(
    {'SMILES': [
        "CCC(CC(C)CC)(C)C",
        "CCC[I+](C)(C)C",
        "CCC(CCC(C))C"
    ]})
df['Indexed SMILES'] = df['SMILES'].apply(smile_grouping)
print(df)

Asked By: YZman

||

Source

Answer 1

You need a way to reset group counter when it leaves the last close bracket. I added one to your open brackets logic as such:

        if open_brackets == 1:
            branch_start_index = i
            if last_non_nested_group is not None:
                group_counter = last_non_nested_group + 1
            else:
                group_counter = 1

This resets the group counter after it completes the final close bracket and enters a new open bracket.

Answered By: beh aaron

Answer 2

Here is my solution. It works for different orders (except 0 order I have a separate code for it).

import pandas as pd
from rdkit import Chem

def smile_grouping(s, order):
    group_counter = 1
    res = []
    open_brackets = 0
    branch_start_index = None
    last_non_nested_group = None

    for i, letter in enumerate(s):
        if letter == '(':
            open_brackets += 1
            if open_brackets == order:
                branch_start_index = i
                if last_non_nested_group is not None:
                    group_counter = last_non_nested_group + 1
        elif letter == ')':
            if open_brackets == order:
                last_non_nested_group = None
            open_brackets -= 1
            if open_brackets == order - 1:
                branch_start_index = None
                group_counter = 1
                last_non_nested_group = None
            continue
        elif letter not in ['[', ']', '+', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'i', 'e']:
            if open_brackets == order:
                if branch_start_index is not None and branch_start_index + 1 != i:
                    group_counter += 1
                res.append(group_counter)
                last_non_nested_group = group_counter
            elif open_brackets < order:
                res.append(0)
            elif open_brackets > order:
                res.append(last_non_nested_group)

    mol = Chem.MolFromSmiles(s)
    num_atoms = mol.GetNumAtoms()
    while len(res) < num_atoms:
        res.append(0)

    return res

df = pd.DataFrame(
    {'SMILES': [
        "CCC(CC(C)CC)(C)C",
        "CCC[I+](C)(C)C",
        "CCC(CCC(C))C"
    ]})
df['Indexed SMILES'] = df['SMILES'].apply(lambda x: smile_grouping(x, order=1))
print(df)

Answered By: YZman

Indexing of strings (molecule SMILES)

Question:

Answers: