Indexing of strings

Question:

I am trying to get an output based on this procedure, which is best to explain with an example.

for example in a smile,

C(N)(N)CC(N)C, [0, 1, 2, 0, 0, 1, 0]
this is the output I am trying to get.

It counts the branching (which is represented by brackets). So for the above example, it counts the first (N) as 1, then the second (N) as 2. This count is then reset once it reaches an atom that is not branched (or bracketed). It continues to get 0 and the count begins and resets again. The problem is I am not getting the expected output. Below are my outputs, expected outputs and code. Thanks

Also, I need to ensure situations like these CC(CC(C)) are not incorrectly indexed. It should not count excess and not reset, not continuously count. That smile should have output of
[0 0 1 1 1].

another example:
CC(CCC)CCCC
[0 0 1 1 1 0 0 0 0]

For nested brackets I will rerun this process and just start counting from 1.

I am getting this

          SMILES                             branch_count
0  C(N)(N)CC(N)C  [0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0]
1            CCC                                [0, 0, 0]
2          C1CC1                          [0, 0, 0, 0, 0]
3      C1CC1(C)C              [0, 0, 0, 0, 0, 0, 1, 0, 0]
4         CC(C)C                       [0, 0, 0, 1, 0, 0]

when it should be this

          SMILES        branch_count
0  C(N)(N)CC(N)C  [0, 1, 2, 0, 0, 1, 0]
1            CCC           [0, 0, 0]
2          C1CC1           [0, 0, 0]
3      C1CC1(C)C        [0, 0, 0, 1, 0]
4         CC(C)C           [0, 0, 1, 0]


import pandas as pd
import numpy as np
from rdkit import Chem

def get_branch_count(smile):
    # Initialize variables
    branch_count = [0] * len(smile)
    bracket_count = 0
    current_count = 0
    
    # Loop through each character in the smile
    for i, c in enumerate(smile):
        # If the character is an open bracket, increment bracket count
        if c == "(":
            bracket_count += 1
        # If the character is a close bracket, decrement bracket count
        elif c == ")":
            bracket_count -= 1
            # If there are no more open brackets after this one, reset current count
            if bracket_count == 0:
                current_count = 0
        # If the character is not a bracket, update the current count
        else:
            if bracket_count > 0:
                # If the previous character was also a bracket, don't increment the count
                if smile[i-1] != ")":
                    current_count += 1
            else:
                current_count = 0
            branch_count[i] = current_count
            
    return branch_count

def collect_branch_count(smile_list):
    rows = []

    for smile in smile_list:
        branch_count = get_branch_count(smile)
        data = {"branch_count": branch_count}

        row = {"SMILES": smile}
        for key, value in data.items():
            row[key] = value
        rows.append(row)

    df = pd.DataFrame(rows)
    return df

smile_list = ["C(N)(N)CC(N)C", "CCC", "C1CC1", "C1CC1(C)C", "CC(C)C"]
df = collect_branch_count(smile_list)
print(df)

Asked By: YZman

||

Answers:

This is my solution.

First I replace all C1 with C to evaluate one letter as an optional group. Then I count the open brackets. If only one backet is open, I have a new group. It I have a closing bracket, I check it the next letter is an opening one, to check if there is a consecutive group. If not, I reset the counter to 0.

import pandas as pd

def smile_grouping(s):
    s = s.replace('C1', 'C')
    open_brackets = 0
    group_counter = 0

    res = []
    for i, letter in enumerate(s):
        if letter == '(':
            open_brackets += 1
            if open_brackets == 1:
                group_counter += 1
        elif letter == ')':
            open_brackets -= 1
        else:
            res.append(group_counter)

        if open_brackets == 0:
            if i+1<len(s) and s[i+1] != '(':
                group_counter = 0
    return res

This is the result

df = pd.DataFrame(
    {'smile':[
        "C(N)(N)CC(N)C",
        "CCC",
        "C1CC1",
        "C1CC1(C)C",
        "CC(C)C",
        "C(N)(N)(N)CC(N)C",
        "C((N)(N)N)CC(N)C",
        "CC(CCC)CCCC",
        "CC(CC(C))"
    ]})
df['branch_count'] = df['smile'].apply(smile_grouping)
>>> df
              smile                 branch_count
0     C(N)(N)CC(N)C        [0, 1, 2, 0, 0, 1, 0]
1               CCC                    [0, 0, 0]
2             C1CC1                    [0, 0, 0]
3         C1CC1(C)C              [0, 0, 0, 1, 0]
4            CC(C)C                 [0, 0, 1, 0]
5  C(N)(N)(N)CC(N)C     [0, 1, 2, 3, 0, 0, 1, 0]
6  C((N)(N)N)CC(N)C     [0, 1, 1, 1, 0, 0, 1, 0]
7       CC(CCC)CCCC  [0, 0, 1, 1, 1, 0, 0, 0, 0]
8         CC(CC(C))              [0, 0, 1, 1, 1]
Answered By: mosc9575

The loop is including the brackets as characters so for each open and closed bracket your code will count it as an atom. You should have a check for if the character is a letter or not by using .isalpha(). Then you should also have a check (mine is n) for whether the character should be replaced by a number or not. For example, in your bad code, the brackets and numbers were also replaced by a 0/1 and that meant you had extra atoms that you didn’t want. Read my comments for extra explanations and run this code in your own engine to make sure it is correct (though I have already checked multiple times).

import pandas as pd
import numpy as np
from rdkit import Chem


# All changes in function
def get_branch_count(smile):
    # Initialize variables
    n = 0 # This is to make sure that only the needed characters are added, so it doesn't include 
    length_smile = 0
    for char in smile:
        if char.isalpha():
            length_smile += 1
    branch_count = [0] * length_smile
    bracket_count = 0
    bracket_together = 0 # Use this variable for when the brackets are next to each other for less confusing code
    current_count = 0
    # Loop through each character in the smile
    for i, c in enumerate(smile):
        if c == '(':
            bracket_count += 1
        
        # Continue after the IF statement because the letters are now inside of the brackets
        elif bracket_count >= 1 and c.isalpha():
            current_count = bracket_count
            branch_count[n] = current_count
            n += 1
        # This is to check if there are consecutive branches
        elif c ==')':
            if smile[i+1] != '(':
                bracket_count = 0
            
            
        # If the character is not surrounded by brackets and if it is alphabetical
        elif c.isalpha() and bracket_count == 0:
            current_count = 0
            branch_count[n] = current_count # Do this inside of each IF statement for the alphabetical chars so that it doesn't include the brackets
            n += 1
            
    return branch_count

def collect_branch_count(smile_list):
    rows = []

    for smile in smile_list:
        branch_count = get_branch_count(smile)
        data = {"branch_count": branch_count}

        row = {"SMILES": smile}
        for key, value in data.items():
            row[key] = value
        rows.append(row)

    df = pd.DataFrame(rows)
    return df

smile_list = ["C(N)(N)CC(N)C", "CCC", "C1CC1", "C1CC1(C)C", "CC(C)C"]
df = collect_branch_count(smile_list)
print(df)


As you can see, I changed a few things:

  • Instead of doing branch_count = [0] * len(smile) I changed it to:

     ```python
     # This is to make sure that there are no extra numbers (for example the brackets and the non-alphabetical characters.
     length_smile = 0
     for char in smile:
         if char.isalpha():
             length_smile += 1
     branch_count = [0] * length_smile
     ```
    
Answered By: TZhao
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.