Indexing of strings
Question:
I am trying to get an output based on this procedure, which is best to explain with an example.
for example in a smile,
C(N)(N)CC(N)C, [0, 1, 2, 0, 0, 1, 0]
this is the output I am trying to get.
It counts the branching (which is represented by brackets). So for the above example, it counts the first (N) as 1, then the second (N) as 2. This count is then reset once it reaches an atom that is not branched (or bracketed). It continues to get 0 and the count begins and resets again. The problem is I am not getting the expected output. Below are my outputs, expected outputs and code. Thanks
Also, I need to ensure situations like these CC(CC(C)) are not incorrectly indexed. It should not count excess and not reset, not continuously count. That smile should have output of
[0 0 1 1 1].
another example:
CC(CCC)CCCC
[0 0 1 1 1 0 0 0 0]
For nested brackets I will rerun this process and just start counting from 1.
I am getting this
SMILES branch_count
0 C(N)(N)CC(N)C [0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0]
1 CCC [0, 0, 0]
2 C1CC1 [0, 0, 0, 0, 0]
3 C1CC1(C)C [0, 0, 0, 0, 0, 0, 1, 0, 0]
4 CC(C)C [0, 0, 0, 1, 0, 0]
when it should be this
SMILES branch_count
0 C(N)(N)CC(N)C [0, 1, 2, 0, 0, 1, 0]
1 CCC [0, 0, 0]
2 C1CC1 [0, 0, 0]
3 C1CC1(C)C [0, 0, 0, 1, 0]
4 CC(C)C [0, 0, 1, 0]
import pandas as pd
import numpy as np
from rdkit import Chem
def get_branch_count(smile):
# Initialize variables
branch_count = [0] * len(smile)
bracket_count = 0
current_count = 0
# Loop through each character in the smile
for i, c in enumerate(smile):
# If the character is an open bracket, increment bracket count
if c == "(":
bracket_count += 1
# If the character is a close bracket, decrement bracket count
elif c == ")":
bracket_count -= 1
# If there are no more open brackets after this one, reset current count
if bracket_count == 0:
current_count = 0
# If the character is not a bracket, update the current count
else:
if bracket_count > 0:
# If the previous character was also a bracket, don't increment the count
if smile[i-1] != ")":
current_count += 1
else:
current_count = 0
branch_count[i] = current_count
return branch_count
def collect_branch_count(smile_list):
rows = []
for smile in smile_list:
branch_count = get_branch_count(smile)
data = {"branch_count": branch_count}
row = {"SMILES": smile}
for key, value in data.items():
row[key] = value
rows.append(row)
df = pd.DataFrame(rows)
return df
smile_list = ["C(N)(N)CC(N)C", "CCC", "C1CC1", "C1CC1(C)C", "CC(C)C"]
df = collect_branch_count(smile_list)
print(df)
Answers:
This is my solution.
First I replace all C1
with C
to evaluate one letter as an optional group. Then I count the open brackets. If only one backet is open, I have a new group. It I have a closing bracket, I check it the next letter is an opening one, to check if there is a consecutive group. If not, I reset the counter to 0.
import pandas as pd
def smile_grouping(s):
s = s.replace('C1', 'C')
open_brackets = 0
group_counter = 0
res = []
for i, letter in enumerate(s):
if letter == '(':
open_brackets += 1
if open_brackets == 1:
group_counter += 1
elif letter == ')':
open_brackets -= 1
else:
res.append(group_counter)
if open_brackets == 0:
if i+1<len(s) and s[i+1] != '(':
group_counter = 0
return res
This is the result
df = pd.DataFrame(
{'smile':[
"C(N)(N)CC(N)C",
"CCC",
"C1CC1",
"C1CC1(C)C",
"CC(C)C",
"C(N)(N)(N)CC(N)C",
"C((N)(N)N)CC(N)C",
"CC(CCC)CCCC",
"CC(CC(C))"
]})
df['branch_count'] = df['smile'].apply(smile_grouping)
>>> df
smile branch_count
0 C(N)(N)CC(N)C [0, 1, 2, 0, 0, 1, 0]
1 CCC [0, 0, 0]
2 C1CC1 [0, 0, 0]
3 C1CC1(C)C [0, 0, 0, 1, 0]
4 CC(C)C [0, 0, 1, 0]
5 C(N)(N)(N)CC(N)C [0, 1, 2, 3, 0, 0, 1, 0]
6 C((N)(N)N)CC(N)C [0, 1, 1, 1, 0, 0, 1, 0]
7 CC(CCC)CCCC [0, 0, 1, 1, 1, 0, 0, 0, 0]
8 CC(CC(C)) [0, 0, 1, 1, 1]
The loop is including the brackets as characters so for each open and closed bracket your code will count it as an atom. You should have a check for if the character is a letter or not by using .isalpha()
. Then you should also have a check (mine is n
) for whether the character should be replaced by a number or not. For example, in your bad code, the brackets and numbers were also replaced by a 0/1 and that meant you had extra atoms that you didn’t want. Read my comments for extra explanations and run this code in your own engine to make sure it is correct (though I have already checked multiple times).
import pandas as pd
import numpy as np
from rdkit import Chem
# All changes in function
def get_branch_count(smile):
# Initialize variables
n = 0 # This is to make sure that only the needed characters are added, so it doesn't include
length_smile = 0
for char in smile:
if char.isalpha():
length_smile += 1
branch_count = [0] * length_smile
bracket_count = 0
bracket_together = 0 # Use this variable for when the brackets are next to each other for less confusing code
current_count = 0
# Loop through each character in the smile
for i, c in enumerate(smile):
if c == '(':
bracket_count += 1
# Continue after the IF statement because the letters are now inside of the brackets
elif bracket_count >= 1 and c.isalpha():
current_count = bracket_count
branch_count[n] = current_count
n += 1
# This is to check if there are consecutive branches
elif c ==')':
if smile[i+1] != '(':
bracket_count = 0
# If the character is not surrounded by brackets and if it is alphabetical
elif c.isalpha() and bracket_count == 0:
current_count = 0
branch_count[n] = current_count # Do this inside of each IF statement for the alphabetical chars so that it doesn't include the brackets
n += 1
return branch_count
def collect_branch_count(smile_list):
rows = []
for smile in smile_list:
branch_count = get_branch_count(smile)
data = {"branch_count": branch_count}
row = {"SMILES": smile}
for key, value in data.items():
row[key] = value
rows.append(row)
df = pd.DataFrame(rows)
return df
smile_list = ["C(N)(N)CC(N)C", "CCC", "C1CC1", "C1CC1(C)C", "CC(C)C"]
df = collect_branch_count(smile_list)
print(df)
As you can see, I changed a few things:
-
Instead of doing branch_count = [0] * len(smile)
I changed it to:
```python
# This is to make sure that there are no extra numbers (for example the brackets and the non-alphabetical characters.
length_smile = 0
for char in smile:
if char.isalpha():
length_smile += 1
branch_count = [0] * length_smile
```
I am trying to get an output based on this procedure, which is best to explain with an example.
for example in a smile,
C(N)(N)CC(N)C, [0, 1, 2, 0, 0, 1, 0]
this is the output I am trying to get.
It counts the branching (which is represented by brackets). So for the above example, it counts the first (N) as 1, then the second (N) as 2. This count is then reset once it reaches an atom that is not branched (or bracketed). It continues to get 0 and the count begins and resets again. The problem is I am not getting the expected output. Below are my outputs, expected outputs and code. Thanks
Also, I need to ensure situations like these CC(CC(C)) are not incorrectly indexed. It should not count excess and not reset, not continuously count. That smile should have output of
[0 0 1 1 1].
another example:
CC(CCC)CCCC
[0 0 1 1 1 0 0 0 0]
For nested brackets I will rerun this process and just start counting from 1.
I am getting this
SMILES branch_count
0 C(N)(N)CC(N)C [0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0]
1 CCC [0, 0, 0]
2 C1CC1 [0, 0, 0, 0, 0]
3 C1CC1(C)C [0, 0, 0, 0, 0, 0, 1, 0, 0]
4 CC(C)C [0, 0, 0, 1, 0, 0]
when it should be this
SMILES branch_count
0 C(N)(N)CC(N)C [0, 1, 2, 0, 0, 1, 0]
1 CCC [0, 0, 0]
2 C1CC1 [0, 0, 0]
3 C1CC1(C)C [0, 0, 0, 1, 0]
4 CC(C)C [0, 0, 1, 0]
import pandas as pd
import numpy as np
from rdkit import Chem
def get_branch_count(smile):
# Initialize variables
branch_count = [0] * len(smile)
bracket_count = 0
current_count = 0
# Loop through each character in the smile
for i, c in enumerate(smile):
# If the character is an open bracket, increment bracket count
if c == "(":
bracket_count += 1
# If the character is a close bracket, decrement bracket count
elif c == ")":
bracket_count -= 1
# If there are no more open brackets after this one, reset current count
if bracket_count == 0:
current_count = 0
# If the character is not a bracket, update the current count
else:
if bracket_count > 0:
# If the previous character was also a bracket, don't increment the count
if smile[i-1] != ")":
current_count += 1
else:
current_count = 0
branch_count[i] = current_count
return branch_count
def collect_branch_count(smile_list):
rows = []
for smile in smile_list:
branch_count = get_branch_count(smile)
data = {"branch_count": branch_count}
row = {"SMILES": smile}
for key, value in data.items():
row[key] = value
rows.append(row)
df = pd.DataFrame(rows)
return df
smile_list = ["C(N)(N)CC(N)C", "CCC", "C1CC1", "C1CC1(C)C", "CC(C)C"]
df = collect_branch_count(smile_list)
print(df)
This is my solution.
First I replace all C1
with C
to evaluate one letter as an optional group. Then I count the open brackets. If only one backet is open, I have a new group. It I have a closing bracket, I check it the next letter is an opening one, to check if there is a consecutive group. If not, I reset the counter to 0.
import pandas as pd
def smile_grouping(s):
s = s.replace('C1', 'C')
open_brackets = 0
group_counter = 0
res = []
for i, letter in enumerate(s):
if letter == '(':
open_brackets += 1
if open_brackets == 1:
group_counter += 1
elif letter == ')':
open_brackets -= 1
else:
res.append(group_counter)
if open_brackets == 0:
if i+1<len(s) and s[i+1] != '(':
group_counter = 0
return res
This is the result
df = pd.DataFrame(
{'smile':[
"C(N)(N)CC(N)C",
"CCC",
"C1CC1",
"C1CC1(C)C",
"CC(C)C",
"C(N)(N)(N)CC(N)C",
"C((N)(N)N)CC(N)C",
"CC(CCC)CCCC",
"CC(CC(C))"
]})
df['branch_count'] = df['smile'].apply(smile_grouping)
>>> df
smile branch_count
0 C(N)(N)CC(N)C [0, 1, 2, 0, 0, 1, 0]
1 CCC [0, 0, 0]
2 C1CC1 [0, 0, 0]
3 C1CC1(C)C [0, 0, 0, 1, 0]
4 CC(C)C [0, 0, 1, 0]
5 C(N)(N)(N)CC(N)C [0, 1, 2, 3, 0, 0, 1, 0]
6 C((N)(N)N)CC(N)C [0, 1, 1, 1, 0, 0, 1, 0]
7 CC(CCC)CCCC [0, 0, 1, 1, 1, 0, 0, 0, 0]
8 CC(CC(C)) [0, 0, 1, 1, 1]
The loop is including the brackets as characters so for each open and closed bracket your code will count it as an atom. You should have a check for if the character is a letter or not by using .isalpha()
. Then you should also have a check (mine is n
) for whether the character should be replaced by a number or not. For example, in your bad code, the brackets and numbers were also replaced by a 0/1 and that meant you had extra atoms that you didn’t want. Read my comments for extra explanations and run this code in your own engine to make sure it is correct (though I have already checked multiple times).
import pandas as pd
import numpy as np
from rdkit import Chem
# All changes in function
def get_branch_count(smile):
# Initialize variables
n = 0 # This is to make sure that only the needed characters are added, so it doesn't include
length_smile = 0
for char in smile:
if char.isalpha():
length_smile += 1
branch_count = [0] * length_smile
bracket_count = 0
bracket_together = 0 # Use this variable for when the brackets are next to each other for less confusing code
current_count = 0
# Loop through each character in the smile
for i, c in enumerate(smile):
if c == '(':
bracket_count += 1
# Continue after the IF statement because the letters are now inside of the brackets
elif bracket_count >= 1 and c.isalpha():
current_count = bracket_count
branch_count[n] = current_count
n += 1
# This is to check if there are consecutive branches
elif c ==')':
if smile[i+1] != '(':
bracket_count = 0
# If the character is not surrounded by brackets and if it is alphabetical
elif c.isalpha() and bracket_count == 0:
current_count = 0
branch_count[n] = current_count # Do this inside of each IF statement for the alphabetical chars so that it doesn't include the brackets
n += 1
return branch_count
def collect_branch_count(smile_list):
rows = []
for smile in smile_list:
branch_count = get_branch_count(smile)
data = {"branch_count": branch_count}
row = {"SMILES": smile}
for key, value in data.items():
row[key] = value
rows.append(row)
df = pd.DataFrame(rows)
return df
smile_list = ["C(N)(N)CC(N)C", "CCC", "C1CC1", "C1CC1(C)C", "CC(C)C"]
df = collect_branch_count(smile_list)
print(df)
As you can see, I changed a few things:
-
Instead of doing
branch_count = [0] * len(smile)
I changed it to:```python # This is to make sure that there are no extra numbers (for example the brackets and the non-alphabetical characters. length_smile = 0 for char in smile: if char.isalpha(): length_smile += 1 branch_count = [0] * length_smile ```