How can I label a column of strings into numbered groups based on another column containing substrings?

Question:

I have the 1st column that is around 4920 different chemical compounds.

For example:

0              Ag(AuS)2      
1            Ag(W3Br7)2      
2      Ag0.5Ge1Pb1.75S4     
3     Ag0.5Ge1Pb1.75Se4     
4                Ag2BBr      
...                 ...      
4916             ZrTaN3     
4917               ZrTe      
4918             ZrTi2O      
4919             ZrTiF6      
4920               ZrW2  

I have the 2nd column that has all the elements of the periodic table numerically listed atomic number

0      H
1     He
2     Li
3     Be
4      B
..   ...
113   Fl
114  Uup
115   Lv
116  Uus
117  Uuo

How can I classify the first column into groups based on the compound’s first element corresponding to their atomic number from column 2 so that I can return the first column

The atomic number of Ag = 27
The atomic number of Zr = 40

    0            47      
    1            47      
    2            47     
    3            47    
    4            47      
    ...                 ...      
    4916         40    
    4917         40       
    4918         40         
    4919         40        
    4920         40     
Asked By: asdf123

||

Answers:

Since the first element could be a varying number of letters, the simplest solution would be to use the regex approach for getting the first section.
For example:

import re

compounds = ["Ag(AuS)2", "HTiF", "ZrTaN3"]

for compound in compounds:
    match = re.match(r"[A-Z][a-z]*", compound)
    if match:
        fist_element = match.group(0)
        print(fist_element)

this will print out the first element of each compound.
Note: If there are some more complex compounds and you need to adjust your regex, I recommend using https://regex101.com/ as a playground.

Once you have that information it just needs to be connected with the element in the second column which would be easiest if you mapped that column to a dictionary resembling:

{ H: 0, He: 1, Li: 2 ...}

which would allow you to simply get the element index by calling dict_with_elements.get(first_element).

From there on, the rest is just looping and writing data. I hope this helps.

Answered By: Matija Pul