How can I classify a column of strings with true and false values by comparing with another column of strings

Question:

So I have a column of strings that is listed as "compounds"

Composition (column title)

ZrMo3

Gd(CuS)3

Ba2DyInTe5

I have another column that has strings metal elements from the periodic table and i’ll call that column "metals"

Elements (column title)

Li

Be

Na

The objective is to check each string from "compounds" with every single string listed in "metals" and if any string from metals is there then it would be classified as true. Any ideas how I can code this?

Example: (if "metals" has Zr, Ag, and Te)

ZrMo3 True

Gd(CuS)3 False

Ba2DyInTe5 True

I recently tried using this code below, but I ended up getting all false

asd = subset['composition'].isin(metals['Elements'])
    
print(asd)

also tried this code and got all false as well

subset['Boolean'] = subset.apply(lambda x: True if any(word in x.composition for word in metals) else False, axis=1)
Asked By: asdf123

||

Answers:

assuming you are using pandas, you can use a list comprehension inside your lambda since you essentially need to iterate over all elements in the elements list

import pandas as pd

elements = ['Li', 'Be', 'Na', 'Te']
compounds = ['ZrMo3', 'Gd(CuS)3', 'Ba2DyInTe5']

df = pd.DataFrame(compounds, columns=['compounds'])
print(df)

output

  compounds
0       ZrMo3
1    Gd(CuS)3
2  Ba2DyInTe5

df['boolean'] = df.compounds.apply(lambda x: any([True if el in x else False for el in elements]))
print(df)

output

    compounds  boolean
0       ZrMo3    False
1    Gd(CuS)3    False
2  Ba2DyInTe5     True

if you are not using pandas, you can apply the lambda function to the lists with the map function

out = list(
    map(
        lambda x: any([True if el in x else False for el in elements]), compounds)
)
print(out)

output

[False, False, True]

here would be a more complex version which also tackles the potential errors @Ezon mentioned based on the regular expression matching module re. since this approach is essentially looping not only over the elements to compare with a single compound string but also over each constituent of the compounds I made two helper functions for it to be more readable.

import re
import pandas as pd


def split_compounds(c):
    
    # remove all non-alphabet elements
    c_split = re.sub(r"[^a-zA-Z]", "", c)
    # split string at capital letters
    c_split = '-'.join(re.findall('[A-Z][^A-Z]*', c_split))
    return c_split

def compare_compound(compound, element):
    
    # split compound into list
    compound_list = compound.split('-')
    
    return any([element == c for c in compound_list])
    
    
# build sample data
compounds = ['SiO2', 'Ba2DyInTe5', 'ZrMo3', 'Gd(CuS)3']
elements = ['Li', 'Be', 'Na', 'Te', 'S']
df = pd.DataFrame(compounds, columns=['compounds'])

# split compounds into elements
df['compounds_elements'] = [split_compounds(x) for x in compounds]

print(df)

output

    compounds compounds_elements
0        SiO2               Si-O
1  Ba2DyInTe5        Ba-Dy-In-Te
2       ZrMo3              Zr-Mo
3    Gd(CuS)3            Gd-Cu-S


# check if any item from 'elements' is in the compounds
df['boolean'] = df.compounds_elements.apply(
    lambda x: any([True if compare_compound(x, el) else False for el in elements])
)

print(df)

output

    compounds compounds_elements  boolean
0        SiO2               Si-O    False
1  Ba2DyInTe5        Ba-Dy-In-Te     True
2       ZrMo3              Zr-Mo    False
3    Gd(CuS)3            Gd-Cu-S     True
Answered By: AlexWach
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.