Search columns with list of string for a specific set of text and if the text is found enter new a new string of text in a new column

Question:

I want to search for names in column col_one where I have a list of names in the variable list20. When searching, if the value of col_one matches in list20, put the same name in a new column named new_col

Most of the time, the name will be at the front, such as ZEN, W, WICE, but there will be some names.
with a symbol after the name again, such as ZEN-R, ZEN-W2, ZEN13P2302A

my data

import pandas as pd

list20 = ['ZEN', 'OOP', 'WICE', 'XO', 'WP', 'K', 'WGE', 'YGG', 'W', 'YUASA', 'XPG',  'ABC', 'WHA', 'WHAUP', 'WFX', 'WINNER', 'WIIK', 'WIN', 'YONG', 'WPH', 'KCE']

data = {
  "col_one": ["ZEN", "WPH", "WICE", "YONG", "K" "XO", "WIN", "WP", "WIIK", "YGG-W1", "W-W5", "WINNER", "YUASA", "WGE", "WFX", "XPG", "WHAUP", "WHA", "KCE13P2302A", "OOP-R"],
}
df = pd.DataFrame(data)


# The code you provided will give the result like the picture below. and it's not right
# or--------
df['new_col'] = df['col_one'].str.extract('('+'|'.join(list20)+')')[0]
# or--------
import re
pattern = re.compile(r"|".join(x for x in list20))
df = (df
 .assign(new=lambda x: [re.findall(pattern, string)[0] for string in x.col_one])
)
# or----------
def matcher(col_one):
    for i in list20:
        if i in col_one:
            return i
    return 'na' #adjust as you see fit
df['new_col'] = df.apply(lambda x: matcher(x['col_one']), axis=1)

The result obtained from the code above and it’s not right

enter image description here

Expected Output

enter image description here

Asked By: T_Ner

||

Answers:

Try with str.extract

df['new'] = df['col_one'].str.extract('('+'|'.join(list20)+')')[0]
df
Out[121]: 
      col_one   new
0        CFER  CFER
1  ABCP6P45C9   ABC
2      LOU-W5   LOU
3      CFER-R  CFER
4      ABC-W1   ABC
5  LOU13C2465   LOU
Answered By: BENY

One way to do this, less attractive in terms of efficiency, is to use a simple function with a lambda such that:

def matcher(col_one):
    for i in list20:
        if i in col_one:
            return i
    return 'na' #adjust as you see fit

df['new_col'] = df.apply(lambda x: matcher(x['col_one']), axis=1)
df

expected results:

    col_one     new_col
0   CFER        CFER
1   ABCP6P45C9  ABC
2   LOU-W5      LOU
3   CFER-R      CFER
4   ABC-W1      ABC
5   LOU13C2465  LOU
Answered By: adhg

Another approach:

pattern = re.compile(r"|".join(x for x in list20))

(df
 .assign(new=lambda x: [re.findall(pattern, string)[0] for string in x.col_one])
)
Answered By: William Rosenbaum

Try to sort the list first:

pattern = re.compile(r"|".join(x for x in sorted(list20, reverse=True, key=len)))

(df
 .assign(new=lambda x: [re.findall(pattern, string)[0] for string in x.col_one])
)
Answered By: William Rosenbaum
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.