Python – Create indicator for whether a variable includes a name

Question:

I have a two column dataframe with a "text" variable and a "name" variable. I am trying to create a new variable that returns a value of 1 if the name is in the text, and zero otherwise. Here is an example of the dataframe:

text name
(d) On August 24, 2004, the Board of Directors of Corinthian Colleges, Inc. (the "Company") elected a new director, Hank Adler. Adler, 58, is a certified public accountant who retired in 2003 after 30 years with Deloitte & Touche, LLP, where he served as both a Tax and Primary Client Service Partner. Adler currently teaches accounting courses at Chapman University in Orange, California. A press release dated August 30, 2004, announcing Adler’s election to the Companys Board of Directors, is attached as Exhibit 99.1 to this report and is incorporated herein by reference. Adler
On March 28, 2005, the Company announced that acting chief executive officer Richard D. Spurr has been appointed the Companys Chief Executive Officer. Mr. Spurr accepted the role of acting CEO in February 2005 as the result of former CEO John A. Ryan relinquishing the role, as described in the Companys report on Form 8-K, filed February 10, 2005. Mr. Spurr will also retain his prior responsibilities as the Companys President and Chief Operating Officer. Smith

Here is what I would like to create:

text name match
(d) On August 24, 2004, the Board of Directors of Corinthian Colleges, Inc. (the "Company") elected a new director, Hank Adler. Adler, 58, is a certified public accountant who retired in 2003 after 30 years with Deloitte & Touche, LLP, where he served as both a Tax and Primary Client Service Partner. Adler currently teaches accounting courses at Chapman University in Orange, California. A press release dated August 30, 2004, announcing Adler’s election to the Companys Board of Directors, is attached as Exhibit 99.1 to this report and is incorporated herein by reference. Adler 1
On March 28, 2005, the Company announced that acting chief executive officer Richard D. Spurr has been appointed the Companys Chief Executive Officer. Mr. Spurr accepted the role of acting CEO in February 2005 as the result of former CEO John A. Ryan relinquishing the role, as described in the Companys report on Form 8-K, filed February 10, 2005. Mr. Spurr will also retain his prior responsibilities as the Companys President and Chief Operating Officer. Smith 0

Here is my code as of now:

    pd_05['match'] = np.where(pd_05['text'].str.contains(pd_05['name']), 0, 1)

I am getting the following error: "unhashable type: ‘Series’"

Please let me know how I can fix my code, or if there is an easier way to get this done.

Asked By: nillawafer

||

Answers:

I’m not sure if there is a more efficient way, but apply() should work. Here, the DataFrame pd_05 is passed as the argument into the lambda function and setting axis=1 has the function work by row.

pd_05["match"] = pd_05.apply( lambda x: x["name"] in x["text"], axis=1 ).astype( int )

To handle cases where the value of name is an empty string:
If you want these instances to have a value of 0 in match then you can change the lambda function to lambda x: (x["name"] in x["text"]) and (x["name"] != "").

If you want these instances to have a missing value in match then you use where():

pd_05["match"] = pd_05.apply( lambda x: x["name"] in x["text"], axis=1 ).astype( int ).where( pd_05["name"] != "" )

To make the whole process case-insensitive, you can change everything to lowercase. You can do this in the lambda function (x["name"].str.lower()) but it’s probably more efficient to create new, lowercase columns, and then use them in the lambda function, assuming no memory constraints.

pd_05["name_lowercase"] = pd_05["name"].str.lower()
pd_05["text_lowercase"] = pd_05["text"].str.lower()
pd_05["match"] = pd_05.apply( lambda x: x["name_lowercase"] in x["text_lowercase"], axis=1 ).astype( int )
Answered By: Leo

Use a list comprehension, this should be the fastest, and convert the boolean to integer:

df['match'] = [int(n in t) for n,t in zip(df['name'], df['text'])]
Answered By: mozway
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.