Adding column by substring from another column in Pandas

Question:

I have a data frame with one column,

DF = pd.DataFrame({'files': ["S18-000344PAS", "S18-001850HE1", "S18-00344HE1"]})

I want to add another column with the substring of files, the final dataframe should look like

DF = pd.DataFrame({'files': ["S18-000344PAS", "S18-001850HE1", "S18-00344HE1"], 'stain': ["PAS", "HE1", "HE1"]})

I try

DF["Stain"] = DF.apply(lambda row: row.files[re.search(r'[a-zA-Z]{2,}', row.files).start():], axis=1)

But it returned

AttributeError: 'NoneType' object has no attribute 'start'

What should I do?

Asked By: pill45

||

Answers:

If you want to extract last 3 characters from the files column you can do:

DF["stain"] = DF["files"].str[-3:]
print(DF)

Prints:

           files stain
0  S18-000344PAS   PAS
1  S18-001850HE1   HE1
2   S18-00344HE1   HE1

EDIT: Using regular expression to extract the stain:

DF["stain"] = DF["files"].str.extract(r"^(?:.{2,})-d*(.+)")
print(DF)
Answered By: Andrej Kesely

Here’s one approach using the str accessor

DF[["files", "stain"]] = DF["files"].str.extract(pat="(.+d)(D.+)")
    files   stain
0   S18-000344  PAS
1   S18-001850  HE1
2   S18-00344   HE1

If you need to keep the extracted variable in the first column, you can do

DF["stain"] = DF["files"].str.extract(pat="(.+d)(D.+)")[1]
    files   stain
0   S18-000344PAS   PAS
1   S18-001850HE1   HE1
2   S18-00344HE1    HE1

Answered By: Just James
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.