Adding column by substring from another column in Pandas

Question

I have a data frame with one column,

DF = pd.DataFrame({'files': ["S18-000344PAS", "S18-001850HE1", "S18-00344HE1"]})

I want to add another column with the substring of files, the final dataframe should look like

DF = pd.DataFrame({'files': ["S18-000344PAS", "S18-001850HE1", "S18-00344HE1"], 'stain': ["PAS", "HE1", "HE1"]})

I try

DF["Stain"] = DF.apply(lambda row: row.files[re.search(r'[a-zA-Z]{2,}', row.files).start():], axis=1)

But it returned

AttributeError: 'NoneType' object has no attribute 'start'

What should I do?

Asked By: pill45

||

Answer 1

If you want to extract last 3 characters from the files column you can do:

DF["stain"] = DF["files"].str[-3:]
print(DF)

Prints:

           files stain
0  S18-000344PAS   PAS
1  S18-001850HE1   HE1
2   S18-00344HE1   HE1

EDIT: Using regular expression to extract the stain:

DF["stain"] = DF["files"].str.extract(r"^(?:.{2,})-d*(.+)")
print(DF)

Answer 2

Here’s one approach using the str accessor

DF[["files", "stain"]] = DF["files"].str.extract(pat="(.+d)(D.+)")

    files   stain
0   S18-000344  PAS
1   S18-001850  HE1
2   S18-00344   HE1

If you need to keep the extracted variable in the first column, you can do

DF["stain"] = DF["files"].str.extract(pat="(.+d)(D.+)")[1]

    files   stain
0   S18-000344PAS   PAS
1   S18-001850HE1   HE1
2   S18-00344HE1    HE1

Answered By: Just James

Question: