Python Vectorization Split String

Question

I want to use vectorization to create a column in a pandas data frame that retrieve the second/last part of a string, from each row in a column, that is split on ‘_’. I tried this code:

df = pd.DataFrame()

df['Var1'] = ["test1_test2","test3_test4"]
df['Var2'] = [[df['Var1'].str.split('_')][0]][0]
df

           Var1  Var2
0   test1_test2 test3
1   test3_test4 test4

Which is obviously incorrect as I should get test2 and test 4 in row 0 and 1 of column Var2 respectively.

Asked By: Alan

||

Source

Answer 1

One option is to use str.extract:

df['Var2'] = df['Var1'].str.extract("_([^_]+)$")
print(df)

Output

          Var1   Var2
0  test1_test2  test2
1  test3_test4  test4

The regular expression "_([^_]+)$" matches the last split.

Answered By: Dani Mesejo

Answer 2

Use the .str.split('_') method along with .str[-1] to retrieve the second/last part of each string in the column.

Following is the updated code:

import pandas as pd

df = pd.DataFrame()

df['Var1'] = ["test1_test2", "test3_test4"]
df['Var2'] = df['Var1'].str.split('_').str[-1]

print(df)

Output:

          Var1   Var2
0  test1_test2  test2
1  test3_test4  test4

In the above code, df['Var1'].str.split('_') splits each string in the ‘Var1’ column by the ‘_’ delimiter, and .str[-1] selects the last part of the split string for each row.

Answered By: Bilesh Ganguly

Answer 3

You can use apply():

df["Var2"] = df['Var1'].apply(lambda x: x.split("_")[-1])

df output:

          Var1   Var2
0  test1_test2  test2
1  test3_test4  test4

Answered By: Marcelo Paco

Python Vectorization Split String

Question:

Answers: