Python Vectorization Split String
Question:
I want to use vectorization to create a column in a pandas data frame that retrieve the second/last part of a string, from each row in a column, that is split on ‘_’. I tried this code:
df = pd.DataFrame()
df['Var1'] = ["test1_test2","test3_test4"]
df['Var2'] = [[df['Var1'].str.split('_')][0]][0]
df
Var1 Var2
0 test1_test2 test3
1 test3_test4 test4
Which is obviously incorrect as I should get test2 and test 4 in row 0 and 1 of column Var2 respectively.
Answers:
One option is to use str.extract
:
df['Var2'] = df['Var1'].str.extract("_([^_]+)$")
print(df)
Output
Var1 Var2
0 test1_test2 test2
1 test3_test4 test4
The regular expression "_([^_]+)$"
matches the last split.
Use the .str.split('_')
method along with .str[-1]
to retrieve the second/last part of each string in the column.
Following is the updated code:
import pandas as pd
df = pd.DataFrame()
df['Var1'] = ["test1_test2", "test3_test4"]
df['Var2'] = df['Var1'].str.split('_').str[-1]
print(df)
Output:
Var1 Var2
0 test1_test2 test2
1 test3_test4 test4
In the above code, df['Var1'].str.split('_')
splits each string in the ‘Var1’ column by the ‘_’ delimiter, and .str[-1]
selects the last part of the split string for each row.
You can use apply()
:
df["Var2"] = df['Var1'].apply(lambda x: x.split("_")[-1])
df
output:
Var1 Var2
0 test1_test2 test2
1 test3_test4 test4
I want to use vectorization to create a column in a pandas data frame that retrieve the second/last part of a string, from each row in a column, that is split on ‘_’. I tried this code:
df = pd.DataFrame()
df['Var1'] = ["test1_test2","test3_test4"]
df['Var2'] = [[df['Var1'].str.split('_')][0]][0]
df
Var1 Var2
0 test1_test2 test3
1 test3_test4 test4
Which is obviously incorrect as I should get test2 and test 4 in row 0 and 1 of column Var2 respectively.
One option is to use str.extract
:
df['Var2'] = df['Var1'].str.extract("_([^_]+)$")
print(df)
Output
Var1 Var2
0 test1_test2 test2
1 test3_test4 test4
The regular expression "_([^_]+)$"
matches the last split.
Use the .str.split('_')
method along with .str[-1]
to retrieve the second/last part of each string in the column.
Following is the updated code:
import pandas as pd
df = pd.DataFrame()
df['Var1'] = ["test1_test2", "test3_test4"]
df['Var2'] = df['Var1'].str.split('_').str[-1]
print(df)
Output:
Var1 Var2
0 test1_test2 test2
1 test3_test4 test4
In the above code, df['Var1'].str.split('_')
splits each string in the ‘Var1’ column by the ‘_’ delimiter, and .str[-1]
selects the last part of the split string for each row.
You can use apply()
:
df["Var2"] = df['Var1'].apply(lambda x: x.split("_")[-1])
df
output:
Var1 Var2
0 test1_test2 test2
1 test3_test4 test4