Separating column string values with varying delimiters
Question:
I have a column in a dataframe that I want to split into two columns. The values in the column are strings with a players’ name followed by their position. Because players have different numbers of names, this becomes a bigger issue.
For example:
- 1 name:
Jorginho Defensive Midfield
- 2 names:
Heung-min Son Left Winger
- 3 names:
Bilal El Khannouss Attacking Midfield
The desired output would be:
Player Position
Jorginho Defensive Midfield
Heung-min Son Left Winger
Bilal El Khannouss Attacking Midfield
I believe this can be done by listing the player positions, however I don’t know how to approach that problem. I tried separating using split()
with a space character as the delimiter, but that doesn’t work unfortunately.
import pandas as pd
df = pd.DataFrame({'Player': ['Richarlison Centre-Forward',
'Heung-min Son Left Winger',
'Harry Wilson Right Winger',
'Bilal El Khannouss Attacking Midfield',
'Eduardo Camavinga Central Midfield',
'Jorginho Defensive Midfield',
'Lewis Patterson Centre-Back',
'Layvin Kurzawa Left-Back',
'Kyle Walker Right-Back',
'Jordan Pickford Goalkeeper']})
positions = ['Centre-Forward', 'Left Winger', 'Right Winger',
'Attacking Midfield', 'Central Midfield', 'Defensive Midfield',
'Centre-Back', 'Left-Back', 'Right-Back', 'Goalkeeper']
Is this possible to do?
Answers:
You can craft a regex.
import re
regex = '|'.join(map(re.escape, positions))
df['Player'].str.extract(fr'(.*)s*({regex})')
NB. changed 'Central Midfielder'
to 'Central Midfield'
in the list of positions.
Another approach that does not require any list, would be to extract the last 2 words (either separated by spaces, or a dash):
df['Player'].str.extract(r'(.*)s(w+(?:-|s+)w+)')
output:
0 1
0 Richarlison Centre-Forward
1 Heung-min Son Left Winger
2 Harry Wilson Right Winger
3 Bilal El Khannouss Attacking Midfield
4 Eduardo Camavinga Central Midfield
5 Jorginho Defensive Midfield
6 Lewis Patterson Centre-Back
7 Layvin Kurzawa Left-Back
8 Kyle Walker Right-Back
9 Jordan Pickford Goalkeeper
You can use re.sub() function to delete the positions from the player list and then create another list of player. Here the positions are indexed with player so I don’t use for loop for matching the position.
import re
player = [item for item in range(len(positions))]
for i in range(len(positions)):
player[i]=re.sub(positions[i],'',str(df['Player'][i]))
d_frame = {'Player':player,'Position':positions}
df = pd.DataFrame(d_frame)
You can use it to make the new dataframe
I have a column in a dataframe that I want to split into two columns. The values in the column are strings with a players’ name followed by their position. Because players have different numbers of names, this becomes a bigger issue.
For example:
- 1 name:
Jorginho Defensive Midfield
- 2 names:
Heung-min Son Left Winger
- 3 names:
Bilal El Khannouss Attacking Midfield
The desired output would be:
Player Position
Jorginho Defensive Midfield
Heung-min Son Left Winger
Bilal El Khannouss Attacking Midfield
I believe this can be done by listing the player positions, however I don’t know how to approach that problem. I tried separating using split()
with a space character as the delimiter, but that doesn’t work unfortunately.
import pandas as pd
df = pd.DataFrame({'Player': ['Richarlison Centre-Forward',
'Heung-min Son Left Winger',
'Harry Wilson Right Winger',
'Bilal El Khannouss Attacking Midfield',
'Eduardo Camavinga Central Midfield',
'Jorginho Defensive Midfield',
'Lewis Patterson Centre-Back',
'Layvin Kurzawa Left-Back',
'Kyle Walker Right-Back',
'Jordan Pickford Goalkeeper']})
positions = ['Centre-Forward', 'Left Winger', 'Right Winger',
'Attacking Midfield', 'Central Midfield', 'Defensive Midfield',
'Centre-Back', 'Left-Back', 'Right-Back', 'Goalkeeper']
Is this possible to do?
You can craft a regex.
import re
regex = '|'.join(map(re.escape, positions))
df['Player'].str.extract(fr'(.*)s*({regex})')
NB. changed 'Central Midfielder'
to 'Central Midfield'
in the list of positions.
Another approach that does not require any list, would be to extract the last 2 words (either separated by spaces, or a dash):
df['Player'].str.extract(r'(.*)s(w+(?:-|s+)w+)')
output:
0 1
0 Richarlison Centre-Forward
1 Heung-min Son Left Winger
2 Harry Wilson Right Winger
3 Bilal El Khannouss Attacking Midfield
4 Eduardo Camavinga Central Midfield
5 Jorginho Defensive Midfield
6 Lewis Patterson Centre-Back
7 Layvin Kurzawa Left-Back
8 Kyle Walker Right-Back
9 Jordan Pickford Goalkeeper
You can use re.sub() function to delete the positions from the player list and then create another list of player. Here the positions are indexed with player so I don’t use for loop for matching the position.
import re
player = [item for item in range(len(positions))]
for i in range(len(positions)):
player[i]=re.sub(positions[i],'',str(df['Player'][i]))
d_frame = {'Player':player,'Position':positions}
df = pd.DataFrame(d_frame)
You can use it to make the new dataframe