Split string using Python when same delimiter has different meaning in different records
Question:
I have data that looks like this.
Record number
level 1 person
level 2 person
date
time spent on job
1
Tim David, Cameron Green – (Division 1)
01/01/2023
5
2
Tim David – (Division 1)
Mitch, Eli Kin Marsh – (Division 2)
02/02/2023
3
3
David Warner – (Division 2), Travis Head – (Division 3)
03/04/2023
1
4
Cameron Green – (Division 1)
Tim David – (Division 1)
07/01/2023
2
The final aim is to get the total time each person spends on doing jobs per month categorised by the division. This is regardless of the level of person. The result should be something similar to:
Division
Person
Month
time spent on job
Division 1
Tim David
Jan-23
7
Division 1
Tim David
Feb-23
3
Division 1
Cameron Green
Jan-23
7
Division 2
Mitch, Eli Kin Marsh
Feb-23
3
Division 2
David Warner
Apr-23
1
Division 3
Travis Head
Apr-23
1
To achieve this first I am trying to clean the ‘level 2 person’ column. In this column, record 1 means there are two people both in Division 1. One person is Tim David and the other is Cameron Green. In record 2 there is only one person Mitch, Eli Kin Marsh who is in Division 2. In the 3rd record there are two people in two separate divisions. David Warner is in Division 2 and Travis Head is in Division 3. In record 4, only one person Tim David in Division 1.
- I am trying to create a new column that captures all the people involved in a particular record. In doing this I am having trouble splitting the names in ‘level 2 person’ column. For example in Record 1 and Record 2 I have trouble splitting by a comma because in Record 2 even though there is only one person there is a comma separating the last name and other names. So the list I want for Record 1 is [‘Tim David’, ‘Cameron Green’] for Record 2 [‘Mitch Eli Kin Marsh’].
This is what I did to attempt this part:
def split_names(row):
string = row['level 2 person']
pattern = '([ws,-]+)'
names = re.split(pattern, string)
name_list = list()
for name in names:
replacements = [('-', ''), ('(', ''), (')', '')]
for char, replacement in replacements:
if char in name:
name= name.replace(char, replacement)
name_list.append(name)
while("" in name_list): # remove empty elements
name_list.remove("")
return name_list
df['names'] = df.apply(split_names,axis=1)
- Then I also want to assign Division for those who do not have it. This happens if multiple people are in the same division. For example, in Record 1. So, I am thinking of creating another column with a list where each element would correspond to the division that person belongs to. So for Record 1 this list would be [‘Division 1’, ‘Division 1’]
Answers:
As Mitch
is "1 word", would the rule: "2 words" must come before the comma solve this issue?
That can be done with the regex module from pypi as it supports variable length lookbehind assertions.
>>> import regex
>>>
>>> pattern = r'(?<=[^,s]+s+[^,s]+), '
>>> regex.split(pattern, 'I, am all one name - (Division 2)')
['I, am all one name - (Division 2)']
>>> regex.split(pattern, 'I am, not all one name - (Division 2)')
['I am', 'not all one name - (Division 2)']
(You could also implement this without regex by splitting on just the comma and merging "1 word" cells with their neighbor.)
Modifying your example:
def split_names(cols):
# must be 2 "words" before comma space
pattern = r'(?<=[^,s]+s+[^,s]+), '
people = {}
for names in cols:
names = regex.split(pattern, names)
if names == ['']:
continue
same_division = False
level = {}
for name in names:
if ' - ' in name:
name, division = name.split(' - ')
division = division.strip('()')
else:
division = None
same_division = True
level[name] = division
if same_division:
level = dict.fromkeys(level, division)
people.update(level)
return [
{'Division': division, 'Person': person} for person, division in people.items()
]
Example usage:
columns = ['level 1 person', 'level 2 person']
df[columns].apply(split_names, axis=1)
0 [{'Division': 'Division 1', 'Person': 'Tim Dav...
1 [{'Division': 'Division 1', 'Person': 'Tim Dav...
2 [{'Division': 'Division 2', 'Person': 'David W...
3 [{'Division': 'Division 1', 'Person': 'Cameron...
dtype: object
You could .explode
and .join
to turn the result into columns.
columns = ['level 1 person', 'level 2 person']
df = (
df.drop(columns=columns)
.join(df[columns].apply(split_names, axis=1).rename('People'))
.explode('People', ignore_index=True)
)
df = df.join(pd.DataFrame(df.pop('People').values.tolist()))
Record number date time spent on job Division Person
0 1 2023-01-01 5 Division 1 Tim David
1 1 2023-01-01 5 Division 1 Cameron Green
2 2 2023-02-02 3 Division 1 Tim David
3 2 2023-02-02 3 Division 2 Mitch, Eli Kin Marsh
4 3 2023-03-04 1 Division 2 David Warner
5 3 2023-03-04 1 Division 3 Travis Head
6 4 2023-07-01 2 Division 1 Cameron Green
7 4 2023-07-01 2 Division 1 Tim David
I have data that looks like this.
Record number | level 1 person | level 2 person | date | time spent on job |
---|---|---|---|---|
1 | Tim David, Cameron Green – (Division 1) | 01/01/2023 | 5 | |
2 | Tim David – (Division 1) | Mitch, Eli Kin Marsh – (Division 2) | 02/02/2023 | 3 |
3 | David Warner – (Division 2), Travis Head – (Division 3) | 03/04/2023 | 1 | |
4 | Cameron Green – (Division 1) | Tim David – (Division 1) | 07/01/2023 | 2 |
The final aim is to get the total time each person spends on doing jobs per month categorised by the division. This is regardless of the level of person. The result should be something similar to:
Division | Person | Month | time spent on job |
---|---|---|---|
Division 1 | Tim David | Jan-23 | 7 |
Division 1 | Tim David | Feb-23 | 3 |
Division 1 | Cameron Green | Jan-23 | 7 |
Division 2 | Mitch, Eli Kin Marsh | Feb-23 | 3 |
Division 2 | David Warner | Apr-23 | 1 |
Division 3 | Travis Head | Apr-23 | 1 |
To achieve this first I am trying to clean the ‘level 2 person’ column. In this column, record 1 means there are two people both in Division 1. One person is Tim David and the other is Cameron Green. In record 2 there is only one person Mitch, Eli Kin Marsh who is in Division 2. In the 3rd record there are two people in two separate divisions. David Warner is in Division 2 and Travis Head is in Division 3. In record 4, only one person Tim David in Division 1.
- I am trying to create a new column that captures all the people involved in a particular record. In doing this I am having trouble splitting the names in ‘level 2 person’ column. For example in Record 1 and Record 2 I have trouble splitting by a comma because in Record 2 even though there is only one person there is a comma separating the last name and other names. So the list I want for Record 1 is [‘Tim David’, ‘Cameron Green’] for Record 2 [‘Mitch Eli Kin Marsh’].
This is what I did to attempt this part:
def split_names(row):
string = row['level 2 person']
pattern = '([ws,-]+)'
names = re.split(pattern, string)
name_list = list()
for name in names:
replacements = [('-', ''), ('(', ''), (')', '')]
for char, replacement in replacements:
if char in name:
name= name.replace(char, replacement)
name_list.append(name)
while("" in name_list): # remove empty elements
name_list.remove("")
return name_list
df['names'] = df.apply(split_names,axis=1)
- Then I also want to assign Division for those who do not have it. This happens if multiple people are in the same division. For example, in Record 1. So, I am thinking of creating another column with a list where each element would correspond to the division that person belongs to. So for Record 1 this list would be [‘Division 1’, ‘Division 1’]
As Mitch
is "1 word", would the rule: "2 words" must come before the comma solve this issue?
That can be done with the regex module from pypi as it supports variable length lookbehind assertions.
>>> import regex
>>>
>>> pattern = r'(?<=[^,s]+s+[^,s]+), '
>>> regex.split(pattern, 'I, am all one name - (Division 2)')
['I, am all one name - (Division 2)']
>>> regex.split(pattern, 'I am, not all one name - (Division 2)')
['I am', 'not all one name - (Division 2)']
(You could also implement this without regex by splitting on just the comma and merging "1 word" cells with their neighbor.)
Modifying your example:
def split_names(cols):
# must be 2 "words" before comma space
pattern = r'(?<=[^,s]+s+[^,s]+), '
people = {}
for names in cols:
names = regex.split(pattern, names)
if names == ['']:
continue
same_division = False
level = {}
for name in names:
if ' - ' in name:
name, division = name.split(' - ')
division = division.strip('()')
else:
division = None
same_division = True
level[name] = division
if same_division:
level = dict.fromkeys(level, division)
people.update(level)
return [
{'Division': division, 'Person': person} for person, division in people.items()
]
Example usage:
columns = ['level 1 person', 'level 2 person']
df[columns].apply(split_names, axis=1)
0 [{'Division': 'Division 1', 'Person': 'Tim Dav...
1 [{'Division': 'Division 1', 'Person': 'Tim Dav...
2 [{'Division': 'Division 2', 'Person': 'David W...
3 [{'Division': 'Division 1', 'Person': 'Cameron...
dtype: object
You could .explode
and .join
to turn the result into columns.
columns = ['level 1 person', 'level 2 person']
df = (
df.drop(columns=columns)
.join(df[columns].apply(split_names, axis=1).rename('People'))
.explode('People', ignore_index=True)
)
df = df.join(pd.DataFrame(df.pop('People').values.tolist()))
Record number date time spent on job Division Person
0 1 2023-01-01 5 Division 1 Tim David
1 1 2023-01-01 5 Division 1 Cameron Green
2 2 2023-02-02 3 Division 1 Tim David
3 2 2023-02-02 3 Division 2 Mitch, Eli Kin Marsh
4 3 2023-03-04 1 Division 2 David Warner
5 3 2023-03-04 1 Division 3 Travis Head
6 4 2023-07-01 2 Division 1 Cameron Green
7 4 2023-07-01 2 Division 1 Tim David