Split string using Python when same delimiter has different meaning in different records

Question

I have data that looks like this.

Record number	level 1 person	level 2 person	date	time spent on job
1		Tim David, Cameron Green – (Division 1)	01/01/2023	5
2	Tim David – (Division 1)	Mitch, Eli Kin Marsh – (Division 2)	02/02/2023	3
3		David Warner – (Division 2), Travis Head – (Division 3)	03/04/2023	1
4	Cameron Green – (Division 1)	Tim David – (Division 1)	07/01/2023	2

The final aim is to get the total time each person spends on doing jobs per month categorised by the division. This is regardless of the level of person. The result should be something similar to:

Division	Person	Month	time spent on job
Division 1	Tim David	Jan-23	7
Division 1	Tim David	Feb-23	3
Division 1	Cameron Green	Jan-23	7
Division 2	Mitch, Eli Kin Marsh	Feb-23	3
Division 2	David Warner	Apr-23	1
Division 3	Travis Head	Apr-23	1

To achieve this first I am trying to clean the ‘level 2 person’ column. In this column, record 1 means there are two people both in Division 1. One person is Tim David and the other is Cameron Green. In record 2 there is only one person Mitch, Eli Kin Marsh who is in Division 2. In the 3rd record there are two people in two separate divisions. David Warner is in Division 2 and Travis Head is in Division 3. In record 4, only one person Tim David in Division 1.

I am trying to create a new column that captures all the people involved in a particular record. In doing this I am having trouble splitting the names in ‘level 2 person’ column. For example in Record 1 and Record 2 I have trouble splitting by a comma because in Record 2 even though there is only one person there is a comma separating the last name and other names. So the list I want for Record 1 is [‘Tim David’, ‘Cameron Green’] for Record 2 [‘Mitch Eli Kin Marsh’].

This is what I did to attempt this part:

def split_names(row):
string = row['level 2 person']

pattern = '([ws,-]+)'

names = re.split(pattern, string) 

name_list = list()

for name in names:
    replacements = [('-', ''), ('(', ''), (')', '')]

    for char, replacement in replacements:
        if char in name:
            name= name.replace(char, replacement)
    name_list.append(name)        

while("" in name_list): # remove empty elements
    name_list.remove("")
    
return name_list

df['names'] = df.apply(split_names,axis=1)

Then I also want to assign Division for those who do not have it. This happens if multiple people are in the same division. For example, in Record 1. So, I am thinking of creating another column with a list where each element would correspond to the division that person belongs to. So for Record 1 this list would be [‘Division 1’, ‘Division 1’]

Asked By: sam_rox

||

Source

Answer 1

As Mitch is "1 word", would the rule: "2 words" must come before the comma solve this issue?

That can be done with the regex module from pypi as it supports variable length lookbehind assertions.

>>> import regex
>>>
>>> pattern = r'(?<=[^,s]+s+[^,s]+), '
>>> regex.split(pattern, 'I, am all one name - (Division 2)')
['I, am all one name - (Division 2)']
>>> regex.split(pattern, 'I am, not all one name - (Division 2)')
['I am', 'not all one name - (Division 2)']

(You could also implement this without regex by splitting on just the comma and merging "1 word" cells with their neighbor.)

Modifying your example:

def split_names(cols):
     # must be 2 "words" before comma space 
     pattern = r'(?<=[^,s]+s+[^,s]+), '
    
     people = {}
     
     for names in cols:
         names = regex.split(pattern, names)

         if names == ['']: 
             continue 
             
         same_division = False
         level = {}
             
         for name in names:
             if ' - ' in name:
                name, division = name.split(' - ')
                division = division.strip('()')
             else:
                division = None
                same_division = True
               
             level[name] = division
 
         if same_division:
             level = dict.fromkeys(level, division)
             
         people.update(level)
            
     return [
         {'Division': division, 'Person': person} for person, division in people.items()
     ]

Example usage:

columns = ['level 1 person', 'level 2 person']
df[columns].apply(split_names, axis=1)

0    [{'Division': 'Division 1', 'Person': 'Tim Dav...
1    [{'Division': 'Division 1', 'Person': 'Tim Dav...
2    [{'Division': 'Division 2', 'Person': 'David W...
3    [{'Division': 'Division 1', 'Person': 'Cameron...
dtype: object

You could .explode and .join to turn the result into columns.

columns = ['level 1 person', 'level 2 person'] 

df = (
   df.drop(columns=columns)
     .join(df[columns].apply(split_names, axis=1).rename('People'))
     .explode('People', ignore_index=True)
)

df = df.join(pd.DataFrame(df.pop('People').values.tolist()))

   Record number       date  time spent on job    Division                Person
0              1 2023-01-01                  5  Division 1             Tim David
1              1 2023-01-01                  5  Division 1         Cameron Green
2              2 2023-02-02                  3  Division 1             Tim David
3              2 2023-02-02                  3  Division 2  Mitch, Eli Kin Marsh
4              3 2023-03-04                  1  Division 2          David Warner
5              3 2023-03-04                  1  Division 3           Travis Head
6              4 2023-07-01                  2  Division 1         Cameron Green
7              4 2023-07-01                  2  Division 1             Tim David

Answered By: jqurious

Split string using Python when same delimiter has different meaning in different records

Question:

Answers: