How can I avoid superfluous leading whitespace when splitting a string with RegEx pattern with words and numbers?
Question:
My question came along when trying to help in this post: split an enumerated text list into multiple columns
I’m searching for a Regex pattern which splits this string at 1.
, 2.
and 3.
or in general: split after a digit (or more if the list would be longer) followed by a dot. Problem is that there are more numbers in the string which are needed.
test_string = '1. Fruit 12 oranges 2. vegetables 7 carrot 3. NFL 246 SHIRTS'
With this pattern I managed to do so, but I got an empty string at the start and didn’t know how to change that.
l1 = re.split(r"s?d{1,2}.", test_string)
# Output l1:
['', ' Fruit 12 oranges', ' vegetables 7 carrot', ' NFL 246 SHIRTS']
So I changed from "split it" to "search something that finds the pattern":
l2 = re.findall(r"(?:^|(?<=d.))([sa-zA-Z0-9]+)(?:d.|$)", pattern)
# Output l2:
[' Fruit 12 oranges ', ' vegetables 7 carrot ', ' NFL 246 SHIRTS']
It is really close to be fine with it, just the trailing whitespace at the beginning of every element in the list.
What would be a good and efficient approach for my task? Stick with the splitting with re.split()
or building a pattern and use re.findall()
? Is my pattern good like I have done it or is it way too complicated?
Answers:
By just adding twice (?:s)
to your expression:
re.findall(r"(?:^|(?<=d.))(?:s)([sa-zA-Z0-9]+)(?:sd.|$)", test_string)
The output is: ['Fruit 12 oranges', 'vegetables 7 carrot', 'NFL 246 SHIRTS']
My question came along when trying to help in this post: split an enumerated text list into multiple columns
I’m searching for a Regex pattern which splits this string at 1.
, 2.
and 3.
or in general: split after a digit (or more if the list would be longer) followed by a dot. Problem is that there are more numbers in the string which are needed.
test_string = '1. Fruit 12 oranges 2. vegetables 7 carrot 3. NFL 246 SHIRTS'
With this pattern I managed to do so, but I got an empty string at the start and didn’t know how to change that.
l1 = re.split(r"s?d{1,2}.", test_string)
# Output l1:
['', ' Fruit 12 oranges', ' vegetables 7 carrot', ' NFL 246 SHIRTS']
So I changed from "split it" to "search something that finds the pattern":
l2 = re.findall(r"(?:^|(?<=d.))([sa-zA-Z0-9]+)(?:d.|$)", pattern)
# Output l2:
[' Fruit 12 oranges ', ' vegetables 7 carrot ', ' NFL 246 SHIRTS']
It is really close to be fine with it, just the trailing whitespace at the beginning of every element in the list.
What would be a good and efficient approach for my task? Stick with the splitting with re.split()
or building a pattern and use re.findall()
? Is my pattern good like I have done it or is it way too complicated?
By just adding twice (?:s)
to your expression:
re.findall(r"(?:^|(?<=d.))(?:s)([sa-zA-Z0-9]+)(?:sd.|$)", test_string)
The output is: ['Fruit 12 oranges', 'vegetables 7 carrot', 'NFL 246 SHIRTS']