A regex pattern that matches all words starting from a word with an s and stopping before a word that starts with an s
Question:
I’m trying to capture words in a string such that the first word starts with an s, and the regex stops matching if the next word also starts with an s.
For example. I have the string " Stack, Code and StackOverflow". I want to capture only " Stack, Code and " and not include "StackOverflow" in the match.
This is what I am thinking:
- Start with a space followed by an s.
- Match everything except if the group is a space and an s (I’m using negative lookahead).
The regex I have tried:
(?<=s)S[a-z -,]*(?!(sS))
I don’t know how to make it work.
Answers:
I think this should work. I adapted the regex from this thread. You can also test it out here. I have also included a non-regex solution. I basically track the first occurrence of a word starting with an ‘s’ and the next word starting with an ‘s’ and get the words in that range.
import re
teststring = " Stack, Code and StackOverflow"
extractText = re.search(r"(s)[sS][^*s]*[^sS]*", teststring)
print(extractText[0])
#non-regex solution
listwords = teststring.split(' ')
# non regex solution
start = 0
end = 0
for i,word in enumerate(listwords):
if word.startswith('s') or word.startswith('S'):
if start == 0:
start = i
else:
end = i
break
newstring = " " + " ".join([word for word in listwords[start:end]])
print(newstring)
Output
Stack, Code and
Stack, Code and
You could use for example a capture group:
(S(?<!S.).*?)s*S(?<!S.)
Explanation
(
Capture group 1
S(?<!S.)
Match S
and assert that to the left of the S
there is not a whitespace boundary
.*?
Match any character, as few as possible
)
Close group
s*
Match optional whitespace chars
S(?<!S.)
Match S
and assert that to the left of the S
there is not a whitespace boundary
See a regex demo and a Python demo.
Example code:
import re
pattern = r"(S(?<!S.).*?)s*S(?<!S.)"
s = "Stack, Code and StackOverflow"
m = re.search(pattern, s)
if m:
print(m.group(1))
Output
Stack, Code and
Another option using a lookaround to assert the S
to the right and not consume it to allow multiple matches after each other:
S(?<!S.).*?(?=s*S(?<!S.))
import re
pattern = r"S(?<!S.).*?(?=s*S(?<!S.))"
s = "Stack, Code and StackOverflow test Stack"
print(re.findall(pattern, s))
Output
['Stack, Code and', 'StackOverflow test']
I’m trying to capture words in a string such that the first word starts with an s, and the regex stops matching if the next word also starts with an s.
For example. I have the string " Stack, Code and StackOverflow". I want to capture only " Stack, Code and " and not include "StackOverflow" in the match.
This is what I am thinking:
- Start with a space followed by an s.
- Match everything except if the group is a space and an s (I’m using negative lookahead).
The regex I have tried:
(?<=s)S[a-z -,]*(?!(sS))
I don’t know how to make it work.
I think this should work. I adapted the regex from this thread. You can also test it out here. I have also included a non-regex solution. I basically track the first occurrence of a word starting with an ‘s’ and the next word starting with an ‘s’ and get the words in that range.
import re
teststring = " Stack, Code and StackOverflow"
extractText = re.search(r"(s)[sS][^*s]*[^sS]*", teststring)
print(extractText[0])
#non-regex solution
listwords = teststring.split(' ')
# non regex solution
start = 0
end = 0
for i,word in enumerate(listwords):
if word.startswith('s') or word.startswith('S'):
if start == 0:
start = i
else:
end = i
break
newstring = " " + " ".join([word for word in listwords[start:end]])
print(newstring)
Output
Stack, Code and
Stack, Code and
You could use for example a capture group:
(S(?<!S.).*?)s*S(?<!S.)
Explanation
(
Capture group 1S(?<!S.)
MatchS
and assert that to the left of theS
there is not a whitespace boundary.*?
Match any character, as few as possible
)
Close groups*
Match optional whitespace charsS(?<!S.)
MatchS
and assert that to the left of theS
there is not a whitespace boundary
See a regex demo and a Python demo.
Example code:
import re
pattern = r"(S(?<!S.).*?)s*S(?<!S.)"
s = "Stack, Code and StackOverflow"
m = re.search(pattern, s)
if m:
print(m.group(1))
Output
Stack, Code and
Another option using a lookaround to assert the S
to the right and not consume it to allow multiple matches after each other:
S(?<!S.).*?(?=s*S(?<!S.))
import re
pattern = r"S(?<!S.).*?(?=s*S(?<!S.))"
s = "Stack, Code and StackOverflow test Stack"
print(re.findall(pattern, s))
Output
['Stack, Code and', 'StackOverflow test']