Find the first/last n words of a string with a maximum of 20 characters using regex
Question:
I’m trying to find any number of words at the beginning or end of a string with a maximum of 20 characters.
This is what I have right now:
s1 = "Hello, World! This is a reallly long string"
match = re.search(r"^(b.{0,20}b)", s1)
print(f"'{match.group(0)}'") # 'Hello, World! This '
My problem is the extra space that it adds at the end. I believe this is because b matches either the beginning or the end of the string but I’m not sure what to do about it.
I run into the same issue if I try to do the same with the end of the string but with a leading space instead:
s1 = "Hello, World! This is a reallly long string"
match = re.search(r"(b.{0,20}b)$", s1)
print(f"'{match.group(0)}'") # ' reallly long string'
I know I can just use rstrip and lstrip to get rid of the leading/trailing whitespace but I was just wondering if there’s a way to do it with regex.
Answers:
You can use r"^(.{0,19}Sb|)"
(regex demo), S
ensuring to have a non space character on the bound. You need to decrease the number of previous characters to 19 and use |
with empty string to match 0 characters if needed:
import re
s1 = "Hello, World! This is a reallly long string"
match = re.search(r"^(.{0,19}Sb|)", s1)
print(f"'{match.group(0)}'", len(match.group(0)))
Output:
'Hello, World' 15
For the end of string r"(|bS.{0,19})$"
(regex demo):
import re
s1 = "Hello, World! This is a reallly long string"
match = re.search(r"(|bS.{0,19})$", s1)
print(f"'{match.group(0)}'", len(match.group(0)))
output:
'reallly long string' 19
why (...|)
?
to enable zeros characters, the below example would fail with ^(.{0,19}Sb)
import re
s1 = "X"*21
match = re.search(r"^(.{0,19}Sb|)$", s1)
print(f"'{match.group(0)}'", len(match.group(0)))
output:
'' 0
You may use this regex:
^S.{0,18}Sb|bS.{0,18}S$
S
(not a whitespace) at start and end guarantees that your matches start and with with a non-whitespace character.
code:
import re
s = "Hello, World! This is a reallly long string"
print(re.findall(r'^S.{0,18}Sb|bS.{0,18}S$', s))
# ['Hello, World', 'reallly long string']
I’m trying to find any number of words at the beginning or end of a string with a maximum of 20 characters.
This is what I have right now:
s1 = "Hello, World! This is a reallly long string"
match = re.search(r"^(b.{0,20}b)", s1)
print(f"'{match.group(0)}'") # 'Hello, World! This '
My problem is the extra space that it adds at the end. I believe this is because b matches either the beginning or the end of the string but I’m not sure what to do about it.
I run into the same issue if I try to do the same with the end of the string but with a leading space instead:
s1 = "Hello, World! This is a reallly long string"
match = re.search(r"(b.{0,20}b)$", s1)
print(f"'{match.group(0)}'") # ' reallly long string'
I know I can just use rstrip and lstrip to get rid of the leading/trailing whitespace but I was just wondering if there’s a way to do it with regex.
You can use r"^(.{0,19}Sb|)"
(regex demo), S
ensuring to have a non space character on the bound. You need to decrease the number of previous characters to 19 and use |
with empty string to match 0 characters if needed:
import re
s1 = "Hello, World! This is a reallly long string"
match = re.search(r"^(.{0,19}Sb|)", s1)
print(f"'{match.group(0)}'", len(match.group(0)))
Output:
'Hello, World' 15
For the end of string r"(|bS.{0,19})$"
(regex demo):
import re
s1 = "Hello, World! This is a reallly long string"
match = re.search(r"(|bS.{0,19})$", s1)
print(f"'{match.group(0)}'", len(match.group(0)))
output:
'reallly long string' 19
why (...|)
?
to enable zeros characters, the below example would fail with ^(.{0,19}Sb)
import re
s1 = "X"*21
match = re.search(r"^(.{0,19}Sb|)$", s1)
print(f"'{match.group(0)}'", len(match.group(0)))
output:
'' 0
You may use this regex:
^S.{0,18}Sb|bS.{0,18}S$
S
(not a whitespace) at start and end guarantees that your matches start and with with a non-whitespace character.
code:
import re
s = "Hello, World! This is a reallly long string"
print(re.findall(r'^S.{0,18}Sb|bS.{0,18}S$', s))
# ['Hello, World', 'reallly long string']