How to split on a delimiter in python preserving the delimiter
Question:
so what i wanna do here is basically i have a file with a list of url endpoints, and i wanna split the links in the file on the slash delimter, basically generating sub-endpoints of endpoints, example:
https://www.somesite.com/path1/path2/path3
and i would want to get this:
https://www.somesite.com/path1/
https://www.somesite.com/path1/path2/
https://www.somesite.com/path1/path2/path3
i know how to achieve this in bash, but not with python, i tried using split function but it’s very limited in my hands. i hope i can get some help here, thank you
Answers:
For the generic "split", but keeping the delimiter, you can use the str.partition
method: https://docs.python.org/3/library/stdtypes.html#str.partition
Now, for your specific use case, where you want the full intermediate strings as a list, you can write some code, starting with the urllib.parse to get the URL initia;l part, without worrying about corner cases, and them manipulate the path with for
, split
and join
.
url = "https://www.somesite.com/path1/path2/path3"
from urllib.parse import urlparse, urlunparse
path = (components:= list(urlparse(a)))[2]
path_comps_str = ""
path_comps = [path_comps_str:= path_comps_str + f"/{comp}" for comp in path.split("/")[1:]]
for path in path_comps:
url_parts = components[:]
url_parts[2] = path
all_urls.append(urlunparse(url_parts))
One option is to split by a /
, then slice the result and join back:
>>> url = 'https://www.somesite.com/path1/path2/path3'
>>> parts = url.split('/')
>>> ['/'.join(parts[:p+1]) for p in range(3, len(parts))]
['https://www.somesite.com/path1', 'https://www.somesite.com/path1/path2', 'https://www.somesite.com/path1/path2/path3']
Try something like this:
link = "https://www.somesite.com/path1/path2/path3"
splitted = link.split('/')
newLink = splitted[0] + "//" + splitted[2] + "/"
for i in range(3, len(splitted)):
newLink += splitted[i]
if i != len(splitted)-1:
newLink += "/"
print(newLink)
The output code is:
https://www.somesite.com/path1/
https://www.somesite.com/path1/path2/
https://www.somesite.com/path1/path2/path3
But the last /
of links is not needed so you could write it as:
link = "https://www.somesite.com/path1/path2/path3"
splitted = link.split('/')
newLink = splitted[0] + "//" + splitted[2]
for i in range(3, len(splitted)):
newLink += "/" + splitted[i]
print(newLink)
so what i wanna do here is basically i have a file with a list of url endpoints, and i wanna split the links in the file on the slash delimter, basically generating sub-endpoints of endpoints, example:
https://www.somesite.com/path1/path2/path3
and i would want to get this:
https://www.somesite.com/path1/
https://www.somesite.com/path1/path2/
https://www.somesite.com/path1/path2/path3
i know how to achieve this in bash, but not with python, i tried using split function but it’s very limited in my hands. i hope i can get some help here, thank you
For the generic "split", but keeping the delimiter, you can use the str.partition
method: https://docs.python.org/3/library/stdtypes.html#str.partition
Now, for your specific use case, where you want the full intermediate strings as a list, you can write some code, starting with the urllib.parse to get the URL initia;l part, without worrying about corner cases, and them manipulate the path with for
, split
and join
.
url = "https://www.somesite.com/path1/path2/path3"
from urllib.parse import urlparse, urlunparse
path = (components:= list(urlparse(a)))[2]
path_comps_str = ""
path_comps = [path_comps_str:= path_comps_str + f"/{comp}" for comp in path.split("/")[1:]]
for path in path_comps:
url_parts = components[:]
url_parts[2] = path
all_urls.append(urlunparse(url_parts))
One option is to split by a /
, then slice the result and join back:
>>> url = 'https://www.somesite.com/path1/path2/path3'
>>> parts = url.split('/')
>>> ['/'.join(parts[:p+1]) for p in range(3, len(parts))]
['https://www.somesite.com/path1', 'https://www.somesite.com/path1/path2', 'https://www.somesite.com/path1/path2/path3']
Try something like this:
link = "https://www.somesite.com/path1/path2/path3"
splitted = link.split('/')
newLink = splitted[0] + "//" + splitted[2] + "/"
for i in range(3, len(splitted)):
newLink += splitted[i]
if i != len(splitted)-1:
newLink += "/"
print(newLink)
The output code is:
https://www.somesite.com/path1/
https://www.somesite.com/path1/path2/
https://www.somesite.com/path1/path2/path3
But the last /
of links is not needed so you could write it as:
link = "https://www.somesite.com/path1/path2/path3"
splitted = link.split('/')
newLink = splitted[0] + "//" + splitted[2]
for i in range(3, len(splitted)):
newLink += "/" + splitted[i]
print(newLink)