Use Regex to Separate Pairs of Title and URL
Question:
I copied a list of books and their URL from a website that becomes one string when pasted in a word doc and I’d like to separate each Title and URL on new lines:
Copied list:
Elementary Algebra https://amzn.to/3S7yG0Y Pre-Algebra https://amzn.to/3TpW8HK Discrete Mathematical Structures https://amzn.to/3eBYogb Discrete Mathematics and its Applications https://amzn.to/3TvfThe Discrete and Combinatorial Mathematics https://amzn.to/3CELUfO …etc…
I figure the regex pattern can be something like:
any number of words (might have a hypthon or something) then http or https :// then a
mix of alphanumerics, forward slashes and periods and then a final space after the http
address to mark the location to split on.
I have this aweful looking pattern that is not capturing all the pairs:
w*sw*sw*://w*.w*/w*s
Also, in python it returns only the first match and I can’t figure out how to group it with () and then get all of them with *
Answers:
Try (regex101):
import re
s = """Elementary Algebra https://amzn.to/3S7yG0Y Pre-Algebra https://amzn.to/3TpW8HK Discrete Mathematical Structures https://amzn.to/3eBYogb Discrete Mathematics and its Applications https://amzn.to/3TvfThe Discrete and Combinatorial Mathematics https://amzn.to/3CELUfO"""
pat = re.compile(r"s*(.*?)s+(https?://S+)")
print(pat.findall(s))
Prints:
[
("Elementary Algebra", "https://amzn.to/3S7yG0Y"),
("Pre-Algebra", "https://amzn.to/3TpW8HK"),
("Discrete Mathematical Structures", "https://amzn.to/3eBYogb"),
("Discrete Mathematics and its Applications", "https://amzn.to/3TvfThe"),
("Discrete and Combinatorial Mathematics", "https://amzn.to/3CELUfO"),
]
I copied a list of books and their URL from a website that becomes one string when pasted in a word doc and I’d like to separate each Title and URL on new lines:
Copied list:
Elementary Algebra https://amzn.to/3S7yG0Y Pre-Algebra https://amzn.to/3TpW8HK Discrete Mathematical Structures https://amzn.to/3eBYogb Discrete Mathematics and its Applications https://amzn.to/3TvfThe Discrete and Combinatorial Mathematics https://amzn.to/3CELUfO …etc…
I figure the regex pattern can be something like:
any number of words (might have a hypthon or something) then http or https :// then a
mix of alphanumerics, forward slashes and periods and then a final space after the http
address to mark the location to split on.
I have this aweful looking pattern that is not capturing all the pairs:
w*sw*sw*://w*.w*/w*s
Also, in python it returns only the first match and I can’t figure out how to group it with () and then get all of them with *
Try (regex101):
import re
s = """Elementary Algebra https://amzn.to/3S7yG0Y Pre-Algebra https://amzn.to/3TpW8HK Discrete Mathematical Structures https://amzn.to/3eBYogb Discrete Mathematics and its Applications https://amzn.to/3TvfThe Discrete and Combinatorial Mathematics https://amzn.to/3CELUfO"""
pat = re.compile(r"s*(.*?)s+(https?://S+)")
print(pat.findall(s))
Prints:
[
("Elementary Algebra", "https://amzn.to/3S7yG0Y"),
("Pre-Algebra", "https://amzn.to/3TpW8HK"),
("Discrete Mathematical Structures", "https://amzn.to/3eBYogb"),
("Discrete Mathematics and its Applications", "https://amzn.to/3TvfThe"),
("Discrete and Combinatorial Mathematics", "https://amzn.to/3CELUfO"),
]