Use Regex to Separate Pairs of Title and URL

Question

I copied a list of books and their URL from a website that becomes one string when pasted in a word doc and I’d like to separate each Title and URL on new lines:

Copied list:

Elementary Algebra https://amzn.to/3S7yG0Y Pre-Algebra https://amzn.to/3TpW8HK Discrete Mathematical Structures https://amzn.to/3eBYogb Discrete Mathematics and its Applications https://amzn.to/3TvfThe Discrete and Combinatorial Mathematics https://amzn.to/3CELUfO …etc…

I figure the regex pattern can be something like:
any number of words (might have a hypthon or something) then http or https :// then a
mix of alphanumerics, forward slashes and periods and then a final space after the http
address to mark the location to split on.

I have this aweful looking pattern that is not capturing all the pairs:

w*sw*sw*://w*.w*/w*s

https://regexr.com/

Also, in python it returns only the first match and I can’t figure out how to group it with () and then get all of them with *

Asked By: Matan

||

Source

Answer 1

Try (regex101):

import re

s = """Elementary Algebra https://amzn.to/3S7yG0Y Pre-Algebra https://amzn.to/3TpW8HK Discrete Mathematical Structures https://amzn.to/3eBYogb Discrete Mathematics and its Applications https://amzn.to/3TvfThe Discrete and Combinatorial Mathematics https://amzn.to/3CELUfO"""

pat = re.compile(r"s*(.*?)s+(https?://S+)")

print(pat.findall(s))

Prints:

[
    ("Elementary Algebra", "https://amzn.to/3S7yG0Y"),
    ("Pre-Algebra", "https://amzn.to/3TpW8HK"),
    ("Discrete Mathematical Structures", "https://amzn.to/3eBYogb"),
    ("Discrete Mathematics and its Applications", "https://amzn.to/3TvfThe"),
    ("Discrete and Combinatorial Mathematics", "https://amzn.to/3CELUfO"),
]

Answered By: Andrej Kesely

Use Regex to Separate Pairs of Title and URL

Question:

Answers: