Python link scraper regex works when only searching for 1 extension type, but fails when matching more than one extension type

Question

This is the test link I am using for this project: https://www.dropbox.com/sh/4cgwf2b6gk4bex4/AADtM1GDYgPDdv8QP6JdSOkba?dl=0

Now, the code below works when matching for .mp3 only (line 8), and outputs the plain link to a text file as asked.

import re
import requests

url = input('Enter URL: ')
html_content = requests.get(url).text

# Define the regular expression pattern to match links
pattern = re.compile(r'http[s]?://[^s]+.mp3')

# Find all matches in the HTML content
links = re.findall(pattern, html_content)

# Remove duplicates
links = list(set(links))

# Write the extracted links to a text file
with open('links.txt', 'w') as file:
    for link in links:
        file.write(link + 'n')

The issue is, the test link above contains not only .mp3’s but also .flac, and .wav.

When I change the code (line 8) to the following to scrape and return all links containing those extensions above (.mp4, .flac, .wav), it outputs a text file with "mp3", "flac" and "wav". No links.

import re
import requests

url = input('Enter URL: ')
html_content = requests.get(url).text

# Define the regular expression pattern to match links
pattern = re.compile(r'http[s]?://[^s]+.(mp3|flac|wav)')

# Find all matches in the HTML content
links = re.findall(pattern, html_content)

# Remove duplicates
links = list(set(links))

# Write the extracted links to a text file
with open('links.txt', 'w') as file:
    for link in links:
        file.write(link + 'n')

I’ve been trying to understand where the error is. Regex, or something else? I can’t figure it out.

Thank you.

Asked By: Sealfan69

||

Source

Answer 1

That’s because in your second code, you capture (with parenthesis) only the extension of the url/file. So, one way to fix that, is to add another capturing group like below (read comments) :

import re
import requests

url = input('Enter URL: ')
html_content = requests.get(url).text

# Define the regular expression pattern to match links
pattern = re.compile(r'(http[s]?://[^s]+.(mp3|flac|wav))') # <- added outer parenthesis

# Find all matches in the HTML content
links = re.findall(pattern, html_content)

# Remove duplicates
links = list(set(links))

# Write the extracted links to a text file
with open('links.txt', 'w') as file:
    for link in links:
        file.write(link[0] + 'n') # <- link[0] instead of link[0]

Output :

Answered By: Timeless

Answer 2

Another way to solve this is to have non capturing groups with ?:

pattern = re.compile(r'http[s]?://[^s]+.(?:mp3|flac|wav)')

See here.

Answered By: mike.slomczynski

Python link scraper regex works when only searching for 1 extension type, but fails when matching more than one extension type

Question:

Answers: