Python link scraper regex works when only searching for 1 extension type, but fails when matching more than one extension type
Question:
This is the test link I am using for this project: https://www.dropbox.com/sh/4cgwf2b6gk4bex4/AADtM1GDYgPDdv8QP6JdSOkba?dl=0
Now, the code below works when matching for .mp3 only (line 8), and outputs the plain link to a text file as asked.
import re
import requests
url = input('Enter URL: ')
html_content = requests.get(url).text
# Define the regular expression pattern to match links
pattern = re.compile(r'http[s]?://[^s]+.mp3')
# Find all matches in the HTML content
links = re.findall(pattern, html_content)
# Remove duplicates
links = list(set(links))
# Write the extracted links to a text file
with open('links.txt', 'w') as file:
for link in links:
file.write(link + 'n')
The issue is, the test link above contains not only .mp3’s but also .flac, and .wav.
When I change the code (line 8) to the following to scrape and return all links containing those extensions above (.mp4, .flac, .wav), it outputs a text file with "mp3", "flac" and "wav". No links.
import re
import requests
url = input('Enter URL: ')
html_content = requests.get(url).text
# Define the regular expression pattern to match links
pattern = re.compile(r'http[s]?://[^s]+.(mp3|flac|wav)')
# Find all matches in the HTML content
links = re.findall(pattern, html_content)
# Remove duplicates
links = list(set(links))
# Write the extracted links to a text file
with open('links.txt', 'w') as file:
for link in links:
file.write(link + 'n')
I’ve been trying to understand where the error is. Regex, or something else? I can’t figure it out.
Thank you.
Answers:
That’s because in your second code, you capture (with parenthesis) only the extension of the url/file. So, one way to fix that, is to add another capturing group like below (read comments) :
import re
import requests
url = input('Enter URL: ')
html_content = requests.get(url).text
# Define the regular expression pattern to match links
pattern = re.compile(r'(http[s]?://[^s]+.(mp3|flac|wav))') # <- added outer parenthesis
# Find all matches in the HTML content
links = re.findall(pattern, html_content)
# Remove duplicates
links = list(set(links))
# Write the extracted links to a text file
with open('links.txt', 'w') as file:
for link in links:
file.write(link[0] + 'n') # <- link[0] instead of link[0]
Output :
Another way to solve this is to have non capturing groups with ?:
pattern = re.compile(r'http[s]?://[^s]+.(?:mp3|flac|wav)')
See here.
This is the test link I am using for this project: https://www.dropbox.com/sh/4cgwf2b6gk4bex4/AADtM1GDYgPDdv8QP6JdSOkba?dl=0
Now, the code below works when matching for .mp3 only (line 8), and outputs the plain link to a text file as asked.
import re
import requests
url = input('Enter URL: ')
html_content = requests.get(url).text
# Define the regular expression pattern to match links
pattern = re.compile(r'http[s]?://[^s]+.mp3')
# Find all matches in the HTML content
links = re.findall(pattern, html_content)
# Remove duplicates
links = list(set(links))
# Write the extracted links to a text file
with open('links.txt', 'w') as file:
for link in links:
file.write(link + 'n')
The issue is, the test link above contains not only .mp3’s but also .flac, and .wav.
When I change the code (line 8) to the following to scrape and return all links containing those extensions above (.mp4, .flac, .wav), it outputs a text file with "mp3", "flac" and "wav". No links.
import re
import requests
url = input('Enter URL: ')
html_content = requests.get(url).text
# Define the regular expression pattern to match links
pattern = re.compile(r'http[s]?://[^s]+.(mp3|flac|wav)')
# Find all matches in the HTML content
links = re.findall(pattern, html_content)
# Remove duplicates
links = list(set(links))
# Write the extracted links to a text file
with open('links.txt', 'w') as file:
for link in links:
file.write(link + 'n')
I’ve been trying to understand where the error is. Regex, or something else? I can’t figure it out.
Thank you.
That’s because in your second code, you capture (with parenthesis) only the extension of the url/file. So, one way to fix that, is to add another capturing group like below (read comments) :
import re
import requests
url = input('Enter URL: ')
html_content = requests.get(url).text
# Define the regular expression pattern to match links
pattern = re.compile(r'(http[s]?://[^s]+.(mp3|flac|wav))') # <- added outer parenthesis
# Find all matches in the HTML content
links = re.findall(pattern, html_content)
# Remove duplicates
links = list(set(links))
# Write the extracted links to a text file
with open('links.txt', 'w') as file:
for link in links:
file.write(link[0] + 'n') # <- link[0] instead of link[0]
Output :
Another way to solve this is to have non capturing groups with ?:
pattern = re.compile(r'http[s]?://[^s]+.(?:mp3|flac|wav)')
See here.