Python REGEX remove string containing substring

Question:

I am writing a script that will scrape a newsletter for URLs. There are some URLs in the newsletter that are irrelevant (e.g. links to articles, mailto links, social links, etc.). I added some logic to remove those links, but for some reason not all of them are being removed. Here is my code:

from os import remove
from turtle import clear
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd

termSheet = "https://fortune.com/newsletter/termsheet"
html = requests.get(termSheet)
htmlParser = BeautifulSoup(html.text, "html.parser")
termSheetLinks = []

for companyURL in htmlParser.select("table#templateBody p > a"):
    termSheetLinks.append(companyURL.get('href'))

for link in termSheetLinks:
    if "fortune.com" in link in termSheetLinks:
        termSheetLinks.remove(link)
    if "forbes.com" in link in termSheetLinks:
        termSheetLinks.remove(link)
    if "twitter.com" in link in termSheetLinks:
        termSheetLinks.remove(link)

print(termSheetLinks)

When I ran it most recently, this was my output, despite trying to remove all links containing "fortune.com":

['https://fortune.com/company/blackstone-group?utm_source=email&utm_medium=newsletter&utm_campaign=term-sheet&utm_content=2022080907am', 'https://fortune.com/company/tpg?utm_source=email&utm_medium=newsletter&utm_campaign=term-sheet&utm_content=2022080907am', 'https://casproviders.org/asd-guidelines/', 'https://fortune.com/company/carlyle-group?utm_source=email&utm_medium=newsletter&utm_campaign=term-sheet&utm_content=2022080907am', 'https://ir.carlyle.com/static-files/433abb19-8207-4632-b173-9606698642e5', 'mailto:[email protected]', 'https://www.afresh.com/', 'https://www.geopagos.com/', 'https://montana-renewables.com/', 'https://descarteslabs.com/', 'https://www.dealer-pay.com/', 'https://www.sequeldm.com/', 'https://pueblo-mechanical.com/', 'https://dealcloud.com/future-proof-your-firm/', 'https://apartmentdata.com/', 'https://www.irobot.com/', 'https://www.martin-bencher.com/', 'https://cell-matters.com/', 'https://www.lever.co/', 'https://www.sigulerguff.com/']

Any help would be greatly appreciated!

Asked By: user18871432

||

Answers:

It do not need a regex in my opinion – Instead of removing the urls, append only those to a list that do not contain your substrings, eg with a list comprehension:

[companyURL.get('href') for companyURL in htmlParser.select("table#templateBody p > a") if not any(x in companyURL.get('href') for x in ["fortune.com","forbes.com","twitter.com"])]

Example

from bs4 import BeautifulSoup
import requests

termSheet = "https://fortune.com/newsletter/termsheet"
html = requests.get(termSheet)
htmlParser = BeautifulSoup(html.text, "html.parser")

myList = ["fortune.com","forbes.com","twitter.com"]
[companyURL.get('href') for companyURL in htmlParser.select("table#templateBody p > a") 
     if not any(x in companyURL.get('href') for x in myList)]

Output

['https://casproviders.org/asd-guidelines/',
 'https://ir.carlyle.com/static-files/433abb19-8207-4632-b173-9606698642e5',
 'https://www.afresh.com/',
 'https://www.geopagos.com/',
 'https://montana-renewables.com/',
 'https://descarteslabs.com/',
 'https://www.dealer-pay.com/',
 'https://www.sequeldm.com/',
 'https://pueblo-mechanical.com/',
 'https://dealcloud.com/future-proof-your-firm/',
 'https://apartmentdata.com/',
 'https://www.irobot.com/',
 'https://www.martin-bencher.com/',
 'https://cell-matters.com/',
 'https://www.lever.co/',
 'https://www.sigulerguff.com/']
Answered By: HedgeHog

Removing the links after the for iterator will not skip any entry.

from os import remove
from turtle import clear
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd

termSheet = "https://fortune.com/newsletter/termsheet"
html = requests.get(termSheet)
htmlParser = BeautifulSoup(html.text, "html.parser")
termSheetLinks = []

for companyURL in htmlParser.select("table#templateBody p > a"):
    termSheetLinks.append(companyURL.get('href'))

lRemove = []
for link in termSheetLinks:
    if "fortune.com" in link:
        lRemove.append(link)
    if "forbes.com" in link:
        lRemove.append(link)
    if "twitter.com" in link:
        lRemove.append(link)
for l in lRemove:
    termSheetLinks.remove(l)

print(termSheetLinks)
Answered By: Hezi Shahmoon