need to limit BeautifulSoup href result to first occurence – or – account for an open parenthesis in href string

Question:

I want ONLY the <a href NPPES Data Dissemination in the Full Replacement Monthly NPI File section of https://download.cms.gov/nppes/NPI_Files.html. There are other <a href NPPES Data Dissemination files in the Weekly Incremental NPI Files that I do NOT want. Here is the code that gets ALL NPPES Data Dissemination files in the monthly and weekly sections:

import subprocess
import re
from bs4 import BeautifulSoup
import requests
import wget

def get_urls(soup):
    urls = []
    for a in soup.find_all('a', href=True):
        ul = a.find_all(text=re.compile('NPPES Data Dissemination'))
        if ul != []:
            urls.append(a)
    print('done scraping the url...')
    return urls

def download_and_extract(urls):
for texts in urls:
text = str(texts)
file = text[55:99]
print(‘zip file :’, file)
zip_link = texts[‘href’]
print(‘Downloading %s :’ %zip_link)
slashurl = zip_link.split(‘/’)
print(slashurl)
wget.download("https://download.cms.gov/nppes/"+ slashurl[1])

r = requests.get('https://download.cms.gov/nppes/NPI_Files.html')
soup = BeautifulSoup(r.content, 'html.parser')
urls = get_urls(soup)
download_and_extract(urls)

Tried:
Limit=1 does not work as I have it below, as all NPPES Data Dissemination files are still collected

def get_urls(soup):
    urls = []
    for a in soup.find_all('a', href=True):
        ul = a.find_all(text=re.compile('NPPES Data Dissemination'), limit=1)
        if ul != []:
            urls.append(a)
    print('done scraping the url......!!!!')
    return urls

Tried:
If I use the open parenthesis ‘NPPES Data Dissemination (‘ as it is only in the Full Replacement Monthly NPI File section, I get errors (below)

def get_urls(soup):
    urls = []
    for a in soup.find_all('a', href=True):
        ul = a.find_all(text=re.compile('NPPES Data Dissemination ('), limit=1)
        if ul != []:
            urls.append(a)
    print('done scraping the url......!!!!')
    return urls 

thank you for any assistance you may provide!!!!

Asked By: sherri pytorch

||

Answers:

If what you need is only the first link

So what happen here is, the limit you set is the first regex found in the link
But you still loop searching it for all links

The simple solution to get the first link is just add break when you found so it will stop the loop

def get_urls(soup):
    urls = []
    for a in soup.find_all('a', href=True):
        ul = a.find_all(text=re.compile('NPPES Data Dissemination'))
        if ul != []:
            urls.append(a)
            # break (stop loop) if found
            break
    print('done scraping the url......!!!!')
    return urls

Update: when I look at the website
actually you can update it by using regex only (not using break)

Full Replacement Monthly NPI File -> re.compile('NPPES Data Dissemination (')

Full Replacement Monthly NPI Deactivation File -> re.compile('NPPES Data Dissemination - Monthly Deactivation Update')

Weekly Incremental NPI Files -> re.compile('NPPES Data Dissemination - Weekly Update')

Answered By: d_frEak
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.