Scraping links from the Abstract, and Early Life sections separately

Question

I want to scrape all the links from the Abstract and Early Life section of this page. https://en.wikipedia.org/wiki/Barack_Obama
I want to store the links I get from the two sections separately. However, I am having issues with isolating that tag/class. I tried to use class "mw-headline" for the early life section but that is returning only the header text. Any hints are really appreciated.

I couldn’t figure out how to the get abstract and early life sections separately.

url='https://en.wikipedia.org/wiki/Barack_Obama'
response = requests.get(url)
soup=bs(response.content,'html.parser')
page=soup.find('div',attrs={'id':'bodyContent'})

early_life=page.findAll('span',attrs={'class':'mw-headline'})

Asked By: Minda

||

Source

Answer 1

It’s not very clear what format you want your output to be in, but the following will produce a list of dictionaries with the sections in several different formats:

First, the sections and abstract are all inside this div, and are not nested any further into separate elements – so this starts by selecting the whole outer element and then going through its children:

content = soup.select_one('#mw-content-text > .mw-parser-output').children
splitContent = []

(splitContent is the list that will be filled up with a dictionary for each section.)

for c in content:
    if c.name == 'h2' or splitContent == []:
        sectionName = 'Abstract' if splitContent == [] else c.text
        splitContent.append({
            'section': sectionName, 
            'listSoups': [], 'HTML': '', 'asText': ''
        }) 

    splitContent[-1]['listSoups'].append(c)
    splitContent[-1]['HTML'] += str(c)
    if c.name not in ['style', 'script']: 
        splitContent[-1]['asText'] += c.text

Each section header is wrapped as h2*, so every time the loop gets to a child tag that’s h2, a new dictionary is started, and the child object itself is always added to listSoups in the last dictionary of the splitContent list.

HTML is saved too, so if you want a single bs4 object to be created for each section, splitContent can be looped through:

for i in range(len(splitContent)):
    splitContent[i]['asSoup'] = BeautifulSoup(splitContent[i]['HTML'], 'html.parser')

Now, you can see any of the sections in any of the formats added to the dictionaries.

Note that listSoups is not the same as asSoup. listSoups is a list, and each item within is still connected to the original soup variable and you can view its parent, nextSibling, etc in ways that would not be possible with asSoup, which is a single object.

*Btw, using {'class':'mw-headline'} will give you not just the main section headers, but also the subheaders. You can actually get something like a tree of the article structure with:

for h in soup.findAll('span',attrs={'class':'mw-headline'}):
    hLevel = int(h.parent.name.replace('h', ''))
    print(('t'*(hLevel-2))+'↳', f'[{h.parent.name}] {h.text}')

Additional EDIT:

To get a dictionary of section texts, just use

sectnTexts_dict = dict([(
    sc['section'].replace(' ', '_'), # section name to key
    sc['asText'] # section text as value
) for sc in splitContent])

to view a truncated version, print dict((k, v[:50]+'...') for k, v in sectnTexts_dict.items()), which looks like

{
   "Abstract": "44th President of the United Statesn"Barack" and "...",
   "Early_life_and_career": "Early life and careernMain article: Early life and...",
   "Legal_career": "Legal careernCivil Rights attorneynHe joined Davis...",
   "Legislative_career": "Legislative careernIllinois Senate (1997u20132004)nMai...",
   "Presidential_campaigns": "Presidential campaignsn2008nMain articles: 2008 Un...",
   "Presidency_(2009u20132017)": "Presidency (2009u20132017)n First official portrait of...",
   "Cultural_and_political_image": "Cultural and political imagenMain article: Public ...",
   "Post-presidency_(2017u2013present)": "Post-presidency (2017u2013present)n Obama with his the...",
   "Legacy": "Legacyn Job growth during the presidency of Obama ...",
   "Bibliography": "BibliographynMain article: Bibliography of Barack ...",
   "See_also": "See alsonnnBiography portalnUnited States portalnC...",
   "References": "Referencesnn^ "Barack Hussein Obama Takes The Oath...",
   "Further_reading": "Further readingnnDe Zutter, Hank (December 8, 1995...",
   "External_links": "External linksnLibrary resources about  Barack Oba..."
}

Answered By: Driftr95

Scraping links from the Abstract, and Early Life sections separately

Question:

Answers: