How to extract paragraph content and mark the text embedded to a link using BeautifulSoup

Question:

Given a paragraph on a Wikipedia page (i.e. Lana Del Ray) using BeautifulSoup I need to extract the text from the paragraph and mark the word or set of words that have enclosed a link.

For example, consider the first sentences of the first paragraph:

Elizabeth Woolridge Grant (born June 21, 1985), known professionally
as Lana Del Rey, is an American singer and songwriter. Her music is
noted for its cinematic quality and exploration of tragic romance,
glamour, and melancholia, containing references to contemporary pop
culture
and 1950s–1960s Americana.

I need to have it in this format:

Elizabeth Woolridge Grant (born June 21, 1985), known professionally
as Lana Del Rey, is an American singer and songwriter. Her music is
noted for its cinematic quality and exploration of tragic romance,
START_A glamour END_A, and START_A melancholia END_A, containing references to contemporary START_A pop
culture END_A and START_A 1950s–1960s Americana END_A.

Until now I could extract either one paragraph or link separately using something like this:

from urllib import request
from bs4 import BeautifulSoup
soup = BeautifulSoup(request.urlopen("https://en.wikipedia.org/wiki/Lana_Del_Rey").read())

 for tag in soup.select('p a[href]'):
     if tag['href'].startswith('/wiki/'):
         text = tag.text.strip()
         print(text)

for tag in soup.select('p'):
        text = tag.text.strip()
        print(text)

Is it possible to extract the text of the paragraph

tag and identify which words have a link attached to them?

Asked By: Aiha

||

Answers:

You could use replaceWith() to convert the <a> and its text to your expected output:

a.replaceWith('START_A '+a.text+' END_A')

Example

from urllib import request
from bs4 import BeautifulSoup
soup = BeautifulSoup(request.urlopen("https://en.wikipedia.org/wiki/Lana_Del_Rey").read())

for tag in soup.select('p'):
    for a in tag.select('a'):
        a.replaceWith('START_A '+a.text+' END_A')
    print(tag.text)
Output

Elizabeth Woolridge Grant (born June 21, 1985), known professionally
as Lana Del Rey, is an American singer and songwriter. Her music is
noted for its cinematic quality and exploration of tragic romance,
START_A glamour END_A, and START_A melancholia END_A, containing
references to contemporary START_A pop culture END_A and START_A 1950s
END_A–START_A 1960s END_A START_A Americana END_A.START_A [1] END_A
She is the recipient of START_A various accolades END_A, including two
START_A Brit Awards END_A, two START_A MTV Europe Music Awards END_A,
and a START_A Satellite Award END_A, in addition to nominations for
six START_A Grammy Awards END_A and a START_A Golden Globe Award
END_A.START_A [2] END_A START_A Variety END_A honored her at their
START_A Hitmakers Awards END_A for being "one of the most influential
singer-songwriters of the 21st century."START_A [3] END_ASTART_A [4]
END_A

Answered By: HedgeHog