BeautifulSoup Extract Text from a Paragraph and Split Text by

Question

I am very new to BeauitfulSoup.

How would I be able to extract the text in a paragraph from an html source code, split the text whenever there is a , and store it into an array such that each element in the array is a chunk from the paragraph text (that was split by a )?

For example, for the following paragraph:

<p>
    <strong>Pancakes</strong>
    <br/> 
    A <strong>delicious</strong> type of food
    <br/>
</p>

I would like it to be stored into the following array:

['Pancakes', 'A delicious type of food']

What I have tried is:

import bs4 as bs

soup = bs.BeautifulSoup("<p>Pancakes<br/> A delicious type of food<br/></p>")
p = soup.findAll('p')
p[0] = p[0].getText()
print(p)

but this outputs an array with only one element:

['Pancakes A delicious type of food']

What is a way to code it so that I can get an array that contains the paragraph text split by any in the paragraph?

Asked By: FoxEvolved

||

Source

Answer 1

try this

from bs4 import BeautifulSoup, NavigableString

html = '<p>Pancakes<br/> A delicious type of food<br/></p>'

soup = BeautifulSoup(html, 'html.parser')
p = soup.findAll('p')
result = [str(child).strip() for child in p[0].children
            if isinstance(child, NavigableString)]

Update for deep recursive

from bs4 import BeautifulSoup, NavigableString, Tag

html = "<p><strong>Pancakes</strong><br/> A <strong>delicious</strong> type of food<br/></p>"

soup = BeautifulSoup(html, 'html.parser')
p = soup.find('p').find_all(text=True, recursive=True)

Update again for text split only by

from bs4 import BeautifulSoup, NavigableString, Tag

html = "<p><strong>Pancakes</strong><br/> A <strong>delicious</strong> type of food<br/></p>"

soup = BeautifulSoup(html, 'html.parser')
text = ''
for child in soup.find_all('p')[0]:
    if isinstance(child, NavigableString):
        text += str(child).strip()
    elif isinstance(child, Tag):
        if child.name != 'br':
            text += child.text.strip()
        else:
            text += 'n'

result = text.strip().split('n')
print(result)

Answered By: Jason Yang

Answer 2

I stumbled across this whilst having a similar issue. This was my solution…
A simple way is to replace the line

p[0] = p[0].getText()

with

p[0].getText(‘#’).split(‘#’)

Result is:
[‘Pancakes’, ‘ A delicious type of food’]

Obv choose a character/characters that won’t appear in the text

Answered By: Paul Hester

BeautifulSoup Extract Text from a Paragraph and Split Text by <br/>

Question:

Answers: