How to extract a specific text when web scraping for this situation

Question:

I need to scrape texts from a website, but could not figure out a way to scrape a specific text for this situation:

<td valign="top" class="testo_normale">
    <font face="Geneva">
        <i>W. Richard Bowen</i>
        <br>
        "Water engineering for the promotion of peace"  
        <br>
        "1(2009)1-6"
        <br>
        "DOI: "
        <br>
        "Received:26/08/2008; Accepted: 25/11/2008; "

So in the above example, I want to only get Water engineering and 1(2009)1-6

I tried to do that all day but I either get all the texts having tag <br> :

"W. Richard Bowen"

    "Water engineering for the promotion of peace"  

    "1(2009)1-6"

  "DOI: "
  "Received:26/08/2008; Accepted: 25/11/2008;"

or I get empty output.

here is website I’m trying to scrape, and a picture of what I want to scrape
and a picture of what I want to scrape

This is my code:

from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.deswater.com/vol.php?vol=1&oth=1|1-3|January|2009')
soup = BeautifulSoup(r.content, 'html.parser')
s = soup.find('td', class_='testo_normale')

lines = s.find_all('br')

for line in lines:
    print(line.text.strip())
Asked By: user17356493

||

Answers:

To extact ANY text in the position of ‘Water engineering’ which is what I think you want, you can write a regex function like the following:

import re

def extract_text(string):
    pattern = r'<br>s*(.*?)s*(?:<br>|<)'
    regex = re.compile(pattern)
    matches = regex.finditer(string)
    texts = []
    for match in matches:
        texts.append(match.group(1))
    return texts

string = """
<td valign="top" class="testo_normale">
    <font face="Geneva">
        <i>Mariam B</i>
        <br>
        "some other text" 
        <br>
        "1(2009)1-6"
        <br>"""

text = extract_text(string)
print(text)

The regular expression consists of the following parts:

<br>: This matches the
tag literally. This indicates that the text we are looking for is preceded by this tag in the string.

s*: This matches any whitespace characters (space, tab, newline, etc.) zero or more times. This allows the <br> tag to be followed by any amount of whitespace, including none at all.

(.*?): This is a capturing group that matches any sequence of characters (except a newline) zero or more times, as few times as possible. This is the part of the regular expression that actually captures the text we are looking for. The ? after the * makes the * "lazy", which means it will match as few characters as possible. This is necessary to prevent the regular expression from matching too much text.

s*: This is the same as the second s* in the pattern, and it allows the text we are looking for to be followed by any amount of whitespace, including none at all.

(?:<br>|<): This is a non-capturing group that matches either a <br> tag or a < character. This indicates that the text we are looking for is followed by one of these two patterns in the string.

This regular expression will match any sequence of characters that is preceded by a <br> tag and followed by a <br> or < tag. For example, in the given string <td valign="top" class="testo_normale"> ... <br>"Water engineering" <br>"1(2009)1-6"<br>", it will match the text Water engineering because it is preceded by <br> and followed by <br>.

Note that this regular expression is not perfect and may not work in all cases. For example, if the text you are looking for contains a < or <br> character, this regular expression will not match it correctly. You may need to adjust the regular expression pattern to handle such cases.

Answered By: mike16889

You can apply split() method like:

from bs4 import BeautifulSoup

html ='''

<td valign="top" class="testo_normale">
    <font face="Geneva">
        <i>W. Richard Bowen</i>
        <br>
        "Water engineering for the promotion of peace"  
        <br>
        "1(2009)1-6"
        <br>
        "DOI: "
        <br>
        "Received:26/08/2008; Accepted: 25/11/2008; "
 
'''

soup= BeautifulSoup(html, 'lxml')

txt = soup.select_one('.testo_normale font')
print(' '.join(' '.join(txt.get_text(strip=True).split('"')).strip().split(':')[0].split()[3:-1]))

#OR 

for u in soup.select('.testo_normale font'):
    txt = ' '.join(' '.join(u.get_text(strip=True).split('"')).strip().split(':')[0].split()[3:-1])
    print(txt)

Output:

Water engineering for the promotion of peace 1(2009)1-6

Update with full working code:

from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.deswater.com/vol.php?vol=1&oth=1|1-3|January|2009')
soup = BeautifulSoup(r.content, 'html.parser')

for u in soup.select('font[face="Geneva, Arial, Helvetica, san-serif"]')[6:]:
    txt = u.contents[2:-1]
    for i in txt:
        print(i.get_text(strip=True))

Output:

Editorial and Obituary for Sidney Loeb by Miriam Balaban

1(2009)vii-viii
Water engineering for the promotion of peace

1(2009)1-6
Modeling the permeate transient response to perturbations from steady state in a nanofiltration process

1(2009)7-16
Modeling the effect of anti-scalant on CaCO3 precipitation in continuous flow

1(2009)17-24
Alternative primary energy for power desalting plants in Kuwait: the nuclear option I

1(2009)25-41
Alternative primary energy for power desalting plants in Kuwait: the nuclear
option II  The steam cycle and its combination with desalting units

1(2009)42-57
Potential applications of quarry dolomite for post treatment of desalinated water

1(2009)58-67
Salinity tolerance evaluation methodology for desalination plant discharge

1(2009)68-74
Studies on a water-based absortion heat transformer for desalination using MED

1(2009)75-81
Estimation of stream compositions in reverse osmosis seawater desalination systems

1(2009)82-87
Genetic algorithm-based optimization of a multi-stage flash desalination plant

1(2009)88-106
Numerical simulation on a dynamic mixing process in ducts of a rotary pressure exchanger for SWRO

1(2009)107-113
Simulation of an autonomous, two-stage solar organic Rankine cycle system for reverse osmosis desalination

1(2009)114-127
Experiment and optimal parameters of a solar heating system study on an absorption solar desalination unit

1(2009)128-138
Roles of various mixed liquor constituents in membrane filtration of activated sludge

1(2009)139-149
Natural organic matter fouling using a cellulose acetate copolymer ultrafiltration membrane

1(2009)150-156
Progress of enzyme immobilization and its potential application

1(2009)157-171
Investigating microbial activities of constructed wetlands with respect to nitrate and sulfate reduction

1(2009)172-179
Membrane fouling caused by soluble microbial products in an activated sludge system under starvation

1(2009)180-185
Characterization of an ultrafiltration membrane modified by sorption of branched polyethyleneimine

1(2009)186-193
Combined humic substance coagulation and membrane filtration under saline conditions

1(2009)194-200
Preparation, characterization and performance of phenolphthalein polyethersulfone ultrafiltration hollow fiber membranes

1(2009)201-207
Application of coagulants in pretreatment of fish wastewater using factorial design

1(2009)208-214
Performance analysis of a trihybrid NF/RO/MSF desalination plant

1(2009)215-222
Nitrogen speciation by microstill flow injection analysis

1(2009)223-231
Wastewater from a mountain village treated with a constructed wetland

1(2009)232-236
The influence of various operating conditions on specific cake resistance in the crossflow microfiltration of yeast suspensions

1(2009)237-247
On-line monitoring of floc formation in various flocculants for piggery wastewater treatment

1(2009)248-258
Rigorous steady-state modeling of MSFBR desalination systems

1(2009)259-276
Detailed numerical simulations of flow mechanics and membrane performance in spacer-filled channels, flat and curved

1(2009)277-288
Removal of polycyclic aromatic hydrocarbons from Ismailia Canal water by chlorine, chlorine dioxide and ozone

1(2009)289-298
Water resources management to satisfy high water demand in the arid Sharm El Sheikh, the Red Sea, Egypt

1(2009)299-306
Effect of storage of NF membranes on fouling deposits and cleaning efficiency

1(2009)307-311
Laboratory studies and CFD modeling of photocatalytic degradation of colored textile wastewater by titania nanoparticles

1(2009)312-317
Startup operation and process control of a two-stage sequencing batch reactor (TSSBR) for biological nitrogen removal via nitrite

1(2009)318-325
Answered By: Md. Fazlul Hoque

The use of split() is one of the options and seems legitimate, but the more excessive indexing or slicing is used, the greater the risk of catching wrong content or encountering the error list index out of range.

Therefore, the recommendation would be to reduce this to a minimum, so the following approach is limited to the first two elements that are always present as siblings of the author – Used css selectors and .find_next_siblings() here:

for e in soup.select('.testo_normale i'):
    print(', '.join([s.strip() for s in e.find_next_siblings(text=True)[:2]]))

Example

from bs4 import BeautifulSoup
import requests

soup = BeautifulSoup(requests.get('https://www.deswater.com/vol.php?vol=1&oth=1|1-3|January|2009').content)

for e in soup.select('.testo_normale i'):
    print(', '.join([s.strip() for s in e.find_next_siblings(text=True)[:2]]))

Output

Editorial and Obituary for Sidney Loeb by Miriam Balaban, 1(2009)vii-viii
Water engineering for the promotion of peace, 1(2009)1-6
Modeling the permeate transient response to perturbations from steady state in a nanofiltration process, 1(2009)7-16
Modeling the effect of anti-scalant on CaCO3 precipitation in continuous flow, 1(2009)17-24
Alternative primary energy for power desalting plants in Kuwait: the nuclear option I, 1(2009)25-41
Alternative primary energy for power desalting plants in Kuwait: the nuclear
option II — The steam cycle and its combination with desalting units, 1(2009)42-57
Potential applications of quarry dolomite for post treatment of desalinated water, 1(2009)58-67
...
Answered By: HedgeHog