Scraping : How to exclude specific tag with bS4

Question

I hope you’re well do you know how I can exclude a specific tag in scraping?

#Récupération des ingrédients
            try:    
                ingredientsdiv = soup.find("div", class_="c-recipe-ingredients")
                ingredientsbloc = ingredientsdiv.find("ul", class_="c-recipe-ingredients__list")
                ingredients = [re.findall(r'^(?:(d+)s([^Wd_]*))?(.*)', item.text.replace("n", "").strip()) for item in ingredientsbloc.find_all("li", {"class": ""})]
            except Exception as e:
                ingredients = None

here is the HTML code

<div class="c-recipe-ingredients"><ul class="c-recipe-ingredients__list" data-id="258101"><li>10 cl de lait de coco</li><li>1 cuillère à café de poivre vert</li><li>Huile de pépin de raisin</li><li>Fleur de sel</li><li>4 brins de menthe</li><li>2 c&amp;œligurs de laitue</li><li>4 citrons verts</li><li>12 tomates cerise</li><li>4 oignons nouveaux</li><li>600 g de filets de bar                                <span class="c-recipe-ingredients__item--sponso u-relative"><span><a target="_blank" href="https://www.pourdebon.com/bar-sauvage-d38?utm_source=750g&amp;utm_medium=autopromo&amp;utm_content=Top10_750g_Autopromo&amp;utm_campaign=750g_autopromo_recette" class="u-some-link u-color-pourdebon xXx" onclick="ga('send', 'event', 'autopromo-pdb-ingredient', 'clic', '600x20gx20dex20filetsx20dex20bar')">                                            En direct des producteurs sur
                                            <img src="/bundles/cuisinewebsite/img/partner/logo-pourdebon.png" alt="Logo Pourdebon" itemprop="logo"></a></span></span><script>
                                    document.addEventListener('DOMContentLoaded', function() {
                                        ga('send', 'event', 'autopromo-pdb-ingredient', 'view', '600x20gx20dex20filetsx20dex20bar', {
                                            nonInteraction : true
                                        });
                                    });
                                </script></li></ul></div>

There is a sponsoring link, like that:

<a target="_blank" href="" class="u-some-link u-color-pourdebon xXx" onclick="ga('send', 'event', 'autopromo-pdb-ingredient', 'clic', '600x20gx20dex20filetsx20dex20bar')">                                            Lorem ipsum 
                                            <img src="/bundles/cuisinewebsite/img/partner/logo-pourdebon.png" alt="Logo Pourdebon" itemprop="logo"></a>

I would like to exclude the sponsoring link text in my scraping (json file) 🙂 Do you have any idea?

Asked By: Louis

||

Source

Answer 1

What do you want to extract exactly using li tag?
If you want to extract text contained within li tags, just add .text()

find_all("li", {"class": ""}).text()

Above statement will get rid of js & link within anchor tag and return you with text value.

Answered By: TechyKajal

Answer 2

For all intents and purposes, if you want to get the text from elements like li but not the nested a and script element(s) – you should use NavigableString

So, instead of .text – you should use this function-

import bs4

...

def get_only_text(elem):
    for item in elem.children:
        if isinstance(item, bs4.element.NavigableString):
            yield item

Then call this function outside and join the entire generator to get the final string-

ingredients = [re.findall(r'^(?:(d+)s([^Wd_]*))?(.*)', ''.join(get_only_text(item)).strip()) for item in ingredientsbloc.find_all("li", {"class": ""})]

Output of ingredients–

[[('10', 'cl', ' de lait de coco')],
 [('1', 'cuillère', ' à café de poivre vert')],
 [('', '', 'Huile de pépin de raisin')],
 [('', '', 'Fleur de sel')],
 [('4', 'brins', ' de menthe')],
 [('2', 'c', '&œligurs de laitue')],
 [('4', 'citrons', ' verts')],
 [('12', 'tomates', ' cerise')],
 [('4', 'oignons', ' nouveaux')],
 [('600', 'g', ' de filets de bar')]]

Answered By: Chase

Scraping : How to exclude specific tag with bS4

Question:

Answers: