How to scrape many children down from a common tag with BeautifulSoup?

Question:

I am trying to use BeautifulSoup to access thesaurus.com in order to quickly find synonyms for certain words. However, the synonyms are in a list that has different ids and classes per word, and so the best thing I can do is access a grandparent that is the same per word: Here is a simplified example:

<div data-testid="same_between_words">
    <ul class="different_between_words">
        <li>
            <a data-linkid="same_between_words_2">Word 1</a>
        </li>
        <li>
            <a data-linkid="same_between_words_2">Word 2</a>
        </li>
    </ul>
</div>

There’s also similar words which are fine to include if necessary and antonyms which are obviously not fine to include. In case it matters, the words do have the same data-linkid between each other and different words but they’re also the same as antonyms, so I haven’t gotten that to work. My current code is

from bs4 import BeautifulSoup
import requests

url = "https://www.thesaurus.com/browse/EXAMPLE WORD"
page = requests.get(url)
html = page.text

soup = BeautifulSoup(html,"html.parser")
ele = soup.find('div', attrs={'data-testid': 'word-grid-container'})
syn = ele.findChildren('ul', recursive=False)
print(syn)

which gives all of the html for the data-testid in a big old mess, and adding .text doesn’t seem to work since it’s saying I’m treating a list of results like a single one (which I don’t think I am. I’m not using find_all). Not to mention I think adding that would just give me the first synonym which isn’t ideal.

I’d like to get a list of synonyms from a word. I’ve gotten a big single string with all the words but I would love to have it be in a list I can work with since some synonyms have spaces in them (like ‘fine and dandy’ for ‘good’. I can’t split a string based on spaces then).

Asked By: tlars25

||

Answers:

Each word is in a tag with font-weight="inherit" property, you can even just select all a tags.

from bs4 import BeautifulSoup
import requests

url = "https://www.thesaurus.com/browse/smile"
page = requests.get(url)
html = page.text

soup = BeautifulSoup(html,"html.parser")
#words = soup.select_one('div[data-testid="word-grid-container"]').select('a[font-weight="inherit"]')
words = soup.select_one('div[data-testid="word-grid-container"]').select('a')
for word in words:
    print(word.get_text())
Answered By: lex

You are near to your goal, but to give you an idea, try to select by static things id or HTML structure, may use css selectors for convenience.

Example

from bs4 import BeautifulSoup
import requests

url = "https://www.thesaurus.com/browse/idea"
page = requests.get(url)
html = page.text

soup = BeautifulSoup(html,"html.parser")
print([e.get_text(strip=True) for e in soup.select('#meanings ul li>a')])

Output

['belief', 'concept', 'conclusion', 'design', 'feeling', 'form', 'intention', 'interpretation', 'meaning', 'notion', 'objective', 'opinion', 'perception', 'plan', 'scheme', 'sense', 'solution', 'suggestion', 'theory', 'thought', 'understanding', 'view', 'aim', 'approximation', 'brainstorm', 'clue', 'conception', 'conviction', 'doctrine', 'end', 'essence', 'estimate', 'fancy', 'flash', 'guess', 'hint', 'hypothesis', 'import', 'impression', 'inkling', 'intimation', 'judgment', 'object', 'pattern', 'purpose', 'reason', 'significance', 'suspicion', 'teaching', 'viewpoint', 'believed abstraction']
Answered By: HedgeHog