Get all HTML tags with Beautiful Soup

Question:

I am trying to get a list of all html tags from beautiful soup.

I see find all but I have to know the name of the tag before I search.

If there is text like

html = """<div>something</div>
<div>something else</div>
<div class='magical'>hi there</div>
<p>ok</p>"""

How would I get a list like

list_of_tags = ["<div>", "<div>", "<div class='magical'>", "<p>"]

I know how to do this with regex, but am trying to learn BS4

Asked By: humanbeing

||

Answers:

You don’t have to specify any arguments to find_all() – in this case, BeautifulSoup would find you every tag in the tree, recursively.

Sample:

from bs4 import BeautifulSoup

html = """<div>something</div>
<div>something else</div>
<div class='magical'>hi there</div>
<p>ok</p>
"""
soup = BeautifulSoup(html, "html.parser")

print([tag.name for tag in soup.find_all()])
# ['div', 'div', 'div', 'p']

print([str(tag) for tag in soup.find_all()])
# ['<div>something</div>', '<div>something else</div>', '<div class="magical">hi there</div>', '<p>ok</p>']
Answered By: alecxe

I thought I’d share my solution to a very similar question for those that find themselves here, later.

Example

I needed to find all tags quickly but only wanted unique values. I’ll use the Python calendar module to demonstrate.

We’ll generate an html calendar then parse it, finding all and only those unique tags present.

The below structure is very similar to the above, using set comprehensions:

from bs4 import BeautifulSoup
import calendar

html_cal = calendar.HTMLCalendar().formatmonth(2020, 1)
set(tag.name for tag in BeautifulSoup(html_cal, 'html.parser').find_all())

# Result
# {'table', 'td', 'th', 'tr'}
Answered By: Jason R Stevens CFA

Please try the below–

for tag in soup.findAll(True):
    print(tag.name)
Answered By: Anjan

Here is an efficient function that I use to parse different HTML and text documents:

def parse_docs(path, format, tags):
    """
    Parse the different files in path, having html or txt format, and extract the text content.
    Returns a list of strings, where every string is a text document content.
    :param path: str
    :param format: str
    :param tags: list
    :return: list
    """

    docs = []
    if format == "html":
        for document in tqdm(get_list_of_files(path)):
            # print(document)
            soup = BeautifulSoup(open(document, encoding='utf-8').read())
            text = 'n'.join([''.join(s.findAll(text=True)) for s in
                              soup.findAll(tags)])  # parse all <p>, <div>, and <h> tags
            docs.append(text)
    else:
        for document in tqdm(get_list_of_files(path)):
            text = open(document, encoding='utf-8').read()
            docs.append(text)
    return docs

a simple call: parse_docs('/path/to/folder', 'html', ['p', 'h', 'div']) will return a list of text strings.

Answered By: Belkacem Thiziri

If you want to find some specific HTML tags then try this:

html = driver.page_source
# driver.page_source: "<div>something</div>n<div>something else</div>n<div class='magical'>hi there</div>n<p>ok</p>n"
soup = BeautifulSoup(html)
for tag in soup.find_all(['a','div']):  # Mention HTML tag names here.
    print(tag.text)

# Result:
# something
# something else
# hi there
Answered By: Amar Kumar
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.