How to print text and certain specified tags of XML file using BeautifulSoup

Question:

I’m parsing the XML of a Microsoft Word .docx file with BeautifulSoup. I’d like to be able to extract the text of the XML file while still printing certain tags that I choose.

I can get the text of the file easily with soup.text

So for example, for the following XML content, soup.text would output Here is some text inserted into a documente

<w:body>
    <w:p>
        <w:r>
            <w:t>Here is some text</w:t>
        </w:r>
    
        <w:pPr>
        <w:spacing w_line="480" w_lineRule="auto"/>
            <w:jc w_val="center"/>
        </w:pPr>
    
        <w:ins w_id="2" w_author="Author">
            <w:r>
                <w:t>inserted</w:t>
            </w:r>
        </w:ins>
    
        <w:spacing w_line="480" w_lineRule="auto"/>
    
        <w:jc w_val="center"/>
    
        <w:r w_rsidRPr="00406F87">
            <w:t>into a document</w:t>
        </w:r>
    
        <w:del w_id="4" w_author="Author">
            <w:r w_rsidRPr="00406F87" w_rsidDel="00B30E79">
                <w:delText>e</w:delText>
            </w:r>
        </w:del>
    </w:p>
</w:body>

However, I want the output to also include the <w:ins> and <w:del> tags as well. So it would look like this:

Here is some text <w:ins>inserted</w:ins> into a document<w:del>e</w:del>

Is there a way to accomplish this with Beautiful Soup? I’ve also considered just writing a regular expression to remove all the tags except the ones I want, but I’d like to see if Beautiful Soup can do this first.

I’ve tried finding the answer by looking at the bs4 documentation as well as other posts on StackOverflow, but I’m coming up short.

Thank you for your help!

Asked By: Jordan Smith

||

Answers:

Try:

from bs4 import BeautifulSoup

xml_doc = '''
<w:body>
    <w:p>
        <w:r>
            <w:t>Here is some text</w:t>
        </w:r>

        <w:pPr>
        <w:spacing w_line="480" w_lineRule="auto"/>
            <w:jc w_val="center"/>
        </w:pPr>

        <w:ins w_id="2" w_author="Author">
            <w:r>
                <w:t>inserted</w:t>
            </w:r>
        </w:ins>

        <w:spacing w_line="480" w_lineRule="auto"/>

        <w:jc w_val="center"/>

        <w:r w_rsidRPr="00406F87">
            <w:t>into a document</w:t>
        </w:r>

        <w:del w_id="4" w_author="Author">
            <w:r w_rsidRPr="00406F87" w_rsidDel="00B30E79">
                <w:delText>e</w:delText>
            </w:r>
        </w:del>
    </w:p>
</w:body>
'''

soup = BeautifulSoup(xml_doc, 'html.parser')

for t in soup.find_all(['w:ins', 'w:del']):
    t.replace_with(BeautifulSoup(f'<{t.name}>&lt;{t.name}&gt;{t.text.strip()}&lt;/{t.name}&gt;</{t.name}>', 'html.parser'))

print(soup.get_text(strip=True, separator=' '))

Prints:

Here is some text <w:ins>inserted</w:ins> into a document <w:del>e</w:del>
Answered By: Andrej Kesely
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.