How to print text and certain specified tags of XML file using BeautifulSoup
Question:
I’m parsing the XML of a Microsoft Word .docx file with BeautifulSoup. I’d like to be able to extract the text of the XML file while still printing certain tags that I choose.
I can get the text of the file easily with soup.text
So for example, for the following XML content, soup.text
would output Here is some text inserted into a documente
<w:body>
<w:p>
<w:r>
<w:t>Here is some text</w:t>
</w:r>
<w:pPr>
<w:spacing w_line="480" w_lineRule="auto"/>
<w:jc w_val="center"/>
</w:pPr>
<w:ins w_id="2" w_author="Author">
<w:r>
<w:t>inserted</w:t>
</w:r>
</w:ins>
<w:spacing w_line="480" w_lineRule="auto"/>
<w:jc w_val="center"/>
<w:r w_rsidRPr="00406F87">
<w:t>into a document</w:t>
</w:r>
<w:del w_id="4" w_author="Author">
<w:r w_rsidRPr="00406F87" w_rsidDel="00B30E79">
<w:delText>e</w:delText>
</w:r>
</w:del>
</w:p>
</w:body>
However, I want the output to also include the <w:ins>
and <w:del>
tags as well. So it would look like this:
Here is some text <w:ins>inserted</w:ins> into a document<w:del>e</w:del>
Is there a way to accomplish this with Beautiful Soup? I’ve also considered just writing a regular expression to remove all the tags except the ones I want, but I’d like to see if Beautiful Soup can do this first.
I’ve tried finding the answer by looking at the bs4 documentation as well as other posts on StackOverflow, but I’m coming up short.
Thank you for your help!
Answers:
Try:
from bs4 import BeautifulSoup
xml_doc = '''
<w:body>
<w:p>
<w:r>
<w:t>Here is some text</w:t>
</w:r>
<w:pPr>
<w:spacing w_line="480" w_lineRule="auto"/>
<w:jc w_val="center"/>
</w:pPr>
<w:ins w_id="2" w_author="Author">
<w:r>
<w:t>inserted</w:t>
</w:r>
</w:ins>
<w:spacing w_line="480" w_lineRule="auto"/>
<w:jc w_val="center"/>
<w:r w_rsidRPr="00406F87">
<w:t>into a document</w:t>
</w:r>
<w:del w_id="4" w_author="Author">
<w:r w_rsidRPr="00406F87" w_rsidDel="00B30E79">
<w:delText>e</w:delText>
</w:r>
</w:del>
</w:p>
</w:body>
'''
soup = BeautifulSoup(xml_doc, 'html.parser')
for t in soup.find_all(['w:ins', 'w:del']):
t.replace_with(BeautifulSoup(f'<{t.name}><{t.name}>{t.text.strip()}</{t.name}></{t.name}>', 'html.parser'))
print(soup.get_text(strip=True, separator=' '))
Prints:
Here is some text <w:ins>inserted</w:ins> into a document <w:del>e</w:del>
I’m parsing the XML of a Microsoft Word .docx file with BeautifulSoup. I’d like to be able to extract the text of the XML file while still printing certain tags that I choose.
I can get the text of the file easily with soup.text
So for example, for the following XML content, soup.text
would output Here is some text inserted into a documente
<w:body>
<w:p>
<w:r>
<w:t>Here is some text</w:t>
</w:r>
<w:pPr>
<w:spacing w_line="480" w_lineRule="auto"/>
<w:jc w_val="center"/>
</w:pPr>
<w:ins w_id="2" w_author="Author">
<w:r>
<w:t>inserted</w:t>
</w:r>
</w:ins>
<w:spacing w_line="480" w_lineRule="auto"/>
<w:jc w_val="center"/>
<w:r w_rsidRPr="00406F87">
<w:t>into a document</w:t>
</w:r>
<w:del w_id="4" w_author="Author">
<w:r w_rsidRPr="00406F87" w_rsidDel="00B30E79">
<w:delText>e</w:delText>
</w:r>
</w:del>
</w:p>
</w:body>
However, I want the output to also include the <w:ins>
and <w:del>
tags as well. So it would look like this:
Here is some text <w:ins>inserted</w:ins> into a document<w:del>e</w:del>
Is there a way to accomplish this with Beautiful Soup? I’ve also considered just writing a regular expression to remove all the tags except the ones I want, but I’d like to see if Beautiful Soup can do this first.
I’ve tried finding the answer by looking at the bs4 documentation as well as other posts on StackOverflow, but I’m coming up short.
Thank you for your help!
Try:
from bs4 import BeautifulSoup
xml_doc = '''
<w:body>
<w:p>
<w:r>
<w:t>Here is some text</w:t>
</w:r>
<w:pPr>
<w:spacing w_line="480" w_lineRule="auto"/>
<w:jc w_val="center"/>
</w:pPr>
<w:ins w_id="2" w_author="Author">
<w:r>
<w:t>inserted</w:t>
</w:r>
</w:ins>
<w:spacing w_line="480" w_lineRule="auto"/>
<w:jc w_val="center"/>
<w:r w_rsidRPr="00406F87">
<w:t>into a document</w:t>
</w:r>
<w:del w_id="4" w_author="Author">
<w:r w_rsidRPr="00406F87" w_rsidDel="00B30E79">
<w:delText>e</w:delText>
</w:r>
</w:del>
</w:p>
</w:body>
'''
soup = BeautifulSoup(xml_doc, 'html.parser')
for t in soup.find_all(['w:ins', 'w:del']):
t.replace_with(BeautifulSoup(f'<{t.name}><{t.name}>{t.text.strip()}</{t.name}></{t.name}>', 'html.parser'))
print(soup.get_text(strip=True, separator=' '))
Prints:
Here is some text <w:ins>inserted</w:ins> into a document <w:del>e</w:del>