How to extract tag offsets in xml document using Python BeautifulSoup

Question:

I need some help finding the text offset of certain tags in an XML document. I have a data set following the format illustrated below where the ROOT element contains several RECORDs though each RECORD contains only one TEXT element. In the text there may exist several TAG elements used as annotations of some text. I need to convert these annotations to another format requiring begin and end offset of the tags using Python.

<ROOT>
    <RECORD ID="123">
        <TEXT>
        This is an example text written at <TAG TYPE="DATE">December 29th</TAG> to illustrate the problem.
        </TEXT>
    </RECORD>
</ROOT>

Basically, I would like to convert above format to the following format:

<ROOT>
    <RECORD ID="123">
        <TEXT>
        This is an example text written at December 29th to illustrate the problem.
        </TEXT>
        <TAG TYPE="DATE" BEGIN=36 END=49/>
    </RECORD>
</ROOT>

I’ve tried using BeautifulSoup but could not find a way of extracting the tag offsets. Ideas anyone?

Asked By: jaxah

||

Answers:

By lxml.etree

from lxml import etree
root = etree.fromstring(data)
insert_tag = etree.Element("TAG")
insert_t_attib = insert_tag.attrib
insert_t_attib["TYPE"] = "DATE"

for i in root.getiterator("TAG"):
    tag_text = i.text.strip()
    p = i.getparent()
    etree.strip_tags(p, "TAG")
    pp = p.getparent()
    p_text = p.text.strip()
    begin = p_text.find(tag_text)
    end = begin + len(tag_text) 
    insert_t_attib = insert_tag.attrib
    insert_t_attib["BEGIN"] = str(begin)
    insert_t_attib["END"] = str(end)

    pp.insert(pp.getchildren().index(p)+1, insert_tag)


print etree.tostring(root)

<ROOT>
    <RECORD ID="123">
        <TEXT>
        This is an example text written at December 29th to illustrate the problem.
        </TEXT>
    <TAG TYPE="DATE" BEGIN="35" END="48"/></RECORD>
</ROOT>
Answered By: Vivek Sable

The idea is to iterate over all TEXT nodes, find all TAG nodes inside, get the position of each TAG‘s text inside the TEXT‘s text and create new tag on the RECORD level, then unwrap() the TAG from TEXT:

from bs4 import BeautifulSoup

data = """
<ROOT>
    <RECORD ID="123">
        <TEXT>
This is an example text written at <TAG TYPE="DATE">December 29th</TAG> to illustrate the problem.
        </TEXT>
    </RECORD>
</ROOT>
"""

soup = BeautifulSoup(data, "xml")

for text in soup.find_all('TEXT'):

    record = text.parent
    for tag in text.find_all('TAG'):
        begin = text.text.index(tag.text)
        end = len(tag.text) + begin

        record.append(soup.new_tag(tag.name, BEGIN=begin, END=end))

        tag.unwrap()

print soup

Prints:

<?xml version="1.0" encoding="utf-8"?>
<ROOT>
<RECORD ID="123">
<TEXT>
This is an example text written at December 29th to illustrate the problem.
        </TEXT>
<TAG BEGIN="36" END="49"/></RECORD>
</ROOT>

Note: haven’t tested it if multiple TAGs appear on the TEXT level. But at least it should give you a starting point.

Answered By: alecxe