Extracting text from XML using python
Question:
I have this example xml file
<page>
<title>Chapter 1</title>
<content>Welcome to Chapter 1</content>
</page>
<page>
<title>Chapter 2</title>
<content>Welcome to Chapter 2</content>
</page>
I like to extract the contents of title tags and content tags.
Which method is good to extract the data, using pattern matching or using xml module. Or is there any better way to extract the data.
Answers:
There is already a built-in XML library, notably ElementTree
. For example:
>>> from xml.etree import cElementTree as ET
>>> xmlstr = """
... <root>
... <page>
... <title>Chapter 1</title>
... <content>Welcome to Chapter 1</content>
... </page>
... <page>
... <title>Chapter 2</title>
... <content>Welcome to Chapter 2</content>
... </page>
... </root>
... """
>>> root = ET.fromstring(xmlstr)
>>> for page in list(root):
... title = page.find('title').text
... content = page.find('content').text
... print('title: %s; content: %s' % (title, content))
...
title: Chapter 1; content: Welcome to Chapter 1
title: Chapter 2; content: Welcome to Chapter 2
I personally prefer parsing using xml.dom.minidom
like so:
In [18]: import xml.dom.minidom
In [19]: x = """
<root><page>
<title>Chapter 1</title>
<content>Welcome to Chapter 1</content>
</page>
<page>
<title>Chapter 2</title>
<content>Welcome to Chapter 2</content>
</page></root>"""
In [28]: doc = xml.dom.minidom.parseString(x)
In [29]: doc.getElementsByTagName("page")
Out[30]: [<DOM Element: page at 0x94d5acc>, <DOM Element: page at 0x94d5c8c>]
In [32]: [p.firstChild.wholeText for p in doc.getElementsByTagName("title") if p.firstChild.nodeType == p.TEXT_NODE]
Out[33]: [u'Chapter 1', u'Chapter 2']
In [34]: [p.firstChild.wholeText for p in doc.getElementsByTagName("content") if p.firstChild.nodeType == p.TEXT_NODE]
Out[35]: [u'Welcome to Chapter 1', u'Welcome to Chapter 2']
In [36]: for node in doc.childNodes:
if node.hasChildNodes:
for cn in node.childNodes:
if cn.hasChildNodes:
for cn2 in cn.childNodes:
if cn2.nodeType == cn2.TEXT_NODE:
print cn2.wholeText
Out[37]: Chapter 1
Welcome to Chapter 1
Chapter 2
Welcome to Chapter 2
You can also try this code to extract texts:
from bs4 import BeautifulSoup
import csv
data ="""<page>
<title>Chapter 1</title>
<content>Welcome to Chapter 1</content>
</page>
<page>
<title>Chapter 2</title>
<content>Welcome to Chapter 2</content>
</page>"""
soup = BeautifulSoup(data, "html.parser")
########### Title #############
required0 = soup.find_all("title")
title = []
for i in required0:
title.append(i.get_text())
########### Content #############
required0 = soup.find_all("content")
content = []
for i in required0:
content.append(i.get_text())
doc1 = list(zip(title, content))
for i in doc1:
print(i)
Output:
('Chapter 1', 'Welcome to Chapter 1')
('Chapter 2', 'Welcome to Chapter 2')
Code :
from xml.etree import cElementTree as ET
tree = ET.parse("test.xml")
root = tree.getroot()
for page in root.findall('page'):
print("Title: ", page.find('title').text)
print("Content: ", page.find('content').text)
Output:
Title: Chapter 1
Content: Welcome to Chapter 1
Title: Chapter 2
Content: Welcome to Chapter 2
For working (navigating, searching, and modifying) with XML or HTML data, I found BeautifulSoup library very useful. For installation problem or detailed information, click on link.
To find Attribute (tag) or multi-attribute values:
from bs4 import BeautifulSoup
data = """<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">
<pdf2xml producer="poppler" version="0.48.0">
<page number="1" position="absolute" top="0" left="0" height="1188" width="918">
<text top="246" left="135" width="178" height="16" font="1">PALS SOCIETY OF
CANADA</text>
<text top="261" width="86" height="16" font="1">13479 77 AVE</text>
</page>
</pdf2xml>"""
soup = BeautifulSoup(data, features="xml")
page_tag = soup.find_all('page')
for each_page in page_tag:
text_tag = each_page.find_all('text')
for text_data in text_tag:
print("Text : ", text_data.text)
print("Left attribute : ", text_data.get("left"))
Output:
Text : PALS SOCIETY OF CANADA
Left tag : 135
Text : 13479 77 AVE
Left tag : None
Recommend you a simple library. Here’s an example: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples
from simplified_scrapy.simplified_doc import SimplifiedDoc
html ='''
<page>
<title>Chapter 1</title>
<content>Welcome to Chapter 1</content>
</page>
<page>
<title>Chapter 2</title>
<content>Welcome to Chapter 2</content>
</page>'''
doc = SimplifiedDoc(html)
pages = doc.pages
print ([(page.title.text,page.content.text) for page in pages])
Result:
[('Chapter 1', 'Welcome to Chapter 1'), ('Chapter 2', 'Welcome to Chapter 2')]
I have this example xml file
<page>
<title>Chapter 1</title>
<content>Welcome to Chapter 1</content>
</page>
<page>
<title>Chapter 2</title>
<content>Welcome to Chapter 2</content>
</page>
I like to extract the contents of title tags and content tags.
Which method is good to extract the data, using pattern matching or using xml module. Or is there any better way to extract the data.
There is already a built-in XML library, notably ElementTree
. For example:
>>> from xml.etree import cElementTree as ET
>>> xmlstr = """
... <root>
... <page>
... <title>Chapter 1</title>
... <content>Welcome to Chapter 1</content>
... </page>
... <page>
... <title>Chapter 2</title>
... <content>Welcome to Chapter 2</content>
... </page>
... </root>
... """
>>> root = ET.fromstring(xmlstr)
>>> for page in list(root):
... title = page.find('title').text
... content = page.find('content').text
... print('title: %s; content: %s' % (title, content))
...
title: Chapter 1; content: Welcome to Chapter 1
title: Chapter 2; content: Welcome to Chapter 2
I personally prefer parsing using xml.dom.minidom
like so:
In [18]: import xml.dom.minidom
In [19]: x = """
<root><page>
<title>Chapter 1</title>
<content>Welcome to Chapter 1</content>
</page>
<page>
<title>Chapter 2</title>
<content>Welcome to Chapter 2</content>
</page></root>"""
In [28]: doc = xml.dom.minidom.parseString(x)
In [29]: doc.getElementsByTagName("page")
Out[30]: [<DOM Element: page at 0x94d5acc>, <DOM Element: page at 0x94d5c8c>]
In [32]: [p.firstChild.wholeText for p in doc.getElementsByTagName("title") if p.firstChild.nodeType == p.TEXT_NODE]
Out[33]: [u'Chapter 1', u'Chapter 2']
In [34]: [p.firstChild.wholeText for p in doc.getElementsByTagName("content") if p.firstChild.nodeType == p.TEXT_NODE]
Out[35]: [u'Welcome to Chapter 1', u'Welcome to Chapter 2']
In [36]: for node in doc.childNodes:
if node.hasChildNodes:
for cn in node.childNodes:
if cn.hasChildNodes:
for cn2 in cn.childNodes:
if cn2.nodeType == cn2.TEXT_NODE:
print cn2.wholeText
Out[37]: Chapter 1
Welcome to Chapter 1
Chapter 2
Welcome to Chapter 2
You can also try this code to extract texts:
from bs4 import BeautifulSoup
import csv
data ="""<page>
<title>Chapter 1</title>
<content>Welcome to Chapter 1</content>
</page>
<page>
<title>Chapter 2</title>
<content>Welcome to Chapter 2</content>
</page>"""
soup = BeautifulSoup(data, "html.parser")
########### Title #############
required0 = soup.find_all("title")
title = []
for i in required0:
title.append(i.get_text())
########### Content #############
required0 = soup.find_all("content")
content = []
for i in required0:
content.append(i.get_text())
doc1 = list(zip(title, content))
for i in doc1:
print(i)
Output:
('Chapter 1', 'Welcome to Chapter 1')
('Chapter 2', 'Welcome to Chapter 2')
Code :
from xml.etree import cElementTree as ET
tree = ET.parse("test.xml")
root = tree.getroot()
for page in root.findall('page'):
print("Title: ", page.find('title').text)
print("Content: ", page.find('content').text)
Output:
Title: Chapter 1
Content: Welcome to Chapter 1
Title: Chapter 2
Content: Welcome to Chapter 2
For working (navigating, searching, and modifying) with XML or HTML data, I found BeautifulSoup library very useful. For installation problem or detailed information, click on link.
To find Attribute (tag) or multi-attribute values:
from bs4 import BeautifulSoup
data = """<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">
<pdf2xml producer="poppler" version="0.48.0">
<page number="1" position="absolute" top="0" left="0" height="1188" width="918">
<text top="246" left="135" width="178" height="16" font="1">PALS SOCIETY OF
CANADA</text>
<text top="261" width="86" height="16" font="1">13479 77 AVE</text>
</page>
</pdf2xml>"""
soup = BeautifulSoup(data, features="xml")
page_tag = soup.find_all('page')
for each_page in page_tag:
text_tag = each_page.find_all('text')
for text_data in text_tag:
print("Text : ", text_data.text)
print("Left attribute : ", text_data.get("left"))
Output:
Text : PALS SOCIETY OF CANADA
Left tag : 135
Text : 13479 77 AVE
Left tag : None
Recommend you a simple library. Here’s an example: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples
from simplified_scrapy.simplified_doc import SimplifiedDoc
html ='''
<page>
<title>Chapter 1</title>
<content>Welcome to Chapter 1</content>
</page>
<page>
<title>Chapter 2</title>
<content>Welcome to Chapter 2</content>
</page>'''
doc = SimplifiedDoc(html)
pages = doc.pages
print ([(page.title.text,page.content.text) for page in pages])
Result:
[('Chapter 1', 'Welcome to Chapter 1'), ('Chapter 2', 'Welcome to Chapter 2')]