Parsing XML node and substring from value
Question:
I’m trying to parse an XML to find a path of a file (image) and take that path to store it somewhere (so I can later resize the image), but for now I’m stuck into, how do I get the path from the nodes text value.
So far my code is :
import os, glob
import sys
import xml.etree.cElementTree as ET
import re
pathNow ='C:\'
textPath = []
items = []
#change path directory
for item in sys.argv[1:]:
items.append(item)
newPath = pathNow + items[0]
os.chdir(newPath)
print("New path is:"+newPath)
#end
#get agrument for location
for item in items:
docxml = items[1]
#docxml = sys.argv[2:]
print(docxml)
#search for file
for file in glob.glob(docxml + ".xml"):
tree = ET.parse(file)
rootFile = tree.getroot()
for i in rootFile.iter('TextElement'):
if "src" in i.text:
textPath = i.text.split("src="")
print(textPath)<- here I'm get stuck, I manage to the img tags into dictionary and store it, but how do I get the src="(value)" from the XML?
Here is an XML for testing:
<catalog>
<book id="bk101">
<TextElement name="cme_cmb_acd_chart_tit_1" elementId="1375" max_word_count="0" displayName="cme_cmb_acd_chart_tit_1" status="optional" rixmlName="" clientCode="cme_cmb_acd_chart_tit_1">
<![CDATA[
<h1>S&P 500 EPS growth ex-energy remains solid</h1>
]]>
</TextElement>
<TextElement name="cme_cmb_acd_chart_sub_1" elementId="1374" max_word_count="0" displayName="cme_cmb_acd_chart_sub_1" status="optional" rixmlName="" clientCode="cme_cmb_acd_chart_sub_1">
<![CDATA[ S&P 500 EPS ex-energy growth, year over year ]]>
</TextElement>
<TextElement name="cme_cmb_acd_chart_image_1" elementId="1371" max_word_count="0" displayName="cme_cmb_acd_chart_image_1" status="optional" rixmlName="" clientCode="cme_cmb_acd_chart_image_1">
<![CDATA[
<img pdf="/nas/web/image_upload/1190126_2fe42dfa-56af-4893-b81b-af5f85ed8d2f.pdf" svg="/nas/web/image_upload/1190126_2fe42dfa-56af-4893-b81b-af5f85ed8d2f.pdf" src="https://nas/web/image_upload//image_upload/1190126_2fe42dfa-56af-4893-b81b-af5f85ed8d2f.png"/>
]]>
</TextElement>
<TextElement name="cme_cmb_acd_chart_src_1" elementId="1373" max_word_count="0" displayName="cme_cmb_acd_chart_src_1" status="optional" rixmlName="" clientCode="cme_cmb_acd_chart_src_1">
<![CDATA[
<h3><img pdf="/nas/web/clients/ubsprod/images/image_upload/1190126_2fe42dfa-56af-4893-b81b-af5f85ed8d2f.pdf" svg="/nas/web/clients/ubsprod/images/image_upload/1190126_2fe42dfa-56af-4893-b81b-af5f85ed8d2f.svg" src="https:///nas/web/image_upload/1190126_2fe42dfa-56af-4893-b81b-af5f85ed8d2f.png"/></h3><br/>
]]>
</TextElement>
</book>
</catalog>
How could I get the value inside the src=".... "
? I’ve run out of ideas and knowledge.
Answers:
In your case, you can use regular expression to parse the string. Replace your code in the if statement with this:
try:
textPath = re.search('src="(.+?)"/>', i.text).group(1)
except AttributeError:
textPath = '' # apply your error handling
print(textPath)
For your information, if you have the XML file like this:
<catalog>
<book id="bk101">
<TextElement name="cme_cmb_acd_chart_image_1" elementId="1371" max_word_count="0" displayName="cme_cmb_acd_chart_image_1" status="optional" rixmlName="" clientCode="cme_cmb_acd_chart_image_1">
<img pdf="/nas/web/image_upload/1190126_2fe42dfa-56af-4893-b81b-af5f85ed8d2f.pdf" svg="/nas/web/image_upload/1190126_2fe42dfa-56af-4893-b81b-af5f85ed8d2f.pdf" src="https://nas/web/image_upload//image_upload/1190126_2fe42dfa-56af-4893-b81b-af5f85ed8d2f.png"/>
</TextElement>
</book>
</catalog>
Then in your loop you can use:
img = i.find('img')
textPath = img.get('src')
Because the src is an attribute of the img tag, so you can use get to retrieve the value.
I’m trying to parse an XML to find a path of a file (image) and take that path to store it somewhere (so I can later resize the image), but for now I’m stuck into, how do I get the path from the nodes text value.
So far my code is :
import os, glob
import sys
import xml.etree.cElementTree as ET
import re
pathNow ='C:\'
textPath = []
items = []
#change path directory
for item in sys.argv[1:]:
items.append(item)
newPath = pathNow + items[0]
os.chdir(newPath)
print("New path is:"+newPath)
#end
#get agrument for location
for item in items:
docxml = items[1]
#docxml = sys.argv[2:]
print(docxml)
#search for file
for file in glob.glob(docxml + ".xml"):
tree = ET.parse(file)
rootFile = tree.getroot()
for i in rootFile.iter('TextElement'):
if "src" in i.text:
textPath = i.text.split("src="")
print(textPath)<- here I'm get stuck, I manage to the img tags into dictionary and store it, but how do I get the src="(value)" from the XML?
Here is an XML for testing:
<catalog>
<book id="bk101">
<TextElement name="cme_cmb_acd_chart_tit_1" elementId="1375" max_word_count="0" displayName="cme_cmb_acd_chart_tit_1" status="optional" rixmlName="" clientCode="cme_cmb_acd_chart_tit_1">
<![CDATA[
<h1>S&P 500 EPS growth ex-energy remains solid</h1>
]]>
</TextElement>
<TextElement name="cme_cmb_acd_chart_sub_1" elementId="1374" max_word_count="0" displayName="cme_cmb_acd_chart_sub_1" status="optional" rixmlName="" clientCode="cme_cmb_acd_chart_sub_1">
<![CDATA[ S&P 500 EPS ex-energy growth, year over year ]]>
</TextElement>
<TextElement name="cme_cmb_acd_chart_image_1" elementId="1371" max_word_count="0" displayName="cme_cmb_acd_chart_image_1" status="optional" rixmlName="" clientCode="cme_cmb_acd_chart_image_1">
<![CDATA[
<img pdf="/nas/web/image_upload/1190126_2fe42dfa-56af-4893-b81b-af5f85ed8d2f.pdf" svg="/nas/web/image_upload/1190126_2fe42dfa-56af-4893-b81b-af5f85ed8d2f.pdf" src="https://nas/web/image_upload//image_upload/1190126_2fe42dfa-56af-4893-b81b-af5f85ed8d2f.png"/>
]]>
</TextElement>
<TextElement name="cme_cmb_acd_chart_src_1" elementId="1373" max_word_count="0" displayName="cme_cmb_acd_chart_src_1" status="optional" rixmlName="" clientCode="cme_cmb_acd_chart_src_1">
<![CDATA[
<h3><img pdf="/nas/web/clients/ubsprod/images/image_upload/1190126_2fe42dfa-56af-4893-b81b-af5f85ed8d2f.pdf" svg="/nas/web/clients/ubsprod/images/image_upload/1190126_2fe42dfa-56af-4893-b81b-af5f85ed8d2f.svg" src="https:///nas/web/image_upload/1190126_2fe42dfa-56af-4893-b81b-af5f85ed8d2f.png"/></h3><br/>
]]>
</TextElement>
</book>
</catalog>
How could I get the value inside the src=".... "
? I’ve run out of ideas and knowledge.
In your case, you can use regular expression to parse the string. Replace your code in the if statement with this:
try:
textPath = re.search('src="(.+?)"/>', i.text).group(1)
except AttributeError:
textPath = '' # apply your error handling
print(textPath)
For your information, if you have the XML file like this:
<catalog>
<book id="bk101">
<TextElement name="cme_cmb_acd_chart_image_1" elementId="1371" max_word_count="0" displayName="cme_cmb_acd_chart_image_1" status="optional" rixmlName="" clientCode="cme_cmb_acd_chart_image_1">
<img pdf="/nas/web/image_upload/1190126_2fe42dfa-56af-4893-b81b-af5f85ed8d2f.pdf" svg="/nas/web/image_upload/1190126_2fe42dfa-56af-4893-b81b-af5f85ed8d2f.pdf" src="https://nas/web/image_upload//image_upload/1190126_2fe42dfa-56af-4893-b81b-af5f85ed8d2f.png"/>
</TextElement>
</book>
</catalog>
Then in your loop you can use:
img = i.find('img')
textPath = img.get('src')
Because the src is an attribute of the img tag, so you can use get to retrieve the value.