Parsing XML node and substring from value

Question

I’m trying to parse an XML to find a path of a file (image) and take that path to store it somewhere (so I can later resize the image), but for now I’m stuck into, how do I get the path from the nodes text value.

So far my code is :

import os, glob 
import sys 
import xml.etree.cElementTree as ET 
import re 
pathNow ='C:\' 
textPath = []
items = []
#change path directory 
for item in sys.argv[1:]: 
    items.append(item)
    newPath = pathNow + items[0]  
os.chdir(newPath) 
print("New path is:"+newPath) 
#end 

#get agrument for location
for item in items:
    docxml = items[1]
#docxml = sys.argv[2:] 
print(docxml)

#search for file 
for file in glob.glob(docxml + ".xml"): 
    tree = ET.parse(file) 
    rootFile = tree.getroot() 
    for i in rootFile.iter('TextElement'): 
      if "src" in i.text:
        textPath = i.text.split("src="")
        print(textPath)<- here I'm get stuck, I manage to the img tags into dictionary and store it, but how do I get the src="(value)" from the XML?

Here is an XML for testing:

<catalog>
   <book id="bk101">
                    <TextElement name="cme_cmb_acd_chart_tit_1" elementId="1375" max_word_count="0" displayName="cme_cmb_acd_chart_tit_1" status="optional" rixmlName="" clientCode="cme_cmb_acd_chart_tit_1">
                    <![CDATA[
                    <h1>S&amp;P 500 EPS growth ex-energy remains solid</h1>
                    ]]>
                    </TextElement>
                    <TextElement name="cme_cmb_acd_chart_sub_1" elementId="1374" max_word_count="0" displayName="cme_cmb_acd_chart_sub_1" status="optional" rixmlName="" clientCode="cme_cmb_acd_chart_sub_1">
                    <![CDATA[ S&amp;P 500 EPS ex-energy growth, year over year ]]>
                    </TextElement>
                    <TextElement name="cme_cmb_acd_chart_image_1" elementId="1371" max_word_count="0" displayName="cme_cmb_acd_chart_image_1" status="optional" rixmlName="" clientCode="cme_cmb_acd_chart_image_1">
                    <![CDATA[
                    <img pdf="/nas/web/image_upload/1190126_2fe42dfa-56af-4893-b81b-af5f85ed8d2f.pdf" svg="/nas/web/image_upload/1190126_2fe42dfa-56af-4893-b81b-af5f85ed8d2f.pdf" src="https://nas/web/image_upload//image_upload/1190126_2fe42dfa-56af-4893-b81b-af5f85ed8d2f.png"/>
                    ]]>
                    </TextElement>
                    <TextElement name="cme_cmb_acd_chart_src_1" elementId="1373" max_word_count="0" displayName="cme_cmb_acd_chart_src_1" status="optional" rixmlName="" clientCode="cme_cmb_acd_chart_src_1">
                    <![CDATA[
                    <h3><img pdf="/nas/web/clients/ubsprod/images/image_upload/1190126_2fe42dfa-56af-4893-b81b-af5f85ed8d2f.pdf" svg="/nas/web/clients/ubsprod/images/image_upload/1190126_2fe42dfa-56af-4893-b81b-af5f85ed8d2f.svg" src="https:///nas/web/image_upload/1190126_2fe42dfa-56af-4893-b81b-af5f85ed8d2f.png"/></h3><br/>
                    ]]>
                    </TextElement>
   </book>
</catalog>

How could I get the value inside the src=".... "? I’ve run out of ideas and knowledge.

Asked By: Victor

||

Source

Answer 1

In your case, you can use regular expression to parse the string. Replace your code in the if statement with this:

try:
    textPath = re.search('src="(.+?)"/>', i.text).group(1)
except AttributeError:
    textPath = '' # apply your error handling
print(textPath)

For your information, if you have the XML file like this:

<catalog>
   <book id="bk101">

    <TextElement name="cme_cmb_acd_chart_image_1" elementId="1371" max_word_count="0" displayName="cme_cmb_acd_chart_image_1" status="optional" rixmlName="" clientCode="cme_cmb_acd_chart_image_1">
        <img pdf="/nas/web/image_upload/1190126_2fe42dfa-56af-4893-b81b-af5f85ed8d2f.pdf" svg="/nas/web/image_upload/1190126_2fe42dfa-56af-4893-b81b-af5f85ed8d2f.pdf" src="https://nas/web/image_upload//image_upload/1190126_2fe42dfa-56af-4893-b81b-af5f85ed8d2f.png"/>
    </TextElement>

   </book>
</catalog>

Then in your loop you can use:

img = i.find('img')
textPath = img.get('src')

Because the src is an attribute of the img tag, so you can use get to retrieve the value.

Answered By: noname

Parsing XML node and substring from value

Question:

Answers: