Parse XML file and output JSON with Python

Question:

I am quite new to Python. I’m currently trying to parse xml files getting their information and printing them as JSON.

I have managed to parse the xml file, but I cannot print them as JSON. In addition, in my printjson function, the function did not run through all results and only print one time. The parse function worked and run through all input files while printjson didn’t.
My code is as follow.

from xml.dom import minidom
import os
import json

#input multiple files
def get_files(d):
        return [os.path.join(d, f) for f in os.listdir(d) if os.path.isfile(os.path.join(d,f))]

#parse xml
def parse(files):
    for xml_file in files:
        
        #indentify all xml files
        tree = minidom.parse(xml_file)

        #Get some details
        NCT_ID = ("NCT ID : %s" % tree.getElementsByTagName("nct_id")[0].firstChild.data)
        brief_title = ("brief title : %s" % tree.getElementsByTagName("brief_title")[0].firstChild.data)
        official_title = ("official title : %s" % tree.getElementsByTagName("official_title")[0].firstChild.data)

        return NCT_ID,brief_title,official_title

#print result in json
def printjson(results):
        for result in results:
                output_json = json.dumps(result)
                print(output_json)

printjson(parse(get_files('my files path')))

Output when running the file

"NCT ID : NCT00571389"
"brief title : Isolation and Culture of Immune Cells and Circulating Tumor Cells From Peripheral Blood and Leukapheresis Products"
"official title : A Study to Facilitate Development of an Ex-Vivo Device Platform for Circulating Tumor Cell and Immune Cell Harvesting, Banking, and Apoptosis-Viability Assay"

Expected output

{
"NCT ID" : "NCT00571389",
"brief title" : "Isolation and Culture of Immune Cells and Circulating Tumor Cells From Peripheral Blood and Leukapheresis Products",
"official title" : "A Study to Facilitate Development of an Ex-Vivo Device Platform for Circulating Tumor Cell and Immune Cell Harvesting, Banking, and Apoptosis-Viability Assay"
}

The sample indexed xml file that I used is named as COVID-19 Clinical Trials dataset and can be found in kaggle

Asked By: Quan Hoang Nguyen

||

Answers:

I don’t know much about xml.dom library but you can generate the json with a dictionary, because the dumps function is only for convert json to string.
Some like this.


def parse(files):
    for xml_file in files:
        
        #indentify all xml files
        tree = minidom.parse(xml_file)
        dicJson = {}
        dicJson.setdefault("NCT ID",tree.getElementsByTagName("nct_id")[0].firstChild.data)
        dicJson.setdefault("brief title",tree.getElementsByTagName("brief_title")[0].firstChild.data)
        dicJson.setdefault("official title", tree.getElementsByTagName("official_title")[0].firstChild.data)
    return dicJson

and in the function prinJson:

def printJson(results):
    # This function return the dictionary but in string, how to write to a JSON file.
    print(json.dumps(results))

Answered By: Josephdavid07

The issue is that your parse function is returning too early (it’s returning after getting the details from the first XML file. Instead, you should return a list of dictionaries that stores this information, so each item in the list represents a different file, and each dictionary contains the necessary information regarding the corresponding XML file.

Here’s the updated code:

def parse(files):
    xml_information = []
    for xml_file in files:
        
        #indentify all xml files
        tree = minidom.parse(xml_file)

        #Get some details
        NCT_ID = ("NCT ID : %s" % tree.getElementsByTagName("nct_id")[0].firstChild.data)
        brief_title = ("brief title : %s" % tree.getElementsByTagName("brief_title")[0].firstChild.data)
        official_title = ("official title : %s" % tree.getElementsByTagName("official_title")[0].firstChild.data)

        xml_information.append({"NCT_ID": NCT_ID, "brief title": brief_title, "official title": official_title})
    return xml_information

def printresults(results):
        for result in results:
                print(result)

printresults(parse(get_files('my files path')))

If you absolutely want to return format to be json, you can similarly use json.dumps on each dictionary.

Note: If you have a lot of XML files, I would recommend using yield in the function instead of returning a whole list of dictionaries in order to improve speed and performance.

Answered By: Lee Kai Xuan
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.