Pandas "read_xml" returns NaNs

Question

First time parsing an xml file and I’m following both this pandas explanation and this SO question. I have an xml file from pubmed (any should work but I downloaded the first one: pubmed22n1115.xml). This file seems to be very convoluted and much more complex than the SO/pandas explanations and I can’t seem to be able to parse it.

What I tried is:

import pandas as pd
df = pd.read_xml('../../Downloads/pubmed22n1115.xml')
df.head()
>>>
    MedlineCitation PubmedData  PMID
0   NaN NaN NaN
1   NaN NaN NaN
2   NaN NaN NaN
3   NaN NaN NaN

All the other examples I looked at for parsing xml files were very specific to the xml file structure and I can’t seem to follow.
The only 2 things I need from this file are PMID, AbstractText. The expected output is a pandas dataframe that looks like

    PMID    AbstractText
0   1212    text1   
1   1233    text2

Asked By: Penguin

||

Source

Answer 1

You need to drill down into that huge XML file, in order to display the relevant data. You do this with xpath in pandas, like so (this is on a random xml doc downloaded from that link):

import pandas as pd

df = pd.read_xml('pubmed22n1123.xml/pubmed22n1123.xml', xpath=".//PMID")
print(df)

This will print out in terminal:

Version PMID
0   1   14584002
1   1   16916636
2   1   34919821
3   1   17541330
4   1   17643379
... ... ...
18359   1   34919510
18360   1   34919742
18361   1   34919747
18362   1   34919751
18363   1   34919752

The following pandas documentation might be helpful:

https://pandas.pydata.org/docs/dev/reference/api/pandas.read_xml.html

EDIT: You can get AbstractText with:

df = pd.read_xml('pubmed22n1123.xml/pubmed22n1123.xml', xpath=".//AbstractText")
print(df)

Resulting in:

Label   NlmCategory AbstractText    sup i   sub b   u   {http://www.w3.org/1998/Math/MathML}math
0   BACKGROUND  BACKGROUND  Kawasaki disease is the most common cause of a...   None    None    None    None    None    NaN
1   OBJECTIVES  OBJECTIVE   The objective of this review was to evaluate t...   None    None    None    None    None    NaN
2   SEARCH STRATEGY METHODS Electronic searches of the Cochrane Peripheral...   None    None    None    None    None    NaN
3   SELECTION CRITERIA  METHODS Randomised controlled trials of intravenous im...   None    None    None    None    None    NaN
4   DATA COLLECTION AND ANALYSIS    METHODS Fifty-nine trials were identified in the initi...   None    None    None    None    None    NaN
... ... ... ... ... ... ... ... ... ...

Answered By: platipus_on_fire

Pandas "read_xml" returns NaNs

Question:

Answers: