Pandas "read_xml" returns NaNs
Question:
First time parsing an xml
file and I’m following both this pandas explanation and this SO question. I have an xml file from pubmed (any should work but I downloaded the first one: pubmed22n1115.xml
). This file seems to be very convoluted and much more complex than the SO/pandas explanations and I can’t seem to be able to parse it.
What I tried is:
import pandas as pd
df = pd.read_xml('../../Downloads/pubmed22n1115.xml')
df.head()
>>>
MedlineCitation PubmedData PMID
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
All the other examples I looked at for parsing xml files were very specific to the xml file structure and I can’t seem to follow.
The only 2 things I need from this file are PMID
, AbstractText
. The expected output is a pandas dataframe that looks like
PMID AbstractText
0 1212 text1
1 1233 text2
Answers:
You need to drill down into that huge XML file, in order to display the relevant data. You do this with xpath in pandas, like so (this is on a random xml doc downloaded from that link):
import pandas as pd
df = pd.read_xml('pubmed22n1123.xml/pubmed22n1123.xml', xpath=".//PMID")
print(df)
This will print out in terminal:
Version PMID
0 1 14584002
1 1 16916636
2 1 34919821
3 1 17541330
4 1 17643379
... ... ...
18359 1 34919510
18360 1 34919742
18361 1 34919747
18362 1 34919751
18363 1 34919752
The following pandas documentation might be helpful:
https://pandas.pydata.org/docs/dev/reference/api/pandas.read_xml.html
EDIT: You can get AbstractText
with:
df = pd.read_xml('pubmed22n1123.xml/pubmed22n1123.xml', xpath=".//AbstractText")
print(df)
Resulting in:
Label NlmCategory AbstractText sup i sub b u {http://www.w3.org/1998/Math/MathML}math
0 BACKGROUND BACKGROUND Kawasaki disease is the most common cause of a... None None None None None NaN
1 OBJECTIVES OBJECTIVE The objective of this review was to evaluate t... None None None None None NaN
2 SEARCH STRATEGY METHODS Electronic searches of the Cochrane Peripheral... None None None None None NaN
3 SELECTION CRITERIA METHODS Randomised controlled trials of intravenous im... None None None None None NaN
4 DATA COLLECTION AND ANALYSIS METHODS Fifty-nine trials were identified in the initi... None None None None None NaN
... ... ... ... ... ... ... ... ... ...
First time parsing an xml
file and I’m following both this pandas explanation and this SO question. I have an xml file from pubmed (any should work but I downloaded the first one: pubmed22n1115.xml
). This file seems to be very convoluted and much more complex than the SO/pandas explanations and I can’t seem to be able to parse it.
What I tried is:
import pandas as pd
df = pd.read_xml('../../Downloads/pubmed22n1115.xml')
df.head()
>>>
MedlineCitation PubmedData PMID
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
All the other examples I looked at for parsing xml files were very specific to the xml file structure and I can’t seem to follow.
The only 2 things I need from this file are PMID
, AbstractText
. The expected output is a pandas dataframe that looks like
PMID AbstractText
0 1212 text1
1 1233 text2
You need to drill down into that huge XML file, in order to display the relevant data. You do this with xpath in pandas, like so (this is on a random xml doc downloaded from that link):
import pandas as pd
df = pd.read_xml('pubmed22n1123.xml/pubmed22n1123.xml', xpath=".//PMID")
print(df)
This will print out in terminal:
Version PMID
0 1 14584002
1 1 16916636
2 1 34919821
3 1 17541330
4 1 17643379
... ... ...
18359 1 34919510
18360 1 34919742
18361 1 34919747
18362 1 34919751
18363 1 34919752
The following pandas documentation might be helpful:
https://pandas.pydata.org/docs/dev/reference/api/pandas.read_xml.html
EDIT: You can get AbstractText
with:
df = pd.read_xml('pubmed22n1123.xml/pubmed22n1123.xml', xpath=".//AbstractText")
print(df)
Resulting in:
Label NlmCategory AbstractText sup i sub b u {http://www.w3.org/1998/Math/MathML}math
0 BACKGROUND BACKGROUND Kawasaki disease is the most common cause of a... None None None None None NaN
1 OBJECTIVES OBJECTIVE The objective of this review was to evaluate t... None None None None None NaN
2 SEARCH STRATEGY METHODS Electronic searches of the Cochrane Peripheral... None None None None None NaN
3 SELECTION CRITERIA METHODS Randomised controlled trials of intravenous im... None None None None None NaN
4 DATA COLLECTION AND ANALYSIS METHODS Fifty-nine trials were identified in the initi... None None None None None NaN
... ... ... ... ... ... ... ... ... ...