Getting the List Numbers of List Items in docx file using Python-Docx
Question:
When I am accessing paragraph text it does not include the numbering in a list.
Current code:
document = Document("C:/Foo.docx")
for p in document.paragraphs:
print(p.text)
List in docx file:
I am expecting:
(1) The naturalization of both …
(2) The naturalization of the …
(3) The naturalization of the …
What I get:
The naturalization of both …
The naturalization of the …
The naturalization of the …
Upon checking the XML of the document, the list numbers are stored in w:abstructNum but I have no idea how to access them or connect them to the proper list item.
How can I access the number for each list item in python-docx so they could be included in my output?
Is there a way also to determine the proper nesting of these lists using python-docx?
Answers:
this worked for me, using the module docx2python
from docx2python import docx2python
document = docx2python("C:/input/MyDoc.docx")
print(document.body)
According to [ReadThedocs.Python-DocX]: Style-related objects – _NumberingStyle objects, this functionality is not implemented yet.
The alternative (at least one of them) [PyPI]: docx2python is kind of poor handling these elements (mainly because it returns everything converted to strings).
So, a solution would be to parse the XML files manually – discovered how empirically, working on this very example. A good documentation place is Office Open XML (I don’t know whether it’s a standard followed by all the tools that deal with .docx files (especially MS Word)):
- Get each paragraph (w:p node) from word/document.xml
-
Check whether it’s a numbered item (it has w:pPr -> w:numPr) subnode
-
Get the number style Id and level: w:val attribute of w:numId and w:ilvl subnodes (of the node from previous bullet)
-
Match the 2 values with (in word/numbering.xml):
- w:abstractNumId attribute of w:abstractNum node
- w:ilvl attribute of w:lvl subnode
and get the w:val attribute of the corresponding w:numFmt and w:lvlText subnodes (note that bullets are included as well, they can be discriminated based on the bullet value for aforementioned w:numFmt‘s attribute)
However that seems extremely complex, so I’m proposing a workaround (gainarie) that makes use of docx2pythons partial support.
Test document (sample.docx – created with LibreOffice):
code00.py:
#!/usr/bin/env python
import sys
import docx
from docx2python import docx2python as dx2py
def ns_tag_name(node, name):
if node.nsmap and node.prefix:
return "{{{:s}}}{:s}".format(node.nsmap[node.prefix], name)
return name
def descendants(node, desc_strs):
if node is None:
return []
if not desc_strs:
return [node]
ret = {}
for child_str in desc_strs[0]:
for child in node.iterchildren(ns_tag_name(node, child_str)):
descs = descendants(child, desc_strs[1:])
if not descs:
continue
cd = ret.setdefault(child_str, [])
if isinstance(descs, list):
cd.extend(descs)
else:
cd.append(descs)
return ret
def simplified_descendants(desc_dict):
ret = []
for vs in desc_dict.values():
for v in vs:
if isinstance(v, dict):
ret.extend(simplified_descendants(v))
else:
ret.append(v)
return ret
def process_list_data(attrs, dx2py_elem):
#print(simplified_descendants(attrs))
desc = simplified_descendants(attrs)[0]
level = int(desc.attrib[ns_tag_name(desc, "val")])
elem = [i for i in dx2py_elem[0].split("t") if i][0]#.rstrip(")")
return " " * level + elem + " "
def main(*argv):
fname = r"./sample.docx"
docd = docx.Document(fname)
docdpy = dx2py(fname)
dr = docdpy.docx_reader
#print(dr.files) # !!! Check word/numbering.xml !!!
docdpy_runs = docdpy.document_runs[0][0][0]
if len(docd.paragraphs) != len(docdpy_runs):
print("Lengths don't match. Abort")
return -1
subnode_tags = (("pPr",), ("numPr",), ("ilvl",)) # (("pPr",), ("numPr",), ("ilvl", "numId")) # numId is for matching elements from word/numbering.xml
for idx, (par, l) in enumerate(zip(docd.paragraphs, docdpy_runs)):
#print(par.text, l)
numbered_attrs = descendants(par._element, subnode_tags)
#print(numbered_attrs)
if numbered_attrs:
print(process_list_data(numbered_attrs, l) + par.text)
else:
print(par.text)
if __name__ == "__main__":
print("Python {:s} {:03d}bit on {:s}n".format(" ".join(elem.strip() for elem in sys.version.split("n")),
64 if sys.maxsize > 0x100000000 else 32, sys.platform))
rc = main(*sys.argv[1:])
print("nDone.")
sys.exit(rc)
Output:
[cfati@CFATI-5510-0:e:WorkDevStackOverflowq066374154]> "e:WorkDevVEnvspy_pc064_03.09_test0Scriptspython.exe" code00.py
Python 3.9.9 (tags/v3.9.9:ccb0e6a, Nov 15 2021, 18:08:50) [MSC v.1929 64 bit (AMD64)] 064bit on win32
Doc title
doc subtitle
heading1 text0
Paragr0 line0
Paragr0 line1
Paragr0 line2
space Paragr0 line3
a) aa (numbered)
heading1 text1
Paragrx line0
Paragrx line1
a) w tabs Paragrx line2 (NOT numbered – just to mimic 1ax below)
1) paragrx 1x (numbered)
a) paragrx 1ax (numbered)
I) paragrx 1aIx (numbered)
b) paragrx 1bx (numbered)
2) paragrx 2x (numbered)
3) paragrx 3x (numbered)
-- paragrx bullet 0
-- paragrx bullet 00
paragxx text
Done.
Notes:
- Only nodes from word/document.xml are processed (via paragraph’s _element (LXML node) attribute)
- Some list attributes are not captured (due to docx2python‘s limitations)
- This is far away from being robust
- descendants, simplified_descendants can be much simplified, but I wanted to keep the former as generic as possible (if functionality needs to be extended)
There is another path which consists in converting the numbering to text in the first place. After which you can use python-docx
as usual, without the hassle of handling them yourself.
Open the document in Word, open the Visual Basic editor (F11
), open the immediate window (ctrl-G
), type the following macro and press enter:
ActiveDocument.Range.ListFormat.ConvertNumbersToText
At this point, you can save the document and run it through python-docx
.
When I am accessing paragraph text it does not include the numbering in a list.
Current code:
document = Document("C:/Foo.docx")
for p in document.paragraphs:
print(p.text)
List in docx file:
I am expecting:
(1) The naturalization of both …
(2) The naturalization of the …
(3) The naturalization of the …
What I get:
The naturalization of both …
The naturalization of the …
The naturalization of the …
Upon checking the XML of the document, the list numbers are stored in w:abstructNum but I have no idea how to access them or connect them to the proper list item.
How can I access the number for each list item in python-docx so they could be included in my output?
Is there a way also to determine the proper nesting of these lists using python-docx?
this worked for me, using the module docx2python
from docx2python import docx2python
document = docx2python("C:/input/MyDoc.docx")
print(document.body)
According to [ReadThedocs.Python-DocX]: Style-related objects – _NumberingStyle objects, this functionality is not implemented yet.
The alternative (at least one of them) [PyPI]: docx2python is kind of poor handling these elements (mainly because it returns everything converted to strings).
So, a solution would be to parse the XML files manually – discovered how empirically, working on this very example. A good documentation place is Office Open XML (I don’t know whether it’s a standard followed by all the tools that deal with .docx files (especially MS Word)):
- Get each paragraph (w:p node) from word/document.xml
-
Check whether it’s a numbered item (it has w:pPr -> w:numPr) subnode
-
Get the number style Id and level: w:val attribute of w:numId and w:ilvl subnodes (of the node from previous bullet)
-
Match the 2 values with (in word/numbering.xml):
- w:abstractNumId attribute of w:abstractNum node
- w:ilvl attribute of w:lvl subnode
and get the w:val attribute of the corresponding w:numFmt and w:lvlText subnodes (note that bullets are included as well, they can be discriminated based on the bullet value for aforementioned w:numFmt‘s attribute)
-
However that seems extremely complex, so I’m proposing a workaround (gainarie) that makes use of docx2pythons partial support.
Test document (sample.docx – created with LibreOffice):
code00.py:
#!/usr/bin/env python
import sys
import docx
from docx2python import docx2python as dx2py
def ns_tag_name(node, name):
if node.nsmap and node.prefix:
return "{{{:s}}}{:s}".format(node.nsmap[node.prefix], name)
return name
def descendants(node, desc_strs):
if node is None:
return []
if not desc_strs:
return [node]
ret = {}
for child_str in desc_strs[0]:
for child in node.iterchildren(ns_tag_name(node, child_str)):
descs = descendants(child, desc_strs[1:])
if not descs:
continue
cd = ret.setdefault(child_str, [])
if isinstance(descs, list):
cd.extend(descs)
else:
cd.append(descs)
return ret
def simplified_descendants(desc_dict):
ret = []
for vs in desc_dict.values():
for v in vs:
if isinstance(v, dict):
ret.extend(simplified_descendants(v))
else:
ret.append(v)
return ret
def process_list_data(attrs, dx2py_elem):
#print(simplified_descendants(attrs))
desc = simplified_descendants(attrs)[0]
level = int(desc.attrib[ns_tag_name(desc, "val")])
elem = [i for i in dx2py_elem[0].split("t") if i][0]#.rstrip(")")
return " " * level + elem + " "
def main(*argv):
fname = r"./sample.docx"
docd = docx.Document(fname)
docdpy = dx2py(fname)
dr = docdpy.docx_reader
#print(dr.files) # !!! Check word/numbering.xml !!!
docdpy_runs = docdpy.document_runs[0][0][0]
if len(docd.paragraphs) != len(docdpy_runs):
print("Lengths don't match. Abort")
return -1
subnode_tags = (("pPr",), ("numPr",), ("ilvl",)) # (("pPr",), ("numPr",), ("ilvl", "numId")) # numId is for matching elements from word/numbering.xml
for idx, (par, l) in enumerate(zip(docd.paragraphs, docdpy_runs)):
#print(par.text, l)
numbered_attrs = descendants(par._element, subnode_tags)
#print(numbered_attrs)
if numbered_attrs:
print(process_list_data(numbered_attrs, l) + par.text)
else:
print(par.text)
if __name__ == "__main__":
print("Python {:s} {:03d}bit on {:s}n".format(" ".join(elem.strip() for elem in sys.version.split("n")),
64 if sys.maxsize > 0x100000000 else 32, sys.platform))
rc = main(*sys.argv[1:])
print("nDone.")
sys.exit(rc)
Output:
[cfati@CFATI-5510-0:e:WorkDevStackOverflowq066374154]> "e:WorkDevVEnvspy_pc064_03.09_test0Scriptspython.exe" code00.py Python 3.9.9 (tags/v3.9.9:ccb0e6a, Nov 15 2021, 18:08:50) [MSC v.1929 64 bit (AMD64)] 064bit on win32 Doc title doc subtitle heading1 text0 Paragr0 line0 Paragr0 line1 Paragr0 line2 space Paragr0 line3 a) aa (numbered) heading1 text1 Paragrx line0 Paragrx line1 a) w tabs Paragrx line2 (NOT numbered – just to mimic 1ax below) 1) paragrx 1x (numbered) a) paragrx 1ax (numbered) I) paragrx 1aIx (numbered) b) paragrx 1bx (numbered) 2) paragrx 2x (numbered) 3) paragrx 3x (numbered) -- paragrx bullet 0 -- paragrx bullet 00 paragxx text Done.
Notes:
- Only nodes from word/document.xml are processed (via paragraph’s _element (LXML node) attribute)
- Some list attributes are not captured (due to docx2python‘s limitations)
- This is far away from being robust
- descendants, simplified_descendants can be much simplified, but I wanted to keep the former as generic as possible (if functionality needs to be extended)
There is another path which consists in converting the numbering to text in the first place. After which you can use python-docx
as usual, without the hassle of handling them yourself.
Open the document in Word, open the Visual Basic editor (F11
), open the immediate window (ctrl-G
), type the following macro and press enter:
ActiveDocument.Range.ListFormat.ConvertNumbersToText
At this point, you can save the document and run it through python-docx
.