Parse and traverse elements from a Markdown file

Question:

I want to parse and then traverse a Markdown file. I’m looking for something like xml.etree.ElementTree but for Markdown.

One option would be to convert to HTML and then use another library to parse the HTML. But I’d like to avoid that step.

Thanks.

Asked By: jpemberthy

Source

Answers:

There are Markdown parsing modules, but unlike XML and HTML processing modules, they tend to be embedded within Markdown rendering packages, rather than presented for arbitrary Markdown parsing work.

So option one would be to look into Markdown processors in Python, of which there are a ton, find the parser you like most, and adopt that.

Depending on what you want to accomplish, however, it might be easier to find a Markdown processing module that’s already extensible, and build a processing extension. Python-Markdown, e.g., has an complete extension mechanism.

Answered By: Jonathan Eunice

As another comment mentioned, Python-Markdown has an extension API and it happens to use xml.etree.ElementTree under the hood. You could theoretically create an extension that accesses that internal ElementTree object and do what you want with it. However, if you use raw HTML (including HTML entities) and/or the codehilite extension, you will get an incomplete document as there are a few postprocessors that run on the serialized string. So I wouldn’t really recommenced it for your intended purpose (full disclosure: I’m the developer of Python-Markdown).

A rather lengthy list if Markdown implementations exists here. Of the pure Python implementations in that list, Mistune is the only one that I am aware of that uses a two step process (step one returns a parse tree, step two serializes the parse tree — you only need step one). I have never used Mistune personally and cannot speak to its stability or accuracy, but it is supposed to be a Python clone of the very good JavaScript library Marked.

*** Edit ***

A few newer Python packages have become available which all use the parser/renderer pattern and/or parse tree/token stream to varying degrees. I don’t have any personal experience with any of them, but they may be useful for this purpose. See mistletoe, markdown-it-py, and marko.

*** End Edit ***

If you search around, I believe that a few of the C implementations use a similar pattern. Some of them might even already have a Python wrapper. If not, it shouldn’t to too difficult to create a wrapper with ctypes.

If for some reason you want to use an implementation that does not give you a full parse tree, then I would suggest parsing the resulting HTML using LXML (A python wrapper of the C lib) or html5lib (pure python), both of which can return an ElementTree object and are much faster (especially LXML) and more forgiving of invalid HTML (especially html5lib, which acts more like real browsers in the real world). Remember that Markdown can contain raw HTML and most Markdown parsers simply pass it through, valid-or-not. If you then try to parse it with a XML based parser (like in xml.etree) or a strict HTML parser (like html.parser in the standard lib), a single invalid tag can crash the HTML parser.

Answered By: Waylan