Which tool to use to parse programming languages in Python?

Question:

Which Python tool can you recommend to parse programming languages? It should allow for a readable representation of the language grammar inside the source, and it should be able to scale to complicated languages (something with a grammar as complex as e.g. Python itself).

When I search, I mostly find pyparsing, which I will be evaluating, but of course I’m interested in other alternatives.

Edit: Bonus points if it comes with good error reporting and source code locations attached to syntax tree elements.

Asked By: Stefan Majewsky

||

Answers:

Antlr is what you should look at http://www.antlr.org

Take a look at this http://www.antlr.org/wiki/display/ANTLR3/Antlr3PythonTarget

Answered By: Ankur Gupta

For simple task I tend to use the shlex module.

See http://wiki.python.org/moin/LanguageParsing for evaluation of language parsing in python.

Answered By: Fredrik Pihl

For a more complicated parser I would use pyparsing.
Pyparsing

Here is the parsed example from there home page

from pyparsing import Word, alphas

greet = Word(alphas) + "," + Word(alphas) + "!"  # <-- grammar 

defined here

hello = "Hello, World!"
print(hello, "->", greet.parseString(hello))
Answered By: Jakob Bowyer

If you’re evaluating PyParsing, I think you should look at funcparserlib: http://pypi.python.org/pypi/funcparserlib

It’s a bit similar, but in my experience resulting code is much cleaner.

Answered By: Alexander Solovyov

I really like pyPEG. Its error reporting isn’t very friendly, but it can add source code locations to the AST.

pyPEG doesn’t have a separate lexer, which would make parsing Python itself hard (I think CPython recognises indent and dedent in the lexer), but I’ve used pyPEG to build a parser for subset of C# with surprisingly little work.

An example adapted from fdik.org/pyPEG/: A simple language like this:

function fak(n) {
    if (n==0) { // 0! is 1 by definition
        return 1;
    } else {
        return n * fak(n - 1);
    };
}

A pyPEG parser for that language:

def comment():          return [re.compile(r"//.*"),
                                re.compile("/*.*?*/", re.S)]
def literal():          return re.compile(r'd*.d*|d+|".*?"')
def symbol():           return re.compile(r"w+")
def operator():         return re.compile(r"+|-|*|/|==")
def operation():        return symbol, operator, [literal, functioncall]
def expression():       return [literal, operation, functioncall]
def expressionlist():   return expression, -1, (",", expression)
def returnstatement():  return keyword("return"), expression
def ifstatement():      return (keyword("if"), "(", expression, ")", block,
                                keyword("else"), block)
def statement():        return [ifstatement, returnstatement], ";"
def block():            return "{", -2, statement, "}"
def parameterlist():    return "(", symbol, -1, (",", symbol), ")"
def functioncall():     return symbol, "(", expressionlist, ")"
def function():         return keyword("function"), symbol, parameterlist, block
def simpleLanguage():   return function
Answered By: Will Harris

Antlr generates LL(*) parsers. That can be good, but sometimes removing all left recursion can be cumbersome.

If you are LALR(1)-savvy, you can use PyBison. It has similar syntax to Yacc, if you know what it is. Plus, there are a lot of people out there that know how yacc works.

Answered By: Thaddee Tyl

Ned Batchelder did a survey of python parsing tools, which apparently he keeps updated (last updated July 2010):

http://nedbatchelder.com/text/python-parsers.html

If I was going to need a parser today, I would either roll my own recursive descent parser, or possibly use PLY or LEPL — depending on my needs and whether or not I was willing to introduce an external dependency. I wouldn’t personally use PyParsing for anything very complicated.

Answered By: Matt Anderson

pyPEG (a tool I authored) has a tracing facility for error reporting.

Just set pyPEG.print_trace = True and pyPEG will give you a full trace of what’s happening inside.

Answered By: Volker Birk

I would recommend that you check out my library: https://github.com/erezsh/lark

It can parse ALL context-free grammars, automatically builds an AST (with line & column numbers), and accepts the grammar in EBNF format, which is considered the standard.

It can easily parse a language like Python, and it can do so faster than any other parsing library written in Python.

Answered By: Erez
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.