Which tool to use to parse programming languages in Python?
Question:
Which Python tool can you recommend to parse programming languages? It should allow for a readable representation of the language grammar inside the source, and it should be able to scale to complicated languages (something with a grammar as complex as e.g. Python itself).
When I search, I mostly find pyparsing, which I will be evaluating, but of course I’m interested in other alternatives.
Edit: Bonus points if it comes with good error reporting and source code locations attached to syntax tree elements.
Answers:
Antlr is what you should look at http://www.antlr.org
Take a look at this http://www.antlr.org/wiki/display/ANTLR3/Antlr3PythonTarget
For simple task I tend to use the shlex module.
See http://wiki.python.org/moin/LanguageParsing for evaluation of language parsing in python.
For a more complicated parser I would use pyparsing.
Pyparsing
Here is the parsed example from there home page
from pyparsing import Word, alphas
greet = Word(alphas) + "," + Word(alphas) + "!" # <-- grammar
defined here
hello = "Hello, World!"
print(hello, "->", greet.parseString(hello))
If you’re evaluating PyParsing, I think you should look at funcparserlib: http://pypi.python.org/pypi/funcparserlib
It’s a bit similar, but in my experience resulting code is much cleaner.
I really like pyPEG. Its error reporting isn’t very friendly, but it can add source code locations to the AST.
pyPEG doesn’t have a separate lexer, which would make parsing Python itself hard (I think CPython recognises indent and dedent in the lexer), but I’ve used pyPEG to build a parser for subset of C# with surprisingly little work.
An example adapted from fdik.org/pyPEG/: A simple language like this:
function fak(n) {
if (n==0) { // 0! is 1 by definition
return 1;
} else {
return n * fak(n - 1);
};
}
A pyPEG parser for that language:
def comment(): return [re.compile(r"//.*"),
re.compile("/*.*?*/", re.S)]
def literal(): return re.compile(r'd*.d*|d+|".*?"')
def symbol(): return re.compile(r"w+")
def operator(): return re.compile(r"+|-|*|/|==")
def operation(): return symbol, operator, [literal, functioncall]
def expression(): return [literal, operation, functioncall]
def expressionlist(): return expression, -1, (",", expression)
def returnstatement(): return keyword("return"), expression
def ifstatement(): return (keyword("if"), "(", expression, ")", block,
keyword("else"), block)
def statement(): return [ifstatement, returnstatement], ";"
def block(): return "{", -2, statement, "}"
def parameterlist(): return "(", symbol, -1, (",", symbol), ")"
def functioncall(): return symbol, "(", expressionlist, ")"
def function(): return keyword("function"), symbol, parameterlist, block
def simpleLanguage(): return function
Antlr generates LL(*) parsers. That can be good, but sometimes removing all left recursion can be cumbersome.
If you are LALR(1)-savvy, you can use PyBison. It has similar syntax to Yacc, if you know what it is. Plus, there are a lot of people out there that know how yacc works.
Ned Batchelder did a survey of python parsing tools, which apparently he keeps updated (last updated July 2010):
http://nedbatchelder.com/text/python-parsers.html
If I was going to need a parser today, I would either roll my own recursive descent parser, or possibly use PLY or LEPL — depending on my needs and whether or not I was willing to introduce an external dependency. I wouldn’t personally use PyParsing for anything very complicated.
pyPEG (a tool I authored) has a tracing facility for error reporting.
Just set pyPEG.print_trace = True
and pyPEG will give you a full trace of what’s happening inside.
I would recommend that you check out my library: https://github.com/erezsh/lark
It can parse ALL context-free grammars, automatically builds an AST (with line & column numbers), and accepts the grammar in EBNF format, which is considered the standard.
It can easily parse a language like Python, and it can do so faster than any other parsing library written in Python.
Which Python tool can you recommend to parse programming languages? It should allow for a readable representation of the language grammar inside the source, and it should be able to scale to complicated languages (something with a grammar as complex as e.g. Python itself).
When I search, I mostly find pyparsing, which I will be evaluating, but of course I’m interested in other alternatives.
Edit: Bonus points if it comes with good error reporting and source code locations attached to syntax tree elements.
Antlr is what you should look at http://www.antlr.org
Take a look at this http://www.antlr.org/wiki/display/ANTLR3/Antlr3PythonTarget
For simple task I tend to use the shlex module.
See http://wiki.python.org/moin/LanguageParsing for evaluation of language parsing in python.
For a more complicated parser I would use pyparsing.
Pyparsing
Here is the parsed example from there home page
from pyparsing import Word, alphas
greet = Word(alphas) + "," + Word(alphas) + "!" # <-- grammar
defined here
hello = "Hello, World!"
print(hello, "->", greet.parseString(hello))
If you’re evaluating PyParsing, I think you should look at funcparserlib: http://pypi.python.org/pypi/funcparserlib
It’s a bit similar, but in my experience resulting code is much cleaner.
I really like pyPEG. Its error reporting isn’t very friendly, but it can add source code locations to the AST.
pyPEG doesn’t have a separate lexer, which would make parsing Python itself hard (I think CPython recognises indent and dedent in the lexer), but I’ve used pyPEG to build a parser for subset of C# with surprisingly little work.
An example adapted from fdik.org/pyPEG/: A simple language like this:
function fak(n) {
if (n==0) { // 0! is 1 by definition
return 1;
} else {
return n * fak(n - 1);
};
}
A pyPEG parser for that language:
def comment(): return [re.compile(r"//.*"),
re.compile("/*.*?*/", re.S)]
def literal(): return re.compile(r'd*.d*|d+|".*?"')
def symbol(): return re.compile(r"w+")
def operator(): return re.compile(r"+|-|*|/|==")
def operation(): return symbol, operator, [literal, functioncall]
def expression(): return [literal, operation, functioncall]
def expressionlist(): return expression, -1, (",", expression)
def returnstatement(): return keyword("return"), expression
def ifstatement(): return (keyword("if"), "(", expression, ")", block,
keyword("else"), block)
def statement(): return [ifstatement, returnstatement], ";"
def block(): return "{", -2, statement, "}"
def parameterlist(): return "(", symbol, -1, (",", symbol), ")"
def functioncall(): return symbol, "(", expressionlist, ")"
def function(): return keyword("function"), symbol, parameterlist, block
def simpleLanguage(): return function
Antlr generates LL(*) parsers. That can be good, but sometimes removing all left recursion can be cumbersome.
If you are LALR(1)-savvy, you can use PyBison. It has similar syntax to Yacc, if you know what it is. Plus, there are a lot of people out there that know how yacc works.
Ned Batchelder did a survey of python parsing tools, which apparently he keeps updated (last updated July 2010):
http://nedbatchelder.com/text/python-parsers.html
If I was going to need a parser today, I would either roll my own recursive descent parser, or possibly use PLY or LEPL — depending on my needs and whether or not I was willing to introduce an external dependency. I wouldn’t personally use PyParsing for anything very complicated.
pyPEG (a tool I authored) has a tracing facility for error reporting.
Just set pyPEG.print_trace = True
and pyPEG will give you a full trace of what’s happening inside.
I would recommend that you check out my library: https://github.com/erezsh/lark
It can parse ALL context-free grammars, automatically builds an AST (with line & column numbers), and accepts the grammar in EBNF format, which is considered the standard.
It can easily parse a language like Python, and it can do so faster than any other parsing library written in Python.