what next after pyparsing?

Question:

I have a huge grammar developed for pyparsing as part of a large, pure Python application.
I have reached the limit of performance tweaking and I’m at the point where the diminishing returns make me start to look elsewhere. Yes, I think I know most of the tips and tricks and I’ve profiled my grammar and my application to dust.

What next?

I hope to find a parser that gives me the same readability, usability (I’m using many advanced features of pyparsing such as parse-actions to start the post processing of the input which is being parsed) and python integration but at 10× the performance.

I love the fact the the grammar is pure Python.

All my basic blocks are regular expressions, so reusing them would be nice.

I know I can’t have everything so I am willing to give up on some of the features I have today to get to the requested 10× performance.

Where do I go from here?

Asked By: Tal Weiss

Source

Answers:

It looks like the pyparsing folks have anticipated your problem. From https://github.com/pyparsing/pyparsing/blob/master/docs/HowToUsePyparsing.rst :

Performance of pyparsing may be slow for complex grammars and/or large input strings. The psyco package can be used to improve the speed of the pyparsing module with no changes to grammar or program logic – observed improvments have been in the 20-50% range.

However, as Vangel noted in the comments below, psyco is an obsolete project as of March 2012. Its successor is the PyPy project, which starts from the same basic approach to performance: use a JIT native-code compiler instead of a bytecode interpreter. You should be able to achieve similar or greater gains with PyPy if switching Python implementations will work for you.

If you’re really a speed demon, but want to keep some of the legibility and declarative syntax, I’d suggest having a look at ANTLR. Probably not the Python-generating backend; I’m skeptical whether that’s mature or high-performance enough for your needs. I’m talking about the goods: the C backend that started it all.

Wrap a Python C extension module around the entry point to the parser, and turn it loose.

Having said that, you’ll be giving up a lot in this transition: basically any Python you want to do in your parser will have to be done through the C API (not altogether pretty). Also, you’ll have to get used to very different ways of doing things. ANTLR has its charms, but it’s not based on combinators, so there’s not the easy and fluid relationship between your grammar and your language that there is in pyparsing. Plus, it’s its own DSL, much like lex/yacc, which can present a learning curve – but, because it’s LL based, you’ll probably find it easier to adapt to your needs.

Answered By: Owen S.

There’s no way to know what kind of benefit you’ll get without just testing it, but it’s within the range of possibility that you could get 10x benefit just from using Unladen Swallow if your process is long-running and repetitive. (Also, if you have many things to parse and you typically start a new interpreter for each one, Unladen Swallow gets faster – to a point – the longer you run your process, so while parsing one input might not show much gain, you might get significant gains on the 2nd and 3rd inputs in the same process).

(Note: pull the latest out of SVN – you’ll get far better performance than the latest tarball)

Answered By: Nick Bastin

Switch to a generated C/C++ parser (using ANTLR, flex/bison, etc.). If you can delay all the action rules until after you are done parsing, you might be able to build an AST with trivial code and then pass that back to your python code via something like SWIG and process on it with your current actions rules. OTOH, for that to give you a speed boost, the parsing has to be the heavy lifting. If your action rules are the big cost, then this will buy you nothing unless you write your action rules in C as well (but you might have to do it anyway to avoid paying for whatever impedance mismatch you get between the python and C code).

Answered By: BCS

If you really want performance for large grammars, look no farther than SimpleParse (which itself relies on mxTextTools, a C extension). However, know now that it comes at the cost of being more cryptic and requiring that you be well-versed in EBNF.

It’s definitely not the more Pythonic route, and you’re going to have to start all over with an EBNF grammar to use SimpleParse.

Answered By: jathanism

A bit late to the party, but PLY (Python Lex-Yacc), has served me very well. PLY gives you a pure Python framework for constructing lex-based tokenizers, and yacc-based LR parsers.

I went this route when I hit performance issues with pyparsing.

Here is a somewhat old but still interesting article on Python parsing which includes benchmarks for ANTLR, PLY and pyparsing. PLY is roughly 4 times faster than pyparsing in this test.

Answered By: bohrax