Writing a parser for regular expressions

Question:

Even after years of programming, I’m ashamed to say that I’ve never really fully grasped regular expressions. In general, when a problem calls for a regex, I can usually (after a bunch of referring to syntax) come up with an appropriate one, but it’s a technique that I find myself using increasingly often.

So, to teach myself and understand regular expressions properly, I’ve decided to do what I always do when trying to learn something; i.e., try to write something ambitious that I’ll probably abandon as soon as I feel I’ve learnt enough.

To this end, I want to write a regular expression parser in Python. In this case, “learn enough” means that I want to implement a parser that can understand Perl’s extended regex syntax completely. However, it doesn’t have to be the most efficient parser or even necessarily usable in the real-world. It merely has to correctly match or fail to match a pattern in a string.

The question is, where do I start? I know almost nothing about how regexes are parsed and interpreted apart from the fact that it involves a finite state automaton in some way. Any suggestions for how to approach this rather daunting problem would be much appreciated.

EDIT: I should clarify that while I’m going to implement the regex parser in Python, I’m not overly fussed about what programming language the examples or articles are written in. As long as it’s not in Brainfuck, I will probably understand enough of it to make it worth my while.

Asked By: Chinmay Kanchi

||

Answers:

Writing an implementation of a regular expression engine is indeed a quite complex task.

But if you are interested in how to do it, even if you can’t understand enough of the details to actually implement it, I would recommend that you at least look at this article:

Regular Expression Matching Can Be Simple And Fast
(but is slow in Java, Perl, PHP, Python, Ruby, …)

It explains how many of the popular programming languages implement regular expressions in a way that can be very slow for some regular expressions, and explains a slightly different method that is faster. The article includes some details of how the proposed implementation works, including some source code in C. It may be a bit heavy reading if you are just starting to learn regular expressions, but I think it is well worth knowing about the difference between the two approaches.

Answered By: Mark Byers

There’s an interesting (if slightly short) chapter in Beautiful Code by Brian Kernighan, appropriately called “A Regular Expression Matcher”. In it he discusses a simple matcher that can match literal characters, and the .^$* symbols.

Answered By: Richard Fearn

"A play on regular expressions: functional pearl" takes an interesting approach. The implementation is given in Haskell, but it’s been reimplemented in Python at least once.

The developed program is based on an old technique to turn regular expressions into finite automata which makes it efficient both in terms of worst-case time and space bounds and actual performance: despite its simplicity, the Haskell implementation can compete with a recently published professional C++ program for the same problem.

Answered By: dhaffey

I’ve already given a +1 to Mark Byers – but as far as I remember the paper doesn’t really say that much about how regular expression matching works beyond explaining why one algorithm is bad and another much better. Maybe something in the links?

I’ll focus on the good approach – creating finite automata. If you limit yourself to deterministic automata with no minimisation, this isn’t really too difficult.

What I’ll (very quickly) describe is the approach taken in Modern Compiler Design.

Imagine you have the following regular expression…

a (b c)* d

The letters represent literal characters to match. The * is the usual zero-or-more repetitions match. The basic idea is to derive states based on dotted rules. State zero we’ll take as the state where nothing has been matched yet, so the dot goes at the front…

0 : .a (b c)* d

The only possible match is ‘a’, so the next state we derive is…

1 : a.(b c)* d

We now have two possibilities – match the ‘b’ (if there’s at least one repeat of ‘b c’) or match the ‘d’ otherwise. Note – we are basically doing a digraph search here (either depth first or breadth first or whatever) but we are discovering the digraph as we search it. Assuming a breadth-first strategy, we’ll need to queue one of our cases for later consideration, but I’ll ignore that issue from here on. Anyway, we’ve discovered two new states…

2 : a (b.c)* d
3 : a (b c)* d.

State 3 is an end state (there may be more than one). For state 2, we can only match the ‘c’, but we need to be careful with the dot position afterwards. We get “a.(b c)* d” – which is the same as state 1, so we don’t need a new state.

IIRC, the approach in Modern Compiler Design is to translate a rule when you hit an operator, in order to simplify the handling of the dot. State 1 would be transformed to…

1 : a.b c (b c)* d
    a.d

That is, your next option is either to match the first repetition or to skip the repetition. The next states from this are equivalent to states 2 and 3. An advantage of this approach is that you can discard all your past matches (everything before the ‘.’) as you only care about future matches. This typically gives a smaller state model (but not necessarily a minimal one).

EDIT If you do discard already matched details, your state description is a representation of the set of strings that can occur from this point on.

In terms of abstract algebra, this is a kind of set closure. An algebra is basically a set with one (or more) operators. Our set is of state descriptions, and our operators are our transitions (character matches). A closed set is one where applying any operator to any members in the set always produces another member that is in the set. The closure of a set is the mimimal larger set that is closed. So basically, starting with the obvious start state, we are constructing the minimal set of states that is closed relative to our set of transition operators – the minimal set of reachable states.

Minimal here refers to the closure process – there may be a smaller equivalent automata which is normally referred to as minimal.

With this basic idea in mind, it’s not too difficult to say “if I have two state machines representing two sets of strings, how to I derive a third representing the union” (or intersection, or set difference…). Instead of dotted rules, your state representations will a current state (or set of current states) from each input automaton and perhaps additional details.

If your regular grammars are getting complex, you can minimise. The basic idea here is relatively simple. You group all your states into one equivalence class or “block”. Then you repeatedly test whether you need to split blocks (the states aren’t really equivalent) with respect to a particular transition type. If all states in a particular block can accept a match of the same character and, in doing so, reach the same next-block, they are equivalent.

Hopcrofts algorithm is an efficient way to handle this basic idea.

A particularly interesting thing about minimisation is that every deterministic finite automaton has precisely one minimal form. Furthermore, Hopcrofts algorithm will produce the same representation of that minimal form, no matter what representation of what larger case it started from. That is, this is a “canonical” representation which can be used to derive a hash or for arbitrary-but-consistent orderings. What this means is that you can use minimal automata as keys into containers.

The above is probably a bit sloppy WRT definitions, so make sure you look up any terms yourself before using them yourself, but with a bit of luck this gives a fair quick introduction to the basic ideas.

BTW – have a look around the rest of Dick Grunes site – he has a free PDF book on parsing techniques. The first edition of Modern Compiler Design is pretty good IMO, but as you’ll see, there’s a second edition imminent.

Answered By: user180247

I do agree that writing a regex engine will improve understanding but have you taken a look at ANTLR??. It generates the parsers automatically for any kind of language. So maybe you can try your hand by taking one of the language grammars listed at Grammar examples and run through the AST and parser that it generates. It generates a really complicated code but you will have a good understanding on how a parser works.

Answered By: A_Var
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.