Sly parsing of multiple statements where all but the last one must be terminated with a Newline

Question:

I have a scripting language that I am implementing where every statement is terminated by a newline, except possibly the last one. It’s also possible to have lines that are just newlines. Here’s one implementation (statement code omitted since it doesn’t really add anything):

@_('statement_set statement', 'statement_set')
def script(self, p):
    if len(p) > 1:
        p.statement_set.append(p.statement)
    
    return p.statement_set

@_('statement_or_NL')
def statement_set(self, p):
    return p.statement_or_NL

@_('statement_set statement_or_NL')
def statement_set(self, p):
    p.statement_set.append(p.statement_or_NL)
    return p.statement_set

@_('statement NEWLINE', 'NEWLINE')
def statement_or_NL(self, p):
    if len(p) > 1:
        return [p.statement]
    else:
        return []

@_('PLUS', 'MINUS')
def statement(self, p):
    return p[0]

That generates the following rules:

Rule 0     S' -> script
Rule 1     script -> statement_set
Rule 2     script -> statement_set statement
Rule 3     statement_set -> statement_set statement_or_NL
Rule 4     statement_set -> statement_or_NL
Rule 5     statement_or_NL -> NEWLINE
Rule 6     statement_or_NL -> statement NEWLINE

If you’d like, I can post the generated states as well. The issue I’m running into is that, even when I eliminate all shift-reduce conflicts, I still almost always fail in one of the three scenarios of code that starts with a newline, code that ends with a newline, and code that has no newline at the end. Something goes awry in the parsing such that it will read the final statement (which ends with an EOF) as a "statement NEWLINE" and then error out because there is no new-line. Worse comes to worse, I have a version that works as long as there’s no final statement without a newline, so I could probably rig the parser to either add it to the initial string with the code, or to insert the token, but that feels clunky.

I got this working in Ply previously, and I thought it was essentially the same code, but it does not seem to work for me.

Asked By: Sean Duggan

||

Answers:

Here’s a fairly general solution for this sort of problem. It basically takes the statement list to be a two-state automaton, alternating between "ends-with-newline" and "ends-with-statement". The final result is a list of all statements (although you could do something more interesting with the newlines as well).

Since all newlines are ignored except for their role in separating statements, you can have any number of consecutive newlines, and newline(s) at the end of the input are optional. The simple transition diagram:

CURRENT STATE          INPUT       RESULT STATE
-------------------  ---------     -------------------
ends-with-newline       NL         ends-with-newline
ends-with-newline    statement     ends-with-statement
ends-with-statement     NL         ends-with-newline
ends-with-statement  statement     *ERROR*

Finite automaton are easily converted into CFGs using the template:

RESULT: CURRENT INPUT

That leads to the following (in which the actual definition of statement is omitted, since the only thing that matters about it is that it cannot derive anything which starts with a statement terminator, which would be unusual to say the least, and that it must not derive empty, which occasionally requires some adjustments):

    @_('script_ends_nl', 'script_ends_stmt')
    def script(self, p):
        return p[0]

    @_('')
    def script_ends_nl(self, p):
        return []

    @_(' script_ends_nl   NL ',
       ' script_ends_stmt NL ')
    def script_ends_nl(self, p):
        return p[0]

    @_('script_ends_nl statement')
    def script_ends_stmt(self, p):
        return p.script_ends_nl + [p.statement]

Or, using Sly’s debugfile, the grammar:

Rule 0     S' -> script
Rule 1     script -> script_ends_stmt
Rule 2     script -> script_ends_nl
Rule 3     script_ends_nl -> script_ends_stmt NL
Rule 4     script_ends_nl -> script_ends_nl NL
Rule 5     script_ends_nl -> <empty>
Rule 6     script_ends_stmt -> script_ends_nl statement
Answered By: rici
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.