Antlr4 parse issues when changing lexer rule to parser rule| python

Question:

I am new to antlr, and I have been facing some issues when it comes to properly parsing the source code, this is my grammar:

compilationUnit
    : (assignment | declarationList | definitionList)* EOF
    ;

block
    : LC RC
    ;

assignment: typeSpecifier? IDENTIFIER '=' expression ';';

expression
    : INTEGER
    ;

statementList
    :
    ;

declarationList
    : declaration
    | declarationList declaration
    ;

declaration
    : functionDeclaration SEMICOLON
    ;
functionDeclaration
    : typeSpecifier? functionName functionArgs
    ;

definitionList
    : functionDefinition
    ;
functionDefinition: functionDeclaration block;


functionName: IDENTIFIER;
functionArgs: LP RP;

typeSpecifier: VOID | INT;

TYPE_SPECIFIER
    : VOID
    | INT
    ;

IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]*;
INTEGER: [1-9][0-9]*;



STRING_LITERAL: '"' ~('"')* '"';

VOID: 'void';
INT: 'int';
STAR: '*';
LP: '(';
RP: ')';
LC: '{';
RC: '}';
LSQRB: '[';
RSQRB: ']';
SEMICOLON: ';';

WS: [ trn]+ -> skip;

NEWLINE
    :   (   'r' 'n'?
        |   'n'
        )
        -> skip
    ;

BLOCK_COMMENT
    :   '/*' .*? '*/'
        -> skip
    ;

LINE_COMMENT
    :   '//' ~[rn]*
        -> skip
    ;

the problem is typeSpecifier does not get matched properly, unless I change it to a lexer rule, so if I input something like this:

void b();
int a = 1;

it returns:

line 1:0 extraneous input 'void' expecting {<EOF>, IDENTIFIER, 'void', 'int'}
line 2:0 extraneous input 'int' expecting {<EOF>, IDENTIFIER, 'void', 'int'}

but if I rename typeSpecifier to TYPE_SPECIFIER it parses it with no errors, the problem with that is lets say for assignment int a = 1 I cannot distinguish between nodes and terminal nodes, also same issue with identifiers, so it will return:

'int' = <class 'antlr4.tree.Tree.TerminalNodeImpl'>
'a' = <class 'antlr4.tree.Tree.TerminalNodeImpl'>
'=' = <class 'antlr4.tree.Tree.TerminalNodeImpl'>
'1' = <class 'core.CParser.CParser.ExpressionContext'>
';' = <class 'antlr4.tree.Tree.TerminalNodeImpl'>

and I want it to return something more like:

'int' = <class 'antlr4.tree.Tree.TypeSpecifier'>
'a' = <class 'antlr4.tree.Tree.Identifier'>
'=' = <class 'antlr4.tree.Tree.AssignEq'> #or something like that
'1' = <class 'core.CParser.CParser.ExpressionContext'>
';' = <class 'antlr4.tree.Tree.SemiCol'>

this is my python visitor code:

from core.CParser import CParser
from core.CListener import CListener
from io import FileIO
from antlr4.tree.Tree import TerminalNodeImpl


class Listener(CListener):
    def __init__(self, output):
        self.output: FileIO = output

    def add_newline(self):
        self.output.write('n')

    def enterDeclaration(self, ctx: CParser.DeclarationContext):
        ...

    def enterFunctionDeclaration(self,
                                 ctx: CParser.FunctionDeclarationContext):
        for child in ctx.getChildren():
            if isinstance(child, TerminalNodeImpl):
                self.output.write(child.getText() + ' ')
            if isinstance(child, CParser.FunctionNameContext):
                self.output.write(child.getText())
            if isinstance(child, CParser.FunctionArgsContext):
                self.output.write(child.getText())

        self.output.write(';')
        self.add_newline()

    def enterAssignment(self, ctx: CParser.AssignmentContext):
        for child in ctx.getChildren():
            if isinstance(child, TerminalNodeImpl):
                self.output.write(child.getText() + ' ')
            if isinstance(child, CParser.ExpressionContext):
                self.output.write(child.getText())
        self.add_newline()


    def enterBlock(self, ctx: CParser.BlockContext):
            print(ctx.getText())

Thank you in advance 🙂

Asked By: SGB

||

Answers:

You have both of the following:

typeSpecifier: VOID | INT;

TYPE_SPECIFIER
    : VOID
    | INT
    ;

Yet you never used the TYPE_SPECIFIER token in any of your parser rules. (And the TYPE_SPECIFIER Lexer rule will be the token type assigned; you’ll never see a VOID or INT token with this rule in place. You’re effectively making them fragment rules.)

Delete the TYPE_SPECIFIER Lexer rule (you had the right idea to begin with.)

However, you need to move your IDENTIFIER rule below any keywords in your Lexer rule (in a “tie” for Lexer rules, ANTLR will use the first defined rule, so you’ll never see the keyword rules as your keywords will also match the, more generic, IDENTIFIERS rule).

—-

Also, it’s a really good idea to define a start rule that ends with EOF to ensure all input is parsed.

Answered By: Mike Cargal
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.