How can i ignore comments in a string based on compiler design?

Question:

I want to ignore every comment like { comments } and // comments.
I have a pointer named peek that checks my string character by character. I know how to ignore newlines, tabs, and spaces but I don’t know how to ignore comments.

string =  """  beGIn west   WEST north//comment1 n
north       north west East east southn
// comment westn
{n
    commentn
}n end
"""

tokens = []
tmp = ''

for i, peek in enumerate(string.lower()):
    if peek == ' ' or peek == 'n':
        tokens.append(tmp)
        # ignoing WS's and comments
        if(len(tmp)>0): 
            print(tmp)

        tmp = ''
    
    else:
        tmp += peek

Here is my result:

begin
west
west
north//
comment1
north
north
west
east
east
south
{
comment2
}
end

As you see spaces are ignored but comments aren’t.

How can I get a result like below?

begin
west
west
north
north
north
west
east
east
south
end
Asked By: sep_The_new_elixiR

||

Answers:

Simply use global variable skip = False and set it True when you get { and set False when you get } and the rest of your if/else run in if not skip:

string =  """  beGIn west   WEST north//comment1 n
north       north west East east southn
// comment westn
{n
    commentn
}n end
"""

tokens = []
tmp = ''
skip = False

for i, peek in enumerate(string.lower()):

    if peek == '{':
        skip = True
    elif peek == '}':
        skip = False
    elif not skip:

        if peek == ' ' or peek == 'n':
            tokens.append(tmp)
            # ignoing WS's and comments
            if(len(tmp)>0): 
                print(tmp)
            tmp = ''
        else:
            tmp += peek

Because you may have nested { { } } like

{n
    { comment1 }n
    comment2n
    { comment3 }n
}n

so better use skip to count { }

string =  """  beGIn west   WEST north//comment1 n
north       north west East east southn
// comment westn
{n
    { comment1 }n
    comment2n
    { comment3 }n
}n end
"""

tokens = []
tmp = ''
skip = 0

for i, peek in enumerate(string.lower()):

    if peek == '{':
        skip += 1
    elif peek == '}':
        skip -= 1
    elif not skip:  # elif skip == 0:

        if peek == ' ' or peek == 'n':
            tokens.append(tmp)
            # ignoing WS's and comments
            if(len(tmp)>0): 
                print(tmp)
            tmp = ''
        else:
            tmp += peek

But maybe it would be better to get all as tokens and later filter tokens. But I skip this idea.


EDIT:

Version using Python module sly which works similar to C/C++ tools lex/yacc

Regex for MULTI_LINE_COMMENT I found in other tool for building parsers – lark:

syntax for multiline comments

from sly import Lexer, Parser

class MyLexer(Lexer):
    # Create it befor defining regex for Tokens
    tokens = { NAME, ONE_LINE_COMMENT, MULTI_LINE_COMMENT }

    ignore = ' t'

    # Tokens
    NAME = r'[a-zA-Z_][a-zA-Z0-9_]*'
    ONE_LINE_COMMENT = '//.*'
    MULTI_LINE_COMMENT = '{(.|n)*}'

    # Ignored pattern
    ignore_newline = r'n+'

    # Extra action for newlines
    def ignore_newline(self, t):
        self.lineno += t.value.count('n')

    # Work with errors
    def error(self, t):
        print("Illegal character '%s'" % t.value[0])
        self.index += 1

if __name__ == '__main__':
    
    text =  """  beGIn west   WEST north//comment1 
north       north west East east south
// comment west
{
    { comment1 }
    comment2
    { comment3 }
}
 end
"""
    
    lexer = MyLexer()
    tokens = lexer.tokenize(text)
    for item in tokens:
        print(item.type, ':', item.value)

Result:

NAME : beGIn
NAME : west
NAME : WEST
NAME : north
ONE_LINE_COMMENT : //comment1 
NAME : north
NAME : north
NAME : west
NAME : East
NAME : east
NAME : south
ONE_LINE_COMMENT : // comment west
MULTI_LINE_COMMENT : {
    { comment1 }
    comment2
    { comment3 }
}
NAME : end
Answered By: furas

@furas answer works, but to make it count newlines properly, use the _ decorator:

@_('{(.|n)*}')
def MULTILINE_COMMENT(self, t):
    self.lineno += t.value.count('n')
    return t
Answered By: Björn Lindqvist