How can i ignore comments in a string based on compiler design?
Question:
I want to ignore every comment like { comments }
and // comments
.
I have a pointer named peek that checks my string character by character. I know how to ignore newlines, tabs, and spaces but I don’t know how to ignore comments.
string = """ beGIn west WEST north//comment1 n
north north west East east southn
// comment westn
{n
commentn
}n end
"""
tokens = []
tmp = ''
for i, peek in enumerate(string.lower()):
if peek == ' ' or peek == 'n':
tokens.append(tmp)
# ignoing WS's and comments
if(len(tmp)>0):
print(tmp)
tmp = ''
else:
tmp += peek
Here is my result:
begin
west
west
north//
comment1
north
north
west
east
east
south
{
comment2
}
end
As you see spaces are ignored but comments aren’t.
How can I get a result like below?
begin
west
west
north
north
north
west
east
east
south
end
Answers:
Simply use global variable skip = False
and set it True
when you get {
and set False
when you get }
and the rest of your if/else
run in if not skip:
string = """ beGIn west WEST north//comment1 n
north north west East east southn
// comment westn
{n
commentn
}n end
"""
tokens = []
tmp = ''
skip = False
for i, peek in enumerate(string.lower()):
if peek == '{':
skip = True
elif peek == '}':
skip = False
elif not skip:
if peek == ' ' or peek == 'n':
tokens.append(tmp)
# ignoing WS's and comments
if(len(tmp)>0):
print(tmp)
tmp = ''
else:
tmp += peek
Because you may have nested { { } }
like
{n
{ comment1 }n
comment2n
{ comment3 }n
}n
so better use skip
to count {
}
string = """ beGIn west WEST north//comment1 n
north north west East east southn
// comment westn
{n
{ comment1 }n
comment2n
{ comment3 }n
}n end
"""
tokens = []
tmp = ''
skip = 0
for i, peek in enumerate(string.lower()):
if peek == '{':
skip += 1
elif peek == '}':
skip -= 1
elif not skip: # elif skip == 0:
if peek == ' ' or peek == 'n':
tokens.append(tmp)
# ignoing WS's and comments
if(len(tmp)>0):
print(tmp)
tmp = ''
else:
tmp += peek
But maybe it would be better to get all as tokens
and later filter tokens
. But I skip this idea.
EDIT:
Version using Python module sly which works similar to C/C++ tools lex
/yacc
Regex for MULTI_LINE_COMMENT
I found in other tool for building parsers – lark
:
from sly import Lexer, Parser
class MyLexer(Lexer):
# Create it befor defining regex for Tokens
tokens = { NAME, ONE_LINE_COMMENT, MULTI_LINE_COMMENT }
ignore = ' t'
# Tokens
NAME = r'[a-zA-Z_][a-zA-Z0-9_]*'
ONE_LINE_COMMENT = '//.*'
MULTI_LINE_COMMENT = '{(.|n)*}'
# Ignored pattern
ignore_newline = r'n+'
# Extra action for newlines
def ignore_newline(self, t):
self.lineno += t.value.count('n')
# Work with errors
def error(self, t):
print("Illegal character '%s'" % t.value[0])
self.index += 1
if __name__ == '__main__':
text = """ beGIn west WEST north//comment1
north north west East east south
// comment west
{
{ comment1 }
comment2
{ comment3 }
}
end
"""
lexer = MyLexer()
tokens = lexer.tokenize(text)
for item in tokens:
print(item.type, ':', item.value)
Result:
NAME : beGIn
NAME : west
NAME : WEST
NAME : north
ONE_LINE_COMMENT : //comment1
NAME : north
NAME : north
NAME : west
NAME : East
NAME : east
NAME : south
ONE_LINE_COMMENT : // comment west
MULTI_LINE_COMMENT : {
{ comment1 }
comment2
{ comment3 }
}
NAME : end
@furas answer works, but to make it count newlines properly, use the _
decorator:
@_('{(.|n)*}')
def MULTILINE_COMMENT(self, t):
self.lineno += t.value.count('n')
return t
I want to ignore every comment like { comments }
and // comments
.
I have a pointer named peek that checks my string character by character. I know how to ignore newlines, tabs, and spaces but I don’t know how to ignore comments.
string = """ beGIn west WEST north//comment1 n
north north west East east southn
// comment westn
{n
commentn
}n end
"""
tokens = []
tmp = ''
for i, peek in enumerate(string.lower()):
if peek == ' ' or peek == 'n':
tokens.append(tmp)
# ignoing WS's and comments
if(len(tmp)>0):
print(tmp)
tmp = ''
else:
tmp += peek
Here is my result:
begin
west
west
north//
comment1
north
north
west
east
east
south
{
comment2
}
end
As you see spaces are ignored but comments aren’t.
How can I get a result like below?
begin
west
west
north
north
north
west
east
east
south
end
Simply use global variable skip = False
and set it True
when you get {
and set False
when you get }
and the rest of your if/else
run in if not skip:
string = """ beGIn west WEST north//comment1 n
north north west East east southn
// comment westn
{n
commentn
}n end
"""
tokens = []
tmp = ''
skip = False
for i, peek in enumerate(string.lower()):
if peek == '{':
skip = True
elif peek == '}':
skip = False
elif not skip:
if peek == ' ' or peek == 'n':
tokens.append(tmp)
# ignoing WS's and comments
if(len(tmp)>0):
print(tmp)
tmp = ''
else:
tmp += peek
Because you may have nested { { } }
like
{n
{ comment1 }n
comment2n
{ comment3 }n
}n
so better use skip
to count {
}
string = """ beGIn west WEST north//comment1 n
north north west East east southn
// comment westn
{n
{ comment1 }n
comment2n
{ comment3 }n
}n end
"""
tokens = []
tmp = ''
skip = 0
for i, peek in enumerate(string.lower()):
if peek == '{':
skip += 1
elif peek == '}':
skip -= 1
elif not skip: # elif skip == 0:
if peek == ' ' or peek == 'n':
tokens.append(tmp)
# ignoing WS's and comments
if(len(tmp)>0):
print(tmp)
tmp = ''
else:
tmp += peek
But maybe it would be better to get all as tokens
and later filter tokens
. But I skip this idea.
EDIT:
Version using Python module sly which works similar to C/C++ tools lex
/yacc
Regex for MULTI_LINE_COMMENT
I found in other tool for building parsers – lark
:
from sly import Lexer, Parser
class MyLexer(Lexer):
# Create it befor defining regex for Tokens
tokens = { NAME, ONE_LINE_COMMENT, MULTI_LINE_COMMENT }
ignore = ' t'
# Tokens
NAME = r'[a-zA-Z_][a-zA-Z0-9_]*'
ONE_LINE_COMMENT = '//.*'
MULTI_LINE_COMMENT = '{(.|n)*}'
# Ignored pattern
ignore_newline = r'n+'
# Extra action for newlines
def ignore_newline(self, t):
self.lineno += t.value.count('n')
# Work with errors
def error(self, t):
print("Illegal character '%s'" % t.value[0])
self.index += 1
if __name__ == '__main__':
text = """ beGIn west WEST north//comment1
north north west East east south
// comment west
{
{ comment1 }
comment2
{ comment3 }
}
end
"""
lexer = MyLexer()
tokens = lexer.tokenize(text)
for item in tokens:
print(item.type, ':', item.value)
Result:
NAME : beGIn
NAME : west
NAME : WEST
NAME : north
ONE_LINE_COMMENT : //comment1
NAME : north
NAME : north
NAME : west
NAME : East
NAME : east
NAME : south
ONE_LINE_COMMENT : // comment west
MULTI_LINE_COMMENT : {
{ comment1 }
comment2
{ comment3 }
}
NAME : end
@furas answer works, but to make it count newlines properly, use the _
decorator:
@_('{(.|n)*}')
def MULTILINE_COMMENT(self, t):
self.lineno += t.value.count('n')
return t