Parsing / reformatting a tokenized list in python

Question:

I have lists of tokens of the form
"(a OR b) AND c OR d AND c"
or
"(a OR b) AND c OR (d AND c)"

I want to reformat these tokens into a string of the form:
Expressions are of arbitrary form like (a or b) and c or (d and c).

Write python to reformat the expressions as:
{{or {and {or a b} c} {and d c}}}

I have code that works for some token lists, but not for others:

def parse_expression(tokens):

    if len(tokens) == 1:
        return tokens[0]
    
    # Find the top-level operator (either 'and' or 'or')
    parens = 0
    for i in range(len(tokens) - 1, -1, -1):
        token = tokens[i]
        if token == ')':
            parens += 1
        elif token == '(':
            parens -= 1
        elif parens == 0 and token in {'AND', 'OR'}:
            op = token
            break
    else:
        print('Invalid expression')
    
    # Recursively parse the sub-expressions
    left_tokens = tokens[:i]
    right_tokens = tokens[i+1:]
    print("{i} left {left_tokens}")
    print("{i} right {right_tokens}")
    if op == 'AND':
        left = parse_expression(left_tokens)
        right = parse_expression(right_tokens)
        return f'(and {left} {right})'
    else:
        left = parse_expression(left_tokens)
        right = parse_expression(right_tokens)
        return f'(or {left} {right})'
        

x=list()
x = ['x', 'AND', 'y', 'AND', 'z', 'AND', '(', '(', 'a', 'AND', 'b', ')', 'OR', '(', 'c', 'AND', 'd', ')', ')']
y = ['x', 'AND', 'y', 'AND', 'z', 'AND', '(', 'w', 'AND', 'q', ')']

It seems to work without parenthesis, but not when I use them.

When I try to reformat these with the parser, I keep getting

Traceback (most recent call last):
  File "./prog.py", line 41, in <module>
  File "./prog.py", line 29, in parse_expression
  File "./prog.py", line 27, in parse_expression
UnboundLocalError: local variable 'op' referenced before assignment

What am I doing wrong?

Asked By: elbillaf

||

Answers:

The problem you have is with the way you evaluate sub-expressions.

Consider the following sub-expression (the rightmost part of x):

['(', '(', 'a', 'AND', 'b', ')', 'OR', '(', 'c', 'AND', 'd', ')', ')']

Your plan in this evaluation is to separate this out by first finding the operation, and then moving that to the front.
In this example, the operation is "OR." The way you check for this, is by seeing if the "or" is at a parentheses level of 0. However, in this case, the parentheses level is 1, since the entire thing is nested. Because of this, the variable op never gets defined, and you get the error you get. This is also why i does not get defined.

One way to fix this, is to try again with outer parentheses removed if the expression failed to work the first pass, and the outermost characters are opening and closing parentheses. Here is a hackfix:

def parse_expression(tokens):
    if len(tokens) == 1:
        return tokens[0]

    # Find the top-level operator (either 'and' or 'or')
    parens = 0
    for i in range(len(tokens) - 1, -1, -1):
        token = tokens[i]

        if parens < 0:
            print("Invalid expression")

        if token == ')':
            parens += 1
        elif token == '(':
            parens -= 1
        elif parens == 0 and token in {'AND', 'OR'}:
            op = token
            break
    else:
        if tokens[0] == "(" and tokens[-1] == ")":
            return parse_expression(tokens[1:-1])
        else:
            print('Invalid expression')

    # Recursively parse the sub-expressions
    left_tokens = tokens[:i]
    right_tokens = tokens[i+1:]
    if op == 'AND':
        left = parse_expression(left_tokens)
        right = parse_expression(right_tokens)
        return f'(and {left} {right})'
    else:
        left = parse_expression(left_tokens)
        right = parse_expression(right_tokens)
        return f'(or {left} {right})'

I’m pretty sure there has to be a cleaner algorithm that can be used for this, but this, at the very least, works, and you can continue.

I hope that helps. 🙂

Answered By: Egeau