Parsing left associative operators using RLMeta

Published on 2019-09-07.

The first article about RLMeta has this example grammar implementing a simple calculator:

Calculator {
  expression =
    | additive
  additive =
    | multitive:x '+' additive:y -> add(x y)
    | multitive
  multitive =
    | digit:x '*' multitive:y    -> mul(x y)
    | digit
  digit =
    | '0'-'9':x                  -> int(x)
}

It works, but if we were to add a second case to support subtraction, it would break because subtraction is a left associate operator but the calculator creates a right associative parse. In this article I show how it would brake and how it can be solved.

What is operator associativity?
Right associative calculator
Left associative calculator
Correct left associative calculator
Right associative calculator without recursion
Calculator combining operators
Cleaner handling of precedence
Comparison of operator parsers
Resources
Appendix: Test script

What is operator associativity?

Operator associativity has to do with the order that operators are evaluated. The following expression can be evaluated in two ways:

1-2-3

Either as this (if the operator is left associative):

(1-2)-3

Or as this (if the operator is right associative):

1-(2-3)

Let's see how the calculator evaluates expressions.

Right associative calculator

The following grammar works like the calculator but only supports subtraction:

right.rlmeta

Calculator {
  expr  = digit:x '-' expr:y -> sub(x y)
        | digit
  digit = '0'-'9':x          -> int(x)
}

The sub function is implemented in two ways. The first creates an AST node representing a subtraction:

ast.py

def sub(left, right):
    return ["-", left, right]

The second evaluates the subtraction expression:

eval.py

def sub(left, right):
    return left - right

Let's see how the calculator handles the following two expressions:

1-2
1-2-3

Below are the ASTs and the evaluated results:

['-', 1, 2]
=>
-1

['-',
 1,
 ['-', 2, 3]]
=>
2

For an expression with only two numbers, the calculator works as expected. But with more numbers, the parse is incorrect. The calculator parses 1-2-3 as 1-(2-3) which is incorrect because subtraction is left associative and should thus be parsed as (1-2)-3.

The original calculator does not have this problem because it only supports operators that evaluate the same no matter if parsed as left associative or right associative.

If we strip the semantic actions from the expr rule it is easier to see what creates this right associative parse:

expr = digit '-' expr | digit

The topmost expression will be parsed as a digit followed by an arbitrary complex expression: (digit - (...)).

Let's see if we can get a left associative parse instead.

Left associative calculator

In order to get a left associate parse (that we need for subtraction) we would like to write the expr rule like this instead:

expr = expr '-' digit | digit

The whole grammar would then look like this:

left.rlmeta

Calculator {
  expr  = expr:x '-' digit:y -> sub(x y)
        | digit
  digit = '0'-'9':x          -> int(x)
}

Unfortunately, this will not work with the parsing algorithm that RLMeta uses. In order to parse an expr it first has to parse an expr and so on, and it will get stuck in an infinite loop. The parsing algorithm is based on recursive descent parsing and the first choice in the expr rule will thus be translated into something like this:

def expr():
    x = expr()
    atom("-")
    y = digit()
    return sub(x, y)

We need to handle left associative operators differently.

Correct left associative calculator

The right associative parse that we saw earlier can be rewritten using a repetition:

expr = digit '-' expr | digit

It then becomes this:

expr = digit ('-' digit)*

It parses the same expressions, but it does not create a right associative parse. What does it create? Something that consist of a digit and a list of something. We can turn this into a parse tree using a function that we call leftAssoc. Here is the complete grammar:

left_correct.rlmeta

Calculator {
  expr  = digit:x (op:y digit:z -> [y z])*:xs -> leftAssoc(x xs)
  digit = '0'-'9':x                           -> int(x)
  op    = '-'                                 -> makeSub()
}

In the semantic action for expr, x is bound to a digit, and xs is bound to a list of pairs consisting of an operator and a digit: [[op, digit], [op, digit], ...]. The operator is a function that takes two arguments: the left hand side and the right hand side. (makeSub returns the sub function. It is used because currently global variables can not be referenced from semantic actions. Only global functions.) The function leftAssoc turns this into a left associative tree:

support.py

def leftAssoc(expr, items):
    while items:
        op, rhs = items.pop(0)
        expr = op(expr, rhs)
    return expr

Here is how the expression 1-2-3 is parsed:

leftAssoc is called with expr=1 and items=[[sub, 2], [sub, 3]].
Since there are still items, the loop is entered
- op=sub and rhs=2
- expr=sub(1, 2)
Since there are still items, the loop is entered
- op=sub and rhs=3
- expr=sub(sub(1, 2), 3)
Since there are no more items, sub(sub(1, 2), 3) is returned

Let's verify that the calculator handles the following expressions as intended:

1-2
1-2-3

Below are the ASTs and the evaluated results:

['-', 1, 2]
=>
-1

['-',
 ['-', 1, 2],
 3]
=>
-4

And indeed it does. We have now successfully parsed a left associate operator using RLMeta.

Right associative calculator without recursion

We can handle a right associative parse in RLMeta with a recursive rule as we have seen before:

expr = digit '-' expr | digit

However, we can also obtain a right associative parse by parsing operators as a list and then converting them into an AST with a rightAssoc function. Here is a grammar that supports only exponentiation which is right associative:

right_no_recursion.rlmeta

Calculator {
  expr  = digit:x (op:y digit:z -> [y z])*:xs -> rightAssoc(x xs)
  digit = '0'-'9':x                           -> int(x)
  op    = '^'                                 -> makePow()
}

It has the exact same structure as the previous calculator, only now a different function is called to create the parse tree. The rightAssoc function is also similar in structure to the leftAssoc function, but the loop is replaced with an if statement followed by a recursive call:

support.py

def rightAssoc(expr, items):
    if items:
        op, rhs = items.pop(0)
        expr = op(expr, rightAssoc(rhs, items))
    return expr

Here is how the expression 3^2^2 is parsed:

rightAssoc is called with expr=3 and items=[[pow, 2], [pow, 2]]
Since there are still items, the if statement is entered
- op=pow and rhs=2
- expr=pow(3, rightAssoc(2, [[pow, 2]])
rightAssoc is called with expr=2 and items=[[pow, 2]]
Since there are still items, the if statement is entered
- op=pow and rhs=2
- expr=pow(2, rightAssoc(2, [])
rightAssoc is called with expr=2 and items=[]
Since there are no more items, 2 is returned
The parent call returns pow(2, 2)
The parent call returns pow(3, pow(2, 2))

The two pow functions are defined like this:

ast.py

def pow(left, right):
    return ["^", left, right]

eval.py

def pow(left, right):
    return left ** right

Let's verify that the calculator handles the following expressions as intended:

3^2
3^2^2

Below are the ASTs and the evaluated results:

['^', 3, 2]
=>
9

['^',
 3,
 ['^', 2, 2]]
=>
81

And indeed it does.

Calculator combining operators

Now that we know how to handle both left and right associative operators, we can extend the calculator to handle more operators:

calculator.rlmeta

Calculator {
  expr  = expr1
  expr1 = expr2:x (op1:y expr2:z -> [y z])*:xs -> leftAssoc(x xs)
  expr2 = expr3:x (op2:y expr3:z -> [y z])*:xs -> leftAssoc(x xs)
  expr3 = digit:x (op3:y digit:z -> [y z])*:xs -> rightAssoc(x xs)
  digit = '0'-'9':x                            -> int(x)
  op1 =
    | '+' -> makeAdd()
    | '-' -> makeSub()
  op2 =
    | '*' -> makeMul()
    | '/' -> makeDiv()
  op3 =
    | '^' -> makePow()
}

Different levels of expressions are used to handle operators having different precedence (multiplication is done before subtraction for example).

The additional arithmetic functions are defined like this:

ast.py

def add(left, right):
    return ["+", left, right]

def mul(left, right):
    return ["*", left, right]

def div(left, right):
    return ["/", left, right]

eval.py

def add(left, right):
    return left + right

def mul(left, right):
    return left * right

def div(left, right):
    return left / right

Let's see how the calculator handles the following expression:

1+2-3^2^2-5

Below is the AST and the evaluated result:

['-',
 ['-',
  ['+', 1, 2],
  ['^',
   3,
   ['^', 2, 2]]],
 5]
=>
-83

If we enter the same expression in the Python prompt, but replace ^ with ** as Python uses, we get the same result:

>>> 1+2-3**2**2-5
-83

We can now parse complicated expressions correctly. However, if there are many precedence levels in a grammar, it might become difficult to read. Can we do better?

Cleaner handling of precedence

Here is the calculator grammar from the previous section rewritten in a cleaner way:

calculator_ops.rlmeta

Calculator {
  expr  = digit:x (op:y digit:z -> [y z])*:xs -> parseOps(x xs)
  digit = '0'-'9':x                           -> int(x)
  op    =
    | '+' -> Op(makeAdd() "1" "left")
    | '-' -> Op(makeSub() "1" "left")
    | '*' -> Op(makeMul() "2" "left")
    | '/' -> Op(makeDiv() "2" "left")
    | '^' -> Op(makePow() "3" "right")
}

In this version all operators are parsed as a list and then handed over to the parseOps function. The operators also return an Op object instead of just a function to evaluate the operator. The Op object has information about the operator's precedence (given as a string only because RLMeta can not express integers in semantic actions) and associativity:

support.py

class Op(object):

    def __init__(self, fn, prec, assoc):
        self.fn = fn
        self.prec = int(prec)
        self.assoc = assoc

The parseOps function uses that information to create a parse tree:

support.py

def parseOps(expr, items, min_level=0):
    while items and items[0][0].prec >= min_level:
        op, rhs = items.pop(0)
        if op.assoc == "left":
            next_level = op.prec + 1
        else:
            next_level = op.prec
        expr = op.fn(expr, parseOps(rhs, items, next_level))
    return expr

It combines the functionality of leftAssoc and rightAssoc and additionally also handles precedence. The increment of level if the operator is left associative ensures that only operators of higher precedence are parsed in the recursive call. If the next operator is of the same precedence, the recursive call will return immediately, and parseOps will behave like leftAssoc.

Here is how the expression 1+2+3*4 is parsed:

parseOps is called with expr=1, items=[[OpAdd, 2], [OpAdd, 3], [OpMul, 4]] and min_level=0
Since there are still items and the add operator has prec >= 0, the loop is entered
- op=OpAdd and rhs=2
- Since the add operator is left associative, next_level=2
- expr=add(1, parseOps(2, [[OpAdd, 3], [OpMul, 4]], 2))
parseOps is called with expr=2, items=[[OpAdd, 3], [OpMul, 4]] and min_level=2
Since the add operator has prec < 2, the loop is not entered, and 2 is returned
The outer loop continues with expr=add(1, 2)
- op=OpAdd and rhs=3
- Since the add operator is left associative, next_level=2
- expr=add(add(1, 2), parseOps(3, [[OpMul, 4]], 2))
parseOps is called with expr=3, items=[[OpMul, 4]] and min_level=2
Since there are still items and the mul operator has prec >= 2, the loop is entered
- op=OpMul and rhs=4
- Since the mul operator is left associative, next_level=3
- expr=mul(3, parseOps(4, [], 3))
parseOps is called with expr=4, items=[] and min_level=3
Since there are no more items, 4 is returned
The parent returns mul(3, 4)
The parent returns add(add(1, 2), mul(3, 4))

Let's see how the calculator handles the following expressions:

1+2-3
1+2-3^2^2-5

Below are the ASTs and the evaluated results:

['-',
 ['+', 1, 2],
 3]
=>
0

['-',
 ['-',
  ['+', 1, 2],
  ['^',
   3,
   ['^', 2, 2]]],
 5]
=>
-83

This looks correct. We now have a calculator whose grammar is more cleanly written and only uses one support function instead of two.

Comparison of operator parsers

All three operator parser functions have a similar structure to them. Here they are for comparison:

def leftAssoc(expr, items):
    while items:
        op, rhs = items.pop(0)
        expr = op(expr, rhs)
    return expr

def rightAssoc(expr, items):
    if items:
        op, rhs = items.pop(0)
        expr = op(expr, rightAssoc(rhs, items))
    return expr

def parseOps(expr, items, min_level=0):
    while items and items[0][0].prec >= min_level:
        op, rhs = items.pop(0)
        if op.assoc == "left":
            next_level = op.prec + 1
        else:
            next_level = op.prec
        expr = op.fn(expr, parseOps(rhs, items, next_level))
    return expr

Resources

The following articles helped me understand how to handle left associative operators in recursive descent parsers:

Appendix: Test script

I used the following Bash script to run the examples:

expr.sh

compile() {
    echo "import sys"
    echo "import pprint"
    rlmeta --support
    rlmeta < "$1"
    cat "support.py"
    cat "$2"
    echo "makeAdd = lambda: add"
    echo "makeSub = lambda: sub"
    echo "makeMul = lambda: mul"
    echo "makeDiv = lambda: div"
    echo "makePow = lambda: pow"
    echo "try:"
    echo "    for expr in sys.stdin.read().splitlines():"
    echo "        pprint.pprint(Calculator().run('expr', expr), width=20)"
    echo "except _MatchError as e:"
    echo "    sys.stderr.write(e.describe())"
}

(
    while read -e expr; do
        echo "$expr" | python <(compile "$1" "ast.py")
        echo "=>"
        echo "$expr" | python <(compile "$1" "eval.py")
        echo ""
    done
) | head -n-1

It reads expressions from stdin one per line. Then it evaluates them both using the AST function and the eval function and prints the result. The make* functions are also defined here to just return the respective function.

Site proudly generated by Hakyll.