RLMeta poster 2: the poster that wasn't

Published on 12 February 2022.

Code walk through
Changes from the poster version
The future
Code listings for RLMeta

A while ago I created a poster to showcase RLMeta. The version of RLMeta on the poster is based on the version from the memoizing failures article, but I made it smaller and more beautiful to better fit the poster. To be able to finish the poster, I had to stop making changes and put the source code on the poster. That was difficult because I felt the need for it to be perfect. Eventually I did stop polishing, and left a few items unresolved.

Almost immediately after I finished the poster, I started working on a second version. Initially, my plan was to make a second version of the poster. I started to fix the unresolved items and I was making progress. But somehow imperfections kept creeping in. It felt like a never ending game of chasing perfection. That’s when I decided that a second poster was probably not going to be worth it. But I still liked the new version of RLMeta.

Instead, I decided to attempt to present the new version in the style of a code walk through. In other words, another way to showcase RLMeta that is also a bit more practical. Compared to the poster version, this version could also be more easily improved because rendering the blog post is automatic whereas creating the layout of a poster requires manual work every time the source code changes. I also wanted to experiment with the walk through format because I thought it could be something worth putting into the README of a project.

The rest of this blog post consists of the walk through of the new version of RLMeta and a section on the most important changes from the poster version and motivations for them.

Code walk through

Getting RLMeta

In order to follow along on this walk through, you need to download the version of RLMeta from here: rlmeta-poster-2.zip.

File structure

The zip file consists of the source code for the RLMeta compiler, a make script, and the compiler itself (rlmeta.py):

$ tree --dirsfirst
.
├── src
│   ├── assembler.rlmeta
│   ├── codegenerator.rlmeta
│   ├── main.py
│   ├── parser.rlmeta
│   └── support.py
├── make.py
└── rlmeta.py

1 directory, 7 files

The size of the source code is quite small:

$ wc -l src/*
   39 src/assembler.rlmeta
   57 src/codegenerator.rlmeta
   26 src/main.py
   60 src/parser.rlmeta
  240 src/support.py
  422 total

The compiler can be created from this source code only. We will see how later in this walk through.

Exploring RLMeta

Before we dive into how the RLMeta compiler is created, let’s explore RLMeta by writing a small, but real, program in it.

What types of programs can we write in RLMeta?

In RLMeta, we write grammars. Grammars have rules that specify how to match objects from an input stream and specify what should happen when objects are matched.

Let’s write a grammar that counts the number of objects in an input stream and produces a report:

$ cat object_counter.rlmeta
ObjectCounter {
    count = .*:xs -> { "number of objects = " len(xs) }
}

The main function of the RLMeta compiler is to transform grammars into Python code. If invoked without arguments, the compiler reads a grammar from stdin and writes Python code to stdout:

$ cat object_counter.rlmeta | python rlmeta.py
class ObjectCounter(Grammar):
    rules = {
        'count': 0
    }
    code = [
        PUSH_SCOPE,
        LIST_START,
        BACKTRACK,
        10,
        MATCH,
        'any',
        lambda x: True,
        LIST_APPEND,
        COMMIT,
        2,
        LIST_END,
        BIND,
        'xs',
        ACTION,
        lambda self: join(['number of objects = ', self.lookup('len')(self.lookup('xs'))]),
        POP_SCOPE,
        RETURN
    ]

This is equivalent to using the --compile command with a value of - which stands for stdin:

$ cat object_counter.rlmeta | python rlmeta.py --compile - | head -n3
class ObjectCounter(Grammar):
    rules = {
        'count': 0

And, the file can also be specified directly like this:

$ python rlmeta.py --compile object_counter.rlmeta | head -n3
class ObjectCounter(Grammar):
    rules = {
        'count': 0

Don’t worry about understanding the generated code. We will explore it more later. Just note that the generated class inherits from a class called Grammar and that it uses some constants like PUSH_SCOPE and LIST_START. These things are defined in a support library which can be generated by the RLMeta compiler with the --support command:

$ python rlmeta.py --support | grep '^\(class\|def\)'
class VM:
def PUSH_SCOPE(vm):
def POP_SCOPE(vm):
def BACKTRACK(vm):
def COMMIT(vm):
def CALL(vm):
def CALL_(vm, pc):
def RETURN(vm):
def MATCH(vm):
def MATCH_(vm, fn, message):
def MATCH_CALL_RULE(vm):
def LIST_START(vm):
def LIST_APPEND(vm):
def LIST_END(vm):
def BIND(vm):
def ACTION(vm):
def PUSH_STREAM(vm):
def POP_STREAM(vm):
def FAIL(vm):
def FAIL_(vm, fail_message):
class SemanticAction(object):
class MatchError(Exception):
class Grammar(object):
class Runtime(dict):
class Counter(object):
def splice(depth, item):
def concat(lists):
def join(items, delimiter=""):
def indent(text, prefix="    "):
def compile_chain(grammars, source):

To create a complete program, we also have to write a main function that instantiates the ObjectCounter grammar and invokes its count rule.

Here is an example that passes stdin as the input stream to the count rule and prints the result to stdout:

$ cat object_counter_main.py
if __name__ == "__main__":
    import sys
    sys.stdout.write(ObjectCounter().run("count", sys.stdin.read()))

The --copy command of the RLMeta compiler can be used to copy this main file, as is, to the output.

Combining these pieces into a single compile command, we get this:

$ python rlmeta.py --support --compile object_counter.rlmeta --copy object_counter_main.py > object_counter.py

It will perform all commands in the given order and write all generated code concatenated into a single file.

Note that the support library comes before the grammar so that Grammar is defined by the time ObjectCounter is evaluated.

The object counter source code has now been compiled into a standalone Python program that can be run like this:

$ echo 'hello' | python object_counter.py
number of objects = 6

$ echo 'this is longer' | python object_counter.py
number of objects = 15

So programs in RLMeta are written mainly in grammar files with some support functions written in Python. The RLMeta compiler can process all these files to produce a single Python file which is the compiled program.

Compiling RLMeta itself

Now that we have an understanding of RLMeta, let’s look at the command that compiles the RLMeta compiler itself from the source code:

$ python rlmeta.py --embed SUPPORT src/support.py --support --compile src/parser.rlmeta --compile src/codegenerator.rlmeta --compile src/assembler.rlmeta --copy src/main.py > rlmeta-raw.py

The first command, --embed SUPPORT src/support.py, tells the compiler to generate a Python variable named SUPPORT containing the contents of the file src/support.py. The --embed command is the last command of the compiler that we have not yet seen. (The RLMeta compiler needs the support library in a variable so that it can generate it later with the --support command.)

Next, the --support command tells the compiler to generate the support library that is embedded in it.

The --compile ... commands tell the compiler to compile the given grammar files.

The last command, --copy src/main.py, tells the compiler to copy the main file verbatim. Similar to what we did to the main file in the object counter.

The make script can be called with the --compile command to perform this exact function:

$ ./make.py --compile > rlmeta-compile.py
Compiling rlmeta using rlmeta.py
  O-----------------O
  | RLMeta compiled |
~~|     itself!     |
  O-----------------O

And all these files are exactly the same:

$ md5sum rlmeta.py rlmeta-compile.py rlmeta-raw.py
92396155e85e24fb45cb3e58e160e89e  rlmeta.py
92396155e85e24fb45cb3e58e160e89e  rlmeta-compile.py
92396155e85e24fb45cb3e58e160e89e  rlmeta-raw.py

Thus, the RLMeta compiler reproduced itself exactly from the source code.

A tour of the main function

Let’s now look at how all commands of the RLMeta compiler are implemented. Here is the main function:

if __name__ == "__main__":
    import sys
    def read(path):
        if path == "-":
            return sys.stdin.read()
        with open(path) as f:
            return f.read()
    args = sys.argv[1:] or ["--compile", "-"]
    while args:
        command = args.pop(0)
        if command == "--support":
            sys.stdout.write(SUPPORT)
        elif command == "--copy":
            sys.stdout.write(read(args.pop(0)))
        elif command == "--embed":
            sys.stdout.write("{} = {}\n".format(
                args.pop(0),
                repr(read(args.pop(0)))
            ))
        elif command == "--compile":
            sys.stdout.write(compile_chain(
                [(Parser, "file"), (CodeGenerator, "asts"), (Assembler, "asts")],
                read(args.pop(0))
            ))
        else:
            sys.exit("ERROR: Unknown command '{}'".format(command))

It contains command line parsing and handles processing of all commands.

The --compile command is the most complex of them all. It calls the compile_chain function which runs the given grammars/rules in order (in this case the input will first be parsed, then passed to the code generator, and finally passed to the assembler) and prints a pretty error message to stderr upon failure:

def compile_chain(grammars, source):
    import os
    import sys
    import pprint
    for grammar, rule in grammars:
        try:
            source = grammar().run(rule, source)
        except MatchError as e:
            marker = "<ERROR POSITION>"
            if os.isatty(sys.stderr.fileno()):
                marker = f"\033[0;31m{marker}\033[0m"
            if isinstance(e.stream, str):
                stream_string = e.stream[:e.pos] + marker + e.stream[e.pos:]
            else:
                stream_string = pprint.pformat(e.stream)
            sys.exit("ERROR: {}\nPOSITION: {}\nSTREAM:\n{}".format(
                e.message,
                e.pos,
                indent(stream_string)
            ))
    return source

This function might be useful for other RLMeta programs as well. That is why it’s included in the support library and not only in the main file.

Following a compilation

Let’s now follow a compilation of an example grammar to learn more about how a grammar file is turned into Python code. Here it is:

$ cat example.rlmeta
Example {
    main = .
}

And this is what it compiles to:

$ python rlmeta.py --compile example.rlmeta
class Example(Grammar):
    rules = {
        'main': 0
    }
    code = [
        PUSH_SCOPE,
        MATCH,
        'any',
        lambda x: True,
        POP_SCOPE,
        RETURN
    ]

The transformations that the grammar goes through are defined in the main function:

[(Parser, "file"), (CodeGenerator, "asts"), (Assembler, "asts")],

So first the grammar file is passed to the file rule of the parser:

file =
  | (space grammar)*:xs space !.            -> xs

It in turn calls the grammar rule to parse all grammars in the file:

grammar =
  | name:x space '{' rule*:ys space '}'     -> ["Grammar" x ~ys]

This rule matches the name, the open curly brace, a set of rules, and the closing curly brace. It will then return an AST that looks like this:

[
    "Grammar",
    "Example",
    ...
]

All grammar AST nodes are handed off to the asts rule in the code generator:

asts          = ast*:xs !.  -> xs

It it turn calls the ast rule to process each AST node:

ast           = [%:x]       -> x

The ast rule treats the first argument in the AST as a rule name, and calls that rule. In this case Grammar:

Grammar       = .:x ast*:ys -> ["Grammar" x ~~ys]

The code generator creates a new AST node representing a grammar. But this AST node is slightly different and meant to be processed by the assembler. The result is this:

[
    "Grammar",
    "Example",
    ... ast nodes for consumption by assembler ...
]

This AST node and all the others that the code generator produces are passed to the asts rule in the assembler:

asts     = ast*:xs !.      -> { xs }

It in turn calls the ast rule:

ast      = [%:x]           -> x

Which does the same trick again, now invoking the Grammar rule (in the assembler) which looks like this:

Grammar  = .:x ast*:ys     -> list():rules
                           -> list():code
                           -> dict():labels
                           -> list():patches
                           -> ys
                           -> run("asts" patches)
                           -> { "class " x "(Grammar):\n" >
                                  "rules = {\n" > join(rules ",\n") < "\n}\n"
                                  "code = [\n" > join(code  ",\n") < "\n]\n"
                                < }

This rule can be read as follows:

Match the grammar name and all AST nodes
Perform the following action
- Define a variable called rules which is a list
- Define a variable called code which is a list
- Define a variable called labels which is a dictionary
- Define a variable called patches which is a list
- Evaluate the AST nodes (with possible side effects recorded in the above variables)
- Treat the value of the patches variable as a list of AST nodes and process them with the asts rule of this grammar
- Return a string which is generated Python code

The generated code from our example looks like this:

class Example(Grammar):
    rules = {
        ...
    }
    code = [
        ...
    ]

To understand how the rule and code sections are generated, we just have to follow a few more transformations.

Let’s look at one more and see how the rule in our example grammar is transformed.

First, the rule is parsed by the rule rule in the parser:

rule =
  | name:x space '=' choice:y               -> ["Rule" x y]

First the name is matched, then the equals sign, and then an expression representing the body of the rule.

It our case, this rule produces this AST node:

[
    "Rule",
    "main",
    ...
]

That node is going to be processed by the Rule rule in the code generator:

Rule          = .:x ast:y   -> [["Rule" x]
                                ~y
                                ["OpCode" "RETURN"]]

Generating an AST node that looks like this:

[
    ["Rule", "main"],
    ...,
    ["OpCode", "RETURN"]
]

Here we can see that the AST from the code generator looks a bit more like assembly code than a representation of the syntax in the grammar.

The first child in this AST node is going to be handled the Rule rule in the assembler:

Rule     = .:x             -> add(rules { repr(x) ": " len(code) })
                           -> set(labels x len(code))

It does two things:

Adds a string value to the rules list
Adds an entry to the labels dictionary to map a label to an index in the code list

At this point, the variables have the following values.

rules = [
    "'main': 0",
]
labels = {
    'main': 0,
}

The second child in the AST node is going to be handled by the OpCode rule in the assembler:

OpCode   = .:x             -> add(code x)

It adds the given op code to the code list, giving it this value:

code = [
    ...,
    "RETURN",
]

When the rules and code variables are expanded, the resulting class looks like this:

class Example(Grammar):
    rules = {
        'main': 0
    }
    code = [
        ...,
        RETURN
    ]

Hopefully you should now be comfortable to follow transformations yourself to understand how a compilation is done.

The purpose of the make script

When the make script is called without arguments, it performs a meta compilation and runs a few tests:

$ ./make.py
Compiling rlmeta using rlmeta.py
Writing rlmeta1.py
Test: Has its own support library
Test: Disallow semantic action in the middle
ERROR: expected }
POSITION: 22
STREAM:
    Grammar { x = . -> [] <ERROR POSITION>. }
Test: Call unknown rule foo
Moving rlmeta1.py -> rlmeta.py
  O-----------------O
  | RLMeta compiled |
~~|     itself!     |
  O-----------------O

The meaning of a meta compilation is to create a new version of RLMeta that can still reproduce itself from the source code.

In the output above, we can see that it compiled RLMeta and wrote the result to rlmeta1.py. In this case, since it is exactly the same as rlmeta.py, the compilation stopped there and a few more tests were run using this compiler. But if we make changes to the source code, rlmeta1.py will most likely not be exactly the same as rlmeta.py, and a few more compilations might be needed. I’ve written about the details of meta compilation in a previous blog post.

So the purpose of the make script is to ease meta compilations and also run a test suit on the newly generated metacompiler before accepting it.

The make script can also be used to perform a single compilation of RLMeta with the --compile argument as we saw earlier.

Changes from the poster version

This section explains the most important changes in this version of RLMeta compared to the poster version.

First of all, I wanted to work on the unresolved items which were the following:

The label counter is incremented at match time, not at semantic action evaluation time.
Compilation depends on Bash.
Assembly code in code generator is hard to read.

In the poster article, I also had a few notes about future versions:

The smaller it is, the easier it is to understand and therefore extend. The more flexible it is to extend the better. If I make another poster version it would therefore focus on being smaller and more flexible. Since all successive version of RLMeta have been faster than the ones before, performance is also important. But small size, clarity, and flexibility come first.

I used these guidelines to decide if certain changes should go into the new version or not.

One interesting thing to note is that the guidelines are sometimes contradicting. Writing clear code might mean more lines of code which makes the code base larger. Perhaps that’s also why I got stuck chasing perfection. I thought I made something easier to read, but it ended up costing 10 extra lines of code. Should I include it?

Generate labels in semantic actions

One thing that I left in the poster version that still annoyed me was that labels were generated at match time, not at semantic action evaluation time. It would not produce incorrect results. At worst, some labels end up not being used because the counter value captured was in a rule that later failed. But dealing with labels at match time does not make sense. It should really happen at semantic action evaluation time.

Here is what the Not rule looks like in the poster version:

Not = ast:x #:a #:b -> { "I('BACKTRACK', " b ")\n"
                         x
                         "I('COMMIT', " a ")\n"
                         "LABEL(" a ")\n"
                         "I('FAIL', 'no match expected')\n"
                         "LABEL(" b ")\n"                   }

Here is what the Not rule looks like after the change:

Not = ast:x -> label():a -> label():b
            -> { "I('BACKTRACK', " b ")\n"
                 x
                 "I('COMMIT', " a ")\n"
                 "LABEL(" a ")\n"
                 "I('FAIL', 'no match expected')\n"
                 "LABEL(" b ")\n"                   }

This change puts label generation where it belongs, in semantic actions, and thus makes the implementation more clear. The VM is no longer concerned with labels. It is only concerned with matching. This change required a bit of rework how semantic actions work. Previously only one expression was allowed:

<match expression> -> <semantic action expression>

Now multiple expressions are allowed:

<match expression> -> <semantic action expression>:x
                   -> <semantic action expression>
                   -> <semantic action expression>

The result of expressions can also be bound to names which subsequent expressions can refer to. label is such a variable that is set internally to a function that generates increasing integers starting at 0.

The implementation of this change also increases the flexibility of RLMeta. For example, it is now possible to write a semantic action that generates code in different sections like this:

ExampleBuffers {
    program  = ast:x  -> []:header
                      -> { "# HEADER\n"
                           header
                           "# BODY\n"
                           x            }
    ast      = [%:x]  -> x
    Program  = ast*
    Function = .:name -> add(header { "def " name "\n" })
                      -> { name "()\n" }
}

The expression []:header creates a list and assigns it to the variable header. When x is evaluated in the next step, the semantic action for the Function rule will be evaluated which can then access the header variable defined earlier. These variables are not lexically scoped, but dynamically scoped. If at runtime, a variable is defined, it will be accessible. It also means that the Function rule can not be run without program being run first, or the header variable will not be defined.

Here is an example AST representing a program:

AST = [
    ['Program',
        ['Function', 'foo'],
        ['Function', 'bar']
    ]
]

When the program rule is run on the example input, the following is output:

$ python example_buffers.py
# HEADER
def foo
def bar
# BODY
foo()
bar()

In summary, this change is as follows:

Label syntax (#) in parser is removed
Actions can have multiple expressions
Expressions can be bound to names
A default label function to generate labels
Names in semantic actions refer to matches or results bound earlier

(The complete initial diff for this change can be found on GitHub.)

The increased clarity and flexibility come with a price. The size increases and the performance drops.

The parser and the code generator are mostly the same. The greatest addition is in the support library. Which is expected when semantic action evaluation becomes more complex. The drop in performance is likely due to more function calls when evaluating semantic actions. Even though size and performance got worse, I believe the clarity and flexibility gain is worth it.

Remove dependency on Bash

To compile the poster version of RLMeta, you ran the following command:

./compile.sh rlmeta.py

In one way, the compiler could not compile itself, but relied on a Bash script for gluing things together. It would call the rlmeta.py compiler for certain tasks and use Bash and Python for other tasks.

As we have already seen, the new version of RLMeta compiles itself like this:

python rlmeta.py \
    --embed SUPPORT src/support.py \
    --support \
    --compile src/parser.rlmeta \
    --compile src/codegenerator.rlmeta \
    --compile src/assembler.rlmeta \
    --copy src/main.py \
    > rlmeta.py

The rlmeta.py compiler now has support (via --embed and --copy) for doing what the Bash script previously did.

This makes the compiler slightly larger, but it feels so much cleaner.

In addition, the extra features are useful when writing programs in RLMeta. Those programs can now also be compiled with a single command, and there is no need to concatenate different pieces together.

(The complete diff for this change can be found on GitHub.)

Extract assembler

The third thing that annoyed me with in the poster version was the readability of the code generator. For example, the Not rule looked like this:

Not = ast:x -> label():a -> label():b
            -> { "I('BACKTRACK', " b ")\n"
                 x
                 "I('COMMIT', " a ")\n"
                 "LABEL(" a ")\n"
                 "I('FAIL', 'no match expected')\n"
                 "LABEL(" b ")\n"                   }

It generates a string which contains Python code that calls functions to create “assembly” code. So part of the compilation is actually happening at runtime here. It is mixed and messy.

The new Not rule looks like this:

Not = ast:x -> label():a -> label():b
            -> [["OpCode" "BACKTRACK"]
                ["Target" b]
                ~x
                ["OpCode" "COMMIT"]
                ["Target" a]
                ["Label" a]
                ["OpCode" "FAIL"]
                ["Value" "no match"]
                ["Label" b]]

Instead of outputting Python code directly, it now generates abstract assembly code. Then a new third pass, the assembler, turns those instructions into Python code as well as resolves label positions. So no more compilation at runtime.

This reads better because the purpose of the code generator is now a bit narrower. It can focus on one thing and leave the rest to the assembler.

Adding another pass also opens up the possibility to do peep-hole optimizations on the abstract assembly code before the assembler turns the instructions into Python code.

Rewrite VM for clarity

In the poster version, the virtual machine was written as a single function with one loop like this:

def vm(instructions, labels, start_rule, stream):
    ...
    while True:
        name, arg1, arg2 = instructions[pc]
        if name == "PUSH_SCOPE":
            scope_stack.append(scope)
            scope = {}
            pc += 1
            continue
        elif name == "BACKTRACK":
            ...
        ...

It was written like that to be as fast as possible. It avoided function calls. It avoided class variables lookup by avoiding classes. All variables used were defined locally in the vm function. Because function calls could not be used, some code was also duplicated.

I decided that I would not consider performance at all, and instead try to write the VM as clear as I could. I ended up with a VM class to hold some state and instruction functions that operate on an instance of a VM:

class VM:

    def __init__(self, code, rules):
        ...

    ...

def PUSH_SCOPE(vm):
    vm.scope_rest = (vm.scope, vm.scope_rest)
    vm.scope = {}

def BACKTRACK(vm):
    ...

...

As I noted earlier, I’m not sure I am happy with this result. I’m not convinced that it reads better. The biggest upside is that since function calls are now allowed, part of the VM can be expressed more clearly without repetition.

Before I ended up with this VM, I experimented with a language for writing virtual machines that compiled to Python code. You could define instructions and the arguments they took and define macros for code re-use. It was basically a small macro language on top of Python. It looked something like this:

def vm(code, rules, start_rule, stream):
    action = SemanticAction(None)
    pc = rules[start_rule]
    call_backtrack_stack = []
    stream, stream_rest = (stream, None)
    pos, pos_rest = (0, tuple())
    scope, scope_rest = (None, None)
    fail_message = None
    latest_fail_message, latest_fail_pos = (None, tuple())
    memo = {}

definstruction PUSH_SCOPE():
    scope_rest = (scope, scope_rest)
    scope = {}

And here is how macros were used:

definstruction FAIL(arg_message):
    fail_message = (arg_message,)
    #FAIL

defmacro FAIL:
    ...

When this was compiled, something similar to the vm function above was generated. A single function that was intended to run as fast as possible. But you could write the VM quite clearly anyway.

I liked the result of that, but it introduced yet another language and made compilation and metacompilation more complicated. For that reason, I decided against it.

Perhaps another approach would be to consider the VM a separate piece, not to be included in compilations and metacompilations. But they are also strongly connected. Say for example that an optimizer decides to output a new VM instruction, then the VM has to change.

I am not entirely clear about the interface here between the VM and the rest of the compiler.

Add ability to run a rule in semantic action

Another feature that was added in this version was the ability to run a grammar rule recursively from a semantic action.

This was initially needed to to implement recursive macros in the VM language mentioned in the previous section, but it made its way into RLMeta to support the patching of assembly instructions.

When a Target instruction is encountered, the patches list is populated with a command:

Target   = .:x             -> add(patches ["Patch" len(code) x])
                           -> add(code "placeholder")

These commands are then evaluated by running the asts rule on the patches list. This starts another parse on the given stream.

-> run("asts" patches)

The new parse has access to all the runtime variables that the semantic action that invokes it has. So that is why a Patch instruction can modify the code array and insert the correct index there instead of the placeholder:

Patch    = .:x .:y         -> set(code x get(labels y))

Misc

Many more small changes were made. Here are a few notes about them.

Various renames to make intention more clear and reformats to improve readability.
Various clean ups in the parser:
- Only allow semantic actions at the very end of a rule.
- Make sure the whole file is parsed so that junk after a grammar results in an error.
Adapt to Python 3.

The future

On the one hand, I’m quite happy with the improvements to RLMeta that I was able to make. The code feels more clear and flexible. Definitely a better version of RLMeta.

On the other hand, this article turned out to have the same problem as the poster. It just kept growing and growing, and at some point I had to stop working on in, leave some issues unresolved, and call the article finished. For example, I am not happy with how the new VM looks. A mix between classes and functions and helpers.

I decided to set up a repo on GitHub for RLMeta where it can continue to be improved.

I plan for it to contain the base version of RLMeta which is the minimal version that is able to compile itself and maintain properties such as flexible, easy to extend, and easy to understand. Then I want to include examples as well to show how RLMeta can be used and how you can extend it in various ways.

Code listings for RLMeta

Here is all the source code and also the make script for this version of RLMeta.

src/parser.rlmeta

Parser {
  file =
    | (space grammar)*:xs space !.            -> xs
  grammar =
    | name:x space '{' rule*:ys space '}'     -> ["Grammar" x ~ys]
  rule =
    | name:x space '=' choice:y               -> ["Rule" x y]
  choice =
    | (space '|')?
      sequence:x (space '|' sequence)*:xs     -> ["Or" x ~xs]
  sequence =
    | expr*:xs maybeAction:ys                 -> ["Scope" ["And" ~xs ~ys]]
  expr =
    | expr1:x space ':' name:y                -> ["Bind" y x]
    | expr1
  expr1 =
    | expr2:x space '*'                       -> ["Star" x]
    | expr2:x space '?'                       -> ["Or" x ["And"]]
    | space '!' expr2:x                       -> ["Not" x]
    | space '%'                               -> ["MatchCallRule"]
    | expr2
  expr2 =
    | name:x !(space '=')                     -> ["MatchRule" x]
    | space char:x '-' char:y                 -> ["MatchObject" ["Range" x y]]
    | space '\'' (!'\'' matchChar)*:xs '\''   -> ["And" ~xs]
    | space '.'                               -> ["MatchObject" ["Any"]]
    | space '(' choice:x space ')'            -> x
    | space '[' expr*:xs space ']'            -> ["MatchList" ["And" ~xs]]
  matchChar =
    | innerChar:x                             -> ["MatchObject" ["Eq" x]]
  maybeAction =
    | actionExpr:x                            -> [["Action" x]]
    |                                         -> []
  actionExpr =
    | space '->' hostExpr:x
      (space ':' name | -> ""):y actionExpr:z -> ["Set" y x z]
    | space '->' hostExpr:x                   -> x
  hostExpr =
    | space string:x                          -> ["String" x]
    | space '[' hostListItem*:xs space ']'    -> ["List" ~xs]
    | space '{' formatExpr*:xs space '}'      -> ["Format" ~xs]
    | var:x space '(' hostExpr*:ys space ')'  -> ["Call" x ~ys]
    | var:x
  hostListItem =
    | space '~'*:ys hostExpr:x                -> ["ListItem" len(ys) x]
  formatExpr =
    | space '>' formatExpr*:xs space '<'      -> ["Indent" ["Format" ~xs]]
    | hostExpr
  var =
    | name:x !(space '=')                     -> ["Lookup" x]
  string    = '"'  (!'"'  innerChar)*:xs '"'  -> { xs }
  char      = '\''  !'\'' innerChar  :x  '\'' -> x
  innerChar = '\\' escape | .
  escape    = '\\' -> "\\" | '\'' -> "'"
            | '"'  -> "\"" | 'n'  -> "\n"
  name      = space nameStart:x nameChar*:xs  -> { x xs }
  nameStart = 'a'-'z' | 'A'-'Z'
  nameChar  = 'a'-'z' | 'A'-'Z' | '0'-'9'
  space     = (' ' | '\n')*
}

src/codegenerator.rlmeta

CodeGenerator {
  Grammar       = .:x ast*:ys -> ["Grammar" x ~~ys]
  Rule          = .:x ast:y   -> [["Rule" x]
                                  ~y
                                  ["OpCode" "RETURN"]]
  Or            =
    | ast:x Or:y              -> label():a -> label():b
                              -> [["OpCode" "BACKTRACK"]
                                  ["Target" a]
                                  ~x
                                  ["OpCode" "COMMIT"]
                                  ["Target" b]
                                  ["Label" a]
                                  ~y
                                  ["Label" b]]
    | ast
  Scope         = ast:x       -> [["OpCode" "PUSH_SCOPE"]
                                  ~x
                                  ["OpCode" "POP_SCOPE"]]
  And           = ast*:xs     -> [~~xs]
  Bind          = .:x ast:y   -> [~y
                                  ["OpCode" "BIND"]
                                  ["Value" x]]
  Star          = ast:x       -> label():a -> label():b
                              -> [["OpCode" "LIST_START"]
                                  ["Label" a]
                                  ["OpCode" "BACKTRACK"]
                                  ["Target" b]
                                  ~x
                                  ["OpCode" "LIST_APPEND"]
                                  ["OpCode" "COMMIT"]
                                  ["Target" a]
                                  ["Label" b]
                                  ["OpCode" "LIST_END"]]
  Not           = ast:x       -> label():a -> label():b
                              -> [["OpCode" "BACKTRACK"]
                                  ["Target" b]
                                  ~x
                                  ["OpCode" "COMMIT"]
                                  ["Target" a]
                                  ["Label" a]
                                  ["OpCode" "FAIL"]
                                  ["Value" "no match"]
                                  ["Label" b]]
  MatchCallRule =             -> [["OpCode" "MATCH_CALL_RULE"]]
  MatchRule     = .:x         -> [["OpCode" "CALL"]
                                  ["Target" x]]
  MatchObject   = .:x         -> [["OpCode" "MATCH"]
                                  x]
  MatchList     = ast:x       -> [["OpCode" "PUSH_STREAM"]
                                  ~x
                                  ["OpCode" "POP_STREAM"]]
  Action        = .:x         -> [["OpCode" "ACTION"]
                                  ["Action" x]]
  asts          = ast*:xs !.  -> xs
  ast           = [%:x]       -> x
}

src/assembler.rlmeta

Assembler {
  Grammar  = .:x ast*:ys     -> list():rules
                             -> list():code
                             -> dict():labels
                             -> list():patches
                             -> ys
                             -> run("asts" patches)
                             -> { "class " x "(Grammar):\n" >
                                    "rules = {\n" > join(rules ",\n") < "\n}\n"
                                    "code = [\n" > join(code  ",\n") < "\n]\n"
                                  < }
  Rule     = .:x             -> add(rules { repr(x) ": " len(code) })
                             -> set(labels x len(code))
  Label    = .:x             -> set(labels x len(code))
  Target   = .:x             -> add(patches ["Patch" len(code) x])
                             -> add(code "placeholder")
  Patch    = .:x .:y         -> set(code x get(labels y))
  OpCode   = .:x             -> add(code x)
  Value    = .:x             -> add(code repr(x))
  Eq       = .:x             -> add(code repr(x))
                             -> add(code { "lambda x: x == " repr(x) })
  Range    = .:x .:y         -> add(code repr({"range " repr(x) "-" repr(y)}))
                             -> add(code { "lambda x: " repr(x) " <= x <= " repr(y) })
  Any      =                 -> add(code repr("any"))
                             -> add(code "lambda x: True")
  Action   = ast:x           -> add(code {"lambda self: " x})
  Set      = .:x ast:y ast:z -> { "self.bind(" repr(x) ", " y ", lambda: " z ")" }
  String   = .:x             -> repr(x)
  List     = astList:x       -> { "concat([" x "])" }
  ListItem = .:x ast:y       -> { "splice(" repr(x) ", " y ")" }
  Format   = astList:x       -> { "join([" x "])" }
  Indent   = ast:x           -> { "indent(" x ", "
                                  "self.lookup('indentprefix'))" }
  Call     = ast:x astList:y -> { x "(" y ")" }
  Lookup   = .:x             -> { "self.lookup(" repr(x) ")" }
  asts     = ast*:xs !.      -> { xs }
  astList  = ast*:xs         -> join(xs ", ")
  ast      = [%:x]           -> x
}

src/support.py

class VM:

    def __init__(self, code, rules):
        self.code = code
        self.rules = rules

    def run(self, start_rule, stream):
        self.action = SemanticAction(None)
        self.pc = self.rules[start_rule]
        self.call_backtrack_stack = []
        self.stream, self.stream_rest = (stream, None)
        self.pos, self.pos_rest = (0, tuple())
        self.scope, self.scope_rest = (None, None)
        self.latest_fail_message, self.latest_fail_pos = (None, tuple())
        self.memo = {}
        while True:
            result = self.pop_arg()(self)
            if result:
                return result

    def pop_arg(self):
        code = self.code[self.pc]
        self.pc += 1
        return code

def PUSH_SCOPE(vm):
    vm.scope_rest = (vm.scope, vm.scope_rest)
    vm.scope = {}

def POP_SCOPE(vm):
    vm.scope, vm.scope_rest = vm.scope_rest

def BACKTRACK(vm):
    vm.call_backtrack_stack.append((
        vm.pop_arg(), vm.stream, vm.stream_rest, vm.pos, vm.pos_rest, vm.scope, vm.scope_rest
    ))

def COMMIT(vm):
    vm.call_backtrack_stack.pop()
    vm.pc = vm.pop_arg()

def CALL(vm):
    CALL_(vm, vm.pop_arg())

def CALL_(vm, pc):
    key = (pc, vm.pos_rest+(vm.pos,))
    if key in vm.memo:
        if vm.memo[key][0] is None:
            FAIL_(vm, vm.memo[key][1])
        else:
            vm.action, vm.stream, vm.stream_rest, vm.pos, vm.pos_rest = vm.memo[key]
    else:
        vm.call_backtrack_stack.append((vm.pc, key))
        vm.pc = pc

def RETURN(vm):
    if not vm.call_backtrack_stack:
        return vm.action
    vm.pc, key = vm.call_backtrack_stack.pop()
    vm.memo[key] = (vm.action, vm.stream, vm.stream_rest, vm.pos, vm.pos_rest)

def MATCH(vm):
    object_description = vm.pop_arg()
    fn = vm.pop_arg()
    MATCH_(vm, fn, ("expected {}", object_description))

def MATCH_(vm, fn, message):
    if vm.pos >= len(vm.stream) or not fn(vm.stream[vm.pos]):
        FAIL_(vm, message)
    else:
        vm.action = SemanticAction(vm.stream[vm.pos])
        vm.pos += 1
        return True

def MATCH_CALL_RULE(vm):
    if MATCH_(vm, lambda x: x in vm.rules, ("expected rule name",)):
        CALL_(vm, vm.rules[vm.action.value])

def LIST_START(vm):
    vm.scope_rest = (vm.scope, vm.scope_rest)
    vm.scope = []

def LIST_APPEND(vm):
    vm.scope.append(vm.action)

def LIST_END(vm):
    vm.action = SemanticAction(vm.scope, lambda self: [x.eval(self.runtime) for x in self.value])
    vm.scope, vm.scope_rest = vm.scope_rest

def BIND(vm):
    vm.scope[vm.pop_arg()] = vm.action

def ACTION(vm):
    vm.action = SemanticAction(vm.scope, vm.pop_arg())

def PUSH_STREAM(vm):
    if vm.pos >= len(vm.stream) or not isinstance(vm.stream[vm.pos], list):
        FAIL_(vm, ("expected list",))
    else:
        vm.stream_rest = (vm.stream, vm.stream_rest)
        vm.pos_rest = vm.pos_rest + (vm.pos,)
        vm.stream = vm.stream[vm.pos]
        vm.pos = 0

def POP_STREAM(vm):
    if vm.pos < len(vm.stream):
        FAIL_(vm, ("expected end of list",))
    else:
        vm.stream, vm.stream_rest = vm.stream_rest
        vm.pos, vm.pos_rest = vm.pos_rest[-1], vm.pos_rest[:-1]
        vm.pos += 1

def FAIL(vm):
    FAIL_(vm, (vm.pop_arg(),))

def FAIL_(vm, fail_message):
    fail_pos = vm.pos_rest+(vm.pos,)
    if fail_pos >= vm.latest_fail_pos:
        vm.latest_fail_message = fail_message
        vm.latest_fail_pos = fail_pos
    call_backtrack_entry = tuple()
    while vm.call_backtrack_stack:
        call_backtrack_entry = vm.call_backtrack_stack.pop()
        if len(call_backtrack_entry) == 7:
            break
        else:
            vm.memo[call_backtrack_entry[1]] = (None, fail_message)
    if len(call_backtrack_entry) != 7:
        raise MatchError(
            vm.latest_fail_message[0].format(*vm.latest_fail_message[1:]),
            vm.latest_fail_pos[-1],
            vm.stream
        )
    (vm.pc, vm.stream, vm.stream_rest, vm.pos, vm.pos_rest, vm.scope, vm.scope_rest) = call_backtrack_entry

class SemanticAction(object):

    def __init__(self, value, fn=lambda self: self.value):
        self.value = value
        self.fn = fn

    def eval(self, runtime):
        self.runtime = runtime
        return self.fn(self)

    def bind(self, name, value, continuation):
        self.runtime = self.runtime.set(name, value)
        return continuation()

    def lookup(self, name):
        if name in self.value:
            return self.value[name].eval(self.runtime)
        else:
            return self.runtime[name]

class MatchError(Exception):

    def __init__(self, message, pos, stream):
        Exception.__init__(self)
        self.message = message
        self.pos = pos
        self.stream = stream

class Grammar(object):

    def run(self, rule, stream, runtime={}):
        return Runtime(self, dict(runtime, **{
            "label": Counter(),
            "indentprefix": "    ",
            "list": list,
            "dict": dict,
            "add": lambda x, y: x.append(y),
            "get": lambda x, y: x[y],
            "set": lambda x, y, z: x.__setitem__(y, z),
            "len": len,
            "repr": repr,
            "join": join,
        })).run(rule, stream)

class Runtime(dict):

    def __init__(self, grammar, values):
        dict.__init__(self, dict(values, run=self.run))
        self.grammar = grammar

    def set(self, key, value):
        return Runtime(self.grammar, dict(self, **{key: value}))

    def run(self, rule, stream):
        return VM(self.grammar.code, self.grammar.rules).run(rule, stream).eval(self)

class Counter(object):

    def __init__(self):
        self.value = 0

    def __call__(self):
        result = self.value
        self.value += 1
        return result

def splice(depth, item):
    if depth == 0:
        return [item]
    else:
        return concat([splice(depth-1, subitem) for subitem in item])

def concat(lists):
    return [x for xs in lists for x in xs]

def join(items, delimiter=""):
    return delimiter.join(
        join(item, delimiter) if isinstance(item, list) else str(item)
        for item in items
    )

def indent(text, prefix="    "):
    return "".join(prefix+line for line in text.splitlines(True))

def compile_chain(grammars, source):
    import os
    import sys
    import pprint
    for grammar, rule in grammars:
        try:
            source = grammar().run(rule, source)
        except MatchError as e:
            marker = "<ERROR POSITION>"
            if os.isatty(sys.stderr.fileno()):
                marker = f"\033[0;31m{marker}\033[0m"
            if isinstance(e.stream, str):
                stream_string = e.stream[:e.pos] + marker + e.stream[e.pos:]
            else:
                stream_string = pprint.pformat(e.stream)
            sys.exit("ERROR: {}\nPOSITION: {}\nSTREAM:\n{}".format(
                e.message,
                e.pos,
                indent(stream_string)
            ))
    return source

src/main.py

if __name__ == "__main__":
    import sys
    def read(path):
        if path == "-":
            return sys.stdin.read()
        with open(path) as f:
            return f.read()
    args = sys.argv[1:] or ["--compile", "-"]
    while args:
        command = args.pop(0)
        if command == "--support":
            sys.stdout.write(SUPPORT)
        elif command == "--copy":
            sys.stdout.write(read(args.pop(0)))
        elif command == "--embed":
            sys.stdout.write("{} = {}\n".format(
                args.pop(0),
                repr(read(args.pop(0)))
            ))
        elif command == "--compile":
            sys.stdout.write(compile_chain(
                [(Parser, "file"), (CodeGenerator, "asts"), (Assembler, "asts")],
                read(args.pop(0))
            ))
        else:
            sys.exit("ERROR: Unknown command '{}'".format(command))

make.py

#!/usr/bin/env python

import os
import subprocess
import sys

def make_next_version():
    final_compiler = meta_compile_rlmeta()
    test(final_compiler)
    mv(final_compiler, "rlmeta.py")

def meta_compile_rlmeta():
    compiler = "rlmeta.py"
    content = read(compiler)
    for i in range(4):
        next_compiler = "rlmeta{}.py".format(i+1)
        next_content = compile_rlmeta(compiler)
        log("Writing {}".format(next_compiler))
        write(next_compiler, next_content)
        if next_content == content:
            return next_compiler
        compiler = next_compiler
        content = next_content
    fail("Unable to produce metacompiler.")

def compile_rlmeta(rlmeta):
    log("Compiling rlmeta using {}".format(rlmeta))
    return run_rlmeta(rlmeta, [
        "--embed", "SUPPORT", "src/support.py",
        "--support",
        "--compile", "src/parser.rlmeta",
        "--compile", "src/codegenerator.rlmeta",
        "--compile", "src/assembler.rlmeta",
        "--copy", "src/main.py",
    ])

def test(rlmeta):
    log("Test: Has its own support library")
    assert run_rlmeta(rlmeta, ["--support"]) == read("src/support.py")
    log("Test: Disallow semantic action in the middle")
    run_rlmeta(rlmeta, [], b"Grammar { x = . -> [] . }", expect_failure=True)
    log("Test: Call unknown rule foo")
    assert test_grammar(
        rlmeta,
        b"Grammar { x = % | . }",
        b"print(compile_chain([(Grammar, 'x')], ['foo']))"
    ) == b"foo\n"

def test_grammar(rlmeta, grammar, main_code):
    compiled = run_rlmeta(rlmeta, ["--support", "--compile", "-"], grammar)
    total = compiled + main_code
    process = subprocess.Popen(
        ["python"],
        stdin=subprocess.PIPE,
        stdout=subprocess.PIPE
    )
    stdout, _ = process.communicate(total)
    return stdout

def run_rlmeta(rlmeta, args, stdin=b"", expect_failure=False):
    process = subprocess.Popen(
        ["python", rlmeta]+args,
        stdin=subprocess.PIPE,
        stdout=subprocess.PIPE
    )
    stdout, _ = process.communicate(stdin)
    if expect_failure:
        if process.returncode == 0:
            fail("Expected failure")
    else:
        if process.returncode != 0:
            fail("Expected success")
    return stdout

def mv(src, dest):
    log("Moving {} -> {}".format(src, dest))
    os.remove(dest)
    os.rename(src, dest)

def cleanup():
    for path in [
        "rlmeta1.py",
        "rlmeta2.py",
        "rlmeta3.py",
        "rlmeta4.py",
    ]:
        if os.path.exists(path):
            log("Deleting {}".format(path))
            os.remove(path)

def read(path):
    with open(path, "rb") as f:
        return f.read()

def write(path, content):
    with open(path, "wb") as f:
        return f.write(content)

def log(message):
    sys.stderr.write(color(f"{message}\n", "33"))

def success(message):
    sys.stderr.write(color(f"{message}\n", "32"))

def fail(message):
    sys.exit(color(f"ERROR: {message}", "31"))

def color(message, color):
    if os.isatty(sys.stderr.fileno()):
        return f"\033[0;{color}m{message}\033[0m"
    else:
        return message

if __name__ == "__main__":
    cleanup()
    if sys.argv[1:] == ["--compile"]:
        sys.stdout.buffer.write(compile_rlmeta("rlmeta.py"))
    else:
        make_next_version()
    cleanup()
    success("  O-----------------O")
    success("  | RLMeta compiled |")
    success("~~|     itself!     |")
    success("  O-----------------O")

Site proudly generated by Hakyll.