It all started with me doing Advent of Code for the first time in my life. I hadn't written a line of code for two years, busy, as I was, writing my [sci-fi novel](https://www.amazon.com/Wohpe-English-Rimmel-Salvatore-Sanfilippo-ebook/dp/B0BQ3HRDPF/). I felt I needed to start coding again, but I was without a project in my hands. The AoC puzzles helped quite a lot, at first, but they tend to become repetitive and a bit futile after some time. Then something interesting happened. After completing day 13, a puzzle about comparing nested lists, I saw many other solutions resorting to `eval`. They are missing the point, I thought. To me, the puzzle seemed an hint at writing parsers for nested objects.
The gentle reader should be aware that I've a soft spot for [little languages](http://oldblog.antirez.com/page/picol.html). However, Picol was too much of a toy, while [Jim](http://jim.tcl.tk/index.html/doc/www/www/index.html) was too big as a coding example. I also like writing small programs that serve as [examples](https://github.com/antirez/kilo) of how you could design bigger programs, while retaining a manageable size. Don't took me wrong: it's not like I believe my code should be taken as an example, it's just that I learned a lot from such small programs, so, from time to time, I like writing new ones and sharing them. This time I wanted to obtain something of roughly the size of the Kilo editor, that is around ~1000 lines of code, showing the real world challenges arising when writing an actual interpreter for a programming language more complex than Picol. That's the result, and it worked for me: after Aocla I started writing more and more code, and now [I've a project, too](https://github.com/antirez/protoview).
This README will first explain the language briefly. Later we will talk extensively about the implementation and its design. Without counting comments, the Aocla implementation is less than 1000 lines of code, and the core itself is around 500 lines (the rest of the code is the library implementation, the REPL, and other accessory parts): I hope you will find the code easy to follow even if you are not used to C and to writing interpreters. I tried to keep all simple, as I always do when I write code, for myself and the others having the misfortune of modifying it in the future.
Not every feature I desired to have is implemented, and certain data types, like the string type, lack any useful procedure to work with them. This choice was made in order to avoid making the source code more complex than needed, and also, on my side, to avoid writing too much useless code, given that this language will never be used in the real world. Besides, implementing some of the missing parts is a good exercise for the willing reader, assuming she or he are new to this kind of stuff. Even with all this limitations, it is possible to write small working programs with Aocla, and that's all we need.
Aocla is a very simple language, more similar to Joy than to FORTH (higher level). It has a total of six datatypes:
* Lists: `[1 2 3 "foo"]`
* Symbols: `mysymbol`, `==` or `$x`
* Integers: `500`
* Booleans: `#t` or `#f`
* Tuples: `(x y z)`
* Strings: `"Hello World!\n"`
Floating point numbers are not provided for simplicity (writing an implementation should not be too hard, and is a good exercise). Aocla programs are valid Aocla lists, so the language is [homoiconic](https://en.wikipedia.org/wiki/Homoiconicity). While Aocla is a stack-based language, like FORTH, Joy and Factor, it introduces the idea of *local variables capturing*. Because of this construct, Aocla programs look a bit different (and simpler to write and understand in my opinion) compared to other stack-based languages. However locals capturing is optional: any program using locals can be rewritten to avoid using them.
## Our first program
The following is a valid Aocla program, taking 5 and squaring it, to obtain 25.
[5 dup *]
Since all the programs must be lists, and thus are enclosed between `[` and `]`, both the Aocla CLI (Command Line Interface) and the execution of programs from files are designed to avoid needing the brackets. Aocla will put the program inside `[]` for you, so the above program should be written like that:
5 dup *
Programs are executed from left to right, *word by word*. If a word is not a symbol nor a tuple, its execution results into pushing its value on the stack. Symbols will produce a procedure call: the symbol name will be looked up in the table of procedures, and if a procedure with a matching name is found, it gets called. So the above program will perform the following steps:
*`5`: the value 5 is pushed on the stack. The stack will contain `(5)`.
*`dup`: is a symbol. A procedure called `dup` is looked up and executed. What `dup` does is to take the top value on the stack and duplicate it, so now the stack will contain `(5 5)`.
*`*`: is another symbol. The procedure is called. It will take the last two elements on the stack, check if they are integers, multiply them together and push the result on the stack. Now the stack will contain `(25)`.
If an Aocla word is a tuple, like `(x y)`, its execution has the effect of removing a corresponding number of elements from the stack and binding them to the local variables having the specified names:
10 20 (x y)
After the above program is executed, the stack will be empty and the local variables x and y will contain 10 and 20.
Finally, if an Aocla word is a symbol starting with the `$` character and a single additional character, the object stored at the specified variable is pushed on the stack. So the program to square 5 we wrote earlier can be rewritten as:
The ability to capture stack values into locals allow to make complex stack manipulation in a simple way, and make programs more explicit to read and easier to write. Still they have the remarkably quality of not making the language semantically more complex (if not for a small thing we will cover later -- search `upeval` inside this document if you want to know ASAP, but if you know the Tcl programming language, you already understood from the name). In general, while locals help the handling of the stack in the local context of the procedure, words communicate via the stack, so the main advantages of stack-based languages are untouched.
*Note: why allowing locals with just single letter names? The only reason is to make the implementation of the Aocla interpreter simpler to understand. This way, we don't need to make use of any dictionary data structure. If I would design Aocla to be a real language, I would remove this limitation.*
We said that symbols normally trigger a procedure call. But symbols can also be pushed on the stack like any other value. To do so, symbols must be quoted, with the `'` character at the start.
'Hello printnl
The `printnl` procedure prints the last element in the stack and also prints a newline character, so the above program will just print `Hello` on the screen. For now you may wonder what's the point of quoting symbols: you could just use strings, but later we'll see this is important in order to write Aocla programs that write Aocla programs.
Quoting also works with tuples, so if you want to push the tuple `(a b c)` on the stack, instead of capturing the variables a, b and c, you can write:
'(a b c) printnl
## Inspecting the stack content
When you start the Aocla interpreter without a file name, it gets executed
in REPL mode (Read Eval Print Loop). You write a code fragment, press enter, the code gets executed and the current state of the stack is shown:
aocla> 1
1
aocla> 2
1 2
aocla> ['a 'b "foo"]
1 2 [a b "foo"]
This way you always know the stack content.
When you execute programs from files, in order to debug their executions you can print the stack content using the `showstack` procedure.
## User defined procedures
Aocla programs are just lists, and Aocla functions are lists bound to a
name. The name is given as a symbol, and the way to bind a list with a
symbol is an Aocla procedure itself, and not special syntax:
[dup *] 'square def
The `def` procedure will bind the list `[dup *] to the `square` symbol,
so later we can use the `square` symbol and it will call our procedure:
aocla> 5 square
25
Calling a symbol (not quoted symbols are called by default) that is not
bound to any program will produce an error:
aocla> foobar
Symbol not bound to procedure: 'foobar' in unknown:0
## Working with lists
Lists are the central data structure of the language: they are used to represent programs and are useful as a general purpose data structure to represent data. So most of the very few built-in procedures that Aocla offers are lists manipulation procedures.
Showing by examples, via the REPL, is probably the simplest way to show how to write Aocla programs. This pushes an empty list on the stack:
aocla> []
[]
We can add elements to the tail or head of the list, using the `<-` and `->` procedures:
aocla> 1 swap ->
[1]
aocla> 2 swap ->
[1 2]
Note that these procedures are designed to insert the last element in the
stack into the list that is the penultimate element in the stack, so,
in this specific case, we have to swap the order of the last two elements
on the stack before calling `->`. It is possible to design these procedures
in a different way, that is: to the expect `list, element` on the stack instead
of `element, list`. There is no clear winner: one or the other approach is
better or worse depending on the use case. In Aocla, local variables make
all this less important compared to other stack based languages. It is always
possible to make things more explicit, like in the following example:
aocla> [1 2 3]
[1 2 3]
aocla> (l) 4 $l ->
[1 2 3 4]
aocla> (l) 5 $l ->
[1 2 3 4 5]
Then, to know how many elements there are in the list, we can use the
`len` procedure, that also works for other data types:
aocla> ['a 'b 1 2]
[a b 1 2]
aocla> len
4
aocla> "foo"
4 "foo"
aocla> len
4 3
Other useful list operations are the following, that you may find quite
'$ $v cat swap -> // Push $<varname> into the stack
1 swap -> // Push 1
'+ swap -> // Call +
$v [] -> make-tuple swap -> // Capture back value into <varname>
[] -> // Put all into a nested list
'upeval swap -> // Call upeval against the program
$p def // Create the procedure // Bind to the specified proc name
] 'create-incrementing-proc def
Basically calling `create-incrementing-proc` will end generating
a list like that (you can check the intermediate results by adding
`showstack` calls in your programs):
[[$x 1 + (x)] upeval]
And finally the list is bound to the specified symbol using `def`.
Certain times programs that write programs can be quite useful. They are a
central feature in many Lisp dialects. However in the specific case of
Aocla different procedures can be composed via the stack, and we also
have `uplevel`, so I feel their usefulness is greatly reduced. Also note
that if Aocla was a serious language, it would have a lot more constructs
to making writing programs that write programs a lot simpler than the above. Anyway, as you saw earlier, when we implemented the `repeat` procedure, in Aocla
you can already do interesting stuff without using this programming
paradigm.
Ok, I think that's enough. We saw the basic of stack languages, the specific
stuff Aocla adds and how the language feels like. This isn't a course
on stack languages, nor I would be the best person to talk about the
argument. This is a course on how to write a small interpreter in C, so
Well, important things to note, since this may look like just an extension
of the original puzzle 13 code, but look at these differences:
1. We now use reference counting. When the object is allocated, it gets a *refcount* of 1. Then the functions retain() and release() are used in order to increment the reference count when we store the same object elsewhere, or when we want to remove a reference. Finally the references drop to zero and the object gets freed.
2. The object types now are all power of two. This means we can store or pass to functions multiple types at once in a single integer, just performing the bitwise ore. It's useful. No need for functions with a variable number of arguments just to pass many times.
3. There is some information about the line number where a given object was defined in the source code. Aocla can be a toy, but a toy that will try to give you some stack trace if there is a runtime error.
Note that in this implementation deeply nested data structures will produce many recursive calls. This can be avoided using lazy freeing, but not needed for something like Aocla.
So, thanks to our parser, we can take an Aocla program, in the form of a string, parse it and get an Aocla object (`obj*` type) back. Now, in order to run an Aocla program, we have to *execute* this object. Stack based languages are particularly simple to execute: we just go form left to right, and depending on the object type, we do a different action:
* If the object is a symbol (and is not quoted, see the `quoted` field in the object structure), we try to lookup a procedure with that name, and if it exists we execute the procedure. How? By recursively execute the list bound to the symbol.
* If the object is a tuple with single characters elements, we capture the variables on the stack.
* If it's a symbol starting with `$` we push the variable on the stack, or if the variable is not bound we raise an error.
* For any other type of object, we just push it on the stack.
The function responsible to execute the program is called `eval()`, and is so short we can put it fully here, but I'll present the function split in different parts, to explain each one carefully. I will start showing just the first three lines, as they already tell us something.
Here there are three things going on. Eval() takes a context and a list. The list is our program, and it is scanned left-to-right, as Aocla programs are executed left to right, word by word. So all is obvious but the context, what is an execution context for our program?
The stack frame has a pointer to the previous stack frame. This is useful both in order to implement `upeval` and to show a stack trace when an exception happens and the program is halted.
We can continue looking at eval() now. We stopped at the `for` loop, so now we are inside the iteration doing something with each element of the list:
The essence of the loop is a bit `switch` statement doing something different depending on the object type. The object is just the current element of the list. The first case, is the tuple. Tuples capture local variables, unless they are quoted like this:
So if the tuple is not quoted, we check if there are enough stack elements
according to the tuple length. Then, element after element, we move objects
from the Aocla stack to the stack frame, into the array representing the locals. Note that there could be already an object bound to a given local, so we `release()` it before the new assignment.
For symbols, as usually we check if the symbol is quoted, an in such case we just push it on the stack. Otherwise, we handle two different cases. The above is the one where symbol names start with a `$`. It is, basically, the reverse of
what we saw earlier in tuples capturing local vars. This time the local variable is transferred to the stack. However *we still take the reference* in the local variable array, as the program may want to push the same variable again and again, so, after pushing the object on the stack, we have to call `retain()` to increment the reference count of the object.
If the symbol does not start with `$`, then it's a procedure call:
} else { /* Call procedure. */
proc = lookupProc(ctx,o->str.ptr);
if (proc == NULL) {
setError(ctx,o->str.ptr,
"Symbol not bound to procedure");
return 1;
}
if (proc->cproc) {
/* Call a procedure implemented in C. */
aproc *prev = ctx->frame->curproc;
ctx->frame->curproc = proc;
int err = proc->cproc(ctx);
ctx->frame->curproc = prev;
if (err) return err;
} else {
/* Call a procedure implemented in Aocla. */
stackframe *oldsf = ctx->frame;
ctx->frame = newStackFrame(ctx);
ctx->frame->curproc = proc;
int err = eval(ctx,proc->proc);
freeStackFrame(ctx->frame);
ctx->frame = oldsf;
if (err) return err;
}
}
The `lookupProc()` function just scans a linked list of procedures
and returns a list object or, if there is no such procedure defined, NULL.
Now what happens immediately after is much more interesting. Aocla procedures
are just list objects, but it is possible to implement Aocla procedures
directly in C. If the `cproc` is not NULL, then it is a C function pointer
implementing a procedure, otherwise the procedure is *used defined*, written
in Aocla, and we need to evaluate it, with a nested `eval()` call.
As you can see, recursion is crucial in writing interpreters.
Another important thing is that each new Aocla procedure has its own set
of local variables. The scope of local variables, in Aocla, is the
lifetime of the procedure call, like in many other languages. So before
calling al Aocla procedure we allocate a new stack frame with `newStackFrame()`, then we call `eval()`, free the stack frame and store the old one. Procedures implemented in C don't need a stack frame, as they will not make any use of Aocla local variables.
Here we cheat: the code to implement each procedure would be almost the same so we check the name of the procedure called, and bind all the operators to the same function:
So if the object is already unshared (its *refcount* is one), just return it as it is. Otherwise create a copy and remove a reference from the original object. This may look odd, but think at it: the invariant here should be that the caller of this function is the only owner of this object. If we want the caller to be able to abstract totally what happened inside the function, if the object was shared and we returned the caller a copy, the reference the caller had for the old object should be gone. Let's look at the following example:
obj *o = stackPop(ctx);
o = getUnsharedObject(o);
doSomethingThatChanges(o);
stackPush(ctx,o);
Stack pop and push functions don't change the reference counting of the object,
so if the object is not shared we get it with a single reference, change it,
push it on the stack and the object has still a single reference.
Now imagine that, instead, the object is shared and also lives in a
variable. In this case we pop an object that has two references, call
I'll not show the `deepCopy()` function, it just allocates a new object of the specified type and copy the content. But guess what? It's a recursive function.
That's it, and thanks for reading that far. To know more about interpreters you have only one thing to do: write your own, or radically modify Aocla in some crazy ways. Get your hands dirty, it's super fun and rewarding. I can only promise that what you will learn will be worthwhile, even if you'll never write an interpreter again.