2. bison compiles `turtle.y` to `turtle_parser.c` and generates `turtle_parser.h`
3. flex compiles `turtle.lex` to `turtle_lex.c`.
4. gcc compiles `application.c`, `resources.c`, `turtle_parser.c` and `turtle_lex.c` with `turtle.h`, `turtle_lex.h`, `resources.h` and `turtle_parser.h`.
The file consists of three sections which are separated by "%%" (line 18 and 56).
They are definitions, rules and user code sections.
### Definitions section
First, look at the definitions section.
- 1-12: Lines between "%top{" and "}" are C source codes.
They will be copied to the top of the generated C source file.
- 2-3: The function `strlen`, in line 62, is defined in `string.h`
The function `atof`, in line 37, is defined in `stdlib.h`.
- 6-8: The current input position is pointed by `nline` and `ncolumn`.
The function `get_location` (line 58-63) sets `yylloc`to point the start and end point of `yytext` in the buffer.
This function is declared here so that it can be called before the function is defined.
- 11: GSlist is used to keep allocated memories.
- 14: This option (`%option noyywrap`) must be specified when you have only single source file to the scanner. Refer to "9 The Generated Scanner" in the flex documentation in your distribution for further information.
(The documentation is not on the internet.)
- 16-17: `REAL_NUMBER` and `IDENTIFIER` are names.
A name begins with a letter or an underscore followed by zero or more letters, digits, underscores (`_`) or dashes (`-`).
They are followed by regular expressions which are their definition.
They will be used in rules section and will expand to the definition.
You can leave out such definitions here and use regular expressions in rules section directly.
### Rules section
This section is the most important part.
Rules consist of patterns and actions.
For example, line 37 is a rule.
-`{REAL_NUMBER}` is a pattern
-`get_location (yytext); yylval.NUM = atof (yytext); return NUM;` is an action.
`{REAL_NUMBER}` is defined in the 16th line, so it expands to `(0|[1-9][0-9]*)(\.[0-9]+)?`.
This regular expression matches numbers like `0`, `12` and `1.5`.
If the input is a number, it matches the pattern in line 37.
Then the matched text is assigned to `yytext` and corresponding action is executed.
A function `get_location` changes the location variables.
It assigns `atof (yytext)`, which is double sized number converted from `yytext`, to `yylval.NUM` and return `NUM`.
`NUM` is an integer defined by `turtle.y`.
The scanner generated by flex and C compiler has `yylex` function.
If `yylex` is called and the input is "123.4", then it works as follows.
1. A string "123.4" matches `{REAL_NUMBER}`.
2. Update the location variable `ncolumn` and `yylloc`.
3.`atof` converts the string "123.4" to double sized floating point number 123.4.
4. It is assigned to `yylval.NUM`.
5.`yylex` returns `NUM` to the caller.
Then the caller knows the input is `NUM` (number), and its value is 123.4.
- 19-55: Rules section.
- 20: Comment begins `#` followed by any characters except newline.
No action happens.
- 21: White space just increases a variable `ncolumn` by one.
- 22: Tab is assumed to be equal to eight spaces.
- 23: New line increases a variable `nline` by one and resets `ncolumn`.
- 25-35: Keywords just updates the location variables `ncolumn` and `yylloc`, and return the codes of the keywords.
- 37: Real number constant.
- 38: Identifier is defined in line 17.
It begins alphabet followed by zero or more alphabet or digit.
The location variables are updated and the name of the identifier is assigned to `yylval.ID`.
The memory of the name is allocated by the function `g_strdup`.
The memory is registered to the list (GSlist type list).
The memory will be freed after the runtime routine finishes.
Returns `ID`.
- 43-54: Symbols just update the location variable and return the codes.
The code is the same as the symbol itself.
- 55: If the input doesn't match above patterns, then it is error.
Returns `YYUNDEF`.
### User code section
This section is just copied to C source file.
- 58-63: A function `get_location`.
The location of the input is recorded to `nline` and `ncolumn`.
These two variables are for the scanner.
A variable `yylloc` is shared by the scanner and the parser.
It is a C structure and has four members, `first_line`, `first_column`, `last_line` and `last_column`.
They point the start and end of the current input text.
- 65: `YY_BUFFER_STATE` is a type of the pointer points the input buffer.
- 67-70: `init_flex` is called by `run_cb` signal handler, which is called when `Run` button is clicked on.
`run_cb` calls `init_flex` with one argument which is the copy of the content of GtkTextBuffer.
`yy_scan_string` sets the input buffer to read from the text.
- 72-75: `finalize_flex` is called after runtime routine finishes.
It deletes the input buffer.
## Turtle.y
Turtle.y has more than 800 lines so it is difficult to explain all the source code.
So I will explain the key points and leave out other less important parts.
### What does bison do?
Bison creates C source file from bison source file.
Bison source file is a text file.
A parser analyzes a program source code according to its grammar.
Suppose here is a turtle source file.
~~~
fc (1,0,0) # Foreground color is red, rgb = (1,0,0).
pd # Pen down.
distance = 100
angle = 90
fd distance # Go forward by distance (100) pixels.
tr angle # Turn right by angle (90) degrees.
~~~
The parser calls `yylex` to get a token.
The token consists of its type (token kind) and value (semantic value).
So, the parser gets items in the following table whenever it calls `yylex`.
| |token kind|yylval.ID|yylval.NUM|
|:-:|:--------:|:-------:|:--------:|
| 1 | FC | | |
| 2 | ( | | |
| 3 | NUM | | 1.0 |
| 4 | , | | |
| 5 | NUM | | 0.0 |
| 6 | , | | |
| 7 | NUM | | 0.0 |
| 8 | ) | | |
| 9 | PD | | |
|10 | ID |distance | |
|11 | = | | |
|12 | NUM | | 100.0 |
|13 | ID | angle | |
|14 | = | | |
|15 | NUM | | 90.0 |
|16 | FD | | |
|17 | ID |distance | |
|18 | TR | | |
|19 | ID | angle | |
Bison source code specifies the grammar rules of turtle language.
For example, `fc (1,0,0)` is called primary procedure.
A procedure is like a void type function in C source code.
It doesn't return any values.
Programmers can define their own procedures.
On the other hand, `fc` is a built-in procedure.
Such procedures are called primary procedures.
It is described in Bison source code like:
~~~
primary_procedure: FC '(' expression ',' expression ',' expression ')';
expression: ID | NUM;
~~~
This means:
- Primary procedure is FC followed by '(', expression, ',', expression, ',', expression and ')'.
- expression is ID or NUM.
The description above is called BNF (Backus-Naur form).
More precisely, it is similar to BNF.
The first line is:
~~~
FC '(' NUM ',' NUM ',' NUM ')';
~~~
You can find this is a primary_procedure easily.
The parser of the turtle language analyzes the turtle source code in the same way.
The grammar of turtle is described in the [document](../src/turtle/turtle_doc.md).
The following is an extract from the document.
~~~
program:
statement
| program statement
;
statement:
primary_procedure
| procedure_definition
;
primary_procedure:
PU
| PD
| PW expression
| FD expression
| TR expression
| BC '(' expression ',' expression ',' expression ')'
| FC '(' expression ',' expression ',' expression ')'
| ID '=' expression
| IF '(' expression ')' '{' primary_procedure_list '}'
| RT
| RS
| ID '(' ')'
| ID '(' argument_list ')'
;
procedure_definition:
DP ID '(' ')' '{' primary_procedure_list '}'
| DP ID '(' parameter_list ')' '{' primary_procedure_list '}'
;
parameter_list:
ID
| parameter_list ',' ID
;
argument_list:
expression
| argument_list ',' expression
;
primary_procedure_list:
primary_procedure
| primary_procedure_list primary_procedure
;
expression:
expression '=' expression
| expression '>' expression
| expression '<' expression
| expression '+' expression
| expression '-' expression
| expression '*' expression
| expression '/' expression
| '-' expression %prec UMINUS
| '(' expression ')'
| ID
| NUM
;
~~~
The grammar rule defines `program` first.
- program is a statement or a program followed by a statement.
The definition is recursive.
-`statement` is program.
-`statement statement` is `program statemet`.
Therefore, it is program.
-`statement statement statement` is `program statemet`.
Therefore, it is program.
You can find that a list of statements is program like this.
`program` and `statement` aren't tokens.
They don't appear in the input.
They are called non terminal symbols.
On the other hand, tokens are called terminal symbols.
The word "token" used here has wide meaning, it includes tokens and symbols which appear in the input.
Non terminal symbols are often shortened to nterm.
list = g_slist_prepend (list, g_malloc (sizeof (node_t)));
new_node = (node_t *) list->data;
new_node->type = type;
child1(new_node) = child1;
child2(new_node) = child2;
child3(new_node) = child3;
return new_node;
}
node_t *
tree2 (int type, double value) {
node_t *new_node;
list = g_slist_prepend (list, g_malloc (sizeof (node_t)));
new_node = (node_t *) list->data;
new_node->type = type;
value(new_node) = value;
return new_node;
}
node_t *
tree3 (int type, char *name) {
node_t *new_node;
list = g_slist_prepend (list, g_malloc (sizeof (node_t)));
new_node = (node_t *) list->data;
new_node->type = type;
name(new_node) = name;
return new_node;
}
~~~
#### Symbol table
Variables and user defined procedures are registered in a symbol table.
This table is a C array.
It should be replaced by more appropriate data structure with memory allocation in the future version
- Variables are registered with its name and value.
- Procedures are registered with its name and a pointer to the node of the procedure.
Therefore the table has the following fields.
- type to identify variable or procedure
- name
- value or pointer to a node
~~~C
#define MAX_TABLE_SIZE 100
enum {
PROC,
VAR
};
typedef union _object_t object_t;
union _object_t {
node_t *node;
double value;
};
struct {
int type;
char *name;
object_t object;
} table[MAX_TABLE_SIZE];
int tp;
void
init_table (void) {
tp = 0;
}
~~~
`init_table` initializes the table.
This must be called before any registrations.
There are five functions to access the table,
-`proc_install` installs a procedure.
-`var_install` installs a variable.
-`proc_lookup` looks up a procedure. If the procedure is found, it returns a pointer to the node. Otherwise it returns NULL.
-`var_lookup` looks up a variable. If the variable is found, it returns TRUE and sets the pointer (argument) to point the value. Otherwise it returns FALSE.
-`var_replace` replaces the value of a variable. If the variable hasn't registered yet, it installs the variable.
~~~C
int
tbl_lookup (int type, char *name) {
int i;
if (tp == 0)
return -1;
for (i=0; i<tp;++i)
if (type == table[i].type && strcmp(name, table[i].name) == 0)
if (! stack_replace (name, d)) /* First, tries to replace the value in the stack (parameter).*/
var_replace (name, d); /* If the above fails, tries to replace the value in the table. If the variable isn't in the table, installs it, */
break;
case N_IF:
if (eval (child1(node)))
execute (child2(node));
break;
case N_RT:
ret_level--;
break;
case N_RS:
pen = TRUE;
angle = 90.0;
cur_x = 0.0;
cur_y = 0.0;
line_width = 2.0;
fc.red = 0.0; fc.green = 0.0; fc.blue = 0.0;
/* To change background color, use bc. */
break;
case N_procedure_call:
name = name(child1(node));
node_t *proc = proc_lookup (name);
if (! proc)
runtime_error ("Procedure %s not defined.\n", name);
if (strcmp (name, name(child1(proc))) != 0)
runtime_error ("Unexpected error. Procedure %s is called, but invoked procedure is %s.\n", name, name(child1(proc)));
/* make tuples (parameter (name), argument (value)) and push them to the stack */
node_t *param_list;
node_t *arg_list;
param_list = child2(proc);
arg_list = child2(node);
if (param_list == NULL) {
if (arg_list == NULL) {
stack_push (NULL, 0.0); /* number of argument == 0 */
} else
runtime_error ("Procedure %s has different number of argument and parameter.\n", name);
}else {
/* Don't change the stack until finish evaluating the arguments. */
#define TEMP_STACK_SIZE 20
char *temp_param[TEMP_STACK_SIZE];
double temp_arg[TEMP_STACK_SIZE];
n = 0;
for (; param_list->type == N_parameter_list; param_list = child1(param_list)) {
if (arg_list->type != N_argument_list)
runtime_error ("Procedure %s has different number of argument and parameter.\n", name);
if (n >= TEMP_STACK_SIZE)
runtime_error ("Too many parameters. the number must be %d or less.\n", TEMP_STACK_SIZE);
temp_param[n] = name(child2(param_list));
temp_arg[n] = eval (child2(arg_list));
arg_list = child1(arg_list);
++n;
}
if (param_list->type == N_ID && arg_list -> type != N_argument_list) {
temp_param[n] = name(param_list);
temp_arg[n] = eval (arg_list);
if (++n >= TEMP_STACK_SIZE)
runtime_error ("Too many parameters. the number must be %d or less.\n", TEMP_STACK_SIZE);
temp_param[n] = NULL;
temp_arg[n] = (double) n;
++n;
} else
runtime_error ("Unexpected error.\n");
for (i = 0; i <n;++i)
stack_push (temp_param[i], temp_arg[i]);
}
ret_level = ++proc_level;
execute (child3(proc));
ret_level = --proc_level;
stack_return ();
break;
case N_procedure_definition:
name = name(child1(node));
proc_install (name, node);
break;
case N_primary_procedure_list:
execute (child1(node));
execute (child2(node));
break;
default:
runtime_error ("Unknown statement.\n");
}
}
~~~
A node `N_procedure_call` is created by the parser when it has found a user defined procedure call.
The procedure has been defined in the prior statement.
Suppose the parser reads the following example code.
~~~
dp drawline (angle, distance) {
tr angle
fd distance
}
drawline (90, 100)
drawline (90, 100)
drawline (90, 100)
drawline (90, 100)
~~~
This example draws a square.
When The parser reads the lines from one to four, it creates nodes like this:
![Nodes of drawline](../image/tree2.png)
Runtime routine just stores the procedure to the symbol table with its name and node.
![Symbol table](../image/table.png)
When the parser reads the fifth line in the example, it creates nodes like this:
![Nodes of procedure call](../image/proc_call.png)
When the runtime routine meets `N_procedure_call` node, it behaves like this:
1. Searches the symbol table for the procedure by the name.
2. Gets pointers to the node to parameters and the node to the body.
3. Creates a temporary stack.
Makes a tuple of each parameter name and argument value.
Pushes the tuples into the stack, and (NULL, number of parameters) finally.
If no error occurs, copies them from the temporary stack to the parameter stack.
4. Increases `prc_level` by one.
Sets `ret_level` to the same value as `proc_level`.
`proc_level` is zero when runtime routine runs on the main routine.
If it goes into a procedure, `proc_level` increases by one.
Therefore, `proc_level` is the depth of the procedure call.
`ret_level` is the level to return.
If it is the same as `proc_level`, runtime routine executes commands in order of the commands in the procedure.
If it is smaller than `proc_level`, runtime routine doesn't execute commands until it becomes the same level as `proc_level`.
`ret_level` is used to return the procedure.
5. Executes the node of the body of the procedure.
6. Decreases `proc_level` by one.
Sets `ret_level` to the same value as `proc_level`.
Calls `stack_return`.
When the runtime routine meets `N_RT` node, it decreases `ret_level` by one so that the following commands in the procedure are ignored by the runtime routine.
#### Runtime entry and error functions
A function `run` is the entry of the runtime routine.
A function `runtime_error` reports an error occurred during the runtime routine runs.
(Errors which occur during the parsing are called syntax error and reported by `yyerror`.)
After `runtime_error` reports an error, it stops the command execution and goes back to `run` to exit.
Setjmp and longjmp functions are used.
They are declared in `<setjmp.h>`.
`setjmp (buf)` saves state information in `buf` and returns zero.
`longjmp(buf, 1)` restores the state information from `buf` and returns `1` (the second argument).
Because the information is the status at the time `setjmp` is called, so longjmp resumes the execution at the next of setjmp function call.
In the following program, longjmp resumes at the assignment to the variable `i`.
When setjmp is called, 0 is assigned to `i` and `execute(node_top)` is called.
On the other hand, when longjmp is called, 1 is assigned to `i` and `execute(node_top)` is not called..
`g_slist_free_full` frees all the allocated memories.
A function `runtime_error` has a variable-length argument list.
~~~C
void runtime_error (char *format, ...)
~~~
This is implemented with `<stdarg.h>` header file.
The `va_list` type variable `args` will refer to each argument in turn.
A function `va_start` initializes `args`.
A function `va_arg` returns an argument and moves the reference of `args` to the next.
A function `va_end` cleans up everything necessary at the end.
The function `runtime_error` has a similar format of printf standard function.
But its format has only `%s`, `%f` and `%d`.
The functions declared in `<setjmp.h>` and `<stdarg.h>` are explained in the very famous book "The C programming language" written by Brian Kernighan and Dennis Ritchie.
I referred to the book to write the program above.
The program `turtle` is unsophisticated and unpolished.
If you want to make your own language, you need to know more and more.
I don't know any good textbook about compilers and interpreters.
If you know a good book, please let me know.
However, the following information is very useful (but old).
- Bison documentation
- Flex documentation
- Software tools written by Brian W. Kernighan & P. J. Plauger (1976)
- Unix programming environment written by Brian W. Kernighan and Rob Pike (1984)
- Source code of a language, for example, ruby.
Lately, lots of source codes are in the internet.
Maybe reading source codes are the most useful for programmers.