How To Write A Tiny Basic Interpreter
This is a quick introduction in how to write a compiler. If you major in
Computer Science in college, you might take Compiler Construction as a
whole semester. More likely, you will have to get an advanced degree. I
mostly taught this course in grad school. You can read my book (The
Art of Compiler Design Prentice Hall, ISBN 0-13-048190-4) and download
the compiler source code.
So why am I talking about "A Tiny Basic Interpreter"?
First of all, why "Tiny"? That should be easy enough: It's so you will
still be here when I finish.
Then, why "Basic"? Fortran was the first real programming language, invented
in the early 1950s at IBM to program their new 704 super-computer. Except
for Lisp and the Europeans, every high-level programming language is an
adaptation of Fortran. C (and with it, Java) was one of those adaptations,
Basic was another. Kemeny
& Kurtz invented it so they could teach programming to new students.
The acronym came much later. The language was small enough that Bill Gates
-- then still in college back east somewhere -- saw he could do an implementation
on the first commercial personal computer, the Altair 8800. MITS (the NewMexico
company selling both the Altair and Basic) wanted $150 for a tiny paper
tape of software to run on a $400 computer. People out in California were
incensed. Dan Sokol carefully examined the whole program and found no copyright
notice in it (which placed it in the public domain under the Copyright
law then in effect), so he gave a copy to anybody in the HomeBrew Computer
Club willing to bring back two more. Gates was furious, but it was legal.
He never made that mistake again. The copyright law was also changed the
next year so that failure to include notice no longer invalidates the copyright.
Actually, all those pirate copies made Bill Gates the richest man in the
world, but that's another story.
So, why "Interpreter"? Basically, a compiler and an interpreter build on
the same mathematical theory, but an interpreter is easier to write. Generating
code to run later is a lot more work than just doing it on the fly. Not
overly difficult, just a lot of work.
Now let's talk about the Tiny Basic language. Every adequate computer programming
language does six things. They are spelled differently in different languages,
but if you can do these six things, you can program anything --
even a compiler or interpreter. I know, because I've done it. It helps
if you have enough memory to hold your program and data.
Here is a fragment of a small TinyBasic program. Notice how these lines
are all numbered. Fortran had optional line numbers, but they are obligatory
in the original Basic, and also in TinyBasic. You notice I numbered these
lines by tens. That way, if I need to insert some additional code, I just
use a line number in between. So let's look at how we do these six things
in TinyBasic...
First is Sequence. These lines are not in numerical order. Program statements
in TinyBasic (like Kemeny & Kurtz's original Basic) are sequenced by
line number. You can write your program in any order, but it executes in
line number order. The line numbers are analogous to the memory addresses
in a real computer. Programmers were still thinking like a computer back
then.
Oh yes, also GoTo statements. You don't see these in programming languages
any more, because they can make your code impossible to understand. GoTos
are still there in languages like Java, but they are somewhat crippled
and spelled "BREAK" (same thing). Structured programming
was invented after we learned how unhealthy GoTos are. But they are really
easy to program, one size fits all. Structured programming isn't that much
more complicated to implement, but it takes harder thinking (meaning: better
quality code, fewer mistakes) to write code that does the same thing without
GoTos.
Next in the list of six is Assignment. You need to be able to compute values
and assign those values to variables. In TinyBasic we do that with the
LET
statement.
TinyBasic is really tiny, there are only 26 variables, all integer,
named with the 26 letters of the alphabet. No declaration is needed, they
already exist. The original Basic had a letter and a digit, 260 variables,
and they took "numbers" (meaning floating point). TinyBasic is tiny. Floating
point did not exist in those early computers.
We have only the four (ahem) basic arithmetic operators, add, subtract,
multiply, and divide, plus parentheses. That's all. You can do a lot with
that.
There is only one conditional in TinyBasic, IF-THEN,
with three comparisons: less, equal, or greater. Nothing more. You might
ask how you do IF-THEN-ELSE?
It's easy...
There, I color-coded it: red is your THEN-part, blue
is the ELSE-part, and green is where they come back
together.You can do anything you want with that. You can even slice your
wrists, it just takes more code, easier to make mistakes, harder to debug.
Iteration is done with -- guess what? -- GoTo! Just jump back to a lower
line number. That's how the computer hardware does it. Enough said.
You can do subroutines with Gotos also, but TinyBasic is a little more
modern than that, probably because subroutines were already invented on
the 704. "Subroutine" is the old name for what we call "method" now, but
it's the same thing. So we have GOSUB and RETURN
to do that. There are no parameters, no function results, you use variables
for that. This is Tiny Basic.
Finally you have your basic I/O, PRINT and INPUT.
INPUT
accepts a single number from the current input line to be assigned to a
single variable, or if there is no input left, prints a question mark so
the user can type in some more numbers. PRINT prints
any number of quoted text strings and numbers on the output line. I won't
waste your time on the mechanics of I/O today. You can go
to my website and read the user manual and more than you ever wanted
to know about it. There are even links to source code in C and other modern
languages, if you care.
Now that you understand the language, we can start talking about the implementation.
OOPS being the technology fad of the day, we will
implement this as a single TBinterpreter class, supported by classes
TBio,
TBprogStore,
TBstack
and TBvarStore to do the obvious things. Well, maybe not so obvious...
The I/O class has a method to accept one whole line of input typed in by
the user. That line might be a numbered program line, or an unnumbered
command line, or else some input for the INPUT statement.
If it's input, we can ask for the value of the next number in the input
line.
If the program is generating output, it does it a little at a time,
so there are separate methods for text, numbers, and tabs between then,
then finally something to print the whole line when it's complete. Or the
class could print on the fly, and just output a line end when it's done.
That's an implementation detail for the convenience of whatever library
calls your interpreter is using.
The program storage class accepts numbered lines and returns the line number,
or else an unnumbered line which it stores as line #0 (and returns that
zero). This way the interpreter can execute command lines using the same
code as line-numbered program code. The interpreter tells it what line
to do next (it could be zero), and then there are methods for extracting
from that line a command number, and then the program code one character
at a time.
The interpreter does not know the line number sequence, so there's one
more method for getting the next line number after any given line number.
If you give it zero, it will return the first line number in the program.
That does not change the current line number, you do that explicitly using
its own method.
The stack class is pretty simple, push or pop a value.
The variable storage class is also pretty simple, you give it a variable
name (a single letter), and then either ask it to store a number there,
or else return whatever number is already stored in that variable.
The main interpreter class is where all the heavy lifting happens. We will
look more closely at some of the methods.
Programmers make errors -- not you of course, but your users -- so we need
to be able to tell them what went wrong and stop the program. We do not
stop the interpreter, we just drop out of running and back into command-line
mode. If the error happens inside some deeply nested expression, it might
appear that there are other errors, but we report only the first. That's
what the variable stopped does for us.
Here's the interpreter. Yup, that's it, you can go now. Naw, let's look
at it more closely...
Basically, it's a WHILE-loop that never gets out. That's in green here.
The two black lines are the interpreter code, set the next line number,
then the statement method does the work.
That's when the current line number is non-zero (red). We ask what the
next line number is, and if there isn't any, something is wrong. The user
should put an END statement at the end of their program.
Otherwise make the new number current and go do it.
In the command mode, we get an input line and pass it directly to the storage
class. If it had a line number, the line is inserted into the program code
in the proper place and that line number is returned, so we just go get
another line. If it had no line number, then we drop into the statement
executor using line number zero as current.
The rest of the interpreter is much easier to understand as a grammar.
Look in the back of your Java or C++ reference books, and they give you
a syntax diagram (sometimes known as "railroad normal form" because the
diagrams look like train tracks)...
Here is the first half of the TinyBasic grammar. I'll try to explain what
it means.
The top line, a name (in lower case) then a right arrow, followed by some
other stuff, and ending in a semicolon is called a "production" or "replacement
rule" because the name can be replaced by everything to the right of the
arrow. The red star means "any number of" whatever is inside the parentheses
(or the single name, if no parentheses are used). A TinyBasic program consists
of any number of lines (iteration), each consisting of a line number and
then a statement (sequence). That's all it means.
The next rule is more complicated. The multiple arrows mean that a statement
can be any one of those lines, but only one of them at a time (conditional).
You can have different statements on successive lines, but an assignment
statement starts with the keyword "LET" and ends with an expression
(whatever that is, we'll get to it in a minute), and that's the end of
the whole statement.
After the keyword there must be exactly one variable name and exactly
one equal sign before the expression.
The next rule is even more complicated, but still very understandable.
If the keyword at the front is the letters "I" and "F", it must be followed
by an expression, then exactly one of the three comparison operators, then
another expression, then the keyword "THEN" followed by -- what's
that? -- another statement. This rule is recursive. You know what that
means: it calls itself (like a subroutine). The other productions on this
page are pretty self-explanatory.
Here's the rest of the TinyBasic grammar. Now you have seen the whole compiler.
The key insight to take away from this talk is that a grammar is a computer
program, just a different programming language, with all six components
of a programming language. My Java compiler is written in a grammar like
this, only a lot longer and more complicated, because it also specifies
exactly what machine code to generate for each production (output). The
compiler is compiled by a compiler compiler, which is also written in a
grammar, "and so ad infinitem."
But we're not going to use the compiler compiler today, we will translate
the grammar directly into Java code by hand.
Remember, it's a program. The rule names are method names. That's the blue
here. Let's look at one of them...
The method name in this case is "factor." The red arrows and the
red vertical bar are alternatives, whichever one works. Each method is
declared boolean, so it can return false if the rule
doesn't fit the code it's compiling. The green parentheses are literal
characters in the program text.
This rule -- I mean method -- will ask the method named "number"
if it sees a number in the program line. If it does, the value of that
number will be pushed on the stack and the method returns true,
and since that's the end of the alternative, the factor method
also returns true with the input scanner positioned just past
the number it accepted. If it's not a number, then the number
method returns false and doesn't touch the source line. This is
called a one-symbol lookahead.
If it's not a number, maybe it's a variable. Same thing. Or maybe there's
a left parenthesis there, followed by an expression and then a
right parenthesis. If none of those happened, and if no input was eaten
in the process, then
factor returns false and the next
rule can be tried.
That's what the grammar means. Now let's see what the same thing looks
like in Java...
Here it is! I kept the same coloring scheme so you can see, the method
names are in blue, the literal character tests are in green, and the alternative
(conditional) testing is in red.
Here is another production, next in the grammar. We're looking for a digit,
followed by any number (possibly zero) more digits. That's what the red
star means. The difference is that we need to do some semantics, some extra
code to push that number onto the stack. In the grammar I write for my
compiler compiler, all that is encoded in the grammar, but it's too complicated
to explain in the time we have left. It's easier to show you hand-written
code that does it...
...Here. If the next character in the program line is not a decimal digit,
leave it there and return false, somebody else's problem, in this
case whatever method called number(). If it is a digit (number),
we capture that first digit, then keep trying to see more digits and include
them in the growing value. When we stop seeing digits, the accumulated
number is pushed onto the value stack and return true.
Now let's go back to the statement rule. Notice that every alternative
starts with a keyword, so I defined a program storage method to return
a statement keyword as a number, which I can use in a switch statement...
...Like this. Once we accept the keyword, anything outside the grammar
rule is an error. Real compilers put a lot of effort into sensible error
messages, but they are not always successful. Neither am I, here.
Now we can do the most complicated statement rule. We need an expression,
then one of three comparison operators, which we save as a signed number,
then the second value and a second keyword -- each time reporting any deviation
as an error.
Here's the code so far. Everything we can know about this conditional statement
has been processed, except whether to actually do the statement on the
rest of the line. If so, it will return true or false, and that is the
result of this statement. I repeated this three times, once for each of
the compare operators. Maybe there's a simpler or more elegant way to do
this, but not that I could think of so close to midnight. If the compare
fails, then it falls through to the next line, which simply returns true
without even looking at the rest of the line.
I think we can leave the rest of this code as an exercise to the student.
If I went too fast for you, here's a link to the entire talk and the slides,
and the source code.
TinyBasic is what is called an "LL(1) Grammar" which means "Left-to-right
scan, Left-recursive, 1-symbol look-ahead." Java and C are LR(1) (right-recursive)
languages, which gives you a little more flexibility in what you can write
in the language, but the parser is so complicated, only computer-generated
compilers like YACC can do it. My TAG
compiler is even more general, but in a somewhat undisciplined way, which
means you can write the whole compiler -- including the code generator
and optimizer, even an entire Turing Machine -- in my grammar. YACC
can't do that.
The first Fortran compiler came out in the early 1950s, but it wasn't
until Chomsky 's theory of universal grammar in the 1960s that we had a
theoretical basis for computer languages and how to understand them in
computers. I think that's where neural nets are today, like compilers in
the 50s: we still do not have a good mathematical theory for how they can
and do work, just a bunch of ad-hockery, "art not science." The psychologists
were wrong about language before Chomsky, but the same thinking is going
into neural nets. It remains to be seen what a modern Chomsky-like figure
might do to them. But that's my opinion only. Maybe one of you will overturn
the cart the way Chomsky did.
Tom Pittman
Rev. 2017 July 10