How To Write A Tiny Basic Interpreter

Title

This is a quick introduction in how to write a compiler. If you major in Computer Science in college, you might take Compiler Construction as a whole semester. More likely, you will have to get an advanced degree. I mostly taught this course in grad school. You can read my book (The Art of Compiler Design Prentice Hall, ISBN 0-13-048190-4) and download the compiler source code.

Why?

So why am I talking about "A Tiny Basic Interpreter"?

Why Tiny?

First of all, why "Tiny"? That should be easy enough: It's so you will still be here when I finish.

Why Basic?

Then, why "Basic"? Fortran was the first real programming language, invented in the early 1950s at IBM to program their new 704 super-computer. Except for Lisp and the Europeans, every high-level programming language is an adaptation of Fortran. C (and with it, Java) was one of those adaptations, Basic was another. Kemeny & Kurtz invented it so they could teach programming to new students.

= Beginner's ..

The acronym came much later. The language was small enough that Bill Gates -- then still in college back east somewhere -- saw he could do an implementation on the first commercial personal computer, the Altair 8800. MITS (the NewMexico company selling both the Altair and Basic) wanted $150 for a tiny paper tape of software to run on a $400 computer. People out in California were incensed. Dan Sokol carefully examined the whole program and found no copyright notice in it (which placed it in the public domain under the Copyright law then in effect), so he gave a copy to anybody in the HomeBrew Computer Club willing to bring back two more. Gates was furious, but it was legal. He never made that mistake again. The copyright law was also changed the next year so that failure to include notice no longer invalidates the copyright. Actually, all those pirate copies made Bill Gates the richest man in the world, but that's another story.

Why Interpreter?

So, why "Interpreter"? Basically, a compiler and an interpreter build on the same mathematical theory, but an interpreter is easier to write. Generating code to run later is a lot more work than just doing it on the fly. Not overly difficult, just a lot of work.

Seq,Asg,Cond,Iter,Subr,I/O

Now let's talk about the Tiny Basic language. Every adequate computer programming language does six things. They are spelled differently in different languages, but if you can do these six things, you can program anything -- even a compiler or interpreter. I know, because I've done it. It helps if you have enough memory to hold your program and data.

With Listing

Here is a fragment of a small TinyBasic program. Notice how these lines are all numbered. Fortran had optional line numbers, but they are obligatory in the original Basic, and also in TinyBasic. You notice I numbered these lines by tens. That way, if I need to insert some additional code, I just use a line number in between. So let's look at how we do these six things in TinyBasic...

Line #s

First is Sequence. These lines are not in numerical order. Program statements in TinyBasic (like Kemeny & Kurtz's original Basic) are sequenced by line number. You can write your program in any order, but it executes in line number order. The line numbers are analogous to the memory addresses in a real computer. Programmers were still thinking like a computer back then.

+ GoTos

Oh yes, also GoTo statements. You don't see these in programming languages any more, because they can make your code impossible to understand. GoTos are still there in languages like Java, but they are somewhat crippled and spelled "BREAK" (same thing). Structured programming was invented after we learned how unhealthy GoTos are. But they are really easy to program, one size fits all. Structured programming isn't that much more complicated to implement, but it takes harder thinking (meaning: better quality code, fewer mistakes) to write code that does the same thing without GoTos.

LET

Next in the list of six is Assignment. You need to be able to compute values and assign those values to variables. In TinyBasic we do that with the LET statement.

TinyBasic is really tiny, there are only 26 variables, all integer, named with the 26 letters of the alphabet. No declaration is needed, they already exist. The original Basic had a letter and a digit, 260 variables, and they took "numbers" (meaning floating point). TinyBasic is tiny. Floating point did not exist in those early computers.

We have only the four (ahem) basic arithmetic operators, add, subtract, multiply, and divide, plus parentheses. That's all. You can do a lot with that.

IF..THEN

There is only one conditional in TinyBasic, IF-THEN, with three comparisons: less, equal, or greater. Nothing more. You might ask how you do IF-THEN-ELSE? It's easy...

`IF..THEN` Color-coded

There, I color-coded it: red is your THEN-part, blue is the ELSE-part, and green is where they come back together.You can do anything you want with that. You can even slice your wrists, it just takes more code, easier to make mistakes, harder to debug.

Jump Back

Iteration is done with -- guess what? -- GoTo! Just jump back to a lower line number. That's how the computer hardware does it. Enough said.

GoSub+Rtn

You can do subroutines with Gotos also, but TinyBasic is a little more modern than that, probably because subroutines were already invented on the 704. "Subroutine" is the old name for what we call "method" now, but it's the same thing. So we have GOSUB and RETURN to do that. There are no parameters, no function results, you use variables for that. This is Tiny Basic.

I/O

Finally you have your basic I/O, PRINT and INPUT. INPUT accepts a single number from the current input line to be assigned to a single variable, or if there is no input left, prints a question mark so the user can type in some more numbers. PRINT prints any number of quoted text strings and numbers on the output line. I won't waste your time on the mechanics of I/O today. You can go to my website and read the user manual and more than you ever wanted to know about it. There are even links to source code in C and other modern languages, if you care.

Class Hierarchy

Now that you understand the language, we can start talking about the implementation. OOPS being the technology fad of the day, we will implement this as a single TBinterpreter class, supported by classes TBio, TBprogStore, TBstack and TBvarStore to do the obvious things. Well, maybe not so obvious...

TBio Methods

The I/O class has a method to accept one whole line of input typed in by the user. That line might be a numbered program line, or an unnumbered command line, or else some input for the INPUT statement.

If it's input, we can ask for the value of the next number in the input line.

If the program is generating output, it does it a little at a time, so there are separate methods for text, numbers, and tabs between then, then finally something to print the whole line when it's complete. Or the class could print on the fly, and just output a line end when it's done. That's an implementation detail for the convenience of whatever library calls your interpreter is using.

TBprogStore Methods

The program storage class accepts numbered lines and returns the line number, or else an unnumbered line which it stores as line #0 (and returns that zero). This way the interpreter can execute command lines using the same code as line-numbered program code. The interpreter tells it what line to do next (it could be zero), and then there are methods for extracting from that line a command number, and then the program code one character at a time.

The interpreter does not know the line number sequence, so there's one more method for getting the next line number after any given line number. If you give it zero, it will return the first line number in the program. That does not change the current line number, you do that explicitly using its own method.

TBstack Methods

The stack class is pretty simple, push or pop a value.

TBvarStore Methods

The variable storage class is also pretty simple, you give it a variable name (a single letter), and then either ask it to store a number there, or else return whatever number is already stored in that variable.

TBinterpreter Methods

The main interpreter class is where all the heavy lifting happens. We will look more closely at some of the methods.

Instance vars + `error()`

Programmers make errors -- not you of course, but your users -- so we need to be able to tell them what went wrong and stop the program. We do not stop the interpreter, we just drop out of running and back into command-line mode. If the error happens inside some deeply nested expression, it might appear that there are other errors, but we report only the first. That's what the variable stopped does for us.

`void interpret()`

Here's the interpreter. Yup, that's it, you can go now. Naw, let's look at it more closely...

Interpreter outer `while`

Basically, it's a WHILE-loop that never gets out. That's in green here. The two black lines are the interpreter code, set the next line number, then the statement method does the work.

Interpreter `getNextLineNum(doing);`

That's when the current line number is non-zero (red). We ask what the next line number is, and if there isn't any, something is wrong. The user should put an END statement at the end of their program. Otherwise make the new number current and go do it.

Load Program Code

In the command mode, we get an input line and pass it directly to the storage class. If it had a line number, the line is inserted into the program code in the proper place and that line number is returned, so we just go get another line. If it had no line number, then we drop into the statement executor using line number zero as current.

The rest of the interpreter is much easier to understand as a grammar. Look in the back of your Java or C++ reference books, and they give you a syntax diagram (sometimes known as "railroad normal form" because the diagrams look like train tracks)...

TB Grammar, 1st half

Here is the first half of the TinyBasic grammar. I'll try to explain what it means.

Grammar Sequence & Iteration

The top line, a name (in lower case) then a right arrow, followed by some other stuff, and ending in a semicolon is called a "production" or "replacement rule" because the name can be replaced by everything to the right of the arrow. The red star means "any number of" whatever is inside the parentheses (or the single name, if no parentheses are used). A TinyBasic program consists of any number of lines (iteration), each consisting of a line number and then a statement (sequence). That's all it means.

Grammar Literals & Conditionals

The next rule is more complicated. The multiple arrows mean that a statement can be any one of those lines, but only one of them at a time (conditional). You can have different statements on successive lines, but an assignment statement starts with the keyword "LET" and ends with an expression (whatever that is, we'll get to it in a minute), and that's the end of the whole statement.

After the keyword there must be exactly one variable name and exactly one equal sign before the expression.

Grammar Recursion

The next rule is even more complicated, but still very understandable. If the keyword at the front is the letters "I" and "F", it must be followed by an expression, then exactly one of the three comparison operators, then another expression, then the keyword "THEN" followed by -- what's that? -- another statement. This rule is recursive. You know what that means: it calls itself (like a subroutine). The other productions on this page are pretty self-explanatory.

TB Grammar 2nd half

Here's the rest of the TinyBasic grammar. Now you have seen the whole compiler.

The key insight to take away from this talk is that a grammar is a computer program, just a different programming language, with all six components of a programming language. My Java compiler is written in a grammar like this, only a lot longer and more complicated, because it also specifies exactly what machine code to generate for each production (output). The compiler is compiled by a compiler compiler, which is also written in a grammar, "and so ad infinitem."

But we're not going to use the compiler compiler today, we will translate the grammar directly into Java code by hand.

Grammar Nonterminal = Method

Remember, it's a program. The rule names are method names. That's the blue here. Let's look at one of them...

Grammar `factor`

The method name in this case is "factor." The red arrows and the red vertical bar are alternatives, whichever one works. Each method is declared boolean, so it can return false if the rule doesn't fit the code it's compiling. The green parentheses are literal characters in the program text.

This rule -- I mean method -- will ask the method named "number" if it sees a number in the program line. If it does, the value of that number will be pushed on the stack and the method returns true, and since that's the end of the alternative, the factor method also returns true with the input scanner positioned just past the number it accepted. If it's not a number, then the number method returns false and doesn't touch the source line. This is called a one-symbol lookahead.

If it's not a number, maybe it's a variable. Same thing. Or maybe there's a left parenthesis there, followed by an expression and then a right parenthesis. If none of those happened, and if no input was eaten in the process, then factor returns false and the next rule can be tried.

That's what the grammar means. Now let's see what the same thing looks like in Java...

Java `factor()`

Here it is! I kept the same coloring scheme so you can see, the method names are in blue, the literal character tests are in green, and the alternative (conditional) testing is in red.

Grammar `number`

Here is another production, next in the grammar. We're looking for a digit, followed by any number (possibly zero) more digits. That's what the red star means. The difference is that we need to do some semantics, some extra code to push that number onto the stack. In the grammar I write for my compiler compiler, all that is encoded in the grammar, but it's too complicated to explain in the time we have left. It's easier to show you hand-written code that does it...

Java `number()`

...Here. If the next character in the program line is not a decimal digit, leave it there and return false, somebody else's problem, in this case whatever method called number(). If it is a digit (number), we capture that first digit, then keep trying to see more digits and include them in the growing value. When we stop seeing digits, the accumulated number is pushed onto the value stack and return true.

Grammar `statement LET`

Now let's go back to the statement rule. Notice that every alternative starts with a keyword, so I defined a program storage method to return a statement keyword as a number, which I can use in a switch statement...

Java `statement() cm_let`

...Like this. Once we accept the keyword, anything outside the grammar rule is an error. Real compilers put a lot of effort into sensible error messages, but they are not always successful. Neither am I, here.

Grammar `statement IF`

Now we can do the most complicated statement rule. We need an expression, then one of three comparison operators, which we save as a signed number, then the second value and a second keyword -- each time reporting any deviation as an error.

Java `statement() cm_if`

Here's the code so far. Everything we can know about this conditional statement has been processed, except whether to actually do the statement on the rest of the line. If so, it will return true or false, and that is the result of this statement. I repeated this three times, once for each of the compare operators. Maybe there's a simpler or more elegant way to do this, but not that I could think of so close to midnight. If the compare fails, then it falls through to the next line, which simply returns true without even looking at the rest of the line.

I think we can leave the rest of this code as an exercise to the student.

Done

If I went too fast for you, here's a link to the entire talk and the slides, and the source code.

TinyBasic is what is called an "LL(1) Grammar" which means "Left-to-right scan, Left-recursive, 1-symbol look-ahead." Java and C are LR(1) (right-recursive) languages, which gives you a little more flexibility in what you can write in the language, but the parser is so complicated, only computer-generated compilers like YACC can do it. My TAG compiler is even more general, but in a somewhat undisciplined way, which means you can write the whole compiler -- including the code generator and optimizer, even an entire Turing Machine -- in my grammar. YACC can't do that.

The first Fortran compiler came out in the early 1950s, but it wasn't until Chomsky 's theory of universal grammar in the 1960s that we had a theoretical basis for computer languages and how to understand them in computers. I think that's where neural nets are today, like compilers in the 50s: we still do not have a good mathematical theory for how they can and do work, just a bunch of ad-hockery, "art not science." The psychologists were wrong about language before Chomsky, but the same thinking is going into neural nets. It remains to be seen what a modern Chomsky-like figure might do to them. But that's my opinion only. Maybe one of you will overturn the cart the way Chomsky did.

Tom Pittman
Rev. 2017 July 10