Introduction to Interpreters & Compilers


If you just want to use m JavaScript interpreter, see Using the JavaScript Debug Interpreter.

For a deep theoretical understanding of how to program interpreters and compilers, you should read my book (with James Peters), The Art of Compiler Design (Prentice Hall, ISBN 0-13-048190-4). There are additional links here. This introduction will touch on a few aspects, as relevant to the JavaScript interpreter written in JavaScript. For line numbers see the file JSdoingJS.html, where the interpreter interprets itself, with line numbers in comments every eight lines.

A compiler is a computer program that translates computer programs from a high-level programming language like Java or C++ to the ones and zeros of machine language, or else to another programming language. The output from a compiler is another computer program that does the same thing. An interpreter is like a compiler in construction, and does many of the same things, but instead of producing a program to do the same thing, it just does it. Consider this simple program, in English-like HyperCard:

put 123 into myVariable
put 7 into ink
repeat 5 times
  add ink to myVariable
  add 1 to ink
end repeat
divide myVariable by 9
We could compile this program snippet into JavaScript (or C) and the result would look like this:
myVariable = 123;
ink = 7;
for (i=0; i<5; i++)
  myVariable += ink++;
myVariable = myVariable/9;
An interpreter, on the other hand would create a variable "myVariable" somewhere in its memory, set that variable to 123, create another variable "ink" with the value 7, then repeatedly add the value of variable "ink" to variable "myVariable" each time incrementing ink, then finally divide myVariable by 9 and put the quotient back into myVariable. There is no translation going on, the interpreter knows how to do each step and does it immediately.

Both interpreters and compilers need to correctly understand what the source program means (was intended to do, assuming there are no bugs), but the compiler then says the same thing in another language, while the interpreter does that action immediately. Compilers often try to produce an efficient output program, as for example using the auto-post-increment on the variable ink above, but interpreters often don't worry much about efficiency. Both interpreters and compilers are able to detect when a source program fails to conform to the rules of its language, and issue error messages in those cases.
 

Grammar

A programming language is defined by a grammar which spells out (like the English grammar) what is correct form and how things can be said in that language. Most programming languages have a grammar that looks somewhat like this (which defines part of the subset of JavaScript my interpreter accepts):
prog -> func*
func -> "function" ident "(" params ")" blok
blok -> "{" stmt* "}"
stmt -> blok
     -> "if" "(" expn ")" stmt ( "else" stmt )?
     -> "while" "(" expn ")" stmt
     -> expn ";"
     -> ";"
expn -> (lexp "=")? cmpr
cmpr -> smpl (("==" | "!=" | ">" | "<") smpl )?
smpl -> term (("+" | "-") term )*
term -> fact (("*" | "/") fact )*
fact -> "(" expn ")"
     -> "-" fact
     -> lexp | num | quote
     -> ident "(" args ")"
lexp -> ident ( "[" expn "]" )?
This looks a little like a programming language itself -- and in my book it is! -- but the formality helps us make fewer mistakes in defining how the compiler or interpreter must work. In this partial grqammar, a program consists of zero or more functions (the star means any number of whatever it follows; the question mark means zero or one, but not more). A function starts with the keyword "function" then a name (identifier, not specified in this grammar fragment), then a left parenthesis, some parameters (again not specified: in a full grammar, everything is spelled out), a right parenthesis, then a block of statements enclosed in braces. Every item in quotes is required exactly as spelled.

What is less obvious in this grammar is that the operator precedence is also fully spelled out. A simple expression is a sequence of one or more terms separated by plusses and minuses; the terms are each a sequence of one or more factors separated by multiply and divide operators. Thus multiplication and division takes precedence over (is done first, before) addition and subtraction.

Furthermore, because a single line in the grammar contains both the left and right parentheses in each case, a correct program cannot mismatch them, they are both required when that grammar line is activated.

With each line of this grammar, a compiler would associate the appropriate output code to generate for that construct. Conversely, with the same grammar, an interpreter would spell out what actions to take. The grammar defines the language, not what is to be done for it. In my book, I extended the grammar to specify the actions also.

The easy way to convert a grammar into a program for compiling or interpreting the language is to make each nonterminal (the left side of an arroe in a grammar rule) into a function. Stars turn into while loops; question marks, multiple arrows, and vertical alternation bars turn into if statements. Quoted items test the source code for that exact string of characters, and fail if it doesn't match. Sometimes failure is OK if there is another alternation to test; otherwise (for example, a missing right parenthesis) results in an error if it fails.

Notice that the grammar is recursive. A block can have any number of statements inside its braces, and any of those statements might be another block. Most programming languages allow functional recursion. JavaScript does too, but not mutual recursion as we need here (there is a way, but it's different from C). I solved that in my interpreter by making one giant "expn" function for all the recursive rules -- and then ran into the recursion limit. Recursion can be simulated using a stack, and JavaScript has explicit stack push and pop methods, so my interpreter now does the recursion that way.
 

Stack

A stack is like an array, except that you are only allowed to push items onto one end, or pop them off the same end. It enforces a last-in-first-out order of processing the way recursion does. Hardware computers do function calling the same way, using a stack. Whenever a function is called, its parameters and local variables are created on the stack. When another function is called from within the first, the previous local variables are pushed and the new function's local variables are created and made available. When that function returns, its local variables are discarded, and the previous function's variables popped off and become current. My interpreter uses two stacks (InfoStak and ParStack) for language processing, and another (ExpStack) for expression evaluation. There are also arrays for variable names and values, which operate in stack mode through function calls and returns.

Let's look at how a stack enforces operator precedence in expression evaluation. Suppose we have an expression to evaluate:

3+4*5-(6+1) // []
Some of these numbers could be variable references or function calls, but that complication does not change the way things are done. We initially have an empty stack, represented here within the brackets. The interpreter calls expn, which calls each nonterminal in turn down to fact, which recognizes a number, and pushes it onto the stack:
3+4*5-(6+1) // [3] -- expn,cmpr,smpl,term,fact
The scanner looking at the source program is now ready to process the plus (in red). Nonterminal fact has nothing else to do so it returns to term, which is hoping for an optional star or slash, but finding neither, returns to smpl, which is hoping for a plus or minus. It accepts and remembers the plus, then immediately calls term a second time which calls fact to accept and push the four:
3+4*5-(6+1) // [4,3] -- expn,cmpr,smpl,term
Now term, which looks again for a star or slash, finds it and immediately turns around and calls fact for another factor:
3+4*5-(6+1) // [5,4,3] -- expn,cmpr,smpl,term
Back in term, the remembered star reminds it to pop off the top two items from the stack, multiply them together, then push the product:
3+4*5-(6+1) // [20,3] -- expn,cmpr,smpl,term
There is nothing else in the input for term to do anything with, so it returns to smpl, which has a remembered plus to work on. It pops off two numbers, adds them, and pushes the sum. Then it loops back around to see and accept the minus, then calls term, which calls fact, which sees and accepts the left parenthesis and calls expn recursively:
3+4*5-(6+1) // [23] -- expn,cmpr,smpl,term,fact,expn
This continues down to fact to accept and push the 6:
3+4*5-(6+1) // [6,23] -- expn,cmpr,smpl,term,fact,expn,cmpr,smpl
Back out to smpl, the next plus is accepted, which sernds us -- you see where this is going -- all the way back down to fact for the one, which is added in smpl. Now smpl does not know anything about right parentheses, so it returns to cmpr, to expn, and finally back to fact, which does want that right parenthesis, and consumes it before returning back out to smpl to subtract the 7 from the 23, leaving 16 on the stack at the final exit.

Yes, that's a lot of work, but computers are fast. In a compiler, all this running up and down the nonterminal chain happens at compile time. Instead of pushing values onto some stack, it generates code to load them into a register, so all we are left with is a few loads and adds and a multiply. But we are doing an interpreter, which is slower. The names are different -- JavaScript has a lot more precedence levels, so I numbered them, p2 through p11. In the most detailed logging mode (=7) you can see them grow and shrink on the InfoStak in the trace.
 

FSM

A finite-state machine (FSM) is another important computer technology (described extensively in my book) for keeping simple what otherwise might be a very complex processing agenda. A FSM is a very simple computer, which has a fixed number of internal states. It looks at a linear string of input (called for historical reasons a "tape") one symbol at a time, then decides to accept ("read") that symbol or not, and chooses another (possibly) state to jump to based on its current state and that input symbol. A traffic light controller is typically a FSM, where the input "tape" consists of sensors under each lane in the intersection (plus an internal timer), and the states consist in just these four:
0 Main Green + Side Red
1 Main Yellow + Side Red
2 Main Red + Side Green
3 Main Red + Side Yellow
The default state is 0. Every second the input is examined for cars waiting on the side street, and if so (and no cars on the main street), it goes to state 1 and starts the 5-second timer. When that time runs out, it goes to state 2. When there are no cars on the side street, it goes to state 3 and starts the timer, then advances to state 0 when it runs out. The timer is often used also in states 0 and 2, but set to 30 seconds (or some such), so that even if cars continue to arrive on the side street sensor, if there are cars on the main street and the time has run out, it goes to the next state anyway, and vice-versa. Each state looks only at its own state number and the three input signals (main & side sensors, timer expired) to decide when or if to change state.

My interpreter has some eighty states, and its input tape is the source code of  the program being interpreted, which is further simplified by identifying tokens consisting of one name or reserved word or operator or literal. If the state is looking for that token, it accepts it (calls internal function GeToken, line 241) and then advances to some other state. The actions to be taken in each state are selected by a giant switch statement (line 674), which is embedded in a while(true) loop (line 649 through 1781). Nonterminal calls and returns are simulated by breaking out of the switch with a new state number in variable GoNext (or zero to return); continuing the loop using the continue command stays in the same nonterminal but a new state.
 

Jumps

One of the advantages of compiled code is that the compiler can know where the end of the loop is, so when your program executes a break command, it just jumps execution directly there. A linear interpreter like mine here cannot do that, because it has not gotten there yet. The JavaScript interpreter in your browser preprocesses the script, so it can know where the end of each loop and function is. I didn't do that.

So instead, I have a "skipping" mode which suppresses execution, but continues to parse for correct syntax. When you execute an if-statement, if the condition is false execution is turned off, and then back on when it finishes the statement. At the else-part, the execution is toggled, unless it was already off before entering the if. Breaking out of a while-loop similarly turns off execution until the end of the loop body. Because interpretation proceeds left-to-right, we can record the front of loops in a local variable Breakable, so continue can jump directly there. Function calls work similarly. When making a function call, the current interpretation position is recorded, so the function exit or return can come back here. The processing stack and the variable names are also checkpointed, then cut back on function exit.

The switch statement is the most complicated of all, and I did not even attempt it until the rest of the interpreter was working. The switch expression is evaluated, then instead of a jump table (which is how hardware computers and other interpreters do it), I run through a loop processing statements with execution turned off, but looking for case (or default) lines. The first match turns execution back on and depends on break to turn it off at the end of the case. Because continue can refer to an outer loop, I also had to add a separate inLoop variable to retain access to both kinds of jump. Note that these variables are pushed onto  the stack whenever a new loop or switch is started, or a function is called, and then restored at exit. Thus break always breaks out of the innermost loop or switch, and is disabled  (reported as an error) if you try to break out of a function called from within a loop.

The interpreter is pretty complicated, and you should not expect to understand all of it the first day. Much of that complexity is there to facilitate debugging -- both for me, trying to make the interpreter work properly, and subsequently you, trying to make your interpreted scripts work. Take your time studying it: you might even find a bug or two (please tell me).

Enjoy!

Tom Pittman
rev. 2010 December 24