The Go Programming Language

All viable computer programming languages are (virtual) Universal Turing Machines (UTM), that is, anything that can be computed on any computer can be programmed in any UTM programming language -- including (but not limited to) interpreting or compiling any program written in any other programming language. The only difference is efficiency and convenience.

So when somebody tells me of this or that new programming language, I want to know what it's supposed to be better at, so I can debunk the claim(s). If the only claim to fame is performance, then it's hogwash. Any programming language that compiles to machine code can be used to program and run on any machine it compiles to at about the same speed, where differences are generally in library code that can be rewritten in the specified programming language to eliminate all such differences (library programmers generally aim for universality rather than raw performance). C and Java both compile to native machine code in most implementations (Java is a 2-step compile, in which the second step happens at runtime when the program loads, so it costs a little more when the program is starting up, but not thereafter). On the other hand, Java makes more efficient use of programmer time than C, because it catches more common programmer errors at compile time, so it is generally the faster language in wall-clock time, start to finish, unless you are running thousands of separate production runs.

So here comes this email, some fellow telling me that his son is writing a big complex program in Go and asking if I heard of it. No, I had not. Three of the four links he offered me were broken, and the fourth was rather weak on useful information. It did claim that Go makes multiple threads feasible because of its "segmented stack." Multiple threads improve performance only up to the number of independent processor cores in the computer; then, because process context switching is the second most expensive single thing a computer can do (behind I/O), as soon as you have more threads than there are cores to run them, the whole program slows down in a mode called "thrashing."

The segmented stack is an interesting feature, not of the language, but of the runtime environment supporting the language. Before Apple engineered the joint Apple+IBM+Motorola effort to create the PowerPC CPU, IBM (which invented the architecture and was already using it for the fastest minicomputer then known) used a segmented stack for their subroutine calls. This worked for all languages running on that computer, notably C and Fortran. If they had implemented Java with a JIT ("Just-In-Time," the second step in the 2-step compile of most Java programs today) or native-code compiler (like mine), it probably would have had segmented stacks too.

A segmented stack has (minor) performance problems because memory must be allocated for every subroutine call, then released when it returns. Some of that performance cost can be mitigated by pre-allocating a bunch of fixed-size stack segments, then using one from the empty segment list when needed, and returning it to the list when done, but that only works for subroutines that do not exceed that fixed size in local storage requirements (larger stack segments can be allocated using the normal memory allocation process, or else multiple segments aggregated using indirect pointers, which costs slightly more when they are used). The compiler of any programming language can be adjusted to compile for segmented stacks, so it is not a unique feature of Go. Most computer operating systems pre-allocate a huge stack for every thread, so the cost of subtroutine entry and exit (not counting parameter passing) is limited to a dozen instructions or less (five in the computer system I am most familiar with), but if the thread does not need that whole stack, the memory address space is wasted, possibly used up very quickly. Perhaps that is one of the attractive features of the 64-bit address space in all modern desktop machines: you can have millions of million-byte stacks, and lose performance only to swap time when the stack grows beyond current memory usage -- but you have that problem anyway. Segmented stacks basically are a cute idea whose time has passed.

The other claimed benefit of the Go programming language is so-called "goroutines" which seems to imply that a subroutine can be called as a separate thread, with the implied (but not clearly stated) prevention of deadlock. Deadlock happens when separate processes in a computer are allowed to claim control over multiple shared resources, and two or more processes want control over the same set of resources but they request control in a different order. It is easily prevented by denying that kind of control. If each thread is event-driven, there is no need for shared resources at all. Java on my computer does that. I suspect that is also what Go does, but I found no proof of it.

Instead I found a "tutorial" purporting to explain how goroutines work, with this curious confession:

func simulateEvent(name string, timeInSecs int64) {
    // sleep for a while to simulate time consumed by event
    fmt.Println("Started ", name, ": Should take", timeInSecs, "seconds.")
    time.Sleep(timeInSecs * 1e9 )
    fmt.Println("Finished ", name)
}
func main() {
    go simulateEvent("100m sprint", 10) //start 100m sprint, it should take 10 seconds
    go simulateEvent("Long jump", 6) //start long jump, it should take 6 seconds
    go simulateEvent("High jump", 3) //start high jump, it should take 3 seconds
    //so that the program doesn't exit here, we make the program wait here for a while
    time.Sleep(12 * 1e9)
}

Started Long jump : Should take 6 seconds.
Started 100m sprint : Should take 10 seconds.
Started High jump : Should take 3 seconds.
Finished High jump
Finished Long jump
Finished 100m sprit

There is one weird result in the above output. We have invoked the 100m sprint first, but the print out first starts with the long jump. I ran it repeatedly and got the same result. I don't exactly know why it is so...

It doesn't take much hard thought to figure out what happened here. The author of this blog ran his program on a two-core computer, where the main program started running on the first core, and the second core was available for the first goroutine call to simulateEvent, which immediately calls library routine Println to display its startup line. Println does its formatting into a single text string, which it passes -- almost certainly by a message sent to an existing goroutine -- to a process that controls the console output. That is a separate process, but it has been blocked (waiting for another line to print) and must be awakened -- apparently a 2-step situation -- so the main thread is blocked to free up a core for it, and by then (context switching is very slow) the second goroutine call has been started (displacing the first goroutine) and it also has its text line ready to send to the console output process (which is now ready to accept whatever comes), so that's what gets printed first. The second goroutine then blocks in its delay while the console process is still working on its output (which, as I said, is even slower than context switching), so the first goroutine, now re-awakened in that second core, queues its first print line then blocks in its own delay, and the main process resumes in the second core to start the third goroutine (and then to do its own delay block), after which there are at most only two active processes to run at a time (whichever goroutine just woke up, plus the console output from time to time).

The point is, it is the cost of thread startup, plus the cost of all these unnecessary forced context switches, that gives rise to this "weird" output sequence. These are penalties of the way Go handles its goroutines. The time might be less than the cost of the same kinds of startup and context switching in Java or C, but (as in this example) Go encourages the prodigal use of more separate processes than there are cores with the (false) claim that they are easier and faster than conventional languages like C and Java. If you don't thoughtfully match the processes to the available processor cores, you will have that same problem in any language you care to use, but the conventional language vendors are not deceiving you about what the overrun costs you.

My friend's son is said to be doing a "3D rendering engine." There are already tools for converting C and Java into code suitable for a graphics processor (GPU) which has a zillion separate processing cores, but they require you to use a GPU-ready subset of the language, so for Go to be interesting, he must be doing this to run on his main computer, which probably has no more than eight (or on the biggest desktop tower, at most 32) cores. 3D rendering is an interesting project for Go only if he is sending every object and facet of an object off to its own goroutine to be rendered, which in a moderately complex image could be thousands of separate threads competing for those eight cores, and using up large amounts of time to switch off between them. I think -- and I suggested -- that he should do that, then rewrite the same algorithm in Java, taking care to spin off no more separate threads than he has cores, and taking care to avoid unnecessary object creation and destruction (thereby avoiding the cost of memory allocation and especially the cost of triggering garbage collection, both limitations in Java speed), and I claim that he will find the resulting Java program actually runs faster than his Go version. But somebody needs to go through the exercise to know for sure. A well-written paper debunking the claims made for Go -- especially as compared to Java -- would certainly earn a place in some peer-reviewed journal and look good on the kid's college application. If he wants to.

Tom Pittman
Revised 2019 February 5