A Turing Test to Defeat ChatGPT

This paper describes a scientific hypothesis, a testable experiment to distinguish the "general intelligence" that the creators of ChatGPT are explicitly aiming for, from what ChatGPT can actually produce. Only experts in their domain can construct such a test. I happen to be a pretty good computer programmer, good enough that I can tell when my students worked together on their assignment -- even when they changed the names of the variables and re-arranged the text. That is something human programmers can do, and that human teachers can detect. I can write a program to do (or detect) that cheat, but that program is very particular, not at all what the developers of ChatGPT tell us they did. I propose to turn that cheat on its head.

OpenAI's ChatGPT is getting a lot of ink in the trade press lately, mostly accepting the developer's goal of AGI (Artificial General Intelligence) as credible and imminent, and (often) to worry about it. I am not the only one dubiously looking at this, but I do not yet see anybody else trying a grey-box criticism, that is, attempting to test whether the machine is intelligent based at least partially on some kind of understanding of how it works. Most of them assume the accuracy of the notion that neural nets (NNs) "think" the way people think because the "neurons" of NNs are fundamentally the same as biological (human) neurons. That may or may not be true, but they are not wired up the same, nor do they get the same training.

So when the NNs "create" poetry -- not modern stuff that intentionally has no meaning, but Shakespearean sonnets, which often come with multiple meanings in the same line -- we can ask the people who are expert in Shakespearean sonnets whether the machine succeeded. It did not. The developers knew that, so they asked people who were not familiar with the art form, who quickly figured out that all of Shakespeare is on the internet and answered accordingly. Denied that shortcut, they basically guessed randomly. It proves nothing.

Similarly, when the NNs "write" computer programs, are we getting an original creation, or a copy of something in its training data? The developers cannot (or refuse to) tell us. I know enough about the process of creating new software, that I can tell you a lot about the person who wrote it. All the ChatGPT code I have seen was all written by a person.

We have maybe a century (give or take a couple decades) of computational theory and practice, and based on that experience we can do some analytical thinking. The creators of ChatGPT are not telling us the whole story about how their program works. The OpenAI hiring policy is that only True Believers will even be considered for employment there. What if the whole model is broken? How would they know? They cannot.

They fed their engine a billion lines of Reddit comments, and then -- Behold! -- it produced credible conversation. They fed it a comparable amount of open-source code (GitHub and the like) and -- Behold! -- it produced credible code. Or did it?

What they tell us about how it works is incomplete. They say that for any string of characters, it picks the most probable next letter. That may produce passwords, but not readable English.

I've been watching and collating the various reports, and it appears to be somewhat more complicated than that.

First they train a recognizer on those billion lines of text, so if you give it a non-word it can tell you "Nope, not a word in my training data." If you give it a sequence of words -- and if its buffer is long enough, they never actually tell us how long it can get -- it can say "Nope, not any sequence of words I ever saw."

Then they train a second NN to generate random sequences of letters, but disallow the non-words and non-sequences by letting the first engine provide the feedback. Pretty soon it only generates sequences that are credible = were in the training data for the first machine. Plus a humongous amount of carbon in the air (if you care), because it takes a lot of energy to run a lot of computers to do all that training. OK, I can write a simple algorithmic program to do that same thing on far less carbon.

But random letters and words don't make working programs. So they add another machine (a simple compiler) to tell the generator if the program is legal code. So all the random code is disallowed, and the only thing that actually makes it out for the user to see is a sequence of letters and symbols that actually occurred in the first machine's training data. In other words a working whole program. Or maybe a buggy program (this actually happened) because all the open-source code on the internet was witten by people, and people make mistakes (and don't always catch all of them). We tolerate mistakes like that in English prose, we call it "creative." Taken to the nth degree, you get modern poetry that makes no sense at all.

This works for any topic you want to test ChatGPT on. Reddit is full of comments on every conceivable subject, and it's all now in the engine. It works for any program you might ask it to write, a dozen different people already wrote that program and uploaded their code to the internet, and ChatGPT trained on it, so ChatGPT knows all those programs. A billion -- I think I read somewhere it was a trillion -- lines of code is beyond human comprehension. But not beyond what you can train the engine on, if you are willing to pay for the electricity.

How do we know that the computer didn't actually abstract out the principles of coding and create a new program from scratch? I offer the words of Dr. Bill McKeeman when I tried to propose something as silly as that: "Tom, it's a theorem." We have a mathematical proof that things cannot work that way.

The generator they described to us, one letter at a time, is a Finite-State Machine (FSM), which is the bottom of the four levels of computational complexity. It can generate only code that can be described by Regular Expressions. All modern programming languages use parentheses to group mathematical operations, and parentheses come in pairs that must match and be nested properly. A FSM cannot do that, it requires a Context-Free grammar and a stack machine, the next level up from FSM. Maybe they have one, but they are not saying so. I don't think they even know they need it.

It gets worse. All useful programs need variables to hold values used in different parts of the code. That requires a Context-Sensitive grammar (the third level up, above Context-Free) and a machine we do not even know how to build. That's why type-checking is done ad-hoc in compilers, and not built into the grammar. I know about this stuff, I have a PhD in it. The top level is a Turing Machine (TM), which can do anything that is computable. Computers and most programming languages are TMs, and people probably think that way too. But they did not describe for us a TM. NNs as used in the current state of the art AI technology are not TMs.

What they described for us is a FSM, and a FSM can "generate" correct programs by reproducing exactly (some small portion of) its training data. If the engine is such that you give it a description of the program you want, and it finds that description in the comments linked to some code somewhere in its training data, then it can reproduce that code, and it looks like creativity. But there is no abstract thinking, there's not even parenthesis matching, just ignorant copying. Sooo...

The Test

If you ask ChatGPT to write you a Bubble Sort program, it's probably seen hundreds of them in its training data, no problem. If you ask a first or second year student programmer, they probably never heard of it. So,

Part 0 -- Write a Bubble Sort

I give you this one free. Here's a minimal bubble sort, 9 lines of Java:

void BubbleSort() {
int kx, nx, zx;
int[] data = {3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5};
for (nx=10,nx>0,nx--)
    for (kx=1,kx<=nx,kx++)
      if (data[kx-1]>data[kx]) {
        zx = data[kx];
        data[kx] = data[kx-1];
        data[kx-1] = zx;}}

For small data like this, a bubble sort can be (and in this case, is) faster than the nominally optimal N-log-N sorts. Partly sorted initial data (like this case) can run faster with the addition of a boolean test to stop if no swaps happened, but with small sorts (why else would you use a bubble sort?) the difference is unimportant.

Part 1 -- Use Only Variable Names in the Declaration of Independence

Here's a well-known line "We hold these truths to be self-evident, that all men are created equal," you can use these words. Make sure the program compiles and runs the same.

Nobody talks about replacing variable names with some other words, certainly not from an irrelevant document like the DoI, so this is almost certainly not in the ChatGPT training data. If it is, well, replace it with chapter 18 of Genesis in the KJV Bible. Or Ecclesiastes 3. Or just a random word list. You can write a FSM to do that, but it doesn't do it by choosing random probable next letters. It must be intentional.

Part 2 -- Add 12 More Braces and Parentheses

My subroutine above has three each left and right braces, and two each left and right parentheses. To add more braces and parentheses without changing how the subroutine runs requires knowing the semantics of braces and parentheses, and how to match them up. A FSM cannot do that. Nothing in the open-source training data will propose doing that. Like the variable names, it's a silly thing to do, so nobody will waste time uploading code that does that. But any half-competent first or second year student programmer knows exactly how to do it, with great freedom and variety.

Here's my version, three each left and right braces and left and right parentheses added (shown in bold) and it runs the same:

void BubbleSort() {
int we, hold, these;
int[] truths = {3, 1, 4, 1, 5, (9), 2, 6, 5, 3, 5};
for (hold=(10),hold>0,hold--) {
    for (we=1,we<=hold,we++) {
      if (truths[we-1]>truths[we]) {
        these = truths[we];
        {truths[we] = (truths[we-1]);}
        truths[we-1] = these;}}}}

Tom Pittman
2023 September 28