Every Test is a Turing Test

Mar 24

In January 2023, Economic Professor Bryan Caplan had ChatGPT (using GPT-3) “write” his Fall 2022 Labor Economic midterm. It failed, and Caplan bet Matthew Barnett, an AI researcher, “that no AI would be able to get A’s on 5 out of 6 of my exams by January of 2029”. In a blog post from March 21, Caplan announced that ChatGPT on GPT-4, and not only passed but did very well (with a grade of 73/100), leading Caplan to tweet that “AI enthusiasts have cried wolf for decades. GPT-4 is the wolf. I’ve seen it with my own eyes.” Scott Aaronson cited Caplan’s Road to Damascus moment as the way AI adoption is likely to happen from now on: “one user at a time having the “holy shit” reaction about an AI’s performance on a task that they personally designed and care about—leaving, in the end, only a tiny core of hardened ideologues to explain to the rest of us why it’s all just a parrot trick and none of it counts or matters.”

Aaronson also quotes Bill Gates on the subject of GPT-4 passing tests:

In September, when I met with [OpenAI] again, I watched in awe as they asked GPT, their AI model, 60 multiple-choice questions from the AP Bio exam—and it got 59 of them right. Then it wrote outstanding answers to six open-ended questions from the exam. We had an outside expert score the test, and GPT got a 5—the highest possible score, and the equivalent to getting an A or A+ in a college-level biology course.

Despite the fad for Bayesianism among AI researchers, they seem very unwilling to interrogate their “priors”. Like the assumption that intelligence is quantifiable and rankable, there is a presumption that passing a test is an indicator of intelligence, at least intelligence within the domain of the test.

I’ve always been critical of tests. In Grade 7 I wrote an essay arguing for the abolition of exams in favour of speaking with students individually and gauging how much they learned that way. My argument has always been that being able to pass a test shows that a student can pass a test, but no more. But another way, the ability to generate correct answers to tests bears no relation to a student’s understanding or comprehension. Now, obviously teachers and professors don’t want to look at this idea too closely, but a lot of the developments in disability advocacy over the last number of years have pointed out that the ability to pass a test is the ability to pass a test and no more, and that people who can’t pass the test may still have understanding and comprehension of the subject in question.

People like Caplan, Aaronson, and Gates are assuming that tests measure what they think they measure: intelligence and understanding. On that premise, the ability of a GPT to pass a test to the same level as a human student must therefore indicate intelligence and comprehension. (Note that this is really just the Turing test in more up-to-date, pedagogical guise).

In my view, however, the reverse is true. If tests don’t give any indication of a student’s intelligence and comprehension, then GPT-4’s ability to generate correct test answers can’t be evidence of intelligence in the GPT. For both human students and the GPT, generating correct answers to a test are evidence only of being able to generate correct answers to a test.

But to see things this way would require teachers and professor to see that tests were never doing what they though they were doing. Ideology is sticky. One benefit to the advent of ChatGPT could be in making us move on from outdated methods of evaluations which stand as proxies for student understanding and comprehension. With the advent of the calculator, the ability to do calculation became irrelevant to learning high-school mathematics, and the math curricula moved on (or should have, but that’s another story) to other ways of demonstrating understanding and comprehension. With the advent of ChatGPT, perhaps we will realize that “text generation” can’t stand as a proxy for other things, and we will need to find new and better ways to engage with and try to understand our fellow human beings (including students!).

[ Side note, this is connected with Labor Economics in a very direct way: one reason we need proxies for evaluation is because of the proletarianization of academic instruction and the cost-savings involved in larger and larger classes. Proxies for evaluation are made necessary by the fact of having too few (precarious) instructors for the number of students (which has inflated since the advent of neoliberalism to mask the dreadful employment situation in the “post-industrial” economies). Automated testing/scoring/surveillance are all part of the panopticonized automation of the learning factory. ]

All academic written tests, in the view I have put forward, can be thought of as forms of the Turing test, or imitation game: a student passes a test if they convince their instructor that they are an intelligent being which comprehends the subject matter. As with the Turing test, whether they are such a being or not is beyond the capacity of the test to evaluate: as long as the sufficiently imitate one, that is enough.

The 1950 paper in which Turing describes “the imitation game”, “Computing Machines and Intelligence”, is still highly relevant reading today. Not only does he fundamentally describe how “learning machines” function in similar terms as we do today (though Turing is still thinking of symbolic rather than connectionist - e.g. neural network - A.I.), but the criticisms that he counters are still by and large the criticisms made of ChatGPT (with the exception of the Extra-Sensory Perception criticism, which must have been a very 1940s thing). I want to dig into one of the criticisms that Turing challenges, which he ascribes to Geoffrey Jefferson, that he calls “the argument from consciousness”. Turing quotes Jefferson as arguing that:

“Not until a machine can write a sonnet or compose a concerto because of thoughts and emotions felt, and not by the chance fall of symbols, could we agree that machine equals brain—that is, not only write it but know that it had written it. No mechanism could feel (and not merely artificially signal, an easy contrivance) pleasure at its successes, grief when its valves fuse, be warmed by flattery, be made miserable by its mistakes, be charmed by sex, be angry or depressed when it cannot get what it wants.”

Now, this is by and large my view. The only reason to divorce intelligence from such feelings is because in rationalist/capitalist Western society we tend to think (going back to at least Hobbes and Leibniz) that intelligence is procedural and maybe even algorithmic, and in any event is distinct from questions of emotions and feelings. I suspect that a culture which did not draw such a hard distinction between intellect and feelings would not be fooled at all by GPT “success”.

Turing challenges this view as follows:

According to the most extreme form of this view the only way by which one could be sure that a machine thinks is to be the machine and to feel oneself thinking. One could then describe these feelings to the world, but of course no one would be justified in taking any notice. Likewise according to this view the only way to know that a man thinks is to be that particular man. It is in fact the solipsist point of view. It may be the most logical view to hold but it makes communication of ideas difficult. A is liable to believe ‘A thinks but B does not’ whilst B believes ‘B thinks but A does not’. Instead of arguing continually over this point it is usual to have the polite convention that everyone thinks.

Turing’s point here, I think, is that of course we don’t doubt that other people think when we come across them in daily life. The “skepticism of other minds” is a result of an extreme Cartesian skepticism that concerns philosophers but is not an obstacle to social interaction in the everyday world. From a behaviourist point of view, there are all kinds of social things that we do that convince (or rather, give no grounds to doubt) that we are intelligent, understanding, experiencing beings: we converse, we make eye contact, we change our facial expressions, etc. Doing all of these things - which we learn socially - indicate to our interlocutors (as long as they have been brought up in the same social context) that we too are members of society and think and feel as they do.

We also say “ouch” when we feel a minor pain and writhe around in agony when we feel a major pain, both of which demonstrate to other people the fact of our inner experience (being in pain).

In many ways, then, we are all playing the imitation game. To the extent that we imitate the people around us (beginning with our parents when we are very small) we are taken to be human beings, possessing intelligence and understanding. It might seem then that I am making the case for machine intelligence, because if we are all playing the imitation game, GPT-4 included, then what stops us from recognizing GPT-4 as having human intelligence.

Whether we accept a being as sharing some or all human characteristics is a social question. We have not always accepted all human beings in this way, and there are movements afoot to recognize some animals as sharing certain family resemblances with human beings.

But GPT-4 is not doing anything that we would recognize as indicating human intelligence. Passing tests, as I have argued, don’t count because they don’t in fact indicate human intelligence even for human beings. GPT-4 cannot participate in the behavioural processes that make us appear human to other humans. It generates text based on a textual index using a word-selection algorithm. The texts it generates are no indication of shared intelligence than the numbers that appear on the screen of a calculator.

The one benefit of this kind of AI-hype, in my opinion, is that it will help us move away from outdated modes of “evaluating human intelligence”, etc, towards other mechanisms of recognizing shared humanity, broader social ways of seeing the human being rather than the “meat-sack”. It is these older ways of searching for an “essence” of what is human that leads to war (the quintessential deumanizing operation). Only be rejecting the idea of an essence and opting for a social recognizing of family resemblances can we move forward, but this can’t mean mistaking the unconscious and unintelligent operation of a search-index for the behaviour of any kind of intelligent being.

Sam Popowich

Every Test is a Turing Test

Erlebnis and Artificial Intelligence

Reading with Suspicion