A better benchmark
Table of Contents
The search continues for more-sophisticated and evidence-based approaches for evaluating AI. One solution is to go broad and test as many parameters as possible — similar to the Microsoft team’s approach with GPT-4, but in a more systematic and reproducible fashion.
A team of Google researchers spearheaded one such effort in 2022 with its Beyond the Imitation Game benchmark (BIG-Bench) initiative8, which brought together scientists from around the world to assemble a battery of around 200 tests grounded in disciplines such as mathematics, linguistics and psychology.
The idea is that a more diverse approach to benchmarking against human cognition will lead to a richer and more meaningful indicator of whether an AI can reason or understand at least in some areas, even if it falls short in others. Google’s PaLM algorithm, however, was already able to beat humans at nearly two-thirds of the BIG-Bench tests at the time of the framework’s release.
The approach taken by BIG-Bench could be confounded by a number of issues. One is data-set pollution. With an LLM that has been potentially exposed to the full universe of scientific and medical knowledge on the Internet, it becomes exceedingly difficult to ensure that the AI has not been ‘pre-trained’ to solve a given test or even just something resembling it. Hernández-Orallo, who collaborated with the BIG-Bench team, points out that for many of the most advanced AI systems — including GPT-4 — the research community has no clear sense of what data were included or excluded from the training process.
This is problematic because the most robust and well-validated assessment tools, developed in fields such as cognitive science and developmental psychology, are thoroughly documented in the literature, and therefore would probably have been available to the AI. No person could hope to consistently defeat even a stochastic parrot armed with vast knowledge of the tests. “You have to be super-creative and come up with tests that look unlike anything on the Internet,” says Bowman. And even then, he adds, it’s wise to “take everything with a grain of salt”.
Lucy Cheke, a comparative psychologist who studies AI at the University of Cambridge, UK, is also concerned that many of these test batteries are not able to properly assess intelligence. Tests that are designed to evaluate reasoning and cognition, she explains, are generally designed for the assessment of human adults, and might not be well suited for evaluating a broader range of signatures of intelligent behaviour. “I’d be looking to the psycholinguistics literature, at what sorts of tests we use for language development in children, linguistic command understanding in dogs and parrots, or people with different kinds of brain damage that affects language.”
Cheke is now drawing on her expertise in studying animal behaviour and developmental psychology to develop animal-inpsired tests in collaboration with Hernández-Orallo, as part of the RECOG-AI study funded by the US Defense Advanced Research Projects Agency. These go well beyond language to assess intelligence-associated common-sense principles such as object permanence — the recognition that something continues to exist even if it disappears from view.
Tests designed to evaluate animal behaviour could be used to assess AI systems. In this video, AI agents and various animal species attempt to retrieve food from inside a transparent cylinder. Credit: AI videos, Matthew Crosby; animal videos, MacLean, E. L. et al. Proc. Natl Acad. Sci. USA 111, E2140-E2148 (2014).
As an alternative to conventional benchmarks, Pavlick is taking a process-oriented approach that allows her team to essentially check an algorithm’s homework and understand how it arrived at its answer, rather than evaluating the answer in isolation. This can be especially helpful when researchers lack a clear view of the detailed inner workings of an AI algorithm. “Having transparency about what happened under the hood is important,” says Pavlick.
When transparency is lacking, as is the case with today’s corporate-developed LLMs, efforts to assess the capabilities of an AI system are made more difficult. For example, some researchers report that current iterations of GPT-4 differ considerably in their performance from previous versions — including those described in the literature — making apples-to-apples comparison almost impossible. “I think that the current corporate practice of large language models is a disaster for science,” says Marcus.
But there are workarounds that make it possible to establish more rigorously controlled exam conditions for existing tests. For example, some researchers are generating simpler, ‘mini-me’ versions of GPT-4 that replicate its computational architecture but with smaller, carefully defined training data sets. If researchers have a specific battery of tests lined up to assess their AI, they can selectively curate and exclude training data that might give the algorithm a cheat sheet and confound testing. “It might be that once we can spell out how something is happening on a small model, you can start to imagine how the bigger models are working,” says Pavlick.