Limitations of traditional testing
If AI companies are slow to respond to increased benchmark failures, this is partly because the test scoring approach has been effective for a very long time.
One of the greatest early successes of modern AI was the Imagenet Challenge, a preceded by modern benchmarks. Released in 2010 as an open challenge for researchers, the database held over 3 million images of AI systems and categorized them into 1,000 different classes.
Importantly, this test was completely agnostic about the method, and successful algorithms quickly gained reliability regardless of how they worked. When an algorithm called Alexnet broke in 2012 in the form that led to a type of GPU training, it became one of the fundamental consequences of modern AI. Few people would have presumed in advance that Alexnet’s convolutional neural net would become a secret to unlocking image recognition, but after it was well acquired, no one dared to dispute it. (Ilya Sutskever, one of the developers of Alexnet, will proceed to Cofound Openai.)
What made this task so effective was that there was little practical difference between the Imagenet object classification task and the actual process of asking the computer to recognize images. Even if there is a dispute over the method, no one doubts that the best scoring model will be advantageous when deployed in a real image recognition system.
However, over the next 12 years, AI researchers have applied the same method-independent approach to increasingly common tasks. SWE benches are commonly used as proxy for broader coding capabilities, but other test style benchmarks often represent inference capabilities. Its wide range makes it difficult to be strict about what a particular benchmark looks like. This makes it difficult to use the findings responsibly.
Where things break
Anka Leel, a doctoral student who has focused on benchmark issues as part of her research at Stanford, has convinced her the question of assessment is the result of this push to generality. “We’ve moved from a task-specific model to a general purpose model,” says Reuel. “Evaluation becomes more difficult because it’s no longer a single task, but an entire bundle of tasks.”
Like Jacobs at the University of Michigan, Reuel believes that “the main benchmark issue is even more effective than actual implementation.” For example, for tasks as complex as coding, it is nearly impossible to incorporate all possible scenarios into a problem set. As a result, it is difficult to measure whether the model is scoring better because it is more skilled in coding, or because it has manipulated the problem set more effectively. And shortcuts are hard to resist, as they put a lot of pressure on developers to achieve record scores.
For developers, we hope that success in many specific benchmarks will generally become a competent model. However, the agent AI approach means that a single AI system can encompass complex arrays of different models, making it difficult to assess whether improvements to a particular task lead to generalization. “There are many knobs you can turn around,” says Sayash Kapoor, a Princeton computer scientist and a well-known critic of the sloppy practices of the AI ​​industry. “As for agents, they gave up best practices for evaluation.”