It’s getting harder to measure just how good AI is getting

‘2024 was the year in which benchmark after benchmark for AI capabilities became as saturated as thePacific Ocean. We used to test AIs against a physics, biology, and chemistry benchmark called GPQA that was so difficult that even PhD students in the corresponding fields would generally score less than 70 percent. But the AIs now perform better than humans with relevant PhDs, so it’s not a good way to measure further progress.

On the Math Olympiad qualifier, too, the models now perform among top humans. A benchmark called the MMLU was meant to measure language understanding with questions across many different domains. The best models have saturated that one, too. A benchmark called ARC-AGI was meant to be really, really difficult and measure general humanlike intelligence — but o3 (when tuned for the task) achieves a bombshell 88 percent on it.

We can always create more benchmarks. (We are doing so — ARC-AGI-2 will be announced soon, and is supposed to be much harder.) But at the rate AIs are progressing, each new benchmark only lasts a few years, at best. And perhaps more importantly for those of us who aren’t machine learning researchers, benchmarks increasingly have to measure AI performance on tasks that humans couldn’t do themselves in order to describe what they are and aren’t capable of.

Yes, AIs still make stupid and annoying mistakes. But if it’s been six months since you were paying attention, or if you’ve mostly only playing around with the free versions of language models available online, which are well behind the frontier, you are overestimating how many stupid and annoying mistakes they make, and underestimating how capable they are on hard, intellectually demanding tasks…'(Kelsey Piper via Vox)