For my paper Measuring all the noises of LLM evals, to decompose and actually measure all noise types seemed to me the most useful message, but I want to share the story behind it here and pose some remaining questions. It started as a small analysis for CRUXEval as power analysis and quality assurance, but noise has been an obsession for me since my first paper Baselines and Bigrams in 2012, where in Table 1 we calculated the noise upper bound based on which eval (test set) was used. At the time I did not understand paired comparisons well but had an appreciation for the bootstrap, probably common from a ML background.

There is no doubt that guessable classification evals such as MNIST/MMLU need to be large enough and they are (~10k questions), since any correct answer could just be a random guess (~10 choices). However, this is less clear with hard generative evals, initially represented by HumanEval, which only has 164 problems. Real human evals such as IMO, IOI, Putnam are similarly small. For our benchmark work CRUXEval, we did a power analysis that showed 1k questions are needed to reliably distinguish the scaling law effect of doubling the model size, but then we thought it must be enough to be 5x larger than the leading generative eval of the time HumanEval, so we settled at 800. Around the same time, I also attempted to increase the signal by modelling the difficulty of the questions, similar to the Item Response Theory, and I got all failures, which is also reported by Madaan et al, 2024. Projects such as EvalPlus improved the quality of the tests and audited the questions for errors. While this is intuitively very appealing to me, I saw no improvement in the noise levels nor the signal-to-noise ratio. SWEBench-Verified is another project where some human labour was used to improve the subjective data quality while reducing the data size. Another piece of evidence is that MBPP, a subjectively poor quality eval where a random sample likely contains errors, still gave measurably higher signal to noise than HumanEval, a subjectively high quality eval where most of supposed errors reflects realistic cases that you’d want to evaluate on rather than errors that give misleading signals. As an eval creator, I wondered if we should just focus on getting more questions vs. getting better quality questions. The noise levels measurements suggest we should just have more questions, but intuitively it seems hard to trust a larger eval that is full of errors. Despite these more impressive generative evals, we rightly trusted the guessable MMLU more for model development, which has a larger size and high signal-to-noise. Some practitioners do not trust evals at all, and would rather trust their own interactive testing.

As LLM evals start to capture even more interesting and not-guessable problems like LiveCodeBench and SWEBench, and as I contributed to a few more generative evals, I was confused by how small and noisy our evals have become in the pursuit of difficulty and agents – and how to reconcile with the fact that answering one hard, not-guessable question could remove all doubts. While we are rightly impressed by these hard evals, they are quite noisy, so we partly cope with the following methods: “look at the variance of the training curve”, “let’s average over multiple seeds”, and “let’s average over multiple predictions”, which we know to be nonsense in theory because the data noise is completely ignored by these methods. With the unpaired analysis, since question difficulties are roughly bimodal beta distributed, the data noise is sure to be significant. Even when paired, I also got some initial empirical evidence from CRUXEval, where for the results sampled at T=0.2, the paired data noise is a bit higher than the prediction noise, but this turned out to be more of an exception. With CoT, reasoning models, and sampling at more natural temperatures, prediction noise is typically higher. At this time, practitioners often drew multiple predictions per question (HumanEval, CRUXEval) to measure pass@k, but perhaps they are also sensing the prediction noise as such metrics are developed. With multiple predictions per question on CRUXEval, I tried to measure the data noise vs. prediction noise by resampling different objects with the bootstrap, which is correct but it was not clear (to me) how each variance relates to the total. The traditional hypothesis testing framework also provides some answers, which is consistent with bootstrap but still unclear (to me) on how to treat the prediction noise. Then I read Miller, 2024 more carefully, and their variance decomposition really clarified that these are meaningful components that can be measured. So I reanalyzed all the data using this method instead of bootstrap and sign-test. While I still appreciate the bootstrap, when the answer is var(A) or p(1-p) or var(mean(A, axis=1)), “let’s use the bootstrap” is pretty underwhelming. More importantly, we still need to be clear what exactly are we resampling: the data, the prediction, both, with averaging vs. not, paired vs. unpaired. So the typical “we computed the 95% confidence intervals with the bootstrap” can be meaningless.

So I took the opportunity to carefully measure all the noise types on all the evals data I can get, with clearly separated contributions of prediction noise vs. data noise, paired vs unpaired, averaged vs not. From these measurements, a lot of my confusions can be explained by that prediction noise usually dominates the paired data noise:

  • Methods that attempt to model the questions like the Item Response Theory won’t work if most of the noise comes from model predictions instead of the questions, so we must first reduce the prediction noise before trying to model question difficulty.

  • Answering one hard question can only be very significant if it’s not guessable by the baseline model. While hard questions are not guessable by picking at random, they do seem to be guessable by weaker LLMs. So far we do not see these 0 to 1 improvements, most questions that can be answered by stronger models are also guessable by weaker models with less probability. We can only observe this using the paired comparison with multiple prediction samples per question.

  • Looking at the training curve, sampling more predictions and using multiple seeds are actually sound methods when the prediction noise dominates. So it’s an example of not dismissing practical wisdom on the basis of theory.

  • The bonus is the main result of the paper: because LLMs are still very similar to each other, they also generate predictable total noise levels, so these measurements are actually more general than the specific pairs measured, and can be used to assess noise in context. Furthermore, we can actually get a lot more signal from current evals if we reduce the prediction noise by averaging multiple samples, detecting effect sizes 2-6x times smaller, which is pretty nice. The paper goes into more details on these main findings.

A few related big-picture questions:

  • I see that LLMs have high prediction noise, but compared to what? Do humans have similarly high prediction noise? Groups of humans also seem to have pretty high prediction noise looking at the IMO and IOI question level results among different humans. While I don’t know how to measure this for individual humans, I think individual humans have less prediction noise when they have a grounded understanding of the material and enough time, where they tend to know if they really answered a question.

  • How to get more bits of feedback from the trajectory of solving a hard problem? This will be important for both training and evals. 99% of my learning do not come with a label, and when the label/reward comes it’s usually noisy and unreliable anyways.

  • Can we define guessable and relatedly pass@inf more precisely? Everything is guessable with a random string, and pass@inf is always 1. So somehow there needs to be a tier of generations that’s better than random, maybe like in the typical set of the model.

A few related technical questions, roughly in increasing vagueness:

  • Can we still reduce the prediction noise by lowering the temp if we also have some noisy model changes like a small number of training steps? My feeling is we cannot reduce the prediction noise that way, just like how we cannot reduce the prediction noise robustly via fixing the random seed. This can easily be tested but I have not.

  • So after reducing prediction noise, is it worth revisiting IRT and the like? There may still be other obstacles or maybe it’s a big success.

  • Is there something obvious causing the bimodal Beta distribution? if not, does it say something about what LLMs are?

  • Recommending averaging still causes some discomfort to me. The paper has a caveat section elaborating on that where I tried but failed to come up with something that really defeats averaging. With these non-fatal flaws as discussed, it seems better than having large prediction noise. But maybe there is something worse.

Comments and discussion about this work is welcome. Email or github issues are both good formats.