Predicting when LLMs fail

An overview of different paradigms to do so and connections with related areas

Intro

Large Language Models (LLMs) are ML systems. As for any ML system, we may want to predict whether they will fail or succeed on specific task instances (e.g., a specific question from a QA dataset).

There are several interrelated paradigms to do so, differing in whether they are specific to a model or not and whether they rely on the model’s outputs or internals. The aim of this post is to frame how these different approaches relate to one another. Therefore, I overview below three main paradigms, and point towards closely related research fields that tackle different questions.

Disclaimer: I do not claim this overview to cover all existing research strands related to predicting LLM failures or to comprehensively capture all works in each of the considered ones. I welcome any feedback.

TL;DR: in brief, I identified three paradigms that can predict where LLMs fail. The third one is likely very hard to apply to LLMs in practice. The table below summarises their properties:

CATEGORY TRAINING MODEL-SPECIFIC PASSING INPUT THROUGH THE MODEL
Intrinsic UQ In some cases (eg whitebox methods) Yes Yes
Assessor Yes Yes (they can be trained to work for a set of models, based on model features alongside instance features) No
Model-agnostic rejector Yes, but once per data distribution No No

Paradigms to predict where LLMs fail

Intrinsic uncertainty quantification

Uncertainty quantification (UQ) has been extensively researched in “traditional” deep learning. It involves a suite of techniques aimed at extracting an uncertainty quantification from a trained model (such as reporting the logits of a trained multi-class classifier network) or training those models to embed a form of uncertainty (such as Bayesian Neural Networks, or BNNs). Most methods falling under the term “uncertainty quantification” in the traditional deep learning literature involve producing an output with the model and accompanying it with an uncertainty estimate obtained in some way.

Some of these methods have been adapted or evolved into approaches applicable to LLMs; however, other traditional methods are hardly applicable to them. In fact, with respect to traditional deep learning, LLMs have two main differences that limit the applicability of existing methods and spurred the creation of novel ones:

Shorinwa et al., 2024, provides a nice overview of existing UQ methods for LLMs and their connection with classical ones. Beyond token-level uncertainty quantification, a second class of methods involve having the models verbalise their uncertainty, either after or before an answer has been produced . A third category is that of “semantic similarity” approaches; for instance, Kuhn et al., 2023, generates different answers, clusters them according to meaning and considers the uncertainty over clusters. Finally, a set of “white-box” works rely on training additional modules that, starting from model activations at a given internal layer, predicts the probability of a produced answer being correct .

As a tangential note, a recent work has shown how the model confidence obtained by those different methods is not necessarily consistent: indeed, they find that the token-level probabilities and verbalised confidence are very poorly correlated.

The approaches above differ in terms of level access: from access to activations (white-box), to log-probabilities (token-level approaches), to necessity of multiple generation (as for some semantic similarity approaches), to simple generation for verbalised UQ. Across all of those categories, some works involve finetuning the model to improve its UQ ability .

Nevertheless, all UQ approaches for LLMs require to produce model outputs (or even multiple versions), or at least to pass the input through the LLM to obtain its activations.

Assessors: anticipating LLM performance

An alternative paradigm which has received less attention is that of “assessors” : additional modules that are trained to predict the correctness of a main model on specific instances without relying on that model’s output or activations The original nomenclature also included assessors relying on LLM output, which would make the “white-box” unceratinty quantification methods also fall under this. For ease of understanding, I will then consider all assessors to not rely on LLM outputs. In practice, assessors are classifiers that tackle the problem of predicting the success of the main model starting from a set of features of the input , such as some embeddings. This makes the assessor model-specific. Alternatively, a single assessor can be trained to predict the success of multiple models, by relying on specific features to identify them , thus predicting success of a model on a specific instance from a pair (instance features, model features).

In both cases, the main point is that the model does not require to see the input at test time for the assessor to make a prediction of its performance. This has two advantages:

The concept of assessors is closely related to that of routers , which are modules that predict which of a set of LLMs is more likely to succeed at a specific task instance.

Model-agnostic rejectors

Model-agnostic rejectors operate on the assumption that a model is likely to fail when confronted with a data point significantly different from the training distribution. This concept is closely aligned with anomaly detection. To implement this, a predictor is trained on the training data to identify inputs that diverge notably from that distribution.

I am unaware of any work applying this idea to LLMs; I believe the main reason is that LLMs’ training datasets are vast and often unknown, so it is basically impossible to train a rejector using information on that. Even more importantly, as LLMs can tackle a wide range of tasks, it would not make sense to train a rejector on the pre-training dataset.

Therefore, the practical usefulness of this approach may be limited to cases where LLMs undergo fine-tuning for specific use cases in niche domains. Even here, however, it may be better to rely on assessors or uncertainty quantification methods which, by exploiting information on the model, are likely to perform better. In fact, the main advantage of model-agnostic rejectors is that they can be applied to any model trained on a specific dataset; however, it is rare to have multiple LLMs trained (or even finetuned) on the same dataset, due to their large cost.

Bonus: Modifying the LLM Itself

A recent study introduced an innovative approach by integrating an “I don’t know” token into the LLM’s vocabulary during the pre-training phase. This addition improved the model’s calibration by enabling it to explicitly express uncertainty.

Such advancements echo techniques in uncertainty quantification (UQ) for traditional deep learning, where models are modified or trained in bespoke ways to better capture uncertainty.

Tangential directions

Extracting other information from model internals

There are other works investigating the use of model internals to extract additional insights. For instance, some studies focus on determining whether a model is engaging in deceptive behavior (commonly referred to as “lying”).

This approach bears similarities to extracting uncertainty quantification for LLMs but instead targets different “functions” of a response beyond mere correctness. It could also prove valuable in identifying instances of “sandbagging” , where models strategically underperform under specific conditions.

Using LLMs for forecasting

Rather than focusing on quantifying uncertainty in LLM-generated answers, this approach leverages LLMs to provide uncertainty quantification for external world events. This is conceptually akin to applying intrinsic model uncertainty to simple QA quizzes but extended to inherently uncertain questions. Instead of addressing questions with definitive answers, this method deals with questions characterized by intrinsic unpredictability. Notable efforts in this domain include:

Reward models

Reward models can be considered as somehow related to uncertainty quantification: they are typically trained (for instance from human feedback) to evaluate a completion in terms of a validity specification. However, this validity specification usually includes absence of toxic content or alignment with user intentions, rather than correctness of an answer, even if the latter could plausibly be used. Then, they are used as the reward function to finetune LLMs via RL. However, they can also be independently used to evaluate completions produced by a model; some works also show that they can be used to evaluate partial completions, thus stopping generation and then restarting it if a generation is likely to be unsuccessful wrt the final specification .

However, in contrast to intrinsic uncertainty quantification and assessors, reward models are generally not specific to a single LLM. As such, they rely on generic features of the output that can be used to understand agreement with the considered validity specification. In contrast, assessors and uncertainty quantification approaches are trained considering a specific model or a set of models and explicitly or implicitly rely on their features.

How humans interpret LLM uncertainty

Human perception and prediction of LLM performance represents another important research direction. Studies have revealed significant limitations in human ability to anticipate LLM behavior. A recent experiment has shown that humans perform only marginally better than random chance when predicting GPT-4’s performance. Moreover, humans tend to exhibit overconfidence in predicting LLM performance on high-stakes tasks, likely due to anchoring on past interactions with these systems . A particularly concerning finding is that LLM success on challenging tasks within a domain does not necessarily translate to reliable performance on simpler tasks in the same domain . This counterintuitive behavior challenges our natural assumptions about machine learning systems and highlights the importance of systematic evaluation approaches.

Robustness of LLMs to input perturbation

The sensitivity of LLMs to minor prompt modifications has been well-documented in the literature . Research has demonstrated that even subtle changes to input prompts can result in substantially different outputs, raising questions about the reliability and consistency of these systems. This sensitivity to input perturbations suggests a potential avenue for uncertainty estimation: by measuring a model’s response variability to small input modifications, we could potentially identify cases where the model exhibits high uncertainty and should be rejected. While this approach has been theoretically proposed in recent literature reviews, particularly in Hendrickx et al.’s 2024 review , its practical application to LLMs remains unexplored.

Further reading