Summary of "Forecasting Rare Language Model Behaviors"

A new paper by Anthropic shows how to predict the max probability that a query out of a set of n queries elicits a particular (unwanted) model behaviour.

Motivation: in deployment, models are tested on a much larger number of queries than in evaluation. Each specific query \(x\) (prompt) causes the model to produce an (unwanted) outcome with a given probability (by repeated sampling):

\[p_{\text {ELICIT }}\left(x ; \mathcal{D}_{LLM}, B\right)=\mathbb{E}_{o \sim \mathcal{D}_{\mathrm{LIM}}(x)} \mathbf{1}[B(o)=1]\]

where \(\mathcal D_{LLM}(x)\) indicates the sampling distribution of the model given \(x\) , \(o\) indicates a given output and \(B\) is an indicator function for a particular behaviour. The probability \(p_{\text {ELICIT }}\) may be small but not negligible, and it may be estimated by repeated sampling using query \(x\) . The paper then asks: by considering a distribution of queries \(\mathcal D_x\) (assumed to be the same in evaluation and deployment), how does the largest observed value of \(p_{\text {ELICIT }}\) scale with the number of queries \(n\) sampled from \(\mathcal D_x\) ? In fact, the paper motivates (sec 3.2) that the largest value of \(p_{\text {ELICIT }}\) is a good representation of the aggregate risk over the considered \(n\) queries and the frequency of times the behaviour is expected to appear. Notice that \(\mathcal D_x\) induces a distribution on \([0,1]\) through the computation of \(p_{\text {ELICIT }}\) ; therefore, the paper asks how the top \(1/n\) quantile of the distribution of \(p_{\text {ELICIT }}\) , indicated as \(Q_p(n)\) , scales with \(n\) :

\[\mathbf{P}_{x \sim \mathcal{D}x}\left[p_{\text {ELICIT }}\left(x ; \mathcal{D}_{\mathrm{LLM}}, B\right) \geq Q_p(n)\right]=1 / n\]

The paper then recurs to techniques from extreme value theory and finds that the \(Q_p(n)\) and \(n\) are linearly linked in log-space for large \(n\)

\[\log \left(-\log Q_p(n)\right)=\frac{1}{a} \log n-\frac{b}{a},\]

thus allowing to predict \(Q_p(n)\) for higher orders of magnitude from experiments conducted on smaller ones. They test this in practice and achieve good results.

The method is not perfect, as the authors acknowledge; for instance, it assumes the evaluation and deployment distribution of prompts are identical, therefore without accounting for distribution shift or adversarial adaptation. However, it is a great first step.