publications
My academic publications
2024
2024
- arXivLeaving the barn door open for Clever Hans: Simple features predict LLM benchmark answersLorenzo Pacchiardi , Marko Tesic , Lucy G. Cheke , and José Hernández-OralloarXiv preprint arXiv:2410.11672, 2024
The integrity of AI benchmarks is fundamental to accurately assess the capabilities of AI systems. The internal validity of these benchmarks - i.e., making sure they are free from confounding factors - is crucial for ensuring that they are measuring what they are designed to measure. In this paper, we explore a key issue related to internal validity: the possibility that AI systems can solve benchmarks in unintended ways, bypassing the capability being tested. This phenomenon, widely known in human and animal experiments, is often referred to as the ’Clever Hans’ effect, where tasks are solved using spurious cues, often involving much simpler processes than those putatively assessed. Previous research suggests that language models can exhibit this behaviour as well. In several older Natural Language Processing (NLP) benchmarks, individual n-grams like "not" have been found to be highly predictive of the correct labels, and supervised NLP models have been shown to exploit these patterns. In this work, we investigate the extent to which simple n-grams extracted from benchmark instances can be combined to predict labels in modern multiple-choice benchmarks designed for LLMs, and whether LLMs might be using such n-gram patterns to solve these benchmarks. We show how simple classifiers trained on these n-grams can achieve high scores on several benchmarks, despite lacking the capabilities being tested. Additionally, we provide evidence that modern LLMs might be using these superficial patterns to solve benchmarks. This suggests that the internal validity of these benchmarks may be compromised and caution should be exercised when interpreting LLM performance results on them.
- arXiv100 instances is all you need: predicting the success of a new LLM on unseen data by testing on a few instancesLorenzo Pacchiardi , Lucy G. Cheke , and José Hernández-OralloarXiv preprint arXiv:2409.03563, 2024
Predicting the performance of LLMs on individual task instances is essential to ensure their reliability in high-stakes applications. To do so, a possibility is to evaluate the considered LLM on a set of task instances and train an assessor to predict its performance based on features of the instances. However, this approach requires evaluating each new LLM on a sufficiently large set of task instances to train an assessor specific to it. In this work, we leverage the evaluation results of previously tested LLMs to reduce the number of evaluations required to predict the performance of a new LLM. In practice, we propose to test the new LLM on a small set of reference instances and train a generic assessor which predicts the performance of the LLM on an instance based on the performance of the former on the reference set and features of the instance of interest. We conduct empirical studies on HELM-Lite and KindsOfReasoning, a collection of existing reasoning datasets that we introduce, where we evaluate all instruction-fine-tuned OpenAI models until the January 2024 version of GPT4. When predicting performance on instances with the same distribution as those used to train the generic assessor, we find this achieves performance comparable to the LLM-specific assessors trained on the full set of instances. Additionally, we find that randomly selecting the reference instances performs as well as some advanced selection methods we tested. For out of distribution, however, no clear winner emerges and the overall performance is worse, suggesting that the inherent predictability of LLMs is low.
- ICLR 2024How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated QuestionsLorenzo Pacchiardi , Alex J Chan , Sören Mindermann , Ilan Moscovitz , Alexa Y Pan , Yarin Gal , Owain Evans , and Jan BraunerThe Twelfth International Conference on Learning Representations, 2024
Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense. LLMs might "lie", for example, when instructed to output misinformation. Here, we develop a simple lie detector that requires neither access to the LLM’s activations (black-box) nor ground-truth knowledge of the fact in question. The detector works by asking a predefined set of unrelated follow-up questions after a suspected lie, and feeding the LLM’s yes/no answers into a logistic regression classifier. Despite its simplicity, this lie detector is highly accurate and surprisingly general. When trained on examples from a single setting – prompting GPT-3.5 to lie about factual questions – the detector generalises out-of-distribution to (1) other LLM architectures, (2) LLMs fine-tuned to lie, (3) sycophantic lies, and (4) lies emerging in real-life scenarios such as sales. These results indicate that LLMs have distinctive lie-related behavioural patterns, consistent across architectures and contexts, which could enable general-purpose lie detection.
- JMLRProbabilistic Forecasting with Generative Networks via Scoring Rule MinimizationLorenzo Pacchiardi , Rilwan Adewoyin , Peter Dueben , and Ritabrata DuttaJournal of Machine Learning Research, 2024
Probabilistic forecasting relies on past observations to provide a probability distribution for a future outcome, which is often evaluated against the realization using a scoring rule. Here, we perform probabilistic forecasting with generative neural networks, which parametrize distributions on high-dimensional spaces by transforming draws from a latent variable. Generative networks are typically trained in an adversarial framework. In contrast, we propose to train generative networks to minimize a predictive-sequential (or prequential) scoring rule on a recorded temporal sequence of the phenomenon of interest, which is appealing as it corresponds to the way forecasting systems are routinely evaluated. Adversarial-free minimization is possible for some scoring rules; hence, our framework avoids the cumbersome hyperparameter tuning and uncertainty underestimation due to unstable adversarial training, thus unlocking reliable use of generative networks in probabilistic forecasting. Further, we prove consistency of the minimizer of our objective with dependent data, while adversarial training assumes independence. We perform simulation studies on two chaotic dynamical models and a benchmark data set of global weather observations; for this last example, we define scoring rules for spatial data by drawing from the relevant literature. Our method outperforms state-of-the-art adversarial approaches, especially in probabilistic calibration, while requiring less hyperparameter tuning.
- EJSGeneralized Bayesian likelihood-free inferenceLorenzo Pacchiardi , Sherman Khoo , and Ritabrata DuttaElectronic Journal of Statistics, 2024
Generalized Bayesian inference replaces the likelihood in the Bayesian posterior with the exponential of a loss function connecting parameter values and observations. As a loss function, it is possible to use Scoring Rules (SRs), which evaluate the match between the observation and the probabilistic model for given parameter values. In this work, we leverage this Scoring Rule posterior for Bayesian Likelihood-Free Inference (LFI). In LFI, we can sample from the model but not evaluate the likelihood; hence, we use the Energy and Kernel SRs in the SR posterior, as they admit unbiased empirical estimates. While traditional Pseudo-Marginal (PM) Markov Chain Monte Carlo (MCMC) can be applied to the SR posterior, it mixes poorly for concentrated targets, such as those obtained with many observations. As such, we propose to use Stochastic Gradient (SG) MCMC, which improves performance over PM-MCMC and scales to higher-dimensional setups as it is rejection-free. SG-MCMC requires differentiating the simulator model; we achieve this effortlessly by implementing the simulator models using automatic differentiation libraries. We compare SG-MCMC sampling for the SR posterior with related LFI approaches and find that the former scales to larger sample sizes and works well on the raw data, while other methods require determining suitable summary statistics. On a chaotic dynamical system from meteorology, our method even allows inferring the parameters of a neural network used to parametrize a part of the update equations.
2022
2022
- arXivLikelihood-Free Inference with Generative Neural Networks via Scoring Rule MinimizationLorenzo Pacchiardi , and Ritabrata DuttaarXiv preprint arXiv:2205.15784, 2022
Bayesian Likelihood-Free Inference methods yield posterior approximations for simulator models with intractable likelihood. Recently, many works trained neural networks to approximate either the intractable likelihood or the posterior directly. Most proposals use normalizing flows, namely neural networks parametrizing invertible maps used to transform samples from an underlying base measure; the probability density of the transformed samples is then accessible and the normalizing flow can be trained via maximum likelihood on simulated parameter-observation pairs. A recent work [Ramesh et al., 2022] approximated instead the posterior with generative networks, which drop the invertibility requirement and are thus a more flexible class of distributions scaling to high-dimensional and structured data. However, generative networks only allow sampling from the parametrized distribution; for this reason, Ramesh et al. [2022] follows the common solution of adversarial training, where the generative network plays a min-max game against a "critic" network. This procedure is unstable and can lead to a learned distribution underestimating the uncertainty - in extreme cases collapsing to a single point. Here, we propose to approximate the posterior with generative networks trained by Scoring Rule minimization, an overlooked adversarial-free method enabling smooth training and better uncertainty quantification. In simulation studies, the Scoring Rule approach yields better performances with shorter training time with respect to the adversarial framework.
- JMLRScore Matched Neural Exponential Families for Likelihood-Free InferenceLorenzo Pacchiardi , and Ritabrata DuttaJournal of Machine Learning Research, 2022
Bayesian Likelihood-Free Inference (LFI) approaches allow to obtain posterior distributions for stochastic models with intractable likelihood, by relying on model simulations. In Approximate Bayesian Computation (ABC), a popular LFI method, summary statistics are used to reduce data dimensionality. ABC algorithms adaptively tailor simulations to the observation in order to sample from an approximate posterior, whose form depends on the chosen statistics. In this work, we introduce a new way to learn ABC statistics: we first generate parameter-simulation pairs from the model independently on the observation; then, we use Score Matching to train a neural conditional exponential family to approximate the likelihood. The exponential family is the largest class of distributions with fixed-size sufficient statistics; thus, we use them in ABC, which is intuitively appealing and has state-of-the-art performance. In parallel, we insert our likelihood approximation in an MCMC for doubly intractable distributions to draw posterior samples. We can repeat that for any number of observations with no additional model simulations, with performance comparable to related approaches. We validate our methods on toy models with known likelihood and a large-dimensional time-series model.
2021
2021
- PLOS Comp. Biol.Using Mobility Data in the Design of Optimal Lockdown Strategies for the COVID-19 PandemicRitabrata Dutta , Susana Gomes , Dante Kalise , and Lorenzo PacchiardiPLOS Computational Biology, 2021
A mathematical model for the COVID-19 pandemic spread, which integrates age-structured Susceptible-Exposed-Infected-Recovered-Deceased dynamics with real mobile phone data accounting for the population mobility, is presented. The dynamical model adjustment is performed via Approximate Bayesian Computation. Optimal lockdown and exit strategies are determined based on nonlinear model predictive control, constrained to public-health and socio-economic factors. Through an extensive computational validation of the methodology, it is shown that it is possible to compute robust exit strategies with realistic reduced mobility values to inform public policy making, and we exemplify the applicability of the methodology using datasets from England and France.
- JSSABCpy: A High-Performance Computing Perspective to Approximate Bayesian ComputationRitabrata Dutta , Marcel Schoengens , Lorenzo Pacchiardi , Avinash Ummadisingu , Nicole Widmer , Pierre Künzli , Jukka-Pekka Onnela , and Antonietta MiraJournal of Statistical Software, 2021
ABCpy is a highly modular scientific library for approximate Bayesian computation (ABC) written in Python. The main contribution of this paper is to document a software engineering effort that enables domain scientists to easily apply ABC to their research without being ABC experts; using ABCpy they can easily run large parallel simulations without much knowledge about parallelization. Further, ABCpy enables ABC experts to easily develop new inference schemes and evaluate them in a standardized environment and to extend the library with new algorithms. These benefits come mainly from the modularity of ABCpy. We give an overview of the design of ABCpy and provide a performance evaluation concentrating on parallelization. This points us towards the inherent imbalance in some of the ABC algorithms. We develop a dynamic scheduling MPI implementation to mitigate this issue and evaluate the various ABC algorithms according to their adaptability towards high-performance computing.
2020
2020
- Sankhya BDistance-Learning for Approximate Bayesian Computation to Model a Volcanic EruptionLorenzo Pacchiardi , Pierre Künzli , Marcel Schoengens , Bastien Chopard , and Ritabrata DuttaSankhya B, 2020
Approximate Bayesian computation (ABC) provides us with a way to infer parameters of models, for which the likelihood function is not available, from an observation. Using ABC, which depends on many simulations from the considered model, we develop an inferential framework to learn parameters of a stochastic numerical simulator of volcanic eruption. Moreover, the model itself is parallelized using Message Passing Interface (MPI). Thus, we develop a nested-parallelized MPI communicator to handle the expensive numerical model with ABC algorithms. ABC usually relies on summary statistics of the data in order to measure the discrepancy model output and observation. However, informative summary statistics cannot be found for the considered model. We therefore develop a technique to learn a distance between model outputs based on deep metric-learning. We use this framework to learn the plume characteristics (eg. initial plume velocity) of the volcanic eruption from the tephra deposits collected by field-work associated with the 2450 BP Pululagua (Ecuador) volcanic eruption.