Lorenzo Pacchiardi

Assistant Research Professor, University of Cambridge

I am an Assistant Research Professor at the Leverhulme Centre for the Future of Intelligence at the University of Cambridge. I lead a research project (funded by Open Philanthropy) on developing a benchmark for measuring the ability of LLMs to perform data science tasks. I am more broadly interested in AI evaluation, particularly in predictability and cognitive evaluation, and I closely collaborate with Prof José Hernández-Orallo and Prof Lucy Cheke. I contribute to the AI evaluation newsletter.

I am deeply familiar with EU AI policy (having been involved in several initiatives), and am one of the co-founders of the Italian AI policy think tank CePTE. I also collaborate with The Unjournal to make impactful research more rigorous, and I co-founded AcademicJobsItaly.com to make the Italian academic job market more accessible.

I previously worked on detecting lying in large language models with Dr Owain Evans (through the MATS programme) and on technical standards for AI for the EU AI Act at the Future of Life Institute. I have also shortly advised RAND on AI evaluation.

I obtained a PhD in Statistics and Machine Learning at Oxford, during which I worked on Bayesian simulation-based inference, generative models and probabilistic forecasting (with applications to meteorology). My supervisors were Prof. Ritabrata Dutta (Uni. Warwick) and Prof. Geoff Nicholls (Uni. Oxford).

Before my PhD studies, I obtained a Bachelor’s degree in Physical Engineering from Politecnico di Torino (Italy) and an MSc in Physics of Complex Systems from Politecnico di Torino and Université Paris-Sud, France. I did my MSc thesis at LightOn, a machine learning startup in Paris.

news

May 16, 2025	Our survey on AI evaluation was accepted at IJCAI 2025 survey track and our PredictaBoard was accepted at ACL 2025 Findings.
Mar 11, 2025	Our new preprint shows how to extract the most predictive and explanatory power from AI benchmarks by automatically annotating the demands posed by each question. Check it out!
Feb 21, 2025	Two new arXiv preprints: one surveying AI evaluation and identifying six main paradigms, the other one introducing a benchmark for jointly evaluating the performance of LLMs and its predictability on individual instances.
Jan 15, 2025	Our survey paper “Measuring Data Science Automation: A Survey of Evaluation Tools for AI Assistants and Agents” has been accepted and published in Transactions on Machine Learning Research! 🎉
Oct 15, 2024	We have two new preprints on arXiv! One on predicting the performance of LLMs on individual instances, the other one on predicting the answers of LLM benchmarks from simple features.

selected publications

arXiv
No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes

Iván Vicente Moreno Cencerrado , Arnau Padrés Masdemont , Anton Gonzalvez Hawthorne , David Demitri Africa , and Lorenzo Pacchiardi

arXiv preprint arXiv:2509.10625, 2025

Abs Bib HTML Code

Do large language models (LLMs) anticipate when they will answer correctly? To study this, we extract activations after a question is read but before any tokens are generated, and train linear probes to predict whether the model’s forthcoming answer will be correct. Across three open-source model families ranging from 7 to 70 billion parameters, projections on this "in-advance correctness direction" trained on generic trivia questions predict success in distribution and on diverse out-of-distribution knowledge datasets, outperforming black-box baselines and verbalised predicted confidence. Predictive power saturates in intermediate layers, suggesting that self-assessment emerges mid-computation. Notably, generalisation falters on questions requiring mathematical reasoning. Moreover, for models responding "I don’t know", doing so strongly correlates with the probe score, indicating that the same direction also captures confidence. By complementing previous results on truthfulness and other behaviours obtained with probes and sparse auto-encoders, our work contributes essential findings to elucidate LLM internals.
@article{cencerrado2025noanswer, title = {{No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes}}, author = {Cencerrado, Iván Vicente Moreno and Masdemont, Arnau Padrés and Hawthorne, Anton Gonzalvez and Africa, David Demitri and Pacchiardi, Lorenzo}, year = {2025}, eprint = {2509.10625}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, journal = {arXiv preprint arXiv:2509.10625}, url = {https://arxiv.org/abs/2509.10625}, }
TMLR
Measuring Data Science Automation: A Survey of Evaluation Tools for AI Assistants and Agents

Irene Testini* , José Hernández-Orallo , and Lorenzo Pacchiardi*

Transactions on Machine Learning Research, 2025

Abs Bib HTML

Data science aims to extract insights from data to support decision-making processes. Recently, Large Language Models (LLMs) are increasingly used as assistants for data science, by suggesting ideas, techniques and small code snippets, or for the interpretation of results and reporting. Proper automation of some data-science activities is now promised by the rise of LLM agents, i.e., AI systems powered by an LLM equipped with additional affordances—such as code execution and knowledge bases—that can perform self-directed actions and interact with digital environments. In this paper, we survey the evaluation of LLM assistants and agents for data science. We find (1) a dominant focus on a small subset of goal-oriented activities, largely ignoring data management and exploratory activities; (2) a concentration on pure assistance or fully autonomous agents, without considering intermediate levels of human-AI collaboration; and (3) an emphasis on human substitution, therefore neglecting the possibility of higher levels of automation thanks to task transformation.
@article{testini2025measuringdatascienceautomation, title = {{Measuring Data Science Automation: A Survey of Evaluation Tools for AI Assistants and Agents}}, author = {Testini*, Irene and Hernández-Orallo, José and Pacchiardi*, Lorenzo}, year = {2025}, journal = {Transactions on Machine Learning Research}, issn = {2835-8856}, url = {https://openreview.net/forum?id=MB0TCLfLn1}, }
arXiv
General Scales Unlock AI Evaluation with Explanatory and Predictive Power

Lexin Zhou , Lorenzo Pacchiardi , Fernando Martínez-Plumed , Katherine M. Collins , Yael Moros-Daval , Seraphina Zhang , Qinlin Zhao , Yitian Huang , Luning Sun , Jonathan E. Prunty , Zongqian Li , Pablo Sánchez-García , Kexin Jiang Chen , Pablo A. M. Casares , Jiyun Zu , John Burden , Behzad Mehrbakhsh , David Stillwell , Manuel Cebrian , Jindong Wang , Peter Henderson , Sherry Tongshuang Wu , Patrick C. Kyllonen , Lucy Cheke , Xing Xie , and José Hernández-Orallo

arXiv preprint arXiv:2503.06378, 2025

Abs Bib HTML Code

Ensuring safe and effective use of AI requires understanding and anticipating its performance on novel tasks, from advanced scientific challenges to transformed workplace activities. So far, benchmarking has guided progress in AI, but it has offered limited explanatory and predictive power for general-purpose AI systems, given the low transferability across diverse tasks. In this paper, we introduce general scales for AI evaluation that can explain what common AI benchmarks really measure, extract ability profiles of AI systems, and predict their performance for new task instances, in- and out-of-distribution. Our fully-automated methodology builds on 18 newly-crafted rubrics that place instance demands on general scales that do not saturate. Illustrated for 15 large language models and 63 tasks, high explanatory power is unleashed from inspecting the demand and ability profiles, bringing insights on the sensitivity and specificity exhibited by different benchmarks, and how knowledge, metacognition and reasoning are affected by model size, chain-of-thought and distillation. Surprisingly, high predictive power at the instance level becomes possible using these demand levels, providing superior estimates over black-box baseline predictors based on embeddings or finetuning, especially in out-of-distribution settings (new tasks and new benchmarks). The scales, rubrics, battery, techniques and results presented here represent a major step for AI evaluation, underpinning the reliable deployment of AI in the years ahead.
@article{zhou2025generalscalesunlockai, title = {{General Scales Unlock AI Evaluation with Explanatory and Predictive Power}}, author = {Zhou, Lexin and Pacchiardi, Lorenzo and Martínez-Plumed, Fernando and Collins, Katherine M. and Moros-Daval, Yael and Zhang, Seraphina and Zhao, Qinlin and Huang, Yitian and Sun, Luning and Prunty, Jonathan E. and Li, Zongqian and Sánchez-García, Pablo and Chen, Kexin Jiang and Casares, Pablo A. M. and Zu, Jiyun and Burden, John and Mehrbakhsh, Behzad and Stillwell, David and Cebrian, Manuel and Wang, Jindong and Henderson, Peter and Wu, Sherry Tongshuang and Kyllonen, Patrick C. and Cheke, Lucy and Xie, Xing and Hernández-Orallo, José}, year = {2025}, eprint = {2503.06378}, archiveprefix = {arXiv}, primaryclass = {cs.AI}, journal = {arXiv preprint arXiv:2503.06378}, url = {https://kinds-of-intelligence-cfi.github.io/ADELE/}, }
ACL Findings
PredictaBoard: Benchmarking LLM Score Predictability

Lorenzo Pacchiardi , Konstantinos Voudouris , Ben Slater , Fernando Martínez-Plumed , José Hernández-Orallo , Lexin Zhou , and Wout Schellaert

Findings of the Association for Computational Linguistics: ACL 2025, 2025

Abs Bib HTML Code

Despite possessing impressive skills, Large Language Models (LLMs) often fail unpredictably, demonstrating inconsistent success in even basic common sense reasoning tasks. This unpredictability poses a significant challenge to ensuring their safe deployment, as identifying and operating within a reliable "safe zone" is essential for mitigating risks. To address this, we present PredictaBoard, a novel collaborative benchmarking framework designed to evaluate the ability of score predictors (referred to as assessors) to anticipate LLM errors on specific task instances (i.e., prompts) from existing datasets. PredictaBoard evaluates pairs of LLMs and assessors by considering the rejection rate at different tolerance errors. As such, PredictaBoard stimulates research into developing better assessors and making LLMs more predictable, not only with a higher average performance. We conduct illustrative experiments using baseline assessors and state-of-the-art LLMs. PredictaBoard highlights the critical need to evaluate predictability alongside performance, paving the way for safer AI systems where errors are not only minimised but also anticipated and effectively mitigated.
@article{pacchiardi2025predictaboardbenchmarkingllmscore, title = {{PredictaBoard}: Benchmarking {LLM} Score Predictability}, author = {Pacchiardi, Lorenzo and Voudouris, Konstantinos and Slater, Ben and Martínez-Plumed, Fernando and Hernández-Orallo, José and Zhou, Lexin and Schellaert, Wout}, year = {2025}, eprint = {2502.14445}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, journal = {Findings of the Association for Computational Linguistics: ACL 2025}, url = {https://predictaboard.github.io/}, dataset = {https://huggingface.co/collections/kvoudouris/predictaboard-67b6042ee09a99a3b0bbebd0} }
IJCAI
Paradigms of AI Evaluation: Mapping Goals, Methodologies and Culture

John Burden* , Marko Tešić* , Lorenzo Pacchiardi* , and José Hernández-Orallo

IJCAI 2025 Survey Track, 2025

Abs Bib HTML

Research in AI evaluation has grown increasingly complex and multidisciplinary, attracting researchers with diverse backgrounds and objectives. As a result, divergent evaluation paradigms have emerged, often developing in isolation, adopting conflicting terminologies, and overlooking each other’s contributions. This fragmentation has led to insular research trajectories and communication barriers both among different paradigms and with the general public, contributing to unmet expectations for deployed AI systems. To help bridge this insularity, in this paper we survey recent work in the AI evaluation landscape and identify six main paradigms. We characterise major recent contributions within each paradigm across key dimensions related to their goals, methodologies and research cultures. By clarifying the unique combination of questions and approaches associated with each paradigm, we aim to increase awareness of the breadth of current evaluation approaches and foster cross-pollination between different paradigms. We also identify potential gaps in the field to inspire future research directions.
@article{burden2025paradigmsaievaluationmapping, title = {Paradigms of {AI} Evaluation: Mapping Goals, Methodologies and Culture}, author = {Burden*, John and Tešić*, Marko and Pacchiardi*, Lorenzo and Hernández-Orallo, José}, year = {2025}, eprint = {2502.15620}, archiveprefix = {arXiv}, primaryclass = {cs.AI}, journal = {IJCAI 2025 Survey Track}, url = {https://arxiv.org/abs/2502.15620}, }
ICLR 2024
How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Lorenzo Pacchiardi* , Alex J Chan* , Sören Mindermann , Ilan Moscovitz , Alexa Y Pan , Yarin Gal , Owain Evans , and Jan Brauner

The Twelfth International Conference on Learning Representations, 2024

Abs Bib HTML Code

Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense. LLMs might "lie", for example, when instructed to output misinformation. Here, we develop a simple lie detector that requires neither access to the LLM’s activations (black-box) nor ground-truth knowledge of the fact in question. The detector works by asking a predefined set of unrelated follow-up questions after a suspected lie, and feeding the LLM’s yes/no answers into a logistic regression classifier. Despite its simplicity, this lie detector is highly accurate and surprisingly general. When trained on examples from a single setting – prompting GPT-3.5 to lie about factual questions – the detector generalises out-of-distribution to (1) other LLM architectures, (2) LLMs fine-tuned to lie, (3) sycophantic lies, and (4) lies emerging in real-life scenarios such as sales. These results indicate that LLMs have distinctive lie-related behavioural patterns, consistent across architectures and contexts, which could enable general-purpose lie detection.
@article{pacchiardi2023catch, title = {How to Catch an {AI} Liar: Lie Detection in Black-Box {LLM}s by Asking Unrelated Questions}, author = {Pacchiardi*, Lorenzo and Chan*, Alex J and Mindermann, S{\"o}ren and Moscovitz, Ilan and Pan, Alexa Y and Gal, Yarin and Evans, Owain and Brauner, Jan}, journal = {The Twelfth International Conference on Learning Representations}, year = {2024}, url = {https://openreview.net/forum?id=567BjxgaTp}, }