A publicly available benchmark for assessing large language models' ability to predict how humans balance self-interest and the interest of others.

Journal: Scientific reports

Published Date: Jul 1, 2025

Abstract

Large language models (LLMs) hold enormous potential to assist humans in decision-making processes, from everyday to high-stake scenarios. However, as many human decisions carry social implications, for LLMs to be reliable assistants a necessary prerequisite is that they are able to capture how humans balance self-interest and the interest of others. Here we introduce a novel, publicly available, benchmark to test LLM's ability to predict how humans balance monetary self-interest and the interest of others. This benchmark consists of 106 textual instructions from dictator games experiments conducted with human participants from 12 countries, alongside with a compendium of actual human behavior in each experiment. We investigate the ability of four advanced chatbots against this benchmark. We find that none of these chatbots meet the benchmark. In particular, only GPT-4 and GPT-4o (not Bard nor Bing) correctly capture qualitative behavioral patterns, identifying three major classes of behavior: self-interested, inequity-averse, and fully altruistic. Nonetheless, GPT-4 and GPT-4o consistently underestimate self-interest, while overestimating altruistic behavior. In sum, this article introduces a publicly available resource for testing the capacity of LLMs to estimate human other-regarding preferences in economic decisions and reveals an "optimistic bias" in current versions of GPT.

Authors

Valerio Capraro

Department of Economics, Middlesex University, The Burroughs, London NW4 4BT, UK.
Roberto Di Paolo

Department of Economics and Management, University of Parma, 43121, Parma, Italy.
Veronica Pizziol

Department of Economics, University of Bologna, 40126, Bologna, Italy.

Keywords

Adult Benchmarking Decision Making Female Humans Language Large Language Models Male

External Resources

View on PubMed Access via DOI PubMed (40595689)

A publicly available benchmark for assessing large language models' ability to predict how humans balance self-interest and the interest of others.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

A publicly available benchmark for assessing large language models' ability to predict how humans balance self-interest and the interest of others.

Abstract

Authors

Keywords

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals