Prompt architecture induces methodological artifacts in large language models.

Journal: PloS one
PMID:

Abstract

We examine how the seemingly arbitrary way a prompt is posed, which we term "prompt architecture," influences responses provided by large language models (LLMs). Five large-scale, full-factorial experiments performing standard (zero-shot) similarity evaluation tasks using GPT-3, GPT-4, and Llama 3.1 document how several features of prompt architecture (order, label, framing, and justification) interact to produce methodological artifacts, a form of statistical bias. We find robust evidence that these four elements unduly affect responses across all models, and although we observe differences between GPT-3 and GPT-4, the changes are not necessarily for the better. Specifically, LLMs demonstrate both response-order bias and label bias, and framing and justification moderate these biases. We then test different strategies intended to reduce methodological artifacts. Specifying to the LLM that the order and labels of items have been randomized does not alleviate either response-order or label bias, and the use of uncommon labels reduces (but does not eliminate) label bias but exacerbates response-order bias in GPT-4 (and does not reduce either bias in Llama 3.1). By contrast, aggregating across prompts generated using a full factorial design eliminates response-order and label bias. Overall, these findings highlight the inherent fallibility of any individual prompt when using LLMs, as any prompt contains characteristics that may subtly interact with a multitude of hidden associations embedded in rich language data.

Authors

  • Melanie Brucks
    Columbia Business School, New York, NY, United States of America.
  • Olivier Toubia
    Columbia Business School, New York, NY, United States of America.