Moving Beyond Medical Exam Questions: A Clinician-Annotated Dataset of Real-World Tasks and Ambiguity in Mental Healthcare
Journal:
arXiv
Published Date:
Feb 22, 2025
Abstract
Current medical language model (LM) benchmarks often over-simplify the
complexities of day-to-day clinical practice tasks and instead rely on
evaluating LMs on multiple-choice board exam questions. Thus, we present an
expert-created and annotated dataset spanning five critical domains of
decision-making in mental healthcare: treatment, diagnosis, documentation,
monitoring, and triage. This dataset - created without any LM assistance - is
designed to capture the nuanced clinical reasoning and daily ambiguities mental
health practitioners encounter, reflecting the inherent complexities of care
delivery that are missing from existing datasets. Almost all 203 base questions
with five answer options each have had the decision-irrelevant demographic
patient information removed and replaced with variables (e.g., AGE), and are
available for male, female, or non-binary-coded patients. For question
categories dealing with ambiguity and multiple valid answer options, we create
a preference dataset with uncertainties from the expert annotations. We outline
a series of intended use cases and demonstrate the usability of our dataset by
evaluating eleven off-the-shelf and four mental health fine-tuned LMs on
category-specific task accuracy, on the impact of patient demographic
information on decision-making, and how consistently free-form responses
deviate from human annotated samples.