MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale

Journal: arXiv
Published Date:

Abstract

We introduce MedAgentGYM, the first publicly available training environment designed to enhance coding-based medical reasoning capabilities in large language model (LLM) agents. MedAgentGYM comprises 72,413 task instances across 129 categories derived from authentic real-world biomedical scenarios. Tasks are encapsulated within executable coding environments, each featuring detailed task descriptions, interactive feedback mechanisms, verifiable ground-truth annotations, and scalable training trajectory generation. Extensive benchmarking of over 30 LLMs reveals a notable performance disparity between commercial API-based models and open-source counterparts. Leveraging MedAgentGYM, Med-Copilot-7B achieves substantial performance gains through supervised fine-tuning (+36.44%) and continued reinforcement learning (+42.47%), emerging as an affordable and privacy-preserving alternative competitive with gpt-4o. By offering both a comprehensive benchmark and accessible, expandable training resources within unified execution environments, MedAgentGYM delivers an integrated platform to develop LLM-based coding assistants for advanced biomedical research and practice.

Authors

  • Ran Xu
  • Yuchen Zhuang
  • Yishan Zhong
  • Yue Yu
  • Xiangru Tang
  • Hang Wu
  • May D. Wang
  • Peifeng Ruan
  • Donghan Yang
  • Tao Wang
  • Guanghua Xiao
  • Carl Yang
  • Yang Xie
  • Wenqi Shi