Large Language Models' Performances regarding logical observation identifiers names and codes mapping in laboratory medicine: A comparative analysis of ChatGPT-4.0, Gemini, and Perplexity.
Journal:
International journal of medical informatics
Published Date:
Jan 6, 2026
Abstract
OBJECTIVES: This study aimed to assess the feasibility and practical utility of using large language models (LLMs) for Logical Observation Identifiers Names and Codes (LOINC) mapping to standardise healthcare data in the field of laboratory medicine. We evaluated the accuracy and applicability of three LLMs-ChatGPT-4.0 (OpenAI), Gemini 1.5 (Google DeepMind), and Perplexity AI (Perplexity.ai)-in mapping laboratory test items, which typically require considerable institutional-level standardisation efforts. METHODS: A total of 75 representative laboratory test items, including 55 clinical chemistry and 20 hematology tests commonly used in clinical practice, were selected. Six board-certified clinical pathologists independently mapped each test item to its appropriate LOINC code. A consensus mapping was established by the experts and used as the gold standard. Each LLM's output was compared to this consensus, and the results were categorised as complete match (CM), partial match (PM), or mismatch (MM) based on agreement with the reference. RESULTS: Overall paired ordinal analyses demonstrated a significant difference in LOINC code-mapping performance among the three models, with Gemini performing significantly worse than both ChatGPT-4.0 and Perplexity AI, and no significant difference between ChatGPT-4.0 and Perplexity AI. ChatGPT-4.0 achieved the highest CM rate in clinical chemistry (58.2%), whereas Perplexity AI performed best in hematology (55.0%). Gemini showed the highest MM rates, particularly in hematology (80.0%), while partial matches were largely attributable to method-related discrepancies rather than fully incorrect mappings. CONCLUSION: Structured inputs, localisation to domestic laboratory practices, and expert oversight are critical to improving the reliability of LLM-generated LOINC mappings. While LLMs can reduce workload by generating candidate mappings, human validation remains essential to ensure clinical accuracy. Future improvements should focus on algorithmic refinement, error feedback integration, and adaptation to diverse laboratory settings to enhance accuracy and generalisability in real-world laboratory settings.
Authors
Keywords
No keywords available for this article.