Improving Social Determinants of Health Documentation in French EHRs Using Large Language Models
Journal:
arXiv
Published Date:
Jul 4, 2025
Abstract
Social determinants of health (SDoH) significantly influence health outcomes,
shaping disease progression, treatment adherence, and health disparities.
However, their documentation in structured electronic health records (EHRs) is
often incomplete or missing. This study presents an approach based on large
language models (LLMs) for extracting 13 SDoH categories from French clinical
notes. We trained Flan-T5-Large on annotated social history sections from
clinical notes at Nantes University Hospital, France. We evaluated the model at
two levels: (i) identification of SDoH categories and associated values, and
(ii) extraction of detailed SDoH with associated temporal and quantitative
information. The model performance was assessed across four datasets, including
two that we publicly release as open resources. The model achieved strong
performance for identifying well-documented categories such as living
condition, marital status, descendants, job, tobacco, and alcohol use (F1 score
> 0.80). Performance was lower for categories with limited training data or
highly variable expressions, such as employment status, housing, physical
activity, income, and education. Our model identified 95.8% of patients with at
least one SDoH, compared to 2.8% for ICD-10 codes from structured EHR data. Our
error analysis showed that performance limitations were linked to annotation
inconsistencies, reliance on English-centric tokenizer, and reduced
generalizability due to the model being trained on social history sections
only. These results demonstrate the effectiveness of NLP in improving the
completeness of real-world SDoH data in a non-English EHR system.