Electronic health records (EHRs) are defined as systematic, longitudinal, digital records of a person’s health information, created during routine care. According to the European Commission for Public Health, these patient repositories are created and processed for the purpose of the provision of healthcare.1 Stored in EHR systems, the records consist of heterogeneous data, including multimedia data (such as imaging or waveforms), structured data (codes, numeric values) and unstructured texts. While the first two categories are relatively easy to extract and handle, the latter remains challenging to process for research purposes.
The goal of a recent study by van der Loo et al. was to address this challenge by evaluating the feasibility and performance of large language models (LLM) in extracting relevant structural data from clinical reports.2 Using echocardiographic (EC) and invasive coronary angiography (ICA) reports from 1000 patients, the authors investigated the model performance in extracting six specific features from the ICA dataset (occlusion, no CAD, graft, treatment strategy, identification of culprit vessels) and three features from the EC report (LV function, type and grade of each valve dysfunction). For this purpose, they evaluated a commercially available cloud based LLM (GPT-4o via Azure OpenAI) and several open-source models developed for general-purpose, medical domain-specific and multilingual LLMs. These models were optimized by prompt-engineering or on-site fine-tuning requiring a high-performance computing cluster (HCC).
Compared with the gold standard of physician annotation, the prompt engineered LLMs performed reasonable for all labels except the more complex definition of the culprit lesion. For this task, fine-tuned models on a HCC performed significantly better than prompt engineered models and achieved performance comparable to the physician-derived extraction results.
This study nicely illustrates both the potential and the complexity of LLM-based data extraction from EHR. While the use of commercially available LLMs on semi-structured reports created by board-certified cardiologists yielded promising results within this single-centre setting, the modest performance of the selected open-source models highlights a significant technical hurdle. Despite the authors’ optimistic conclusion regarding the reliability of LLMs in classifying structured cardiology reports, following critical questions remain central to future implementation:
- Is the future "structured by AI" or "structured at source"? While LLMs offer a bridge for legacy data, we must consider whether the high computational costs and "tuning" requirements of AI justify bypassing the implementation of universal, standardized reporting guidelines across European centres.
- Even though acceptable performance was reported for these reports in the department of cardiology of a single hospital in the Netherlands, the transferability and implementation across the linguistic and technical diversity of European healthcare remains uncertain.
- Finally, should efforts focus on developing a multi-institutional foundation model capable of handling diverse clinical dialects, or on establishing a standardized validation pipeline for individual models?
In conclusion, future research should now focus on multi centre validation, interoperability across vendors and languages, and standardized evaluation frameworks to ensure that these promising technologies can translate into safe, equitable, and clinically meaningful applications in cardiovascular care.
As the field moves beyond proof-of-concept studies, fostering a collaborative ecosystem where clinical expertise, robust data governance, and advanced AI can work in synergy will be essential to unlock the full potential of real-world cardiovascular data.