Another wave of enthusiasm hit the media in recent weeks unfolding the endless opportunities for artificial intelligence (AI) in service for medicine. This time the “new AI neural network approach detected heart failure from a single heartbeat with 100% diagnostic accuracy” – an energetic and simple message - was shared and reshared across social media and beyond. Lay readers fueled the rolling snowball with comments and views suggesting that Goliath of medicine has been successfully defeated by magic capabilities of AI. But, are we really done with heart failure?
The paper, which gave grounds for these discussions was published online in the Biomedical Signal Processing and Control by Mihaela Porumb et al (1).The authors have implemented in a very elegant manner a new approach for analysis of electrocardiogram (ECG) using the hierarchical neural networks that mimic the human visual system called Convolutional Neural Network (CNN or ConvNets) (2). This method being a class of deep neural networks, allows for image recognition and classification, is used for object or face recognition. This time in the work by Porumb et al. the face patters were replaced by ECG traces and the system was trained to classify the beats in to either “normal” or “chronic heart failure (CHF)” category. In total, 490,505 beats were used to train (50%), validate (25%) and test (25%) the model. The outcomes showed low rate of misclassification reaching 1% for false positive and 3% of false negative results. The diagnostic accuracy was impressively high 97.8±0.2% with area under curve of 0.98±0.01. Interestingly, nearly 72% of all misclassified CHF heart beats belong to a same subject. Furthermore, then the single beat analysis was replaced by analysis of a 5-minute ECG segment the accuracy increased further to 99% and the misclassification dropped to 0.1%. Just great!
Not to understate the impressive achievement of the researches behind, there are some limitations that restrict applicability of this research into clinical. Firstly, the presented model cannot be extrapolated to entire heart failure (HF) population in its present form. If we look into the type of ECG signals that were used to train the CNN model, we find that the authors used only severe and very severe cases of CHF (NYHA III/IV) in whom ECG is very abnormal (Figure 1). In clinical practice, physicians struggle mostly with identifying the mild and moderate forms of CHF at the early stage of the disease. Delay in diagnosis and treatment initiation results in devastating progress of the disease. Therefore, what is needed is to extrapolate AI analysis onto ECG from less severe CHF cases. This has not been done so far… Second concern, which emerge after reading the study methods is a very small sample size of only 15 CHF patients. The multifactorial etiology of CHF makes it impossible to cover all forms of CHF just handful of subjects. This means that the model was trained on single representations of CHF ECGs and likely missed rare cases. Last but not least, a weak clinical design, that includes lack of randomization, no patient level analysis, lack of prospective design and no details on patient demography, concomitant disease, biomarkers and etiology makes it a no-go to impact the clinical practice guidelines. Thus, this paper should be perceived as a basic science research of hypothesis generating potential rather than a clinical confirmatory study.
Watching the news, the enthusiasm around deep data analysis resemble the early times of gene engineering. Alike, the general public was presented with groundbreaking outcomes on animal models that were soon to be implemented in everyday clinical practice. This has never happened, the road form lab to bed is complex and time-consuming, and so will it be for AI. What we can do to accelerate it is to apply rigorous scientific methodology to testing the AI tools in clinical practice. Defining the indications and reaching for improvements in the hard clinical endpoints.