A foundational vision transformer improves diagnostic performance for electrocardiograms

June 6, 2023

Akhil Vaid, Joy Jiang, Ashwin Sawant, Stamatios Lerakis, Edgar Argulian, Yuri Ahuja, Joshua Lampert, Alexander Charney, Hayit Greenspan, Jagat Narula, Benjamin Glicksberg & Girish N Nadkarni

npj Digital Medicine volume 6, Article number: 108 (2023)

Link to article: A foundational vision transformer improves diagnostic performance for electrocardiograms | npj Digital Medicine (nature.com)

Abstract

The electrocardiogram (ECG) is a ubiquitous diagnostic modality. Convolutional neural networks (CNNs) applied towards ECG analysis require large sample sizes, and transfer learning approaches for biomedical problems may result in suboptimal performance when pre-training is done on natural images. We leveraged masked image modeling to create a vision-based transformer model, HeartBEiT, for electrocardiogram waveform analysis. We pre-trained this model on 8.5 million ECGs and then compared performance vs. standard CNN architectures for diagnosis of hypertrophic cardiomyopathy, low left ventricular ejection fraction and ST elevation myocardial infarction using differing training sample sizes and independent validation datasets. We find that HeartBEiT has significantly higher performance at lower sample sizes compared to other models. We also find that HeartBEiT improves explainability of diagnosis by highlighting biologically relevant regions of the EKG vs. standard CNNs. Domain specific pre-trained transformer models may exceed the classification performance of models trained on natural images especially in very low data regimes. The combination of the architecture and such pre-training allows for more accurate, granular explainability of model predictions.

Introduction

The electrocardiogram (ECG) is a body surface-level recording of electrical activity within the heart. Owing to its low cost, non-invasiveness, and wide applicability to cardiac disease, the ECG is a ubiquitous investigation and over 100 million ECGs are performed each year within the United States alone¹ in various healthcare settings. However, the ECG is limited in scope since physicians cannot consistently identify patterns representative of disease – especially for conditions that do not have established diagnostic criteria, or in cases when such patterns may be too subtle or chaotic for human interpretation.

Deep learning has been applied to ECG data for several diagnostic and prognostic use cases^2,3,4,5,6. The vast majority of this work has been built upon Convolutional Neural Networks (CNNs)⁷. Like other neural networks, CNNs are high variance constructs⁸, and require large amounts of data to prevent overfitting⁹. CNNs must also be purpose-built to accommodate the dimensionality of incoming data, and they have been used for interpreting ECGs both as 1D waveforms and 2D images¹⁰.

In this context, interpreting ECGs as 2D images presents an advantage due to widely available pre-trained models which often serve as starting points for modeling tasks on smaller datasets¹¹. This technique is described as transfer learning wherein a model that is trained on a larger, possibly unrelated dataset is fine-tuned on a smaller dataset that is relevant to a problem¹². Transfer learning is especially useful in healthcare since datasets are limited in size due to limited patient cohorts, rarity of outcomes of interest, and costs associated with generating useful labels. As a result, vision models first trained in a supervised manner on natural images¹³ often form the basis of models used in healthcare settings. Unfortunately, transfer learning with such natural images is not a universal solution, and it is known to produce suboptimal results when there exist substantial differences in the pre-training and fine-tuning datasets¹⁴.

Transformer-based neural networks utilize the attention mechanism¹⁵ to establish and define relationships between discrete units of input data known as tokens¹⁶. A significant benefit that transformers allow for is unsupervised learning from large corpora of unlabeled data to learn relationships between tokens, and then utilize this information for other downstream tasks¹⁶. Due to the ease with which unstructured text can be broken down into tokens, transformers have been tremendously successful at Natural Language Processing (NLP) tasks^17,18. Recent work has extended the functionality of such models into vision-based tasks, leading to the advent of the vision transformer^16,19.

The first vision transformers were pre-trained on immense labeled datasets and then fine-tuned on smaller datasets to indicate better performance over CNNs at natural image classification²⁰. More recently, the Bidirectional Encoder representation from Image Transformers (BEiT) approach has allowed large unlabeled datasets to be leveraged for pre-training transformer neural networks²¹. This approach consists of converting parts of an input image into discrete tokens or patches. Such tokens may be considered analogous to the words within a sentence and be used to pre-train a transformer in much the same way as a language model (Fig. 1). Since transformers consider global dependencies²² between all features of provided inputs, such pre-training may be especially advantageous for ECGs. Certain pathological patterns such as the S1Q3T3 occur in different parts of a recording²³, and a model which considers only contiguous regions may miss them entirely.

Methods & Results

We used electronic medical records from 5 hospitals and identified ECGs from adults with documented PVCs. Internal training and testing were performed at one hospital. External validation was performed with the others. The primary outcome was first diagnosis of LVEF ≤40% within 6 months. The dataset included 383,514 ECGs, of which 14,241 remained for analysis. We analyzed area under the receiver operating curves and explainability plots for representative patients, algorithm prediction, PVC burden, and demographics in a multivariable Cox model to assess independent predictors for cardiomyopathy.

Among the 14,241-patient cohort (age 67.6 ± 14.8 years; female 43.8%; White 29.5%, Black 8.6%, Hispanic 6.5%, Asian 2.2%), 22.9% experienced reductions in LVEF to ≤40% within 6 months. The model predicted reductions in LVEF to ≤40% with area under the receiver operating curve of 0.79 (95% CI: 0.77-0.81). The gradient weighted class activation map explainability framework highlighted the sinus rhythm QRS complex-ST segment. In patients who underwent successful PVC ablation there was a post-ablation improvement in LVEF with resolution of cardiomyopathy in most (89%) patients.

Conclusions

Deep-learning on the 12-lead ECG alone can accurately predict new-onset cardiomyopathy in patients with PVCs independent of PVC burden. Model prediction performed well across sex and race, relying on the QRS complex/ST-segment in sinus rhythm, not PVC morphology.

See article: A Novel ECG-Based Deep Learning Algorithm to Predict Cardiomyopathy in Patients With Premature Ventricular Complexes | JACC: Clinical Electrophysiology

A foundational vision transformer improves diagnostic performance for electrocardiograms

Abstract

Introduction

Methods & Results

Conclusions

Certifications