Identifying the severity of diabetic retinopathy by visual function measures using both traditional statistical methods and interpretable machine learning: a cross-sectional study

Data collection

Most participants (n=881) were part of a prospective population-based epidemiological study of people over 50 years old (Northern Ireland Cohort for the Longitudinal Study of Ageing – NICOLA) [13]. The overall NICOLA study is representative of the Northern Ireland population, but of those that attended the clinical health assessment and were eligible for this study, a greater proportion were in younger age categories, male, retired, had higher levels of education and self-reported health [14]. Participation was higher among those within urban and less-deprived areas. Participants from the following categories were recalled for an additional study visit: those with diabetes mellitus (either self-reported or HbA1c ≥ 48 mmol/mol (6.5%)), those with age-related macular degeneration, and those with no retinal diseases and no diabetes mellitus. Additional participants with confirmed diagnoses of diabetes were recruited from diabetes clinics (n=150), together with healthy volunteers (n=91) with no history of eye disease aged under 50. These controls were a convenience sample with a similar age distribution to the younger clinical participants, and comprised university employees, their friends or family, or participants recruited via advertising. These groups constitute the Northern Ireland Sensory Ageing (NISA) study (https://clinicaltrials.gov/ct2/show/NCT02788695), comprising 2244 eyes measured across 1122 participants recruited between 2014 and 2018. Participants gave informed consent before taking part and the study was approved by the School of Medicine, Dentistry and Biomedical Sciences Ethics Committee, Queen’s University Belfast, UK (Ref: 12/23, Ref 16.37v2).

We concentrated on diabetic eye disease, excluding 343 eyes with other conditions affecting visual function (including 242 with intermediate/late-stage age-related macular degeneration). Thus, the analysis cohort comprised 1901 eyes from 1032 individuals, 61% of which were from women. The median age of the cohort was 64 years. Half of the eyes from diabetes patients were from NICOLA, and half were from diabetes clinics (see electronic supplementary material [ESM] Table 1). Participants underwent retinal imaging including a fundus colour picture (obtained using a CX-1 digital fundus camera; Canon, USA), colour ultra-wide-field imaging images centred on the fovea (obtained using an Optomap Panoramic 200Tx scanning laser ophthalmoscope; Optos, UK), a macular volume scan (obtained using a Spectralis spectral domain-optical coherence tomograph [SD-OCT]; Heidelberg Engineering, Germany). The volume scan comprised 61 horizontal B-scans (automatic real time [ART] 9) covering a 30×25 degrees rectangle. Imaging and perimetry were performed after pharmacological dilation using 1% tropicamide.

Nine visual function tests were performed (after full refraction) by an experienced optometrist. For perimetry-based tests, the eye with better best corrected DVA was selected for the study, choosing at random if both were eligible [15]. Tests were chosen to cover the breadth of functional deficits previously reported in diabetes, with an emphasis on methodologies that could be easily applied in a clinical setting if found to be predictive.

Distance visual acuity

Monocular DVA was evaluated using Early Treatment for Diabetic Retinopathy Study (EDTRS) charts in a light box (Precision Vision, USA) at 4 m, and the total number of letters read was recorded. Best corrected DVA was determined at 4 m distance with all room lights switched off.

Near visual acuity

Near visual acuity (NVA) was measured monocularly using Bailey–Lovie near word reading charts at 25 cm with the appropriate reading addition worn over the protocol refraction at 4 m [16] with room lights on. A log minimum angle of resolution (logMAR) score for the smallest line with three or more consecutive words read correctly was recorded.

Reading

Reading speed was assessed after the threshold NVA was established. It was tested monocularly using two sets of modified Bailey–Lovie reading speed charts presented as transparencies with black text and corresponding reading speed score sheets [17]. The chart was selected to exhibit text of a print size that was two logarithmic steps larger than the participant’s threshold NVA in the tested eye. The reading index obtained is the reading speed (words per min) divided by the size of print read, thus providing an adjustment for the range of print sizes used [18].

Distance low-luminance visual acuity

To measure distance low-luminance visual acuity at 4 m, a 2.0 log neutral-density trial lens was inserted over the final distance refraction result [19, 20]. The low-luminance deficit at distance was the difference in number of letters read between high- and low-luminance best corrected DVA.

Near low-luminance visual acuity

To measure near low-luminance visual acuity at 25 cm, the Smith-Kettlewell Institute low-luminance (SKILL) card was used, which measures spatial vision under reduced contrast and luminance [21]. The SKILL card consists of two near acuity charts mounted back-to-back. One has black letters on a dark grey background, simulating reduced contrast and luminance. The other is a high-contrast, black-on-white letter chart. The near low-luminance deficit at near is the acuity loss (number of letters read) between the light and dark sides. The number of letters read from each card (held approximately 40 cm from the patient’s eye wearing the appropriate reading addition) was recorded, with both sides of the chart presented to the participant under normal room lighting.

Contrast sensitivity

The contrast sensitivity was measured monocularly using Pelli–Robson charts (Clement Clarke International, UK) viewed at 1 m [22]. A +1.00 dioptre addition trial lens was added to the participant’s distance refractive correction. For a triplet of letters to be scored as ‘seen’, two out of three letters must be correctly identified. Care was taken to ensure uniform illumination of the chart, with luminance from 645–1292 Lux, and to ensure that the chart was concealed from viewing until the test was performed.

Moorfields acuity

The Moorfields acuity chart is a distance letter chart comprising vanishing optotypes that have pseudo high-pass design [23]. The mean luminance of the optotypes is the same as the background, so the letters appear to ‘vanish’ soon after the resolution threshold is reached. Such charts are thought to be more robust to optical defocus than traditional letter charts [24]. Visual acuity was measured with the chart under full room illumination (353.8 Lux). For examination of the right eye, we used Moorfields acuity chart 1, while for the left eye, we used Moorfields acuity chart 2.

Frequency doubling technology perimetry

The central visual field was assessed on a frequency doubling technology (FDT) Matrix perimeter (Carl Zeiss Meditec, USA) using the 24–2 threshold test. Participants completed the supra-threshold test first to familiarise themselves with the task before undertaking the full threshold test.

Microperimetry

Macular integrity assessment was performed using an MAIA macular integrity assessment system (CenterVue, Italy) with a central red circular fixation target and Goldman III stimuli against a background of 1.27 cd/m2 using a 4–2 threshold strategy. The maximum stimulus luminance was 318 cd/m2, creating a dynamic range of 36 dB. A 45-point customised stimulus grid covering the central 18 degrees was used [25], designed for relatively regular sampling throughout the region, but with slightly increased density towards the fovea.

Visual function tests were performed in the same sequence for all participants, in two batches with a refreshment break (30 min including hot drink and biscuits) in between to reduce risk of participant fatigue (Fig. 1). Some tests generated multiple variables (e.g. mean deviation and pattern standard deviation from FDT perimetry), giving a total of 12 visual function variables in the analysis dataset, in addition to age and sex. Fundus colour pictures were double graded (by authors RD and UC) to identify signs of age-related macular degeneration (Beckman classification) and diabetic retinopathy. Disc and macula colour images and images obtained by ultra-wide-field imaging were assessed for features of diabetic retinopathy in the central and peripheral retina, and then staged using the national screening for diabetic retinopathy system for England and Wales into four levels: none (R0), background (R1), pre-proliferative (R2) and proliferative (R3) [26]. Participants who were not recruited from the diabetic retinopathy clinic were identified as having diabetes if they self-reported a diabetes diagnosis. Diabetes duration was recorded when provided. Participants over 50 years of age and all patients with diabetes were invited for blood sampling to measure plasma HbA1c. Participants with no record of diabetes were classified as having diabetes if their HbA1c was ≥ 48 mmol/mol (6.5%).

Fig. 1figure 1Data preparation

Some visual function variables had a high proportion of missing values (up to 69% for FDT perimetry measures). Only 16.8% of participants had complete information for all variables so imputation was essential to make full use of the dataset. Missing values were imputed using chained equations [27], by imputing missing values for a variable and then using these to impute the next set of missing values until dataset completion. This approach is frequently used to produce multiple imputations (versions) of the input dataset, which are then analysed separately using statistical estimates that are combined using Rubin’s rules. However, this is computationally expensive, and, for this analysis, where the emphasis is on classification rather than estimation, it is unclear how classifications from each version should be combined. Therefore, multiple imputations (six imputed versions for 20 iterations each) were produced for the sole purpose of assessing whether the imputation algorithm had converged on an acceptable solution. A single imputed version was input to the ML pipeline. Numeric variables were normalised prior to model fitting.

Classification using visual function

One approach to classification is to define ‘normal limits’ (e.g. 2.5% and 97.5% percentiles in a healthy population) and classify measurements outside these limits as abnormal (an approach that is used in perimetry). Previous studies have indicated subtle differences in mean values for some visual function measures when comparing diabetic retinopathy patients with other groups [10, 11]. However, visual function measures have high inter-individual variability, so a classifier constructed using closely overlapping frequency distributions of diabetic retinopathy/DMO and ‘normal’ eyes is likely to perform poorly. Combining multiple variables assessing various aspects of visual function is more powerful, but there is no consensus as to how variables should be combined for any of the three classification tasks. In particular, the ideal functional forms of the relationships between visual function variables and diabetes mellitus/diabetic retinopathy/DMO status are unknown. For example, does the probability of an eye having DMO increase linearly with a decrease in NVA, or is the relationship curvilinear or stepped, perhaps complicated further by interaction with another variable?

We selected a flexible ML approach to capture these complexities, with classifications generated by an ensemble of statistical models (learners) fitted to the same data and combined to produce the best possible classification for each eye in each classification task. The library of learners comprised simple intercept-only models, regression-based models for correlated variables (ridge regression), variable selection (least absolute shrinkage and selection operator [LASSO]) and curvilinear relationships (polynomial splines). In addition, the approach included single-layer neural networks and tree-based methods (XGBoost, random forest and Bayesian additive regression trees) that are capable of modelling threshold-type associations and interactions among multiple variables. For each task, each learner was fitted to 90% of the dataset and used to infer classifications for the remaining 10% (validation set). To avoid overfitting, this was repeated ten times, combining validation sets to give a complete set of classifications (i.e. 10-fold cross-validation [28]). We used individuals as the units of assignment for cross-validation to prevent data leakage if one eye was used to train and the other eye to validate. A final classification for each eye was produced from a linear combination of the learner predictions, fitted using a second stage of ML (a meta-learner). The weights in the linear combination estimated for each learner indicated the relative contribution to the final prediction. The overall algorithm (SuperLearner) achieves the best possible classification provided that one of the learners in the library approximates the true data-generating mechanism [28, 29]. For comparison with a more conventional approach, we also fitted a simpler, main-effects logistic regression model for each task.

Input variables for each task comprised the 12 visual function variables together with age and sex. The output of both the ensemble and logistic models for each eye was the predicted probability of membership of the comparison class for that task (i.e. PDM no DR, PDR no DMO and PDR with DMO for tasks A–C, respectively). Eyes with probability >0.5 were labelled into the comparison class; others were labelled into the reference class (no DM, DM no DR and DR no DMO, respectively). Model accuracy was assessed as the proportion of eyes correctly classified. This was compared with the proportion of eyes that would have been labelled correctly if all eyes had been labelled as belonging to the modal class (baseline accuracy). AUC, the area under the receiver operating characteristics (ROC) curve, was calculated as an alternative measure of model performance, which is appropriate for the mild to moderate levels of class imbalance in this dataset (task A, 17% minority class; task B, minority class 44%; class C, minority class 29%).

Interpretable ML

The ensemble algorithm predicted the probability for each eye, but these probability predictions cannot be directly decomposed as there are too many estimated variables to examine manually and some of the learners are fitted stochastically (e.g. random forests). We used an interpretable ML technique, generating SHapley Additive exPlanation (SHAP) values, to evaluate the contribution of each input variable towards each model prediction [30]. The SHAP technique uses concepts from game theory to define a ‘Shapley value’ for a feature, which provides a measurement of its influence on the underlying model’s prediction. Broadly, this value is calculated for a feature by averaging its marginal contribution to every possible prediction for the instance under consideration. The strategy is straightforward, whereby the technique calculates the marginal contribution of the relevant feature for all possible combinations of inputs in the feature space of the instance. SHAP values may be used to examine the statistical (but not necessarily causal) reasons behind a model prediction, including assessing the importance of different variables and highlighting where predictions depend on statistical artefacts such as missing data. SHAP values are calculated at the level of the prediction (i.e. eye), and so both global and local measures of variable importance may be calculated. Variables associated with greater global variation in SHAP values have greater contribution to probability predictions across the entire dataset. We inspected the profiles of SHAP values for individual eyes to better understand the patterns of model predictions. We also performed clustering by SHAP values to identify clusters of eyes in which predictions were made for similar reasons (see ESM section on interpretable machine learning and ESM Figs 1 and 2). Ensemble models were fitted using the sl3 package, and SHAP values were calculated using fastshap in R version 4.2.1 [31,32,33].

Comments (0)

No login
gif