Block coordinate descent algorithm improves variable selection and estimation in error‐in‐variables regression

1 INTRODUCTION

Modern medical research is increasingly built on modeling of high-dimensional data. Sparse regression methods, such as the Lasso (Tibshirani, 1996), Generalized Lasso (Tibshirani et al., 2011), Grouped Lasso (Yuan & Lin, 2006), adaptive Lasso (Zou, 2006), and Elastic Net (Zou & Hastie, 2005), have been widely applied to perform estimation and variable selection at the same time. However, high-dimensional data sets often contain less precise measurements of phenotypes than those that might be available in smaller studies. For example, large biobanks often use billing codes from electronic health care records as proxy measures for a physician-made diagnosis. It is well known that applying naïve regression methods to predictor variables that are measured with error can lead to attenuation of effect estimates (Chesher, 1991; Rosenbaum et al., 2010). Analogously, questionnaire data from large cohorts often contain many missing values (Obermeyer & Emanuel, 2016). Removing subjects who are missing at least one measurement can easily lead to removal of most subjects when data are high dimensional.

Many error-in-variables solutions have been proposed. In addition to simple complete case analysis and pairwise deletion, more rigorous methods, such as expectation-maximization algorithms (Dempster, 1977; Schafer, 1997), multiple imputation methods (Buuren, 2011), and full information maximum likelihood estimation (Enders, 2001; Friedman et al., 2010), have been developed, but these computationally expensive methods cannot be easily extended to high-dimensional settings. In contrast, Loh and Wainwright (2011) developed a penalized method for error-in-variables regression. Within a properly chosen constraint radius, a projected gradient descent algorithm will converge to a small neighborhood of the set of all global minimizers, and is promising for variable selection in a high-dimensional setting (Loh & Wainwright, 2011). Nevertheless, proper choice of this constraint radius depends on knowledge of the parameters yet to be estimated (Datta et al., 2017). Hence, Datta and Zou (2017) developed the Convex Conditioned Lasso (CoCoLasso) that does not require prior knowledge of the unknown parameters. The CoCoLasso algorithm is able to correct for both additive measurement error and missing data, and showed a substantial increase in estimation accuracy and stability compared with the naïve Lasso.

However, when the data are only partially corrupted (i.e., some features are free of measurement error), the CoCoLasso still performs estimation for all features in an undifferentiated manner, limiting the implementation of the approach for large feature sets due to the intensive matrix computations required. Such circumstances of partial corruption are common for genetic epidemiology studies based on large genotyped cohorts, where the genotypes are accurately measured by highly reliable high-throughput sequencing or microarrays, but lifestyle or clinical risk factors (except for age and sex) are measured with various types of error. For instance, in the UK Biobank, one of the largest health registries to date, participants had accurately measured hundreds of thousands of single nucleotide polymorphisms (SNPs) with little missing data, but most covariates based on questionnaires or health care records contained missing data (Bycroft et al., 2018). Samples with such corrupted covariates are usually discarded, potentially leading to underuse of information. Therefore, inspired by the CoCoLasso, we propose here a Block coordinate Descent Convex Conditioned Lasso (BDCoCoLasso) algorithm that makes it possible to perform higher-dimensional error-in-variables regressions by separately optimizing estimation of the parameter estimates for uncorrupted and corrupted features in an iterative manner. Our proposal requires the implementation of a carefully calibrated cross-validation strategy. Furthermore, we build in the smoothly clipped absolute deviation (SCAD) penalty (Fan & Li, 2001) in the new algorithm. In simulations, we confirm that our algorithm provides equivalent results to the CoCoLasso, and demonstrates better performance than the naïve Lasso, with increasing benefit as the dimension increases. Although this approach will still encounter computational limitations for many corrupted features, we substantially enlarge the magnitude of problems that can be analyzed with an error-in-variables approach. We demonstrate the potential practical utility of the BDCoCoLasso by deriving covariate-adjusted genetic risk scores to predict body mass index, bone mineral density, and lifespan in a subset of the UK Biobank (Bycroft et al., 2018).

The rest of the manuscript is organized as follows. In Section 2, we briefly review the CoCoLasso method, and then we describe our new version that allows blocks of features with different corruption states—BDCoCoLasso. We describe simulation settings and results in Section 3. Section 4 illustrates the performance of our algorithm on the UK Biobank data.

2 METHODS

In this section, we first review the principles of the CoCoLasso. We then seek to improve its computational efficiency and stability when the covariate matrix is partially corrupted or when different types of measurement error exist simultaneously, by implementing a block coordinate descent algorithm (Rosenbaum et al., 2013). We also implement a SCAD penalty (Fan & Li, 2001) to avoid overshrinkage when some features have strong effects.

2.1 The CoCoLasso Suppose a true covariate matrix $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0001$ , with $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0002$ observations and $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0003$ features, is measured as a corrupted covariate matrix $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0004$ , where measurement error can be: 1.

Additive error: $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0005$ , where $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0006$ represents additive error;

Missing data: $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0007$ , where $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0008$ or $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0009$ .

It has been shown that using a classical Lasso with an objective function taking the form $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0010$ (1) can lead to biased estimation of $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0011$ (Datta et al., 2017; Loh & Wainwright, 2011), where $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0012$ is the continuous response. Alternatively, this objective function can be reformulated as $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0013$ (2) where $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0014$ and $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0015$ . Loh and Wainwright (2011) proposed that $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0016$ and $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0017$ could be replaced by their unbiased estimators $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0018$ and $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0019$ such that $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0020$ and $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0021$ . However, since the new covariance matrix $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0022$ can have negative eigenvalues, particularly when the covariate matrix is high dimensional ( $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0023$ ), the new optimization problem with the objective function $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0024$ (3) is not necessarily convex. Loh and Wainwright (2011) showed that by setting certain constraints on $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0025$ , the problem could become convex, yet it is necessary to have prior knowledge of $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0026$ to find a suitable constraint. Datta and Zou (2017) therefore proposed the CoCoLasso that adopts the adapted objective function but finds a nearest positive semidefinite matrix for $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0027$ : $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0028$ (4) where $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0029$ . Here, the elementwise maximum norm for matrix $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0030$ is defined as $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0031$ . This nearest positive semidefinite matrix can then be solved by an alternating direction method of multipliers (ADMM) algorithm (Boyd et al., 2011). 2.2 Two-block coordinate descent for partially corrupted covariate matrix The CoCoLasso enables error-in-variables regression in general, but when the feature set is large, the required matrix calculations are demanding. Implementing a block coordinate descent could substantially improve the computational efficiency when the covariate matrix is only partially corrupted. Specifically, projection of the covariance matrix onto a positive semidefinite subspace, that is, $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0032$ , within the CoCoLasso, requires multiple operations on matrices of dimension $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0033$ , which are order $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0034$ . In contrast, our BDCoCoLasso requires these operations only on the corrupted subblocks of the covariance matrix, which are anticipated to be much smaller. Suppose the true covariate matrix $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0035$ is now measured as $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0036$ , where $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0037$ is measured without error, and $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0038$ is measured with error. We then need to estimate $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0039$ where $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0040$ is a coefficient vector for the noncorrupted covariates, and $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0041$ is a coefficient vector for the corrupted covariates. We derive the objective function as $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0042$ (5) We conceive a two-step block coordinate descent algorithm based on (2)–(4): 1.

We first consider $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0043$ fixed, and we solve

$urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0044$ (6) where $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0045$ and $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0046$ . $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0047$ is defined as (a)

in the additive error setting, $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0048$ ;

(b)

in the missing-error setting, specifically, we define a ratio matrix $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0049$ indicating the presence or absence of data as

$urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0050$ where $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0051$ is the number of samples for which both the $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0052$ th and the $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0053$ th features are measured and $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0054$ is the number of samples for which the $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0055$ th feature is measured. Note that $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0056$ is used to correct for measurement error in the corrupted covariates. We then have $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0057$ , that is, $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0058$ for $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0059$ and $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0060$ . 2.

We next consider $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0061$ fixed, with a value optimized in the previous step, and we solve

$urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0062$ (7) where $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0063$ is an unbiased surrogate of $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0064$ and $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0065$ is the nearest positive semidefinite matrix of $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0066$ . For $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0067$ and $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0068$ , (a)

in the additive error setting, $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0069$ and $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0070$ , where $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0071$ is a known variance–covariance matrix for features measured with additive error;

(b)

in the missing error setting, $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0072$ and $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0073$ . Here, $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0074$ represents elementwise division.

We then alternate between the two steps until convergence. Following similar arguments as in Datta et al. (2017), we can ensure that both problems are equivalent to a Lasso problem. The complete optimization procedure is described in Algorithm 1.

Of note, the estimation problem can be defined as finding the global solution for $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0075$ , and our two-step procedure can be seen as equivalent to replacing $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0076$ by its nearest positive definite matrix, $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0077$ , in (5). Use of this substitution might not lead to a jointly convex problem. However, since both marginal problems (6) and (7) are convex, and both have suitable properties (i.e., both are strongly convex and smooth), our generalized alternating minimization algorithm can guarantee global minimization (Jain & Kar, 2017; Kelley, 1999).

Algorithm 1 Two-block coordinate descent Input $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0078$ , error Initialize $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0079$ ; $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0080$ while until convergence do if error = missing then $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0081$ end if if error = additive then $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0082$ end if $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0083$ $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0084$ if error = missing then $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0085$ end if if error = additive then $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0086$ end if $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0087$ Update $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0088$ ; $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0089$ end while Output $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0090$ Cross-validation to choose the penalization parameter, $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0091$ , must be appropriately implemented for the block implementation. Therefore, extending the concept in CoCoLasso (Datta et al., 2017), a $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0092$ -fold cross-validated $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0093$ can be obtained by minimizing the total cross-validation error while correcting for the two blocks separately, $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0094$ (8) Here, $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0095$ and $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0096$ are estimated as described above for $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0097$ and $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0098$ based on data not in the $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0099$ th-fold; $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0100$ and $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0101$ are derived as described above for $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0102$ and $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0103$ based on data in the $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0104$ th-fold. $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0105$ is an unbiased surrogate of $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0106$ , where $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0107$ and $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0108$ are in the $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0109$ th-fold. More specifically, 1.

in the additive error setting, where the additive error is centered to have zero mean, $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0110$ ;

in the missing error setting, $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0111$ where $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0112$ .

Although either an additive error setting or a missing error setting can be approached in the aforementioned two-step manner, data often contain variables subject to both types of errors. Therefore, we further propose a generalized algorithm that copes with a mixed error setting, described in Supporting Information.

2.3 Implementation of a SCAD penalty For potential application in scenarios where the causal variables are few but have large effect sizes, using the Lasso penalty may lead to overshrinkage (Fan & Li, 2001). To resolve this potential issue, we have also implemented a nonconcave SCAD penalty (Fan & Li, 2001). The SCAD penalty is given by $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0113$ (9) and its first derivative with respect to $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0114$ is given by $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0115$ (10) Substituting the regular $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0116$ penalty used in the Lasso by the SCAD penalty can retain large coefficients while shrinking smaller coefficients to zero. Thus, the SCAD penalty is able to produce a sparse solution and more accurate estimation for large coefficients. Following Zou and Li (2008), we implement a local linear approximation of the penalization function: $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0117$ (11) where $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0118$ and $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0119$ are given by Equations (9) and (10), respectively, and $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0120$ is the estimate obtained from the previous iteration. Equivalently, $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0121$ (12) where a weight $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0122$ specific to the $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0123$ th feature is introduced to the regular $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0124$ penalty and is updated after each iteration. This implementation enables an adaptive BDCoCoLasso.

In principle, the hyperparameter $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0125$ in the SCAD penalty should be estimated through cross-validation. However, the resulting two-dimensional cross-validation would be computationally expensive. Fan and Li (2001) proposed that $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0126$ should be suitable for many problems, and that the algorithm performance does not improve significantly with $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0127$ selected by data-driven approaches. We therefore set $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0128$ in all simulations described below.

In addition to the SCAD penalty, other weighting schemes, such as the minimax concave penalty (Zhang, 2010), could be implemented in the future for improved generalizability.

3 SIMULATION STUDY

Simulations were designed to explore the performance of BDCoCoLasso as a function of the number and proportion of corrupted features. Furthermore, we wanted to ensure that our results matched CoCoLasso when both methods could be implemented, that is, for fairly modest $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0129$ , and a single type of error.

3.1 Simulation design We first simulated an uncorrupted covariate matrix $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0130$ from a multivariate normal distribution with $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0131$ observations, zero mean, and a predefined correlation structure across $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0132$ features. We explored a lower-dimensional setting ( $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0133$ and $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0134$ ) and a higher-dimensional setting ( $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0135$ and $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0136$ ) in combination with two common covariance matrix designs to introduce correlation between features ( $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0137$ ): We then generated the response as $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0140$ (13) To ensure a realistic signal-to-noise ratio, we set $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0141$ . When assessing the performance of the CoCoLasso algorithm, Datta and Zou used $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0142$ to generate strong signals from only a few features. Likewise, to start with a simulation that was similar to theirs, we set $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0143$ where three of the features measured without error and three of the features measured with error were assigned to be causal with relatively large effect sizes.

Since we anticipate that this algorithm will be useful in large cohorts where $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0144$ , and anticipating multiple associated features with small effect sizes, we simulated more scenarios with $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0145$ and $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0146$ . We assigned different fractions of features to be causal ( $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0147$ or $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0148$ ), and created higher dimensionality ( $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0149$ , or $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0150$ ) while sampling $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0151$ from a standardized normal distribution $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0152$ .

Next, we introduced different types of error to the covariate matrix: 1.

For the additive error setting, the corrupted design matrix was generated as $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0153$ where $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0154$ . We explored different $urn:x-wiley:07410395:media:gepi22430:gepi22430-math-0155$ parameters in combination with different fractions (at least

View original article

GENETIC EPIDEMIOLOGY

Share Bookmark

0 0 0 0 0 0 0

More from this channel

Block coordinate descent algorithm improves variable selection and estimation in error‐in‐variables regression

Comments (0)