Modern medical research is increasingly built on modeling of high-dimensional data. Sparse regression methods, such as the Lasso (Tibshirani, 1996), Generalized Lasso (Tibshirani et al., 2011), Grouped Lasso (Yuan & Lin, 2006), adaptive Lasso (Zou, 2006), and Elastic Net (Zou & Hastie, 2005), have been widely applied to perform estimation and variable selection at the same time. However, high-dimensional data sets often contain less precise measurements of phenotypes than those that might be available in smaller studies. For example, large biobanks often use billing codes from electronic health care records as proxy measures for a physician-made diagnosis. It is well known that applying naïve regression methods to predictor variables that are measured with error can lead to attenuation of effect estimates (Chesher, 1991; Rosenbaum et al., 2010). Analogously, questionnaire data from large cohorts often contain many missing values (Obermeyer & Emanuel, 2016). Removing subjects who are missing at least one measurement can easily lead to removal of most subjects when data are high dimensional.
Many error-in-variables solutions have been proposed. In addition to simple complete case analysis and pairwise deletion, more rigorous methods, such as expectation-maximization algorithms (Dempster, 1977; Schafer, 1997), multiple imputation methods (Buuren, 2011), and full information maximum likelihood estimation (Enders, 2001; Friedman et al., 2010), have been developed, but these computationally expensive methods cannot be easily extended to high-dimensional settings. In contrast, Loh and Wainwright (2011) developed a penalized method for error-in-variables regression. Within a properly chosen constraint radius, a projected gradient descent algorithm will converge to a small neighborhood of the set of all global minimizers, and is promising for variable selection in a high-dimensional setting (Loh & Wainwright, 2011). Nevertheless, proper choice of this constraint radius depends on knowledge of the parameters yet to be estimated (Datta et al., 2017). Hence, Datta and Zou (2017) developed the Convex Conditioned Lasso (CoCoLasso) that does not require prior knowledge of the unknown parameters. The CoCoLasso algorithm is able to correct for both additive measurement error and missing data, and showed a substantial increase in estimation accuracy and stability compared with the naïve Lasso.
However, when the data are only partially corrupted (i.e., some features are free of measurement error), the CoCoLasso still performs estimation for all features in an undifferentiated manner, limiting the implementation of the approach for large feature sets due to the intensive matrix computations required. Such circumstances of partial corruption are common for genetic epidemiology studies based on large genotyped cohorts, where the genotypes are accurately measured by highly reliable high-throughput sequencing or microarrays, but lifestyle or clinical risk factors (except for age and sex) are measured with various types of error. For instance, in the UK Biobank, one of the largest health registries to date, participants had accurately measured hundreds of thousands of single nucleotide polymorphisms (SNPs) with little missing data, but most covariates based on questionnaires or health care records contained missing data (Bycroft et al., 2018). Samples with such corrupted covariates are usually discarded, potentially leading to underuse of information. Therefore, inspired by the CoCoLasso, we propose here a Block coordinate Descent Convex Conditioned Lasso (BDCoCoLasso) algorithm that makes it possible to perform higher-dimensional error-in-variables regressions by separately optimizing estimation of the parameter estimates for uncorrupted and corrupted features in an iterative manner. Our proposal requires the implementation of a carefully calibrated cross-validation strategy. Furthermore, we build in the smoothly clipped absolute deviation (SCAD) penalty (Fan & Li, 2001) in the new algorithm. In simulations, we confirm that our algorithm provides equivalent results to the CoCoLasso, and demonstrates better performance than the naïve Lasso, with increasing benefit as the dimension increases. Although this approach will still encounter computational limitations for many corrupted features, we substantially enlarge the magnitude of problems that can be analyzed with an error-in-variables approach. We demonstrate the potential practical utility of the BDCoCoLasso by deriving covariate-adjusted genetic risk scores to predict body mass index, bone mineral density, and lifespan in a subset of the UK Biobank (Bycroft et al., 2018).
The rest of the manuscript is organized as follows. In Section 2, we briefly review the CoCoLasso method, and then we describe our new version that allows blocks of features with different corruption states—BDCoCoLasso. We describe simulation settings and results in Section 3. Section 4 illustrates the performance of our algorithm on the UK Biobank data.
2 METHODSIn this section, we first review the principles of the CoCoLasso. We then seek to improve its computational efficiency and stability when the covariate matrix is partially corrupted or when different types of measurement error exist simultaneously, by implementing a block coordinate descent algorithm (Rosenbaum et al., 2013). We also implement a SCAD penalty (Fan & Li, 2001) to avoid overshrinkage when some features have strong effects.
2.1 The CoCoLasso Suppose a true covariate matrix
, with
observations and
features, is measured as a corrupted covariate matrix
, where measurement error can be:
1. Additive error:
, where
represents additive error;
Missing data:
, where
or
.
(1) can lead to biased estimation of
(Datta et al., 2017; Loh & Wainwright, 2011), where
is the continuous response.
Alternatively, this objective function can be reformulated as
(2) where
and
. Loh and Wainwright (2011) proposed that
and
could be replaced by their unbiased estimators
and
such that
and
. However, since the new covariance matrix
can have negative eigenvalues, particularly when the covariate matrix is high dimensional (
), the new optimization problem with the objective function
(3) is not necessarily convex. Loh and Wainwright (2011) showed that by setting certain constraints on
, the problem could become convex, yet it is necessary to have prior knowledge of
to find a suitable constraint.
Datta and Zou (2017) therefore proposed the CoCoLasso that adopts the adapted objective function but finds a nearest positive semidefinite matrix for
:
(4) where
. Here, the elementwise maximum norm for matrix
is defined as
. This nearest positive semidefinite matrix can then be solved by an alternating direction method of multipliers (ADMM) algorithm (Boyd et al., 2011).
2.2 Two-block coordinate descent for partially corrupted covariate matrix
The CoCoLasso enables error-in-variables regression in general, but when the feature set is large, the required matrix calculations are demanding. Implementing a block coordinate descent could substantially improve the computational efficiency when the covariate matrix is only partially corrupted. Specifically, projection of the covariance matrix onto a positive semidefinite subspace, that is,
, within the CoCoLasso, requires multiple operations on matrices of dimension
, which are order
. In contrast, our BDCoCoLasso requires these operations only on the corrupted subblocks of the covariance matrix, which are anticipated to be much smaller. Suppose the true covariate matrix
is now measured as
, where
is measured without error, and
is measured with error. We then need to estimate
where
is a coefficient vector for the noncorrupted covariates, and
is a coefficient vector for the corrupted covariates. We derive the objective function as
(5)
We conceive a two-step block coordinate descent algorithm based on (2)–(4):
1. We first consider
fixed, and we solve
(6) where
and
.
is defined as
(a) in the additive error setting,
;
in the missing-error setting, specifically, we define a ratio matrix
indicating the presence or absence of data as
where
is the number of samples for which both the
th and the
th features are measured and
is the number of samples for which the
th feature is measured. Note that
is used to correct for measurement error in the corrupted covariates. We then have
, that is,
for
and
.
2. We next consider
fixed, with a value optimized in the previous step, and we solve
(7) where
is an unbiased surrogate of
and
is the nearest positive semidefinite matrix of
. For
and
,
(a) in the additive error setting,
and
, where
is a known variance–covariance matrix for features measured with additive error;
in the missing error setting,
and
. Here,
represents elementwise division.
We then alternate between the two steps until convergence. Following similar arguments as in Datta et al. (2017), we can ensure that both problems are equivalent to a Lasso problem. The complete optimization procedure is described in Algorithm 1.
Of note, the estimation problem can be defined as finding the global solution for
, and our two-step procedure can be seen as equivalent to replacing
by its nearest positive definite matrix,
, in (5). Use of this substitution might not lead to a jointly convex problem. However, since both marginal problems (6) and (7) are convex, and both have suitable properties (i.e., both are strongly convex and smooth), our generalized alternating minimization algorithm can guarantee global minimization (Jain & Kar, 2017; Kelley, 1999).
, error
Initialize
;
while until convergence do
if error = missing then
end if
if error = additive then
end if
if error = missing then
end if
if error = additive then
end if
Update
;
end while
Output
Cross-validation to choose the penalization parameter,
, must be appropriately implemented for the block implementation. Therefore, extending the concept in CoCoLasso (Datta et al., 2017), a
-fold cross-validated
can be obtained by minimizing the total cross-validation error while correcting for the two blocks separately,
(8)
Here,
and
are estimated as described above for
and
based on data not in the
th-fold;
and
are derived as described above for
and
based on data in the
th-fold.
is an unbiased surrogate of
, where
and
are in the
th-fold. More specifically,
1. in the additive error setting, where the additive error is centered to have zero mean,
;
in the missing error setting,
where
.
Although either an additive error setting or a missing error setting can be approached in the aforementioned two-step manner, data often contain variables subject to both types of errors. Therefore, we further propose a generalized algorithm that copes with a mixed error setting, described in Supporting Information.
2.3 Implementation of a SCAD penalty For potential application in scenarios where the causal variables are few but have large effect sizes, using the Lasso penalty may lead to overshrinkage (Fan & Li, 2001). To resolve this potential issue, we have also implemented a nonconcave SCAD penalty (Fan & Li, 2001). The SCAD penalty is given by
(9) and its first derivative with respect to
is given by
(10) Substituting the regular
penalty used in the Lasso by the SCAD penalty can retain large coefficients while shrinking smaller coefficients to zero. Thus, the SCAD penalty is able to produce a sparse solution and more accurate estimation for large coefficients.
Following Zou and Li (2008), we implement a local linear approximation of the penalization function:
(11) where
and
are given by Equations (9) and (10), respectively, and
is the estimate obtained from the previous iteration.
Equivalently,
(12) where a weight
specific to the
th feature is introduced to the regular
penalty and is updated after each iteration. This implementation enables an adaptive BDCoCoLasso.
In principle, the hyperparameter
in the SCAD penalty should be estimated through cross-validation. However, the resulting two-dimensional cross-validation would be computationally expensive. Fan and Li (2001) proposed that
should be suitable for many problems, and that the algorithm performance does not improve significantly with
selected by data-driven approaches. We therefore set
in all simulations described below.
In addition to the SCAD penalty, other weighting schemes, such as the minimax concave penalty (Zhang, 2010), could be implemented in the future for improved generalizability.
3 SIMULATION STUDYSimulations were designed to explore the performance of BDCoCoLasso as a function of the number and proportion of corrupted features. Furthermore, we wanted to ensure that our results matched CoCoLasso when both methods could be implemented, that is, for fairly modest
, and a single type of error.
from a multivariate normal distribution with
observations, zero mean, and a predefined correlation structure across
features. We explored a lower-dimensional setting (
and
) and a higher-dimensional setting (
and
) in combination with two common covariance matrix designs to introduce correlation between features (
):
We then generated the response as
(13) To ensure a realistic signal-to-noise ratio, we set
. When assessing the performance of the CoCoLasso algorithm, Datta and Zou used
to generate strong signals from only a few features. Likewise, to start with a simulation that was similar to theirs, we set
where three of the features measured without error and three of the features measured with error were assigned to be causal with relatively large effect sizes.
Since we anticipate that this algorithm will be useful in large cohorts where
, and anticipating multiple associated features with small effect sizes, we simulated more scenarios with
and
. We assigned different fractions of features to be causal (
or
), and created higher dimensionality (
, or
) while sampling
from a standardized normal distribution
.
For the additive error setting, the corrupted design matrix was generated as
where
. We explored different
parameters in combination with different fractions (at least
Comments (0)