SAT: a Surrogate-Assisted Two-wave case boosting sampling method, with application to EHR-based association studies

Chen, Y (通讯作者),Univ Penn, Dept Biostat Epidemiol & Informat, Sch Med, 423 Guardian Dr, Philadelphia, PA 19104 USA.
2022-4-13
Objectives Electronic health records (EHRs) enable investigation of the association between phenotypes and risk factors. However, studies solely relying on potentially error-prone EHR-derived phenotypes (ie, surrogates) are subject to bias. Analyses of low prevalence phenotypes may also suffer from poor efficiency. Existing methods typically focus on one of these issues but seldom address both. This study aims to simultaneously address both issues by developing new sampling methods to select an optimal subsample to collect gold standard phenotypes for improving the accuracy of association estimation. Materials and Methods We develop a surrogate-assisted two-wave (SAT) sampling method, where a surrogate-guided sampling (SGS) procedure and a modified optimal subsampling procedure motivated from A-optimality criterion (OSMAC) are employed sequentially, to select a subsample for outcome validation through manual chart review subject to budget constraints. A model is then fitted based on the subsample with the true phenotypes. Simulation studies and an application to an EHR dataset of breast cancer survivors are conducted to demonstrate the effectiveness of SAT. Results We found that the subsample selected with the proposed method contains informative observations that effectively reduce the mean squared error of the resultant estimator of the association. Conclusions The proposed approach can handle the problem brought by the rarity of cases and misclassification of the surrogate in phenotype-absent EHR-based association studies. With a well-behaved surrogate, SAT successfully boosts the case prevalence in the subsample and improves the efficiency of estimation.
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION
卷号:29|期号:5|页码:918-927
ISSN:1067-5027|收录类别:SCIE
语种
英语
来源机构
University of Pennsylvania; Kaiser Permanente; University of Washington; University of Washington Seattle
资助信息
Research reported in this publication was supported in part by the National Institutes of Health (NIH) grants 1R01AI130460, 1R01AG073435, 1R56AG074604, 1R01LM013519, 1R56AG069880 (to XL and YC), R21CA227613 (to RH, YC, and JC), and R21CA143242 (JC). Data collection was funded in part by the NIH grants R01CA093772, R01CA120562, and U01CA063731. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. This work was supported partially through Patient-Centered Outcomes Research Institute (PCORI) Project Program Awards (ME-2019C3-18315 and ME-2018C314899). All statements in this report, including its findings and conclusions, are solely those of the authors and do not necessarily represent the views of the Patient-Centered Outcomes Research Institute (PCORI), its Board of Governors or Methodology Committee.
被引频次(WOS)
0
被引频次(其他)
0
180天使用计数
0
2013以来使用计数
0
EISSN
1527-974X
出版年
2022-4-13
DOI
10.1093/jamia/ocab267
学科领域
循证公共卫生
关键词
association study electronic health records error in phenotype rare disease sampling strategy
资助机构
National Institutes of Health (NIH)(United States Department of Health & Human ServicesNational Institutes of Health (NIH) - USA) NIH(United States Department of Health & Human ServicesNational Institutes of Health (NIH) - USA) Patient-Centered Outcomes Research Institute (PCORI) Project Program Awards
WOS学科分类
Computer Science, Information Systems Computer Science, Interdisciplinary Applications Health Care Sciences & Services Information Science & Library Science Medical Informatics