Back to Journals » Clinical Epidemiology » Volume 12

External Validation of an Algorithm to Identify Patients with High Data-Completeness in Electronic Health Records for Comparative Effectiveness Research

Authors Lin KJ, Rosenthal GE, Murphy SN, Mandl KD, Jin Y, Glynn RJ, Schneeweiss S

Received 26 September 2019

Accepted for publication 6 December 2019

Published 4 February 2020 Volume 2020:12 Pages 133—141


Checked for plagiarism Yes

Review by Single anonymous peer review

Peer reviewer comments 2

Editor who approved publication: Professor Henrik Toft Sørensen

Kueiyu Joshua Lin,1,2 Gary E Rosenthal,3 Shawn N Murphy,4,5 Kenneth D Mandl,6 Yinzhu Jin,1 Robert J Glynn,1 Sebastian Schneeweiss1

1Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA; 2Department of Medicine, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA; 3Department of Internal Medicine, Wake Forest School of Medicine, Winston-Salem, NC, USA; 4Department of Neurology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA; 5Research Information Science and Computing, Partners Healthcare, Somerville, MA, USA; 6Computational Health Informatics Program, Boston Children’s Hospital, Harvard Medical School, Boston, MA, USA

Correspondence: Kueiyu Joshua Lin
Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, 1620 Tremont St. Suite 3030, Boston, MA 02120, USA
Tel +1 617 278-0930
Fax +1 617 232-8602

Objective: Electronic health records (EHR) data-discontinuity, i.e. receiving care outside of a particular EHR system, may cause misclassification of study variables. We aimed to validate an algorithm to identify patients with high EHR data-continuity to reduce such bias.
Materials and Methods: We analyzed data from two EHR systems linked with Medicare claims data from 2007 through 2014, one in Massachusetts (MA, n=80,588) and the other in North Carolina (NC, n=33,207). We quantified EHR data-continuity by Mean Proportion of Encounters Captured (MPEC) by the EHR system when compared to complete recording in claims data. The prediction model for MPEC was developed in MA and validated in NC. Stratified by predicted EHR data-continuity, we quantified misclassification of 40 key variables by Mean Standardized Differences (MSD) between the proportions of these variables based on EHR alone vs the linked claims-EHR data.
Results: The mean MPEC was 27% in the MA and 26% in the NC system. The predicted and observed EHR data-continuity was highly correlated (Spearman correlation=0.78 and 0.73, respectively). The misclassification (MSD) of 40 variables in patients of the predicted EHR data-continuity cohort was significantly smaller (44%, 95% CI: 40– 48%) than that in the remaining population.
Discussion: The comorbidity profiles were similar in patients with high vs low EHR data-continuity. Therefore, restricting an analysis to patients with high EHR data-continuity may reduce information bias while preserving the representativeness of the study cohort.
Conclusion: We have successfully validated an algorithm that can identify a high EHR data-continuity cohort representative of the source population.

Keywords: electronic medical records, data linkage, comparative effectiveness research, information bias, continuity, external validation

Creative Commons License This work is published and licensed by Dove Medical Press Limited. The full terms of this license are available at and incorporate the Creative Commons Attribution - Non Commercial (unported, v3.0) License. By accessing the work you hereby accept the Terms. Non-commercial uses of the work are permitted without any further permission from Dove Medical Press Limited, provided the work is properly attributed. For permission for commercial use of this work, please see paragraphs 4.2 and 5 of our Terms.

Download Article [PDF]  View Full Text [HTML][Machine readable]