Back to Journals » Clinical Epidemiology » Volume 10

Combining distributed regression and propensity scores: a doubly privacy-protecting analytic method for multicenter research

Authors Toh S, Wellman R, Coley RY, Horgan C, Sturtevant J, Moyneur E, Janning C, Pardee R, Coleman KJ, Arterburn D, McTigue K, Anau J, Cook AJ

Received 25 June 2018

Accepted for publication 17 October 2018

Published 27 November 2018 Volume 2018:10 Pages 1773—1786


Checked for plagiarism Yes

Review by Single anonymous peer review

Peer reviewer comments 5

Editor who approved publication: Professor Vera Ehrenstein

Sengwee Toh,1 Robert Wellman,2 R Yates Coley,2 Casie Horgan,1 Jessica Sturtevant,1 Erick Moyneur,3 Cheri Janning,4 Roy Pardee,2 Karen J Coleman,5 David Arterburn,2 Kathleen McTigue,6 Jane Anau,2 Andrea J Cook2

On behalf of the PCORnet Bariatric Study Collaborative

1Department of Population Medicine, Harvard Medical School and Harvard Pilgrim Health Care Institute, Boston, MA, USA; 2Kaiser Permanente Washington Health Research Institute, Seattle, WA, USA; 3StatLog Econometrics, Inc., Montreal, QC, Canada; 4Duke Clinical and Translational Science Institute, Durham, NC, USA; 5Kaiser Permanente Southern California, Pasadena, CA, USA; 6Department of Medicine, University of Pittsburgh, Pittsburgh, PA, USA

Purpose: Sharing of detailed individual-level data continues to pose challenges in multicenter studies. This issue can be addressed in part by using analytic methods that require only summary-level information to perform the desired multivariable-adjusted analysis. We examined the feasibility and empirical validity of 1) conducting multivariable-adjusted distributed linear regression and 2) combining distributed linear regression with propensity scores, in a large distributed data network.
Patients and methods: We compared percent total weight loss 1-year postsurgery between Roux-en-Y gastric bypass and sleeve gastrectomy procedure among 43,110 patients from 36 health systems in the National Patient-Centered Clinical Research Network. We adjusted for baseline demographic and clinical variables as individual covariates, deciles of propensity scores, or both, in three separate outcome regression models. We used distributed linear regression, a method that requires only summary-level information (specifically, sums of squares and cross products matrix) from sites, to fit the three ordinary least squares linear regression models. A comparison set of analyses that used pooled deidentified individual-level data from sites served as the reference.
Results: Distributed linear regression produced results identical to those from the corresponding pooled individual-level data analysis for all variables in all three models. The maximum numerical difference in the parameter estimate or standard error for all the variables was 3×10−11 across three models.
Conclusion: Distributed linear regression analysis is a feasible and valid analytic method in multicenter studies for one-time continuous outcomes. Combining distributed regression with propensity scores via modeling offers more privacy protection and analytic flexibility.

Keywords: distributed regression, propensity score, distributed data networks, privacy-protecting methods

Creative Commons License This work is published and licensed by Dove Medical Press Limited. The full terms of this license are available at and incorporate the Creative Commons Attribution - Non Commercial (unported, v3.0) License. By accessing the work you hereby accept the Terms. Non-commercial uses of the work are permitted without any further permission from Dove Medical Press Limited, provided the work is properly attributed. For permission for commercial use of this work, please see paragraphs 4.2 and 5 of our Terms.

Download Article [PDF]  View Full Text [HTML][Machine readable]