Validity of Acute Cardiovascular Outcome Diagnoses Recorded in European Electronic Health Records: A Systematic Review

Background Electronic health records are widely used in cardiovascular disease research. We appraised the validity of stroke, acute coronary syndrome and heart failure diagnoses in studies conducted using European electronic health records. Methods Using a prespecified strategy, we systematically searched seven databases from dates of inception to April 2019. Two reviewers independently completed study selection, followed by partial parallel data extraction and risk of bias assessment. Sensitivity, specificity, positive predictive value (PPV), and negative predictive value estimates were narratively synthesized and heterogeneity between sensitivity and PPV estimates were assessed using I2. Results We identified 81 studies, of which 20 validated heart failure diagnoses, 31 validated acute coronary syndrome diagnoses with 29 specifically recording estimates for myocardial infarction, and 41 validated stroke diagnoses. Few studies reported specificity or negative predictive value estimates. Sensitivity was ≤66% in all but one heart failure study, ≥80% for 91% of myocardial infarction studies, and ≥70% for 73% of stroke studies. PPV was ≥80% in 74% of heart failure, 88% of myocardial infarction, and 70% of stroke studies. PPV by stroke subtype was variable, at ≥80% for 80% of ischaemic stroke but only 44% of haemorrhagic stroke. There was considerable heterogeneity (I2 >75%) between sensitivity and PPV estimates for all diagnoses. Conclusion Overall, European electronic health record stroke, acute coronary syndrome and heart failure diagnoses are accurate for use in research, although validity estimates for heart failure and individual stroke subtypes were lower. Where possible, researchers should validate data before use or carefully interpret the results of previous validation studies for their own study purposes.


Introduction
Ischaemic heart disease and cerebrovascular disease have been the leading causes of death globally for more than 15 years. 1 In Europe, cardiovascular disease (CVD) deaths and prevalence have decreased but remain substantial; in 2015 an estimated 85 million people had CVD including 11.3 million with new diagnoses. 2 CVD determinants and outcomes research increasingly utilize electronic health records (EHRs). EHRs contain comprehensive longitudinal health data, extracted from primary and secondary care clinical systems, for large patient populations which provide cost-effective data for research. EHR data is mostly "structured" with diagnoses coded using, for example, the International Classification of Diseases (ICD) but can also be "unstructured" with anonymized free-text notes. 3 EHR-based research predominantly uses structured data. As the primary purpose of EHR data collection is clinical, it is essential to consider the validity of the data's use in research.
EHR use is widespread in Europe, where many countries have national healthcare systems, and several systematic reviews have previously explored the quality of specific European EHRs. [4][5][6][7] Other systematic reviews [8][9][10][11][12] have investigated the validity of CVD diagnoses in computerized health-related records, which included EHRs but mainly drew results from disparate claims-based systems. The previous reviews did not separate results for EHR and claims data, the quality of which may differ due to the differences in setup and collection rationale.
In our systematic review, we provide an up-to-date assessment of the validity of acute CVD diagnoses recorded in European EHRs. We defined acute CVD as heart failure (HF), acute coronary syndrome (ACS), and stroke. These high-burden conditions are key diagnoses commonly included in the composite endpoint of major adverse cardiovascular events (MACE) which is increasingly employed in both clinical trials and observational research studies. 13 We investigated whether the validity of these diagnoses differed by subtype, definition, data source, reference standard, and study population.

Protocol and Registration
Our protocol was published in October 2019 14 following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses Protocol guidelines (PROSPERO registration number CRD42019123898).

Eligibility Criteria
We included articles that validated diagnoses in patients aged ≥16 years captured in any European primary or secondary care EHR. We excluded claims-based databases, disease registries, vital registration systems, or locally held databases. Articles needed to validate clinical codes for the diagnoses of HF, ACS, or stroke (Table 1) against a suitable internal or external reference standard. HF is most frequently a chronic condition which can deteriorate with acute exacerbations. HF may also have an acute onset, for example after an MI. The European Society of Cardiology (ESC) defines acute HF as rapid onset or worsening of symptoms and/or signs of existing HF. 15 ACS encompasses different clinical forms of myocardial ischaemia which includes myocardial infarction (MI) and unstable angina. The specific diagnosis of MI or unstable angina depends on symptoms, signs, biomarkers, and ECG and/or autopsy findings, with the definitions refined over time. 16 The diagnosis of stroke includes subtypes ischaemic stroke, intracerebral haemorrhage (ICH), and subarachnoid haemorrhage (SAH). 17 At least one validation estimate (Figure 1) or the raw data to calculate it was required.

Information Sources
We searched for eligible articles in five databases (Medline, Embase, Scopus, Web of Science, and Cochrane Library), two grey literature sources (OpenGrey and Ethos), and, where available, the bibliographies of EHR databases from the date of inception to April 2019 in any language.

Search Strategy
We searched medical subject heading terms and free-text (in the title and abstract) for the concepts of (1) CVD

Study Selection and Data Collection
Two reviewers (J.A.D. and R.M.) independently screened the titles and abstracts of all retrieved articles, followed by the full-text of articles deemed eligible in the first stage. Our published protocol details the full data collection process. 14 Briefly, we extracted data using a pre-defined template (S2 Appendix) which we piloted using dual extraction for three studies, followed by further parallel extraction for 20% of studies, and completed by a single reviewer (J.A.D.) for the remaining studies.

Risk of Bias in Individual Studies
We used a modified version of the Quality Assessment of Diagnostic Accuracy Studies 2 (QUADAS-2) 18 tool to assess bias (S3 Appendix). As with our data extraction, two authors (J.A.D. and R.M.) piloted the tool for three studies, then independently assessed risk in a further 10% of studies, with the process completed by a single reviewer (J.A.D.).

Synthesis of Results
We synthesized results with a narrative approach, grouping studies by acute CVD diagnosis (HF, ACS or stroke) and, where possible, subgroups of interest. Subgroups were; diagnosis type, definition, data source including diagnostic position and coding system, reference standard, and study population including time period, age and sex. For studies that reported validation estimates without confidence intervals (CIs), but included raw data, we calculated 95% CIs using the Wilson method for binomial proportions. We used the I 2 statistic to assess heterogeneity between the sensitivity and positive predictive value (PPV) estimates, following the Cochrane thresholds. 19 Heterogeneity assessment did not include specificity or negative predictive value (NPV), as few studies reported these measures.
To investigate sources of heterogeneity, we compared I 2 before and after removing studies at a high risk of bias and by the previously mentioned subgroups. We used the Stata metaprop command 20 to calculate I 2 . Metaprop uses raw data rather than precalculated estimates; studies that reported sensitivity or PPV but not the data used to calculate were excluded from heterogeneity assessment.

Risk of Bias Across Studies
We used the Grading of Recommendations, Assessment, Development, and Evaluation (GRADE) tool for diagnostic accuracy systematic reviews 21 to summarise crossstudy quality. Evidence was categorised as "high", "moderate", "low" or "very low" quality. See S4 Appendix for the reasons we rated quality down or up.

Studies Included
We identified 4595 studies, of which 218 were included in full-text review and 81 met eligibility criteria ( Figure 2). Study characteristics are summarized in S1

Diagnosis Type
In the three studies that reported results for first diagnosis, the PPV range was 76-88%. 28,29,77 One study compared the PPV for all diagnoses (84%) to first diagnosis (80%), 28 and another study found the same PPV for first diagnosis and recurrent diagnosis (both 76%). 29 Definition In seven of the eight studies 24

Diagnostic Position
Six studies 29,33,43,54,77,83 reported HF recorded in any diagnostic position (PPV 76-96%) and two studies 30,88 only included primary position (PPV 87% and 100%). Three studies, 33,77,83 which validated any position, also included breakdowns by primary (PPV 88-96%) and secondary (PPV 66-84%) positions.   46,52 Four studies 22,37,68,76 presented overall ACS results, of which one study 68 included an additional breakdown for MI and two studies 37,76 included unstable angina and MI, one of which also included cardiac arrest. 37 A further two studies 29,65 did not report results for ACS overall but did include both unstable angina and MI. The remaining 25 studies solely validated MI diagnoses. 23
Four studies 29,32,37,84 reported the PPV for first MI, with estimates of 75-97%, and one study 29 also included recurrent MI with a PPV of 88% compared to 97% for first MI.

Definition
Varying MI definitions were used (S6 Appendix). Most frequently (nine studies) 26,27,50,70,75,81,84,99,100 the World Health Organization (WHO) Monitoring trends and determinants in cardiovascular disease (MONICA) definition 106 was used, with variable PPV estimates of 53-96% obtained. Two studies compared MONICA to another MI definition; one 75 showed MONICA-defined definite MI had a substantially lower PPV than AHA/ESC-defined 16 definite MI (53% vs 86%), while the other 84 also showed a lower PPV for MONICA compared to "normal clinically defined MI" but with a smaller difference (81% vs 89%). One further study used the AHA/ESC definition 37 (PPV 82%). The universal definition 107 was used in a study 23 which included EHR data from three countries, with PPVs of 75-100%. Three studies used the third universal definition, 108 one 76 of which combined it with the earlier universal definition (PPV 85%). In another 53 PPVs of 92% with obtained for the primary and secondary care EHRs validated. The third 34 validated MI diagnoses recorded for patients with drug-eluting coronary stents, the PPV was 42% for all admission and 73% for acute admissions.

Heterogeneity
We were able to assess the heterogeneity between the main PPV reported in; 14 studies with 16 estimates of HF (I 2 =97.0%), 18 studies with 26 estimates of MI (I 2 =98.5%), and 19 studies with 20 estimates of stroke (I 2 =97.9%) diagnoses. Additionally, we assessed heterogeneity between the main sensitivity for; six studies of HF (I 2 =98.6%), four of MI (I 2 =74.3%), and 11 of stroke (I 2 =98.8%) diagnoses. Heterogeneity between the estimates was considerable, at more than >95% in all cases other than sensitivity estimates for MI. Furthermore, heterogeneity remained considerable after removal of studies at a high risk of bias.

Overall Strength of Evidence
GRADE showed that cross-study quality was very low for all HF outcomes (sensitivity and PPV in secondary care EHRs and PPV in primary care EHRs), low for MI sensitivity and PPV in secondary care EHRs and moderate for PPV in primary care EHRs, and very low for stroke sensitivity in secondary care EHRs and PPV in primary care EHRs and moderate for PPV in secondary care EHRs.

Summary of Findings
Our systematic review suggests that the sensitivity of coded data in European EHRs for HF diagnoses is low at ≤66% in all but one study. There was also wide variation in stroke sensitivity estimates, with only half of studies ≥80%, although three-quarters were ≥70%. The sensitivity of ACS was higher at ≥80% in the vast majority of studies. The majority of studies which validated ACS diagnosis did so specifically for MI.
The PPV of all diagnoses was ≥80% in the majority of studies; two-thirds for HF (nearly three-quarters for secondary care EHRs), nearly three-quarters for MI, and 70% of stroke validation studies. Where subtypes were validated, PPV was ≥80% for four-fifths of ischaemic stroke diagnoses but only 44% of ICH and SAH diagnoses.
The specificity and NPV were also high where available (three HF studies, three MI studies and five stroke studies). However, as most studies only included patients with the diagnosis of interest recorded in the EHR and reference standard, the results presented were mostly limited to sensitivity and PPV.
Both PPV and NPV are impacted by disease prevalence, with lower estimates for rare conditions. 111 Our systematic review focused on Europe, drawing studies from 11 countries. Age-standardized prevalence of CVD in these countries is between 5000-6500 per 100,000, other than the Czech Republic (~8700 per 100,000) which only contributed one study. 2 Therefore, prevalence differences should have limited impact on our comparison of validity estimates between geographies. The prevalence of CVD increases with age, but we did not find any systematic difference in results between studies with younger or older populations.
The low sensitivity of HF diagnoses we identified is consistent with a previous systematic review validating HF diagnoses in administrative data, which identified three European studies. 11 Twelve more studies have since been published and included in our review. These more recent findings, however, do not suggest any improvement in the quality of data over time. This is perhaps unsurprising given the range of clinical aetiology and presentation. The high proportion of studies we found to have a PPV of <80% for stroke diagnoses appeared more substantial than in previous systematic reviews. 9,12 We identified 15 new studies which were not included in these previous reviews. 25,32,45,51,56,57,[61][62][63]74,78,89,91,92,98 Our results for sensitivity and PPV of MI diagnoses are consistent with previous reviews, 8,10 and identified five 29,32,34,76,98 new MI validation studies with variable results.
There was substantial heterogeneity between the sensitivity and PPV estimates for all three acute CVD diagnoses. Heterogeneity was likely because studies differed in multiple ways; for example, even among studies which used medical record review as the reference standard, differences in study time period impacted upon the ICD version used. The heterogeneity caused by variable methods was highlighted in previous systematic reviews of

Defining Diagnosis in the EHR
We were most interested in the results of ICD-10 validation, as this is the latest ICD coding system which is widely used in Europe and elsewhere. In McCormick et al's 10 review of MI diagnoses in administrative data, the authors noted a lack of ICD-10 validation with only three studies identified, whereas our review identified 10. Nevertheless, even within ICD-10, combinations of codes used, and therefore their validity, differed, which highlights the importance of tailoring codes to each research question. Codes are arguably even more important when using other, more complex coding systems such as Read codes, which are used in UK primary care data and can generate vast numbers of codes for every clinical condition.

Defining Diagnosis in the Reference Standard
There is no single recommended gold standard to determine the validity of EHR data. 114 Nearly three-quarters (74%) of studies used medical records; more frequently for HF diagnoses (85%) than ACS (71%) or stroke (68%). This difference may be due to availability of MI and stroke registries, used in 26% and 22% of studies, respectively. No differences in the performance of the reference standard methods were discernable, probably due to heterogeneity. Criteria to define CVD, especially MI, have been refined over time, driven by the development of more sensitive and specific biomarkers, and more precise imaging techniques. 100 However, we did not identify any temporal trends in the accuracy of MI recording, again likely due to overall study heterogeneity.
When validating HF, which can vary in clinical aetiology and presentation, clarity on the criteria used to define, with explicit classification of acute and chronic HF along with ejection fraction would benefit understanding of results.

Comparing and Combining Data Sources
Only 14 (17%) studies validated primary care systems, more than half of which were in the UK. Using primary care EHRs may be beneficial for research into conditions such as HF which are frequently managed in primary care; in our study, 30% of HF EHR validation studies used primary care data, compared to 16% for ACS and 7% for stroke studies. For acute severe conditions resulting in hospitalization, secondary care records should be the most reliable data source. Where possible, the use of linked data to increase the ascertainment of acute CVD events should be considered.

Implications for Future Research
EHR-based research is a growing field -widely used in observational analyses and increasingly employed in trials. 115 Researchers should consider the level of validity necessary for their own CVD outcome definition. When a composite outcome, such as MACE, is used researchers may need to address differing sensitivity in the individual components of the outcome. In studies which investigate CVD incidence, a sensitive definition is particularly important. For example, EHR data are being used for rapid COVID-19 pandemic analyses such as; the impact the virus has in those with CVD, CVD as an outcome after infection with the virus, and excess death estimates. 116 It is important that these rapid analyses consider the validity of the data and definitions used. Conversely, in a pragmatic trial recruitment, a specific definition is likely more important than a sensitive one.

Strengths and Limitations
Our systematic review provides a comprehensive and up-todate evaluation of the validity of acute CVD diagnoses in European EHRs, conducted without language or time restrictions using a broad search strategy. Two independent reviewers performed our study selection, and native speaking collaborators translated foreign language articles. Similar to other systematic reviews of validation studies, we repurposed the QUADAS-2 risk of bias tool developed for diagnostic test accuracy. Additionally, we followed the diagnostic test accuracy GRADE methodology to assess the overall evidence base.
Our work is not without limitations. Firstly, only one reviewer completed full data extraction and risk of bias assessment due to resource constraints, although a sample of 20% of studies had data dual extracted. Secondly, we limited our study to Europe, so theoretically our results are only generalizable to European countries. All previous systematic reviews 8-12 on the validity of acute CVD diagnoses included both EHRs and claim-based systems, while most studies included in each of these reviews were from North America. From these existing reviews, it was unclear if the validity of EHRs differed to claims-based datasets, which reflect payments related to medical care given. Despite this, we obtained similar results to the previous reviews. Thirdly, our review focused on acute CVD events so excluded results from studies that validated broader diagnoses of ischaemic heart disease or cerebrovascular disease, which again limits generalizability to these specific conditions.

Recommendations
For ACS and stroke diagnoses, most sensitivity and PPV results were reasonably high, providing confidence in the use of European EHR data for research into these conditions. However, there was considerable heterogeneity between studies. Sensitivity for HF diagnoses was low, and our GRADE assessment found very low quality for all HF outcomes. For studies of HF, we strongly recommend either validating the definition or referring to existing validation studies to develop the case definition. New validation studies of HF diagnoses should report whether the diagnoses validated are for acute or chronic presentation and HF with reduced ejection fraction or preserved ejection fraction. These principles are also applicable to future ACS and stroke validation studies. Identifying specific stroke subtypes can be difficult; analysis of all stroke subtypes combined is preferable.

Conclusions
Our review on the accuracy of HF, ACS and stroke diagnoses in European EHRs should guide researchers in their selection of data sources and CVD definitions for epidemiological studies. Generally, the data assessed was of reasonable quality. However, it is difficult to summarize validity given the heterogeneity between studies. Where possible, researchers should validate data before use or carefully interpret the results of previous validation studies to consider the impact validity has on research findings. Additionally, the use of linked data will bolster quality.