Discussion
Almost 30 years ago, EBM23 was introduced to wide
medical audience, subsequently being assessed to represent one of the
most important medical milestones of the last 160 years, in the same
category as innovations such as antibiotics and
anesthesia.24 At the heart of EBM is notion that “not
all evidence is created equal”- some evidence is more credible than
others; the higher quality of evidence, the more accurate and
trustworthy are our estimates about true effects of health
interventions. 1 Surprisingly, however, the
relationship between CoE and estimates of treatment effects has not been
empirically evaluated.
Here, we provide the first empirical support for the foundational EBM
principle that low-quality evidence changes more often than high CoE
(Fig 2). However, we found no difference in effect sizes between studies
appraised as very low vs high [or, very low/low vs. moderate/high CoE
(Fig 3)]. This implies that effects that are assessed as less
trustworthy/potentially unreliable (as when CoE is low) cannot be
distinguished from those assessments, which are presumably more
trustworthy/accurate (as when CoE is high). If the magnitude of
treatment effects cannot be meaningfully distinguished from evidence
appraised as high vs. low quality, then the core principle of EBM seems
to be challenged.
Our “negative” results should not be construed as a challenge to
sound, normative EBM epistemological principles, which hold that optimal
practice of medicine requires explicit and conscientious attention to
the nature of medical evidence.1,25,26 Rather, in
assessing the relationship between CoE and “true” effects of health
interventions, more salient question is to ask if the current appraisal
methods capture CoE as intended by the EBM principles. Critical
appraisal of CoE is integral aspect of conduct of systematic reviews,
guidelines development and is widely accepted in the curricula in most
medical and allied professional schools across the world. Over the
years, many critical appraisal methods have been
developed1 to eventually culminate in development of
GRADE methodology, which has been endorsed by more than 110 professional
organizations.7 However, as we demonstrate here,
despite GRADE’s capacity to distinguish across CoE categories, it could
not- and we suspect none of other appraisal methods that GRADE has
replaced- reliably discerned the influence of CoE on the estimates of
treatment effects. The results agree with those of Gartlehner et al who,
based on cumulative meta-analysis of 37 Cochrane reviews, found27 limited value of GRADE in predicting stability of
strength of evidence as new studies emerged.
The finding that the magnitude of
effect size is not reflected in a change of CoE is surprising as
previous meta-epidemiological studies showed that various study
limitations that affect CoE significantly influence estimates of
treatment effects28 (although not always
consistently16). For example, as measured by ROR,
inadequate or unclear (vs. adequate) random-sequence generation,
inadequate or unclear (vs. adequate) allocation concealment, or lack of
or unclear double-blinding (vs. double-blinding) led to statistically
significant exaggeration of treatment effects by 11%, 7% and 13%,
respectively.28 These study limitations are taken into
account in rating of CoE using GRADE method6, so one
would expect that effect size would differ between low vs high CoE in
the GRADE assessment. However, on further examination, we observe that
GRADE combines the study limitations such as adequacy of allocation
concealment, blinding etc (risk of bias) with the assessment of
inconsistency, imprecision, indirectness and publication bias to assign
the final rating of CoE (from very low to high quality) in additive
fashion. 12,29 It appears that using additive means to
report the properties of negative and positive changes in treatment
effect could unhelpfully neutralize this effect and cause imprecision in
the overall estimate. Thus, one can have the same estimates of treatment
effects but completely different GRADE ratings. This is, however,
problematic because central assumption of GRADE is that estimates
underpinned by high CoE are unlikely to change, whereas the very low/low
CoE estimates are more likely to change.
A potential limitation of our study is that we have not collected data
on the individual factors that drove assessment of CoE (i.e., study
limitations/risk of bias vs inconsistency, imprecision, or indirectness,
for example). However, the present empirical report targets, for first
time, the end- stage level assessment of CoE, according to GRADE
specifications, which is how CoE is used in practice to aid
interpretation of evidence and affect development of clinical
guidelines.
We also detected imprecision in the estimates of effects sizes and
relatively wide ROR confidence intervals, particularly in the subgroup
of meta-analyses describing treatment effects in the reviews with CoE
that changed from moderate/high to low/very low. It may be argued that
the current methods of CoE appraisal are simply not sensitive enough and
that with much larger sample size of SR/MAs, we would be able to
differentiate between effect sizes across categories of CoE. This point
was made by Howick and colleagues30 who showed no
change in the CoE between original and updated reviews in a set of the
48 trials they examined, albeit they made no attempt to identify changes
in effect sizes. However, obtaining larger sample sizes is unrealistic
given that we reviewed almost all SRs in the Cochrane database since the
GRADE assessment of CoE was mandated (up to May 2021). Finally, few
Cochrane Reviews we analyzed included observational studies. It is
possible that GRADE may not differentiate the quality of randomized
evidence well but that it may perform better if the comparison is made
between randomized vs observational studies. The Cochrane Reviews,
however, are typically based on randomized trials. Therefore,
categorization of CoE based on currently mandated critical appraisal
system using GRADE in the Cochrane Reviews does not meaningfully
separate effect sizes across the existing gradation of CoE (although,
capacity of GRADE to distinguish the magnitude of effect size between
randomized and observational studies outside of the purview of Cochrane
Reviews remains a worthwhile goal for further empirical research).
Given that studies can be well done, and correctly estimated treatment
effects, but be poorly reported31,32, it is also
possible that we could not detect influence of CoE on the estimates of
treatment effects because current critical appraisal methods depend on
the quality of reporting of the trials that are selected for
meta-analysis. However, if we believe that quality of reporting does not
matter, then the entire critical appraisal efforts can be considered
misplaced to begin with.