SPORTSCIENCE sportsci.org

Page created by Shannon Hampton
 
CONTINUE READING
SPORTSCIENCE · sportsci.org
Perspectives / Research Resources
Magnitude-Based Decisions as Hypothesis Tests
Will G Hopkins
Sportscience 24, 1-16, 2020 (sportsci.org/2020/MBDtests.htm)
Institute for Health and Sport, Victoria University, Melbourne, Australia. Email.

              Magnitude-based decisions (MBD) is an inferential method that avoids the
              problems of null-hypothesis significance testing by interpreting the frequentist
              compatibility interval and sampling distribution in terms of uncertainty in the
              true value of an effect. This reference-Bayesian interpretation is defensible for
              the usual small (and any larger) sample sizes in sport and exercise research,
              if the statistical model is accurate, the measures are valid, the sample is rep-
              resentative, and the researcher proffers an uninformative or weakly informa-
              tive prior belief in the magnitude of the effect. For statisticians who dispute the
              Bayesian interpretation and favor Popperian hypothesis testing, MBD can be
              formulated as interval hypothesis tests. In the clinical version of MBD, an effect
              is clear, has acceptable uncertainty, and is considered potentially publishable
              and implementable in a clinical or practical setting, when the true effect is most
              unlikely harmful and at least possibly beneficial. This requirement is equivalent
              to strong rejection of the hypothesis of a harmful effect (pH0.25). In non-clinical
              MBD, an effect is clear and has acceptable uncertainty and publishability,
              when the true effect is very unlikely to be substantial of one or other sign. This
              requirement is equivalent to moderate rejection of one or other hypothesis of
              substantial magnitude (p+
Hopkins: MBD as Hypothesis Tests                                                                                               Page 2
           Clinical MBD and the Hypothesis of Benefit .....................................................6
           Hypotheses for Non-Clinical MBD....................................................................7
           Combining the Hypotheses ..............................................................................7
           Sample-size Estimation ...................................................................................8
           New Terminology ............................................................................................9
           Type-1 Errors in MBD ......................................................................................9
           Lower P-value Thresholds? ........................................................................... 10
           A Practical Application of MBD ...................................................................... 11
           Conclusion ....................................................................................................11
           References....................................................................................................12
           Appendix: Reporting MBD in Journals ........................................................... 13

Associate editor's note. This article is now open for                 have written the article with little recourse to sta-
post-publication peer review. I invite you to write                   tistical jargon. An article more suitable for an au-
comments in the template and email to me, Ross Ne-                    dience of statisticians has been submitted for
ville. You may also comment on the In-brief item on                   publication by others (Aisbett et al., 2020); for
Moving Forward with Magnitude-Based Decisions.
                                                                      the background to that article, see my posting
The original version with tracked changes resulting
in the current version is available as a docx here.                   and other postings each side of it in the
                                                                      datamethods forum, updated in the In-brief item
Introduction
                                                                      on MBD in this issue. An updated version of a
   When researchers study a sample, they obtain
                                                                      slideshow first presented at the German Sport
an estimate of the magnitude of an effect statis-
                                                                      University in July 2019 is also a succinct sum-
tic, such as a change in a mean measure of health
                                                                      mary of this article and the In-brief item.
or performance following an intervention. With
the usual but crucial assumptions about repre-                        Inferential Methods
sentativeness of the sample, validity of the                              The null-hypothesis significance test is the
measures, and accuracy of the statistical model,                      traditional approach to making as assertion about
a sample of sufficiently large size yields an ac-                     the true magnitude of an effect. In NHST, the
curate estimate of the true or population magni-                      data are converted into a sampling probability
tude, because repeated samples would yield                            distribution for the effect (or a transformation of
practically identical estimates. Sample sizes are                     it), representing expected variation in the effect
seldom this large, so an assertion about the true                     with repeated sampling (Figure 1). This distribu-
magnitude of an effect should account for the                         tion is well-defined, usually a t or z. Effect val-
fact that the sample value is only an approximate                     ues spanning the central region of the distribu-
estimate of the true value.                                           tion represent values that are most compatible
   In this article I explain how to make an asser-                    with the sample data and the model, and the in-
tion about the true magnitude via the usual ap-                       terval spanning 95% of the values is known as
proach of the null-hypothesis significance test                       the 95% compatibility interval (95%CI). If the
(NHST) and via magnitude-based inference                              95%CI includes zero, then the data and model
(MBI), an approach that has been severely criti-                      are compatible with a true effect of zero, so the
cized recently (Sainani, 2018; Sainani et al.,                        null hypothesis H0 cannot be rejected, as shown
2019; Welsh & Knight, 2015). Although the crit-                       in Figure 1a. Figure 1b shows the same data as
icisms have been addressed (Hopkins, 2019a;                           Figure 1a, but the region of the distribution to the
Hopkins & Batterham, 2016; Hopkins &                                  left of the zero and the matching area on the other
Batterham, 2018), it has been suggested that                          tail are shaded red. The total red area defines a
MBI might be more acceptable to at least some                         probability (p) value, representing evidence
members of the statistical community, if it were                      against the hypothesis: the smaller the p value,
presented as hypothesis tests, along with a name                      the better the evidence against it. With a suffi-
change to magnitude-based decisions (MBD;                             ciently small p value, you reject the hypothesis.
Hopkins, 2019a). This article is my response to                       The threshold p value is called the alpha level of
that suggestion. I include an appendix with                           the test, 0.05 for a 95%CI. The p value here is
guidelines for presenting MBD as hypothesis                           0.04 + 0.04 = 0.08, which is >0.05, so the data
tests in a manuscript.                                                and model do not support rejection of H0, and the
   For the benefit of practitioners with little for-                  effect is declared non-significant. Not shown in
mal training in statistics (myself included), I                       Figure 1 is the limiting case, when one limit of
                                                                      the interval touches zero, and p = 0.05. Figure 1c

                                                                                                      Sportscience 24, 1-16, 2020
Hopkins: MBD as Hypothesis Tests                                                                       Page 3
shows another study of an effect where the data         it will often be wide enough to include substan-
and model are not compatible with an effect of          tial values of the effect statistic, implying that
zero: the 95%CI does not include zero; equiva-          substantial values are compatible with the data
lently the p value is 0.05). (c) The distribution of values and p value (0.02) for a different
     effect; 0 is not included in the 95%CI, therefore H0 is rejected (p
Hopkins: MBD as Hypothesis Tests                                                                     Page 4
uniform over the range of non-negligible likeli-       by testing and rejecting hypotheses. Sander
hood, and thus only weakly informative relative        Greenland's suggestion to reframe MBD in terms
to the likelihood function (S. Greenland, per-         of hypothesis testing was motivated by what he
sonal communication). When sample sizes are            sees as the need for the well-defined control of
small, researchers should check that a realistic       error rates that underlie hypothesis testing. I
weakly informative prior does indeed make no           have been opposed to hypothesis testing, for rea-
practical difference to the compatibility interval,    sons espoused by many others, apparently as far
as noted in the article (Hopkins, 2019b) on            back as Francis Bacon and Isaac Newton (e.g.,
Bayesian analysis with a spreadsheet. If a             Glass, 2010). My preference has been instead for
weakly informative prior results in noticeable         estimation of the magnitude of effects and their
"shrinkage" of either of the compatibility limits      uncertainty to make decisions about the true
of an effect, or if the magnitude-based decision       magnitude based on the Bayesian interpretation,
is modified by the prior, the researcher should        but at the same time accounting for decision er-
justify such a prior and report the modified deci-     rors. I will now show that such decisions are
sion.                                                  equivalent to rejecting or failing to reject several
   Other researchers have proposed a Bayesian          hypotheses, and that the errors arising from mak-
interpretation of the compatibility interval or the    ing wrong decisions are the same as the errors
sampling distribution similar to those of MBI          with hypothesis testing.
(Albers et al., 2018; Burton, 1994; Shakespeare        Clinical MBD and the Hypothesis of Harm
et al., 2001). Some researchers interpret the com-        In a clinical or practical setting, it is important
patibility interval as if it represents precision or   that the outcome of a study does not result in im-
uncertainty in the estimation of the value of an       plementation of an intervention that on average
effect, but they stop short of identifying their ap-   could harm the population of interest. The hy-
proach as Bayesian (Cumming, 2014; Rothman,            pothesis that the true effect is harmful is there-
2012). MBI differs from all these approaches by        fore a more relevant hypothesis to reject than the
providing specific guidance on what constitutes        hypothesis that the true effect is zero (the stand-
adequate precision or acceptable uncertainty           ard null). Figure 2, adapted from Lakens et al.
when making a decision about the magnitude of          (2018), shows distributions, compatibility inter-
the effect. Partly for this reason, and partly be-     vals and one-sided p values associated with the
cause of concerns expressed about use of the           test of the hypothesis H0 that the true effect is
word inference (Greenland, 2019), MBI is now           harmful, for three different outcomes that could
known as a method for making magnitude-based           occur with samples. Formally, the test of the
decisions (Hopkins, 2019a).                            harmful hypothesis is a non-inferiority test, in
   MBD nevertheless finds itself in a precarious       which rejection of the hypothesis of inferiority
position, pilloried by Bayesians who insist on in-     (harm) implies the effect is non-inferior (e.g.,
formative priors and by frequentists who believe       Castelloe & Watts, 2015).
that science advances only in Popperian fashion

   Figure 2. Three examples of testing an hypothesis H0 that an effect (e.g., of a treatment) is
   harmful. All harmful values fall in the purple region to the left of the smallest harmful value.
   The 95% compatibility interval (95%CI) in (a) includes harmful values, so harmful values are
   compatible with the sample and model, and H0 is not rejected. H0 is only just rejected in (b),
   and in (c) it is easily rejected with a 95%CI and only just rejected with a 99%CI. P values for
   the test, pH, are evaluated for only one tail of the probability distribution.

                                                                                Sportscience 24, 1-18, 2020
Hopkins: MBD as Hypothesis Tests                                                                  Page 5
   Harmful values are anything to one side of the     researcher fails to discover that the effect is
smallest harmful value, and since this value is       harmful.
negative in the figure, harmful values are any-          When a compatibility interval overlaps harm-
thing to the left of this value. Positive harmful     ful values, the true effect could be harmful, to
values (e.g., a threshold blood pressure for hy-      use the Bayesian interpretation of the interval.
pertension) would fall to the right of a smallest     Furthermore, the p value for the test of harm and
harmful positive value. The test of the hypothe-      the probability that the true effect is harmful in
sis that the effect is any harmful value belongs to   MBD are defined by the sampling distribution in
the class of one-sided interval hypothesis tests.     exactly the same way. In the clinical version of
The important point about such tests is that          MBD, a clinically relevant effect is considered
harmful values are compatible with the sample         for implementation only when the probability
data and model, when the compatibility interval       that the true effect is harmful is
Hopkins: MBD as Hypothesis Tests                                                                       Page 6
   MBD is also effectively more conservative             for rejecting the hypothesis of harm (and thereby
about avoiding harm than Ken Rothman's ap-               ensuring a low risk of implementing a harmful
proach to precision of estimation. In his intro-         effect, to use the Bayesian interpretation). There
ductory text on clinical epidemiology (Rothman,          also needs to be consideration that the treatment
2012), he refers to 90% confidence intervals             could be beneficial. Once again, a one-sided test
three times more often than 95% intervals, and           is involved, formally a non-superiority test, in
he does not refer to 99% intervals at all.               which rejection of the hypothesis of superiority
Clinical MBD and the Hypothesis of Benefit               (benefit) implies the effect is non-superior
   Making a decision about implementation of a           (Castelloe & Watts, 2015). See Figure 4.
treatment is more than just using a powerful test
   Figure 4. Three examples of testing an hypothesis H0 that an effect is beneficial. All beneficial
   values fall in the orange region to the right of the smallest beneficial value. The compatibility
   interval in (a) excludes beneficial values, so H0 is rejected. H0 fails to be rejected in (b) and (c).
   Additionally, in (c) the hypothesis of non-benefit is rejected (i.e., H0 is accepted). P values for
   the test of benefit, pB, are shown in (a) and (b), and for the test of non-benefit, 1–pB, in (c). The
   level of the compatibility interval is discussed below.

   With the same reasoning as for the hypothesis         area of the tail overlapping into beneficial values
of harm, a compatibility interval that falls short       is the probability that the true effect is beneficial,
of the smallest beneficial value implies that no         and a 50% compatibility interval has a one-sided
beneficial values are consistent with the data and       25% tail, which was considered a lower thresh-
model (or in the Bayesian interpretation, the            old for possibly beneficial. In other words, one
chance that the true effect is beneficial is too         should consider an effect to be possibly or poten-
low), so you would not implement it (Figure 4a).         tially beneficial and implementable, provided
A compatibility interval that overlaps beneficial        there is a sufficiently low risk of harm.
values allows for beneficial values to be compat-            A threshold or alpha pB value of 0.25 implies
ible with the sample and model (or for the possi-        an error rate of 25% for failing to discover the
bility that the true effect is beneficial, to give the   smallest beneficial effect (which can be shown
Bayesian interpretation), so you cannot reject the       with a figure similar to Figure 3). While this er-
hypothesis that the effect is beneficial, so it could    ror rate may seem high, it is comparable with the
be worth implementing (Figure 4b and 4c). Fig-           20% Type-2 error rate that underlies the calcula-
ures 4a and 4b show the p value for the test of          tion of sample size in conventional null-hypoth-
benefit, pB. Figure 4c shows the p value for the         esis testing. The calculation provides a sample
test of the hypothesis that the effect is not bene-      size that would give statistical significance for
ficial, 1 – pB: in this example, the compatibility       80% of studies, or a failed-discovery rate of
interval falls entirely in the beneficial region, so     20%, when the true effect is the minimum clini-
the hypothesis of non-benefit is rejected.               cally important difference (the smallest benefi-
   What level should the researcher choose for           cial value).
the compatibility interval and the associated                Of course, researchers are free to specify a
threshold pB value in the test of the beneficial hy-     lower threshold pB to reduce the rate of failing to
pothesis? The level 50% was chosen via the               discover benefit (here, failing to reject benefit),
Bayesian interpretation of a compatibility inter-        but a lower error rate comes with a cost. For ex-
val that just touches the beneficial region: the         ample, if the threshold pB is 0.05, equivalent to

                                                                                  Sportscience 24, 1-18, 2020
Hopkins: MBD as Hypothesis Tests                                                                      Page 7
one tail of a 90% compatibility interval, a tail        can also be recast as one-sided tests, but now the
overlapping into the beneficial region with an          threshold p values are the same, because reject-
area of only 6% is regarded as failure to reject        ing the hypothesis of a substantial negative effect
the beneficial hypotheses, or from the Bayesian         is just as important as rejecting the hypothesis of
perspective, an observed quite trivial effect with      a substantial positive effect. A p value of 0.05,
a chance of benefit of only 6% is potentially im-       corresponding to one tail of a 90% compatibility
plementable. Furthermore, a trivial true effect on      interval, was chosen originally for its Bayesian
the margin of smallest important would have a           interpretation: rejection of the hypothesis of one
95% rate of failure to reject the beneficial hy-        of the substantial magnitudes corresponds to a
pothesis, a very high false-discovery or Type-1         chance of 0.005, so the risk
similar to those in Figure 3 shows that the mini-       of harm is at least very unlikely; could be bene-
mum desirable sample size for MBD would give            ficial means pB>0.25, so the chance of benefit is
a failed-discovery rate of 50% for a true effect        at least possibly; could be substantial means
that is 2× the smallest important, but the rate is      p+>0.05 and p–>0.05, or chances of both magni-
only 5% for a true effect that is 3× the smallest       tudes are at least unlikely.
important, which is borderline small-moderate in            When the true value of an effect is substantial
all my magnitude scales (Hopkins, 2010). The            of a given sign, the outcome consistent with this
minimum desirable sample size in MBD should             effect is failure to reject the hypothesis of that
therefore satisfy those who promote sample-size         sign and rejection of the hypothesis of opposite
estimation for MET. In any case, and regardless         sign. A feature of MBD is the level of evidence
of sample size, the hypothesis test underlying          it conveys for the hypothesis that could not be
MET is automatically available in MBD, as ex-           rejected. For example, if the hypothesis of bene-
plained below.                                          fit is not rejected (and harm is rejected), the ef-
Hypotheses for Non-Clinical MBD                         fect is reported with the probabilistic terms pos-
   I have presented the clinical version of MBD         sibly, likely, very likely or most likely preceding
above as two one-sided hypothesis tests. The            beneficial. Each of these Bayesian terms has an
tests for harm and benefit have to differ in their      equivalent p-value threshold for testing an hy-
respective threshold p values, because it is ethi-      pothesis that the effect is not beneficial:
Hopkins: MBD as Hypothesis Tests                                                                      Page 8
resulting in rejection of the non-beneficial hy-       publishable quantum, regardless of the true mag-
pothesis (pNB0.25); for a marginally substantial neg-
of side effects.                                       ative true effect, the error rate is 5% for deciding
   For clear effects that are possibly trivial and     that the true effect could be substantially positive
possibly substantial (including beneficial or          (failure to reject the substantial positive hypoth-
harmful), I suggest presenting the effect as pos-      esis, with p+>0.05).
sibly substantial, regardless of which probability        Aisbett et al. (2020) have suggested sample-
is greater, although stating that the effect is also   size estimation based on minimal-effects (supe-
possibly trivial would emphasize the uncer-            riority) testing (MET) or equivalence testing
tainty. Effects with adequate precision that are at    (ET). I have already shown above that the MBD
least likely trivial can be presented as such in ta-   sample size is consistent with that of MET for
bles of results, without mention of the fact that      the reasonable expectation of a marginally
one of the substantial magnitudes is unlikely          small-moderate effect, so there is no need to re-
while the other is at least very unlikely.             place the MBD sample size with a MET sample
   Rejection of both substantial hypotheses im-        size. In ET, the researcher needs a sample size
plies a decisively trivial effect, which occurs        that will deliver a decisively trivial outcome (by
when the compatibility interval is contained en-       rejection of the non-trivial hypothesis), if the
tirely within the trivial range of values. In non-     true effect is trivial. As in MET, the researcher
clinical MBD with a 90% interval, this scenario        posits an expected value of the true effect, but
represents an equivalence test, with the non-          now the value has to be somewhat less than the
equivalence hypothesis rejected at the 0.05 level.     smallest important. Unfortunately, the resulting
Thus MBD also includes equivalence testing. A          ET sample size turns out to be impractical: if the
minor point here is that a decisively trivial effect   researcher posits (unrealistically) an expected
can sometimes be likely trivial; for example, a        true effect of exactly zero, a simple considera-
90%CI falling entirely in the trivial region, with     tion of sampling distributions similar to those in
p– = 0.03 and p+= 0.04, implies pT = 1 – (p– + p+)     Figure 3 shows that the sample size needs to be
= 0.93, which is likely trivial. Very likely trivial   4× that of MBD to deliver a decisively trivial
effects are, of course, always decisively trivial.     outcome. A more realistic expected trivial effect
In clinical MBD rejection of the beneficial hy-        midway between zero and the smallest important
pothesis (pB
Hopkins: MBD as Hypothesis Tests                                                                       Page 9
the smallest important. I therefore see no need           However, use of the term clear to describe such
for a new method of sample-size estimation for            effects may be responsible in part for misuse of
MBD, but I have updated my article (Hopkins,              MBI, whereby researchers omit the probabilistic
2020) and spreadsheet for sample-size estima-             term describing the magnitude and present it as
tion to include MET and ET.                               if it is definitive (Lohse et al., 2020; Sainani et
New Terminology                                           al., 2019). An effect that is clear and only possi-
   For researchers who dispute or wish to avoid           bly substantial is obviously not clearly substan-
the Bayesian interpretation of evidence for or            tial. Researchers must therefore be careful to dis-
against magnitudes in MBD, frequentist terms              tinguish between clear effects and clear magni-
have been suggested, corresponding to p-value             tudes: they should refer to a clear effect as being
thresholds for rejection of the one-sided hypoth-         clearly substantial or clearly trivial, when the ef-
esis tests: most unlikely, very unlikely, unlikely,       fect is very likely or most likely substantial or
and possibly correspond to rejection of an hy-            trivial (moderately or strongly compatible with
pothesis with p
Hopkins: MBD as Hypothesis Tests                                                                 Page 10
the effect should be in no doubt about the modest     Similarly, in non-clinical MBD a Type-1 error
level of evidence for the effect being beneficial     occurs if the true effect is trivial and the trivial
(Hopkins & Batterham, 2016).                          hypothesis is rejected. Rejection of the trivial hy-
   The Type-1 error rates are even higher in the      pothesis occurs when the compatibility interval
less conservative odds-ratio version of clinical      covers only substantial values, the correspond-
MBD, according to which an unclear effect is de-      ing Bayesian interpretation being that the true ef-
clared potentially implementable, if there is a       fect is very likely substantial. Therefore possibly
sufficiently high chance of benefit compared          substantial or likely substantial outcomes, be
with the risk of harm (an odds ratio of benefit to    they publishable or unclear, represent failure to
harm greater than a threshold value of 66, de-        reject the trivial hypothesis and therefore do not
rived from the odds for marginally beneficial and     incur a Type-1 error, a crucial point that the de-
marginally harmful). Again, the errors occur          tractors of MBI have not acknowledged
mainly as possibly or likely beneficial (Hopkins      (Sainani, 2018, 2019; Sainani et al., 2019; Welsh
& Batterham, 2016), but the practitioner needs to     & Knight, 2015).
take into consideration loss of control of the er-       Janet Aisbett (personal communication) sug-
ror in rejecting the harmful hypothesis and there-    gested that "an error of sorts also occurs when
fore the increased risk of implementing a harm-       you fail to reject a hypothesis that you should
ful effect.                                           have." In other words, the outcome with a truly
   In non-clinical MBD, a Type-1 error occurs         substantial effect should be rejection of the non-
when the true effect is trivial and the 90% com-      substantial hypothesis, and if you fail to reject
patibility interval falls entirely outside trivial    that hypothesis, you have made an error. A sim-
values (Hopkins & Batterham, 2016): a clearly         ilar error occurs with failure to reject the non-
substantial effect in the new terminology. The        trivial hypothesis, when the true effect is trivial.
equivalent frequentist interpretation of this dis-    Janet's suggestion is just another way of justify-
position of the compatibility interval is rejection   ing sample size with MET or ET, which I have
of the hypothesis that the effect is not substan-     already dealt with.
tially positive (say), with p
Hopkins: MBD as Hypothesis Tests                                                                   Page 11
clinical effects (Hopkins, 2006). If this threshold       Effects on endurance performance are best ex-
were revised downward, non-clinical sample             pressed in percent units, and for elite cyclists the
size would be greater, which seems unreasona-          smallest important change in mean power in
ble, so the probability thresholds in clinical         time-trial races (defined as winning an extra
MBD would also need revising downwards. For            medal in every 10 races on average) is 1.0%
example, if all the p-value thresholds were            (Malcata & Hopkins, 2014). I calculated the net
halved, their S values would move up by one            effect on power in the submaximal time trial as
coin toss. For non-clinical MBI very unlikely          1.1%. When I inserted these values into the fre-
(0.05 or 5%) would become 0.025 or 2.5% (S =           quentist and Bayesian versions of the spread-
5.3). For clinical MBI most unlikely (0.005 or         sheet for converting a p value to MBD (Hopkins,
0.5%) would become 0.0025 or 0.25%, and pos-           2007), the 90%CI was 0.1 to 2.1%, and the non-
sibly (0.25 or 25%) would become 0.125 or              clinical decision was a small, possibly (or am-
12.5% (S = 8.6 and 3.0). Sample size for clinical      biguously) positive effect (p+ = 0.54, p– = 0.001).
and non-clinical MBD would still be practically        The effect was also potentially implementable
the same, but they would rise from the existing        (possible benefit with really low risk of harm),
approximately one-third to about one-half those        but a clinical decision would be relevant only for
of NHST for 5% significance and 80% power. A           someone considering implementation for an ad-
lower p-value threshold for non-clinical effects       vantage in competitions, which is not an issue
would reduce the Type-1 error rates for such ef-       here.
fects, but a lower pB would increase the Type-1           The road-race performance was measured
rates for deciding that trivial true effects could     only once for each cyclist, after the period of ad-
be beneficial. Lower p values require bigger ef-       ministration of rHuEPO. The authors presented
fects for a given sample size, so there could be a     the difference in the mean race time of the two
risk of substantial bias for publishable effects       groups in percent units, but without time in a pre-
with lower threshold p values, when the sample         intervention race for comparison, the uncertainty
size is small. Unclear effects would also be more      accommodates huge negative and positive ef-
common and therefore less publishable with             fects (0.3%, 95%CI -8.3 to 9.6%).
some of the unavoidably small sample sizes in             The conclusion that the submaximal test and
sport and exercise science. On balance, I recom-       road race performance were not affected by in-
mend keeping the existing probability thresh-          jections of rHuEPO is obviously not tenable. The
olds.                                                  researchers assumed that non-significance im-
A Practical Application of MBD                         plies no real effect, which is a reasonable as-
   A colleague who is skeptical about claims of        sumption with the right sample size. Unfortu-
performance enhancement with the substances            nately their approach to estimating sample size
banned by the International Olympic Committee          left the study underpowered for the submaximal
recently asked me to evaluate an article reporting     test (as shown by non-significance for an ob-
the results of a placebo-controlled trial of the ef-   served substantial effect) and grossly underpow-
fects of injections of recombinant human eryth-        ered for road race performance (as shown by
ropoietin (rHuEPO) on performance of cyclists          huge compatibility limits). Use of MBD leads to
(Heuberger et al., 2017). The authors concluded        a realistic conclusion about the uncertainty in the
that "although rHuEPO treatment improved a la-         magnitude of the effects.
boratory test of maximal exercise, the more clin-      Conclusion
ically relevant submaximal exercise test perfor-          If researchers heed the recent call to retire sta-
mance and road race performance were not af-           tistical significance (Amrhein et al., 2019), they
fected." The net effect of rHuEPO on mean              will need some other hypothesis-based inferen-
power in the submaximal test (a 45-min trial)          tial method to make decisions about effects, es-
was presented as 5.9 W (95%CI -0.9 to 12.7 W,          pecially in clinical or practical settings. I have
p=0·086), so their conclusion about performance        shown that the reference-Bayesian probability
in this test was based presumably on what ap-          thresholds in the magnitude-based decision
peared to be a negligible increase in mean power       method are p-value thresholds for rejecting hy-
and, of course, statistical non-significance, along    potheses about substantial magnitudes, which
with the claim that their study was "adequately        are assuredly more relevant to real-world out-
powered".                                              comes than the null hypothesis. Researchers can
                                                       therefore make magnitude-based decisions about
                                                                               Sportscience 24, 1-18, 2020
Hopkins: MBD as Hypothesis Tests                                                                    Page 12
effects in samples, confident that the decisions        Heuberger JA, Rotmans JI, Gal P, Stuurman FE,
have a sound frequentist theoretical basis and ac-        van't Westende J, Post TE, Daniels JM, Moerland
ceptable error rates. I recommend continued use           M, van Veldhoven PL, et al. (2017). Effects of
of the probabilistic terms possibly, likely, and so       erythropoietin on cycling performance of well
                                                          trained cyclists: a double-blind, randomised,
on to describe magnitudes of clear, decisive or
                                                          placebo-controlled trial. The Lancet Haematology
conclusive effects (those with acceptable uncer-          4, e374-e386.
tainty), since these terms can be justified with ei-    Hopkins WG. (2006). Estimating sample size for
ther reference-Bayesian analyses or hypothesis            magnitude-based inferences. Sportscience 10, 63-
tests, and they convey uncertainty in an accessi-         70.
ble manner. The terms clearly, decisively or con-       Hopkins WG. (2007). A spreadsheet for deriving a
clusively should be reserved for magnitudes that          confidence interval, mechanistic inference and
are very likely or most likely trivial or substan-        clinical inference from a p value. Sportscience 11,
tial: those with moderate or strong compatibility         16-20.
with the magnitude.                                     Hopkins WG. (2010). Linear models and effect
                                                          magnitudes for research, clinical and practical
Acknowledgements: I thank Janet Aisbett for im-           applications. Sportscience 14, 49-58.
portant corrections and suggestions for inclusion of    Hopkins WG. (2018). Design and analysis for
additional material. Alan Batterham and Daniel            studies of individual responses. Sportscience 22,
Lakens helped me understand one-sided interval hy-        39-51.
pothesis tests.                                         Hopkins WG. (2019a). Magnitude-based decisions.
References                                                Sportscience 23, i-iii.
Aisbett J, Lakens D, Sainani KL. (2020). Magnitude      Hopkins WG. (2019b). A spreadsheet for Bayesian
  based inference in relation to one-sided                posterior compatibility intervals and magnitude-
   hypotheses testing procedures. SportRxiv,              based decisions. Sportscience 23, 5-7.
                                                        Hopkins WG. (2020). Sample-size estimation for
   https://osf.io/preprints/sportrxiv/pn9s3/.
                                                          various inferential methods. Sportscience 24, 17-
Albers CJ, Kiers HA, van Ravenzwaaij D. (2018).
                                                          27.
  Credible confidence: a pragmatic view on the
  frequentist vs Bayesian debate. Collabra:             Hopkins WG, Batterham AM. (2016). Error rates,
  Psychology 4, 31.                                       decisive outcomes and publication bias with
Amrhein V, Greenland S, McShane B. (2019).                several inferential methods. Sports Medicine 46,
  Retire statistical significance. Nature 567, 305-       1563-1573.
  307.                                                  Hopkins WG, Batterham AM. (2018). The
Burton PR. (1994). Helping doctors to draw                vindication of magnitude-based inference.
  appropriate inferences from the analysis of             Sportscience 22, 19-29.
  medical studies. Statistics in Medicine 13, 1699-     Hopkins WG, Marshall SW, Batterham AM, Hanin
                                                          J. (2009). Progressive statistics for studies in
  1713.
                                                          sports medicine and exercise science. Medicine
Castelloe J, Watts D. (2015). Equivalence and
                                                          and Science in Sports and Exercise 41, 3-12.
  noninferiority testing using sas/stat® software.
  Paper SAS1911-2015, 1-23                              Lakens D, Scheel AM, Isager PM. (2018).
  (https://support.sas.com/resources/papers/proceed       Equivalence testing for psychological research: a
  ings15/SAS1911-2015.pdf).                               tutorial. Advances in Methods and Practices in
Cumming G. (2014). The new statistics: why and            Psychological Science 1, 259-269.
  how. Psychological Science 25, 7-29.                  Lohse K, Sainani K, Taylor JA, Butson ML, Knight
Gelman A, Greenland S. (2019). Are confidence             E, Vickers A. (2020). Systematic review of the
  intervals better termed “uncertainty intervals”?        use of “Magnitude-Based Inference” in sports
  BMJ 366, I5381.                                         science and medicine. SportRxiv,
Glass DJ. (2010). A critique of the hypothesis, and a     https://osf.io/preprints/sportrxiv/wugcr/.
                                                        Malcata RM, Hopkins WG. (2014). Variability of
  defense of the question, as a framework for
                                                          competitive performance of elite athletes: a
  experimentation. Clinical Chemistry 56, 1080-
                                                          systematic review. Sports Medicine 44, 1763-
  1085.
Greenland S. (2006). Bayesian perspectives for            1774.
  epidemiological research: I. Foundations and          Mastrandrea MD, Field CB, Stocker TF, Edenhofer
  basic methods. International Journal of                 O, Ebi KL, Frame DJ, Held H, Kriegler E, Mach
  Epidemiology 35, 765-775.                               KJ, et al. (2010). Guidance note for lead authors
Greenland S. (2019). Valid P-values behave exactly        of the IPCC fifth assessment report on consistent
  as they should: Some misleading criticisms of P-        treatment of uncertainties. Intergovernmental
  values and their resolution with S-values. The          Panel on Climate Change (IPCC),
  American Statistician 73, 106-114.                      https://pure.mpg.de/rest/items/item_2147184/com
                                                          ponent/file_2147185/content.
                                                                                  Sportscience 24, 1-18, 2020
Hopkins: MBD as Hypothesis Tests                                                                 Page 13
Mengersen KL, Drovandi CC, Robert CP, Pyne DB,        Suggested text for this paragraph, which is the
  Gore CJ. (2016). Bayesian estimation of small       same for both kinds of journal, is also provided
  effects in exercise and sports science. PloS One    below.
  11, e0147311,
  doi:0147310.0141371/journal.pone.0147311.           Journals requiring hypothesis testing
Rothman KJ. (2012). Epidemiology: an Introduction        The editor should be satisfied if the p value for
  (2nd ed.). New York: OUP.                           the usual null-hypothesis test is replaced by the
Sainani KL. (2018). The problem with "magnitude-      p values for the one-sided tests of substantial
  based inference". Medicine and Science in Sports    magnitudes. Report the p values to three decimal
  and Exercise 50, 2166-2176.                         places if ≥0.995 (e.g., 0.997), to two decimal
Sainani KL. (2019). Response. Medicine and            places if ≥0.10 (e.g., 0.63), and with one signifi-
  Science in Sports and Exercise 51, 600.             cant digit if 0.95; for example, "there was a
tion of the model or models providing the effect      moderate clear reduction in risk of injury (hazard
statistics. This description should include char-     ratio 0.63, 90%CI 0.46 to 0.86, pH=0.002,
acterization of the dependent variable, any trans-    pB=0.97)." The pB of 0.97 in this example has a
formation that was used for the analysis, and the     frequentist interpretation of moderately compat-
predictors in the model. For mixed models, the        ible with benefit, which could be included for ed-
random effects and their structure should be de-      itors who require strictly frequentist reporting;
scribed. There should be some attention to the        the Bayesian interpretation of very likely benefi-
issue of uniformity of effects and errors in the      cial could be included or replace pB for editors
model. If a published spreadsheet was used for        who are comfortable with non-informative or
the analysis, cite the most recent article accom-     weakly informative priors. If you or the editor
panying the spreadsheet.                              want an informative prior, the probabilities of
   Following the description of the statistical       the true magnitudes will differ from the p values.
model, there is a section on the MBD method           In tables, show a column headed by pH/pT/pB (or
that will depend on whether the journal requires      p–/pT/p+, or p↓/pT/p↑), with values such as
hypothesis testing or accepts instead, or as well,    0.03/0.05/0.92. To save numeric clutter in a table
a Bayesian analysis. Journals should be more          and highlight evidence for magnitudes, clear ef-
willing to accept the original Bayesian version of    fects (those with acceptable uncertainty) can be
MBD, now that it is clear that MBD is isomor-         indicated by using the asterisk and superscript 0
phic with an acceptable frequentist version. Sug-     system described below for Bayesian reporting,
gested texts for this section are provided below,     with the Bayesian terms replaced by the fre-
preceded by advice on presentation of results of      quentist compatibility terms. Decode the aster-
MBD analyses.                                         isks and superscripts in the table's footnote.
   A final paragraph on statistical analysis can         Methods section for a strictly frequentist
deal with evaluation of observed magnitudes.          journal. Uncertainty in the estimates of effects
                                                                              Sportscience 24, 1-18, 2020
Hopkins: MBD as Hypothesis Tests                                                                    Page 14
is presented as 90% compatibility intervals [or            Minimum desirable sample size was esti-
limits]. Decisions about magnitudes accounting          mated for Type-2 (failed-discovery) error rates
for the uncertainty were based on hypothesis            set by the p-value thresholds (0.5% and 25% for
tests for substantial and trivial effects (reference:   smallest important harmful and beneficial clini-
this article and/or Aisbett et al., 2020). For clin-    cal effects respectively; 5% for substantial nega-
ically or practically relevant effects (which could     tive and positive non-clinical effects) using a
result in implementation of a treatment), hypoth-       spreadsheet (Hopkins, 2006). [Use smaller error
eses of harm and benefit were rejected if the re-       rates where relevant for multiple inferences.] Er-
spective p values (pH and pB) were less than            ror of measurement [for sample-size estimation
0.005 and 0.25 (strong and weak rejection, re-          of controlled trials and crossovers] was esti-
spectively). For all other effects, hypotheses of       mated from previously published similar studies
substantial decrease and increase were rejected         [state references for the studies] using the panel
if their respective p values (p– and p+, or for fac-    of cells for that purpose in the spreadsheet.
tor effects, p↓ and p↑) were less than 0.05 (mod-          Methods section incorporating hypothesis
erate rejection). If only one hypothesis was re-        tests and Bayesian analysis. Uncertainty in the
jected, the effect is described as being ambigu-        estimates of effects is presented as 90% compat-
ously, weakly, moderately or strongly compati-          ibility intervals [or limits]. Decisions about mag-
ble with the other magnitude when the p value           nitudes accounting for the uncertainty were
for the test of that magnitude was >0.25, >0.75,        based on hypothesis tests for substantial and triv-
>0.95 and >0.995, respectively.                         ial effects (reference: this article and/or Aisbett
    The p values for the above tests were areas of      et al., 2020). For clinically or practically relevant
the sampling distribution of the effect statistic to    effects (which could result in implementation of
the left or right of the smallest important value       a treatment), hypotheses of harm and benefit
(the trivial-small threshold) [and were provided        were rejected if the respective p values (pH and
by the Sportscience spreadsheet in percent units        pB) were less than 0.005 and 0.25 (strong and
as chances of benefit and harm, or substantially        weak rejection, respectively). For all other ef-
positive and negative]. Effects with at least one       fects, hypotheses of substantial decrease and in-
substantial hypothesis rejected and at least            crease were rejected if their respective p values
weakly compatible evidence with trivial magni-          (p– and p+, or for factor effects, p↓ and p↑) were
tudes (pT >0.75) are presented as such, where pT        less than 0.05 (moderate rejection). If only one
is the p value for the test of the trivial hypothesis   hypothesis was rejected, the p value for the other
(the area of the sampling distribution covering         hypothesis corresponds to the posterior probabil-
trivial values). When the p value for the test of a     ity of the magnitude of the true (large-sample)
substantial or trivial magnitude was >0.95, the         effect in a reference-Bayesian analysis with a
magnitude is described as clear. The sampling           minimally or weakly informative prior
distribution was assumed to be a t distribution         (Hopkins, 2019b), so it was interpreted with the
for effects derived from a continuous dependent         following scale: >0.25, possibly; >0.75, likely;
variable, or a z distribution for effects derived       >0.95, very likely; >0.995, most likely (Hopkins
from counts or events [modify this sentence for         et al., 2009). [Provide additional text here if an
your study]. Where the model providing the ef-          informative prior was used, or if the sample size
fects involved a log or other transformation, the       was so small that a weakly informative prior
smallest important value was also transformed           modified the magnitude-based decision; state
for evaluation of the p values.                         that the modified decision is presented (Hopkins,
    P-value thresholds for rejecting the main pre-      2019b).] If neither hypothesis was rejected, the
planned hypothesis in this study [describe it]          effect is described as unclear, with the exception
were not adjusted for the inflation of error that       of effects with an odds ratio of benefit to harm
occurs with multiple inferences. For the remain-        >66, which were considered clear and poten-
ing tests, thresholds were divided by n [state it; a    tially implementable.
default for study with many effects could be 10],          The p values for the above tests were areas of
to give a Bonferroni-type correction equivalent         the sampling distribution of the effect statistic to
to n independent tests. Effects with hypotheses         the left or right of the smallest important value
thus rejected are highlighted in bold in tables         (the trivial-small threshold) [and were provided
and figures; to reduce inflation of error, interpre-    by the Sportscience spreadsheet in percent units
tation of outcomes is focused on these effects.         as chances of benefit and harm, or substantially

                                                                                Sportscience 24, 1-18, 2020
Hopkins: MBD as Hypothesis Tests                                                                   Page 15
positive and negative]. Effects with at least one      mates of effects is presented as 90% compatibil-
substantial hypothesis rejected and at least likely    ity intervals [or limits]. Decisions about magni-
trivial magnitudes (pT >0.75) were presented as        tudes accounting for the uncertainty were based
such, where pT was the p value for the test of the     on a reference-Bayesian analysis with a mini-
trivial hypothesis (the area of the sampling dis-      mally informative prior (Hopkins, 2019a;
tribution covering trivial values). When the p         Hopkins & Batterham, 2016; Hopkins &
value for the test of a substantial or trivial mag-    Batterham, 2018), which provided estimates of
nitude was >0.95, the magnitude is described as        chances that the true magnitude was harmful,
clear. The sampling distribution was assumed to        trivial and beneficial (for clinically or practically
be a t distribution for effects derived from a con-    relevant effects, which could result in implemen-
tinuous dependent variable, or a z distribution        tation of a treatment), or chances that the true
for effects derived from counts or events [mod-        magnitude was a substantial decrease or negative
ify this sentence for your study]. Where the           value, a trivial value, and a substantial increase
model providing the effects involved a log or          or positive value (for all other effects). [Provide
other transformation, the smallest important           additional text here if an informative prior was
value was also transformed for evaluation of the       used, or if the sample size was so small that a
p values.                                              weakly informative prior modified the magni-
   Include here the two paragraphs on adjust-          tude-based decision; state that the modified de-
ment for multiple inferences and sample-size es-       cision is presented, with the chances from the
timation described above for a strictly fre-           posterior distribution (Hopkins, 2019b).] Clini-
quentist journal.                                      cally relevant effects were deemed clear (had ad-
Journals accepting Bayesian analysis                   equate precision) if the risk of harm was 0.25, possibly;
stantial increase; possibly trivial-small positive     >0.75, likely; >0.95, very likely; >0.995, most
(if trivial and substantial positive are both possi-   likely (Hopkins et al., 2009). When the chances
ble). Clear or decisive can be used to describe        of a substantial or trivial magnitude were >95%,
such effects, but refer to clearly substantial or      the magnitude itself is described as clear. Effects
clearly trivial only for very likely or most likely    with inadequate precision are described as un-
substantial or trivial. Examples: clearly trivial; a   clear. [If you make ~10 or more decisions about
large clear increase. To save space in a table,        magnitudes, and you do not opt for the exact
substantial effects can be indicated with ↓ or ↑,      Bonferroni correction for inflation of error de-
and the probabilistic terms for substantial effects    scribed below, make the following statement
can be replaced (with an appropriate explanatory       here.] Effects with adequate precision defined by
footnote) by *, **, ***, and **** for possibly,        99% compatibility intervals are highlighted in
likely, very likely, and most likely substantial;      bold in tables and figures; the overall error rate
for trivial effects, use 0, 00, 000, and 0000. Exam-   for coverage of 10 independent true values with
ples: large↓****; small↑**; trivial00; small↑ (for     such intervals is that of a single effect with a
an observed small but unclear increase). A pos-        90%CI (10%), and interpretation of outcomes is
sibly trivial possibly substantial effect can be in-   focused on these effects.
dicated accordingly, to emphasize the uncer-              The chances of substantial and trivial magni-
tainty. Examples: trivial0↓* (if the observed ef-      tudes of the true effect were the percent areas of
fect was trivial, possibly trivial and possibly neg-   the sampling distribution of the effect statistic to
ative); small↓*,0 (if the observed effect was          the left or right of the smallest important value
small, possibly negative and possibly trivial).        (the trivial-small threshold). The sampling distri-
   Methods section. Uncertainty in the esti-           bution was assumed to be a t distribution for ef-

                                                                               Sportscience 24, 1-18, 2020
Hopkins: MBD as Hypothesis Tests                                                                  Page 16
fects derived from a continuous dependent vari-        for effects with no known relationship to perfor-
able, or a z distribution for effects derived from     mance, wealth or health, standardized thresholds
counts or events [modify this sentence for your        given by
You can also read