SPORTSCIENCE sportsci.org
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
SPORTSCIENCE · sportsci.org Perspectives / Research Resources Magnitude-Based Decisions as Hypothesis Tests Will G Hopkins Sportscience 24, 1-16, 2020 (sportsci.org/2020/MBDtests.htm) Institute for Health and Sport, Victoria University, Melbourne, Australia. Email. Magnitude-based decisions (MBD) is an inferential method that avoids the problems of null-hypothesis significance testing by interpreting the frequentist compatibility interval and sampling distribution in terms of uncertainty in the true value of an effect. This reference-Bayesian interpretation is defensible for the usual small (and any larger) sample sizes in sport and exercise research, if the statistical model is accurate, the measures are valid, the sample is rep- resentative, and the researcher proffers an uninformative or weakly informa- tive prior belief in the magnitude of the effect. For statisticians who dispute the Bayesian interpretation and favor Popperian hypothesis testing, MBD can be formulated as interval hypothesis tests. In the clinical version of MBD, an effect is clear, has acceptable uncertainty, and is considered potentially publishable and implementable in a clinical or practical setting, when the true effect is most unlikely harmful and at least possibly beneficial. This requirement is equivalent to strong rejection of the hypothesis of a harmful effect (pH0.25). In non-clinical MBD, an effect is clear and has acceptable uncertainty and publishability, when the true effect is very unlikely to be substantial of one or other sign. This requirement is equivalent to moderate rejection of one or other hypothesis of substantial magnitude (p+
Hopkins: MBD as Hypothesis Tests Page 2 Clinical MBD and the Hypothesis of Benefit .....................................................6 Hypotheses for Non-Clinical MBD....................................................................7 Combining the Hypotheses ..............................................................................7 Sample-size Estimation ...................................................................................8 New Terminology ............................................................................................9 Type-1 Errors in MBD ......................................................................................9 Lower P-value Thresholds? ........................................................................... 10 A Practical Application of MBD ...................................................................... 11 Conclusion ....................................................................................................11 References....................................................................................................12 Appendix: Reporting MBD in Journals ........................................................... 13 Associate editor's note. This article is now open for have written the article with little recourse to sta- post-publication peer review. I invite you to write tistical jargon. An article more suitable for an au- comments in the template and email to me, Ross Ne- dience of statisticians has been submitted for ville. You may also comment on the In-brief item on publication by others (Aisbett et al., 2020); for Moving Forward with Magnitude-Based Decisions. the background to that article, see my posting The original version with tracked changes resulting in the current version is available as a docx here. and other postings each side of it in the datamethods forum, updated in the In-brief item Introduction on MBD in this issue. An updated version of a When researchers study a sample, they obtain slideshow first presented at the German Sport an estimate of the magnitude of an effect statis- University in July 2019 is also a succinct sum- tic, such as a change in a mean measure of health mary of this article and the In-brief item. or performance following an intervention. With the usual but crucial assumptions about repre- Inferential Methods sentativeness of the sample, validity of the The null-hypothesis significance test is the measures, and accuracy of the statistical model, traditional approach to making as assertion about a sample of sufficiently large size yields an ac- the true magnitude of an effect. In NHST, the curate estimate of the true or population magni- data are converted into a sampling probability tude, because repeated samples would yield distribution for the effect (or a transformation of practically identical estimates. Sample sizes are it), representing expected variation in the effect seldom this large, so an assertion about the true with repeated sampling (Figure 1). This distribu- magnitude of an effect should account for the tion is well-defined, usually a t or z. Effect val- fact that the sample value is only an approximate ues spanning the central region of the distribu- estimate of the true value. tion represent values that are most compatible In this article I explain how to make an asser- with the sample data and the model, and the in- tion about the true magnitude via the usual ap- terval spanning 95% of the values is known as proach of the null-hypothesis significance test the 95% compatibility interval (95%CI). If the (NHST) and via magnitude-based inference 95%CI includes zero, then the data and model (MBI), an approach that has been severely criti- are compatible with a true effect of zero, so the cized recently (Sainani, 2018; Sainani et al., null hypothesis H0 cannot be rejected, as shown 2019; Welsh & Knight, 2015). Although the crit- in Figure 1a. Figure 1b shows the same data as icisms have been addressed (Hopkins, 2019a; Figure 1a, but the region of the distribution to the Hopkins & Batterham, 2016; Hopkins & left of the zero and the matching area on the other Batterham, 2018), it has been suggested that tail are shaded red. The total red area defines a MBI might be more acceptable to at least some probability (p) value, representing evidence members of the statistical community, if it were against the hypothesis: the smaller the p value, presented as hypothesis tests, along with a name the better the evidence against it. With a suffi- change to magnitude-based decisions (MBD; ciently small p value, you reject the hypothesis. Hopkins, 2019a). This article is my response to The threshold p value is called the alpha level of that suggestion. I include an appendix with the test, 0.05 for a 95%CI. The p value here is guidelines for presenting MBD as hypothesis 0.04 + 0.04 = 0.08, which is >0.05, so the data tests in a manuscript. and model do not support rejection of H0, and the For the benefit of practitioners with little for- effect is declared non-significant. Not shown in mal training in statistics (myself included), I Figure 1 is the limiting case, when one limit of the interval touches zero, and p = 0.05. Figure 1c Sportscience 24, 1-16, 2020
Hopkins: MBD as Hypothesis Tests Page 3 shows another study of an effect where the data it will often be wide enough to include substan- and model are not compatible with an effect of tial values of the effect statistic, implying that zero: the 95%CI does not include zero; equiva- substantial values are compatible with the data lently the p value is 0.05). (c) The distribution of values and p value (0.02) for a different effect; 0 is not included in the 95%CI, therefore H0 is rejected (p
Hopkins: MBD as Hypothesis Tests Page 4 uniform over the range of non-negligible likeli- by testing and rejecting hypotheses. Sander hood, and thus only weakly informative relative Greenland's suggestion to reframe MBD in terms to the likelihood function (S. Greenland, per- of hypothesis testing was motivated by what he sonal communication). When sample sizes are sees as the need for the well-defined control of small, researchers should check that a realistic error rates that underlie hypothesis testing. I weakly informative prior does indeed make no have been opposed to hypothesis testing, for rea- practical difference to the compatibility interval, sons espoused by many others, apparently as far as noted in the article (Hopkins, 2019b) on back as Francis Bacon and Isaac Newton (e.g., Bayesian analysis with a spreadsheet. If a Glass, 2010). My preference has been instead for weakly informative prior results in noticeable estimation of the magnitude of effects and their "shrinkage" of either of the compatibility limits uncertainty to make decisions about the true of an effect, or if the magnitude-based decision magnitude based on the Bayesian interpretation, is modified by the prior, the researcher should but at the same time accounting for decision er- justify such a prior and report the modified deci- rors. I will now show that such decisions are sion. equivalent to rejecting or failing to reject several Other researchers have proposed a Bayesian hypotheses, and that the errors arising from mak- interpretation of the compatibility interval or the ing wrong decisions are the same as the errors sampling distribution similar to those of MBI with hypothesis testing. (Albers et al., 2018; Burton, 1994; Shakespeare Clinical MBD and the Hypothesis of Harm et al., 2001). Some researchers interpret the com- In a clinical or practical setting, it is important patibility interval as if it represents precision or that the outcome of a study does not result in im- uncertainty in the estimation of the value of an plementation of an intervention that on average effect, but they stop short of identifying their ap- could harm the population of interest. The hy- proach as Bayesian (Cumming, 2014; Rothman, pothesis that the true effect is harmful is there- 2012). MBI differs from all these approaches by fore a more relevant hypothesis to reject than the providing specific guidance on what constitutes hypothesis that the true effect is zero (the stand- adequate precision or acceptable uncertainty ard null). Figure 2, adapted from Lakens et al. when making a decision about the magnitude of (2018), shows distributions, compatibility inter- the effect. Partly for this reason, and partly be- vals and one-sided p values associated with the cause of concerns expressed about use of the test of the hypothesis H0 that the true effect is word inference (Greenland, 2019), MBI is now harmful, for three different outcomes that could known as a method for making magnitude-based occur with samples. Formally, the test of the decisions (Hopkins, 2019a). harmful hypothesis is a non-inferiority test, in MBD nevertheless finds itself in a precarious which rejection of the hypothesis of inferiority position, pilloried by Bayesians who insist on in- (harm) implies the effect is non-inferior (e.g., formative priors and by frequentists who believe Castelloe & Watts, 2015). that science advances only in Popperian fashion Figure 2. Three examples of testing an hypothesis H0 that an effect (e.g., of a treatment) is harmful. All harmful values fall in the purple region to the left of the smallest harmful value. The 95% compatibility interval (95%CI) in (a) includes harmful values, so harmful values are compatible with the sample and model, and H0 is not rejected. H0 is only just rejected in (b), and in (c) it is easily rejected with a 95%CI and only just rejected with a 99%CI. P values for the test, pH, are evaluated for only one tail of the probability distribution. Sportscience 24, 1-18, 2020
Hopkins: MBD as Hypothesis Tests Page 5 Harmful values are anything to one side of the researcher fails to discover that the effect is smallest harmful value, and since this value is harmful. negative in the figure, harmful values are any- When a compatibility interval overlaps harm- thing to the left of this value. Positive harmful ful values, the true effect could be harmful, to values (e.g., a threshold blood pressure for hy- use the Bayesian interpretation of the interval. pertension) would fall to the right of a smallest Furthermore, the p value for the test of harm and harmful positive value. The test of the hypothe- the probability that the true effect is harmful in sis that the effect is any harmful value belongs to MBD are defined by the sampling distribution in the class of one-sided interval hypothesis tests. exactly the same way. In the clinical version of The important point about such tests is that MBD, a clinically relevant effect is considered harmful values are compatible with the sample for implementation only when the probability data and model, when the compatibility interval that the true effect is harmful is
Hopkins: MBD as Hypothesis Tests Page 6 MBD is also effectively more conservative for rejecting the hypothesis of harm (and thereby about avoiding harm than Ken Rothman's ap- ensuring a low risk of implementing a harmful proach to precision of estimation. In his intro- effect, to use the Bayesian interpretation). There ductory text on clinical epidemiology (Rothman, also needs to be consideration that the treatment 2012), he refers to 90% confidence intervals could be beneficial. Once again, a one-sided test three times more often than 95% intervals, and is involved, formally a non-superiority test, in he does not refer to 99% intervals at all. which rejection of the hypothesis of superiority Clinical MBD and the Hypothesis of Benefit (benefit) implies the effect is non-superior Making a decision about implementation of a (Castelloe & Watts, 2015). See Figure 4. treatment is more than just using a powerful test Figure 4. Three examples of testing an hypothesis H0 that an effect is beneficial. All beneficial values fall in the orange region to the right of the smallest beneficial value. The compatibility interval in (a) excludes beneficial values, so H0 is rejected. H0 fails to be rejected in (b) and (c). Additionally, in (c) the hypothesis of non-benefit is rejected (i.e., H0 is accepted). P values for the test of benefit, pB, are shown in (a) and (b), and for the test of non-benefit, 1–pB, in (c). The level of the compatibility interval is discussed below. With the same reasoning as for the hypothesis area of the tail overlapping into beneficial values of harm, a compatibility interval that falls short is the probability that the true effect is beneficial, of the smallest beneficial value implies that no and a 50% compatibility interval has a one-sided beneficial values are consistent with the data and 25% tail, which was considered a lower thresh- model (or in the Bayesian interpretation, the old for possibly beneficial. In other words, one chance that the true effect is beneficial is too should consider an effect to be possibly or poten- low), so you would not implement it (Figure 4a). tially beneficial and implementable, provided A compatibility interval that overlaps beneficial there is a sufficiently low risk of harm. values allows for beneficial values to be compat- A threshold or alpha pB value of 0.25 implies ible with the sample and model (or for the possi- an error rate of 25% for failing to discover the bility that the true effect is beneficial, to give the smallest beneficial effect (which can be shown Bayesian interpretation), so you cannot reject the with a figure similar to Figure 3). While this er- hypothesis that the effect is beneficial, so it could ror rate may seem high, it is comparable with the be worth implementing (Figure 4b and 4c). Fig- 20% Type-2 error rate that underlies the calcula- ures 4a and 4b show the p value for the test of tion of sample size in conventional null-hypoth- benefit, pB. Figure 4c shows the p value for the esis testing. The calculation provides a sample test of the hypothesis that the effect is not bene- size that would give statistical significance for ficial, 1 – pB: in this example, the compatibility 80% of studies, or a failed-discovery rate of interval falls entirely in the beneficial region, so 20%, when the true effect is the minimum clini- the hypothesis of non-benefit is rejected. cally important difference (the smallest benefi- What level should the researcher choose for cial value). the compatibility interval and the associated Of course, researchers are free to specify a threshold pB value in the test of the beneficial hy- lower threshold pB to reduce the rate of failing to pothesis? The level 50% was chosen via the discover benefit (here, failing to reject benefit), Bayesian interpretation of a compatibility inter- but a lower error rate comes with a cost. For ex- val that just touches the beneficial region: the ample, if the threshold pB is 0.05, equivalent to Sportscience 24, 1-18, 2020
Hopkins: MBD as Hypothesis Tests Page 7 one tail of a 90% compatibility interval, a tail can also be recast as one-sided tests, but now the overlapping into the beneficial region with an threshold p values are the same, because reject- area of only 6% is regarded as failure to reject ing the hypothesis of a substantial negative effect the beneficial hypotheses, or from the Bayesian is just as important as rejecting the hypothesis of perspective, an observed quite trivial effect with a substantial positive effect. A p value of 0.05, a chance of benefit of only 6% is potentially im- corresponding to one tail of a 90% compatibility plementable. Furthermore, a trivial true effect on interval, was chosen originally for its Bayesian the margin of smallest important would have a interpretation: rejection of the hypothesis of one 95% rate of failure to reject the beneficial hy- of the substantial magnitudes corresponds to a pothesis, a very high false-discovery or Type-1 chance of 0.005, so the risk similar to those in Figure 3 shows that the mini- of harm is at least very unlikely; could be bene- mum desirable sample size for MBD would give ficial means pB>0.25, so the chance of benefit is a failed-discovery rate of 50% for a true effect at least possibly; could be substantial means that is 2× the smallest important, but the rate is p+>0.05 and p–>0.05, or chances of both magni- only 5% for a true effect that is 3× the smallest tudes are at least unlikely. important, which is borderline small-moderate in When the true value of an effect is substantial all my magnitude scales (Hopkins, 2010). The of a given sign, the outcome consistent with this minimum desirable sample size in MBD should effect is failure to reject the hypothesis of that therefore satisfy those who promote sample-size sign and rejection of the hypothesis of opposite estimation for MET. In any case, and regardless sign. A feature of MBD is the level of evidence of sample size, the hypothesis test underlying it conveys for the hypothesis that could not be MET is automatically available in MBD, as ex- rejected. For example, if the hypothesis of bene- plained below. fit is not rejected (and harm is rejected), the ef- Hypotheses for Non-Clinical MBD fect is reported with the probabilistic terms pos- I have presented the clinical version of MBD sibly, likely, very likely or most likely preceding above as two one-sided hypothesis tests. The beneficial. Each of these Bayesian terms has an tests for harm and benefit have to differ in their equivalent p-value threshold for testing an hy- respective threshold p values, because it is ethi- pothesis that the effect is not beneficial:
Hopkins: MBD as Hypothesis Tests Page 8 resulting in rejection of the non-beneficial hy- publishable quantum, regardless of the true mag- pothesis (pNB0.25); for a marginally substantial neg- of side effects. ative true effect, the error rate is 5% for deciding For clear effects that are possibly trivial and that the true effect could be substantially positive possibly substantial (including beneficial or (failure to reject the substantial positive hypoth- harmful), I suggest presenting the effect as pos- esis, with p+>0.05). sibly substantial, regardless of which probability Aisbett et al. (2020) have suggested sample- is greater, although stating that the effect is also size estimation based on minimal-effects (supe- possibly trivial would emphasize the uncer- riority) testing (MET) or equivalence testing tainty. Effects with adequate precision that are at (ET). I have already shown above that the MBD least likely trivial can be presented as such in ta- sample size is consistent with that of MET for bles of results, without mention of the fact that the reasonable expectation of a marginally one of the substantial magnitudes is unlikely small-moderate effect, so there is no need to re- while the other is at least very unlikely. place the MBD sample size with a MET sample Rejection of both substantial hypotheses im- size. In ET, the researcher needs a sample size plies a decisively trivial effect, which occurs that will deliver a decisively trivial outcome (by when the compatibility interval is contained en- rejection of the non-trivial hypothesis), if the tirely within the trivial range of values. In non- true effect is trivial. As in MET, the researcher clinical MBD with a 90% interval, this scenario posits an expected value of the true effect, but represents an equivalence test, with the non- now the value has to be somewhat less than the equivalence hypothesis rejected at the 0.05 level. smallest important. Unfortunately, the resulting Thus MBD also includes equivalence testing. A ET sample size turns out to be impractical: if the minor point here is that a decisively trivial effect researcher posits (unrealistically) an expected can sometimes be likely trivial; for example, a true effect of exactly zero, a simple considera- 90%CI falling entirely in the trivial region, with tion of sampling distributions similar to those in p– = 0.03 and p+= 0.04, implies pT = 1 – (p– + p+) Figure 3 shows that the sample size needs to be = 0.93, which is likely trivial. Very likely trivial 4× that of MBD to deliver a decisively trivial effects are, of course, always decisively trivial. outcome. A more realistic expected trivial effect In clinical MBD rejection of the beneficial hy- midway between zero and the smallest important pothesis (pB
Hopkins: MBD as Hypothesis Tests Page 9 the smallest important. I therefore see no need However, use of the term clear to describe such for a new method of sample-size estimation for effects may be responsible in part for misuse of MBD, but I have updated my article (Hopkins, MBI, whereby researchers omit the probabilistic 2020) and spreadsheet for sample-size estima- term describing the magnitude and present it as tion to include MET and ET. if it is definitive (Lohse et al., 2020; Sainani et New Terminology al., 2019). An effect that is clear and only possi- For researchers who dispute or wish to avoid bly substantial is obviously not clearly substan- the Bayesian interpretation of evidence for or tial. Researchers must therefore be careful to dis- against magnitudes in MBD, frequentist terms tinguish between clear effects and clear magni- have been suggested, corresponding to p-value tudes: they should refer to a clear effect as being thresholds for rejection of the one-sided hypoth- clearly substantial or clearly trivial, when the ef- esis tests: most unlikely, very unlikely, unlikely, fect is very likely or most likely substantial or and possibly correspond to rejection of an hy- trivial (moderately or strongly compatible with pothesis with p
Hopkins: MBD as Hypothesis Tests Page 10 the effect should be in no doubt about the modest Similarly, in non-clinical MBD a Type-1 error level of evidence for the effect being beneficial occurs if the true effect is trivial and the trivial (Hopkins & Batterham, 2016). hypothesis is rejected. Rejection of the trivial hy- The Type-1 error rates are even higher in the pothesis occurs when the compatibility interval less conservative odds-ratio version of clinical covers only substantial values, the correspond- MBD, according to which an unclear effect is de- ing Bayesian interpretation being that the true ef- clared potentially implementable, if there is a fect is very likely substantial. Therefore possibly sufficiently high chance of benefit compared substantial or likely substantial outcomes, be with the risk of harm (an odds ratio of benefit to they publishable or unclear, represent failure to harm greater than a threshold value of 66, de- reject the trivial hypothesis and therefore do not rived from the odds for marginally beneficial and incur a Type-1 error, a crucial point that the de- marginally harmful). Again, the errors occur tractors of MBI have not acknowledged mainly as possibly or likely beneficial (Hopkins (Sainani, 2018, 2019; Sainani et al., 2019; Welsh & Batterham, 2016), but the practitioner needs to & Knight, 2015). take into consideration loss of control of the er- Janet Aisbett (personal communication) sug- ror in rejecting the harmful hypothesis and there- gested that "an error of sorts also occurs when fore the increased risk of implementing a harm- you fail to reject a hypothesis that you should ful effect. have." In other words, the outcome with a truly In non-clinical MBD, a Type-1 error occurs substantial effect should be rejection of the non- when the true effect is trivial and the 90% com- substantial hypothesis, and if you fail to reject patibility interval falls entirely outside trivial that hypothesis, you have made an error. A sim- values (Hopkins & Batterham, 2016): a clearly ilar error occurs with failure to reject the non- substantial effect in the new terminology. The trivial hypothesis, when the true effect is trivial. equivalent frequentist interpretation of this dis- Janet's suggestion is just another way of justify- position of the compatibility interval is rejection ing sample size with MET or ET, which I have of the hypothesis that the effect is not substan- already dealt with. tially positive (say), with p
Hopkins: MBD as Hypothesis Tests Page 11 clinical effects (Hopkins, 2006). If this threshold Effects on endurance performance are best ex- were revised downward, non-clinical sample pressed in percent units, and for elite cyclists the size would be greater, which seems unreasona- smallest important change in mean power in ble, so the probability thresholds in clinical time-trial races (defined as winning an extra MBD would also need revising downwards. For medal in every 10 races on average) is 1.0% example, if all the p-value thresholds were (Malcata & Hopkins, 2014). I calculated the net halved, their S values would move up by one effect on power in the submaximal time trial as coin toss. For non-clinical MBI very unlikely 1.1%. When I inserted these values into the fre- (0.05 or 5%) would become 0.025 or 2.5% (S = quentist and Bayesian versions of the spread- 5.3). For clinical MBI most unlikely (0.005 or sheet for converting a p value to MBD (Hopkins, 0.5%) would become 0.0025 or 0.25%, and pos- 2007), the 90%CI was 0.1 to 2.1%, and the non- sibly (0.25 or 25%) would become 0.125 or clinical decision was a small, possibly (or am- 12.5% (S = 8.6 and 3.0). Sample size for clinical biguously) positive effect (p+ = 0.54, p– = 0.001). and non-clinical MBD would still be practically The effect was also potentially implementable the same, but they would rise from the existing (possible benefit with really low risk of harm), approximately one-third to about one-half those but a clinical decision would be relevant only for of NHST for 5% significance and 80% power. A someone considering implementation for an ad- lower p-value threshold for non-clinical effects vantage in competitions, which is not an issue would reduce the Type-1 error rates for such ef- here. fects, but a lower pB would increase the Type-1 The road-race performance was measured rates for deciding that trivial true effects could only once for each cyclist, after the period of ad- be beneficial. Lower p values require bigger ef- ministration of rHuEPO. The authors presented fects for a given sample size, so there could be a the difference in the mean race time of the two risk of substantial bias for publishable effects groups in percent units, but without time in a pre- with lower threshold p values, when the sample intervention race for comparison, the uncertainty size is small. Unclear effects would also be more accommodates huge negative and positive ef- common and therefore less publishable with fects (0.3%, 95%CI -8.3 to 9.6%). some of the unavoidably small sample sizes in The conclusion that the submaximal test and sport and exercise science. On balance, I recom- road race performance were not affected by in- mend keeping the existing probability thresh- jections of rHuEPO is obviously not tenable. The olds. researchers assumed that non-significance im- A Practical Application of MBD plies no real effect, which is a reasonable as- A colleague who is skeptical about claims of sumption with the right sample size. Unfortu- performance enhancement with the substances nately their approach to estimating sample size banned by the International Olympic Committee left the study underpowered for the submaximal recently asked me to evaluate an article reporting test (as shown by non-significance for an ob- the results of a placebo-controlled trial of the ef- served substantial effect) and grossly underpow- fects of injections of recombinant human eryth- ered for road race performance (as shown by ropoietin (rHuEPO) on performance of cyclists huge compatibility limits). Use of MBD leads to (Heuberger et al., 2017). The authors concluded a realistic conclusion about the uncertainty in the that "although rHuEPO treatment improved a la- magnitude of the effects. boratory test of maximal exercise, the more clin- Conclusion ically relevant submaximal exercise test perfor- If researchers heed the recent call to retire sta- mance and road race performance were not af- tistical significance (Amrhein et al., 2019), they fected." The net effect of rHuEPO on mean will need some other hypothesis-based inferen- power in the submaximal test (a 45-min trial) tial method to make decisions about effects, es- was presented as 5.9 W (95%CI -0.9 to 12.7 W, pecially in clinical or practical settings. I have p=0·086), so their conclusion about performance shown that the reference-Bayesian probability in this test was based presumably on what ap- thresholds in the magnitude-based decision peared to be a negligible increase in mean power method are p-value thresholds for rejecting hy- and, of course, statistical non-significance, along potheses about substantial magnitudes, which with the claim that their study was "adequately are assuredly more relevant to real-world out- powered". comes than the null hypothesis. Researchers can therefore make magnitude-based decisions about Sportscience 24, 1-18, 2020
Hopkins: MBD as Hypothesis Tests Page 12 effects in samples, confident that the decisions Heuberger JA, Rotmans JI, Gal P, Stuurman FE, have a sound frequentist theoretical basis and ac- van't Westende J, Post TE, Daniels JM, Moerland ceptable error rates. I recommend continued use M, van Veldhoven PL, et al. (2017). Effects of of the probabilistic terms possibly, likely, and so erythropoietin on cycling performance of well trained cyclists: a double-blind, randomised, on to describe magnitudes of clear, decisive or placebo-controlled trial. The Lancet Haematology conclusive effects (those with acceptable uncer- 4, e374-e386. tainty), since these terms can be justified with ei- Hopkins WG. (2006). Estimating sample size for ther reference-Bayesian analyses or hypothesis magnitude-based inferences. Sportscience 10, 63- tests, and they convey uncertainty in an accessi- 70. ble manner. The terms clearly, decisively or con- Hopkins WG. (2007). A spreadsheet for deriving a clusively should be reserved for magnitudes that confidence interval, mechanistic inference and are very likely or most likely trivial or substan- clinical inference from a p value. Sportscience 11, tial: those with moderate or strong compatibility 16-20. with the magnitude. Hopkins WG. (2010). Linear models and effect magnitudes for research, clinical and practical Acknowledgements: I thank Janet Aisbett for im- applications. Sportscience 14, 49-58. portant corrections and suggestions for inclusion of Hopkins WG. (2018). Design and analysis for additional material. Alan Batterham and Daniel studies of individual responses. Sportscience 22, Lakens helped me understand one-sided interval hy- 39-51. pothesis tests. Hopkins WG. (2019a). Magnitude-based decisions. References Sportscience 23, i-iii. Aisbett J, Lakens D, Sainani KL. (2020). Magnitude Hopkins WG. (2019b). A spreadsheet for Bayesian based inference in relation to one-sided posterior compatibility intervals and magnitude- hypotheses testing procedures. SportRxiv, based decisions. Sportscience 23, 5-7. Hopkins WG. (2020). Sample-size estimation for https://osf.io/preprints/sportrxiv/pn9s3/. various inferential methods. Sportscience 24, 17- Albers CJ, Kiers HA, van Ravenzwaaij D. (2018). 27. Credible confidence: a pragmatic view on the frequentist vs Bayesian debate. Collabra: Hopkins WG, Batterham AM. (2016). Error rates, Psychology 4, 31. decisive outcomes and publication bias with Amrhein V, Greenland S, McShane B. (2019). several inferential methods. Sports Medicine 46, Retire statistical significance. Nature 567, 305- 1563-1573. 307. Hopkins WG, Batterham AM. (2018). The Burton PR. (1994). Helping doctors to draw vindication of magnitude-based inference. appropriate inferences from the analysis of Sportscience 22, 19-29. medical studies. Statistics in Medicine 13, 1699- Hopkins WG, Marshall SW, Batterham AM, Hanin J. (2009). Progressive statistics for studies in 1713. sports medicine and exercise science. Medicine Castelloe J, Watts D. (2015). Equivalence and and Science in Sports and Exercise 41, 3-12. noninferiority testing using sas/stat® software. Paper SAS1911-2015, 1-23 Lakens D, Scheel AM, Isager PM. (2018). (https://support.sas.com/resources/papers/proceed Equivalence testing for psychological research: a ings15/SAS1911-2015.pdf). tutorial. Advances in Methods and Practices in Cumming G. (2014). The new statistics: why and Psychological Science 1, 259-269. how. Psychological Science 25, 7-29. Lohse K, Sainani K, Taylor JA, Butson ML, Knight Gelman A, Greenland S. (2019). Are confidence E, Vickers A. (2020). Systematic review of the intervals better termed “uncertainty intervals”? use of “Magnitude-Based Inference” in sports BMJ 366, I5381. science and medicine. SportRxiv, Glass DJ. (2010). A critique of the hypothesis, and a https://osf.io/preprints/sportrxiv/wugcr/. Malcata RM, Hopkins WG. (2014). Variability of defense of the question, as a framework for competitive performance of elite athletes: a experimentation. Clinical Chemistry 56, 1080- systematic review. Sports Medicine 44, 1763- 1085. Greenland S. (2006). Bayesian perspectives for 1774. epidemiological research: I. Foundations and Mastrandrea MD, Field CB, Stocker TF, Edenhofer basic methods. International Journal of O, Ebi KL, Frame DJ, Held H, Kriegler E, Mach Epidemiology 35, 765-775. KJ, et al. (2010). Guidance note for lead authors Greenland S. (2019). Valid P-values behave exactly of the IPCC fifth assessment report on consistent as they should: Some misleading criticisms of P- treatment of uncertainties. Intergovernmental values and their resolution with S-values. The Panel on Climate Change (IPCC), American Statistician 73, 106-114. https://pure.mpg.de/rest/items/item_2147184/com ponent/file_2147185/content. Sportscience 24, 1-18, 2020
Hopkins: MBD as Hypothesis Tests Page 13 Mengersen KL, Drovandi CC, Robert CP, Pyne DB, Suggested text for this paragraph, which is the Gore CJ. (2016). Bayesian estimation of small same for both kinds of journal, is also provided effects in exercise and sports science. PloS One below. 11, e0147311, doi:0147310.0141371/journal.pone.0147311. Journals requiring hypothesis testing Rothman KJ. (2012). Epidemiology: an Introduction The editor should be satisfied if the p value for (2nd ed.). New York: OUP. the usual null-hypothesis test is replaced by the Sainani KL. (2018). The problem with "magnitude- p values for the one-sided tests of substantial based inference". Medicine and Science in Sports magnitudes. Report the p values to three decimal and Exercise 50, 2166-2176. places if ≥0.995 (e.g., 0.997), to two decimal Sainani KL. (2019). Response. Medicine and places if ≥0.10 (e.g., 0.63), and with one signifi- Science in Sports and Exercise 51, 600. cant digit if 0.95; for example, "there was a tion of the model or models providing the effect moderate clear reduction in risk of injury (hazard statistics. This description should include char- ratio 0.63, 90%CI 0.46 to 0.86, pH=0.002, acterization of the dependent variable, any trans- pB=0.97)." The pB of 0.97 in this example has a formation that was used for the analysis, and the frequentist interpretation of moderately compat- predictors in the model. For mixed models, the ible with benefit, which could be included for ed- random effects and their structure should be de- itors who require strictly frequentist reporting; scribed. There should be some attention to the the Bayesian interpretation of very likely benefi- issue of uniformity of effects and errors in the cial could be included or replace pB for editors model. If a published spreadsheet was used for who are comfortable with non-informative or the analysis, cite the most recent article accom- weakly informative priors. If you or the editor panying the spreadsheet. want an informative prior, the probabilities of Following the description of the statistical the true magnitudes will differ from the p values. model, there is a section on the MBD method In tables, show a column headed by pH/pT/pB (or that will depend on whether the journal requires p–/pT/p+, or p↓/pT/p↑), with values such as hypothesis testing or accepts instead, or as well, 0.03/0.05/0.92. To save numeric clutter in a table a Bayesian analysis. Journals should be more and highlight evidence for magnitudes, clear ef- willing to accept the original Bayesian version of fects (those with acceptable uncertainty) can be MBD, now that it is clear that MBD is isomor- indicated by using the asterisk and superscript 0 phic with an acceptable frequentist version. Sug- system described below for Bayesian reporting, gested texts for this section are provided below, with the Bayesian terms replaced by the fre- preceded by advice on presentation of results of quentist compatibility terms. Decode the aster- MBD analyses. isks and superscripts in the table's footnote. A final paragraph on statistical analysis can Methods section for a strictly frequentist deal with evaluation of observed magnitudes. journal. Uncertainty in the estimates of effects Sportscience 24, 1-18, 2020
Hopkins: MBD as Hypothesis Tests Page 14 is presented as 90% compatibility intervals [or Minimum desirable sample size was esti- limits]. Decisions about magnitudes accounting mated for Type-2 (failed-discovery) error rates for the uncertainty were based on hypothesis set by the p-value thresholds (0.5% and 25% for tests for substantial and trivial effects (reference: smallest important harmful and beneficial clini- this article and/or Aisbett et al., 2020). For clin- cal effects respectively; 5% for substantial nega- ically or practically relevant effects (which could tive and positive non-clinical effects) using a result in implementation of a treatment), hypoth- spreadsheet (Hopkins, 2006). [Use smaller error eses of harm and benefit were rejected if the re- rates where relevant for multiple inferences.] Er- spective p values (pH and pB) were less than ror of measurement [for sample-size estimation 0.005 and 0.25 (strong and weak rejection, re- of controlled trials and crossovers] was esti- spectively). For all other effects, hypotheses of mated from previously published similar studies substantial decrease and increase were rejected [state references for the studies] using the panel if their respective p values (p– and p+, or for fac- of cells for that purpose in the spreadsheet. tor effects, p↓ and p↑) were less than 0.05 (mod- Methods section incorporating hypothesis erate rejection). If only one hypothesis was re- tests and Bayesian analysis. Uncertainty in the jected, the effect is described as being ambigu- estimates of effects is presented as 90% compat- ously, weakly, moderately or strongly compati- ibility intervals [or limits]. Decisions about mag- ble with the other magnitude when the p value nitudes accounting for the uncertainty were for the test of that magnitude was >0.25, >0.75, based on hypothesis tests for substantial and triv- >0.95 and >0.995, respectively. ial effects (reference: this article and/or Aisbett The p values for the above tests were areas of et al., 2020). For clinically or practically relevant the sampling distribution of the effect statistic to effects (which could result in implementation of the left or right of the smallest important value a treatment), hypotheses of harm and benefit (the trivial-small threshold) [and were provided were rejected if the respective p values (pH and by the Sportscience spreadsheet in percent units pB) were less than 0.005 and 0.25 (strong and as chances of benefit and harm, or substantially weak rejection, respectively). For all other ef- positive and negative]. Effects with at least one fects, hypotheses of substantial decrease and in- substantial hypothesis rejected and at least crease were rejected if their respective p values weakly compatible evidence with trivial magni- (p– and p+, or for factor effects, p↓ and p↑) were tudes (pT >0.75) are presented as such, where pT less than 0.05 (moderate rejection). If only one is the p value for the test of the trivial hypothesis hypothesis was rejected, the p value for the other (the area of the sampling distribution covering hypothesis corresponds to the posterior probabil- trivial values). When the p value for the test of a ity of the magnitude of the true (large-sample) substantial or trivial magnitude was >0.95, the effect in a reference-Bayesian analysis with a magnitude is described as clear. The sampling minimally or weakly informative prior distribution was assumed to be a t distribution (Hopkins, 2019b), so it was interpreted with the for effects derived from a continuous dependent following scale: >0.25, possibly; >0.75, likely; variable, or a z distribution for effects derived >0.95, very likely; >0.995, most likely (Hopkins from counts or events [modify this sentence for et al., 2009). [Provide additional text here if an your study]. Where the model providing the ef- informative prior was used, or if the sample size fects involved a log or other transformation, the was so small that a weakly informative prior smallest important value was also transformed modified the magnitude-based decision; state for evaluation of the p values. that the modified decision is presented (Hopkins, P-value thresholds for rejecting the main pre- 2019b).] If neither hypothesis was rejected, the planned hypothesis in this study [describe it] effect is described as unclear, with the exception were not adjusted for the inflation of error that of effects with an odds ratio of benefit to harm occurs with multiple inferences. For the remain- >66, which were considered clear and poten- ing tests, thresholds were divided by n [state it; a tially implementable. default for study with many effects could be 10], The p values for the above tests were areas of to give a Bonferroni-type correction equivalent the sampling distribution of the effect statistic to to n independent tests. Effects with hypotheses the left or right of the smallest important value thus rejected are highlighted in bold in tables (the trivial-small threshold) [and were provided and figures; to reduce inflation of error, interpre- by the Sportscience spreadsheet in percent units tation of outcomes is focused on these effects. as chances of benefit and harm, or substantially Sportscience 24, 1-18, 2020
Hopkins: MBD as Hypothesis Tests Page 15 positive and negative]. Effects with at least one mates of effects is presented as 90% compatibil- substantial hypothesis rejected and at least likely ity intervals [or limits]. Decisions about magni- trivial magnitudes (pT >0.75) were presented as tudes accounting for the uncertainty were based such, where pT was the p value for the test of the on a reference-Bayesian analysis with a mini- trivial hypothesis (the area of the sampling dis- mally informative prior (Hopkins, 2019a; tribution covering trivial values). When the p Hopkins & Batterham, 2016; Hopkins & value for the test of a substantial or trivial mag- Batterham, 2018), which provided estimates of nitude was >0.95, the magnitude is described as chances that the true magnitude was harmful, clear. The sampling distribution was assumed to trivial and beneficial (for clinically or practically be a t distribution for effects derived from a con- relevant effects, which could result in implemen- tinuous dependent variable, or a z distribution tation of a treatment), or chances that the true for effects derived from counts or events [mod- magnitude was a substantial decrease or negative ify this sentence for your study]. Where the value, a trivial value, and a substantial increase model providing the effects involved a log or or positive value (for all other effects). [Provide other transformation, the smallest important additional text here if an informative prior was value was also transformed for evaluation of the used, or if the sample size was so small that a p values. weakly informative prior modified the magni- Include here the two paragraphs on adjust- tude-based decision; state that the modified de- ment for multiple inferences and sample-size es- cision is presented, with the chances from the timation described above for a strictly fre- posterior distribution (Hopkins, 2019b).] Clini- quentist journal. cally relevant effects were deemed clear (had ad- Journals accepting Bayesian analysis equate precision) if the risk of harm was 0.25, possibly; stantial increase; possibly trivial-small positive >0.75, likely; >0.95, very likely; >0.995, most (if trivial and substantial positive are both possi- likely (Hopkins et al., 2009). When the chances ble). Clear or decisive can be used to describe of a substantial or trivial magnitude were >95%, such effects, but refer to clearly substantial or the magnitude itself is described as clear. Effects clearly trivial only for very likely or most likely with inadequate precision are described as un- substantial or trivial. Examples: clearly trivial; a clear. [If you make ~10 or more decisions about large clear increase. To save space in a table, magnitudes, and you do not opt for the exact substantial effects can be indicated with ↓ or ↑, Bonferroni correction for inflation of error de- and the probabilistic terms for substantial effects scribed below, make the following statement can be replaced (with an appropriate explanatory here.] Effects with adequate precision defined by footnote) by *, **, ***, and **** for possibly, 99% compatibility intervals are highlighted in likely, very likely, and most likely substantial; bold in tables and figures; the overall error rate for trivial effects, use 0, 00, 000, and 0000. Exam- for coverage of 10 independent true values with ples: large↓****; small↑**; trivial00; small↑ (for such intervals is that of a single effect with a an observed small but unclear increase). A pos- 90%CI (10%), and interpretation of outcomes is sibly trivial possibly substantial effect can be in- focused on these effects. dicated accordingly, to emphasize the uncer- The chances of substantial and trivial magni- tainty. Examples: trivial0↓* (if the observed ef- tudes of the true effect were the percent areas of fect was trivial, possibly trivial and possibly neg- the sampling distribution of the effect statistic to ative); small↓*,0 (if the observed effect was the left or right of the smallest important value small, possibly negative and possibly trivial). (the trivial-small threshold). The sampling distri- Methods section. Uncertainty in the esti- bution was assumed to be a t distribution for ef- Sportscience 24, 1-18, 2020
Hopkins: MBD as Hypothesis Tests Page 16 fects derived from a continuous dependent vari- for effects with no known relationship to perfor- able, or a z distribution for effects derived from mance, wealth or health, standardized thresholds counts or events [modify this sentence for your given by
You can also read