Presenting the Probabilities of Different Effect Sizes: Towards a Better Understanding and Communication of Statistical Uncertainty - arXiv
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Presenting the Probabilities of Different Effect Sizes: Towards a Better Understanding and Communication of Statistical Uncertainty Akisato Suzuki School of Politics and International Relations University College Dublin Belfield, Dublin 4, Ireland akisato.suzuki@gmail.com ORCID: 0000-0003-3691-0236 arXiv:2008.07478v2 [stat.AP] 2 Sep 2021 Working paper (August 12, 2021) Abstract How should social scientists understand and communicate the uncertainty of statisti- cally estimated causal effects? It is well-known that the conventional significance-vs.- insignificance approach is associated with misunderstandings and misuses. Behavioral research suggests people understand uncertainty more appropriately in a numerical, con- tinuous scale than in a verbal, discrete scale. Motivated by these backgrounds, I propose presenting the probabilities of different effect sizes. Probability is an intuitive continuous measure of uncertainty. It allows researchers to better understand and communicate the uncertainty of statistically estimated effects. In addition, my approach needs no decision threshold for an uncertainty measure or an effect size, unlike the conventional approaches, allowing researchers to be agnostic about a decision threshold such as p < 5% and a jus- tification for that. I apply my approach to a previous social scientific study, showing it enables richer inference than the significance-vs.-insignificance approach taken by the original study. The accompanying R package makes my approach easy to implement. Keywords – Bayesian, inference, p-value, uncertainty
Presenting the Probabilities of Different Effect Sizes Akisato Suzuki 1 Introduction How should social scientists understand and communicate the uncertainty of statistically estimated causal effects? They usually use a specific decision threshold of a p-value or confidence/credible interval and conclude whether a causal effect is significant or not, meaning it is not (practically) null (Gross 2015; Kruschke 2018). While convenient to make categorical decisions, the significance-vs.-insignificance dichotomy leads us to over- look the full nuance of the statistical measure of uncertainty. This is because uncer- tainty is the degree of confidence and, therefore, a continuous scale. The dichotomy is also associated with problems such as p-hacking, publication bias, and seeing statistical insignificance as evidence for the null hypothesis (e.g., see Amrhein, Greenland, and Mc- Shane 2019; Esarey and Wu 2016; Gerber and Malhotra 2008; McShane and Gal 2016, 2017; Simonsohn, Nelson, and Simmons 2014). While these cited articles discuss these problems stemming from the Frequentist Null Hypothesis Significance Testing, Bayesian inference is equally susceptible to the same issues if a specific decision threshold is used to interpret a posterior distribution. Behavioral research suggests both researchers and decision makers understand uncer- tainty more appropriately if it is presented as a numerical, continuous scale rather than as a verbal, discrete scale (Budescu et al. 2014; Friedman et al. 2018; Jenkins, Harris, and R. M. Lark 2018; McShane and Gal 2016, 2017; Mislavsky and Gaertig 2019). If researchers are interested in estimating a causal effect, the natural quantity of uncer- tainty is the probability of an effect, i.e., the probability that a causal factor affects the outcome of interest. Probability is an intuitive quantity and used in everyday life, for example, in a weather forecast for the chance of rain. The probability of an effect can be computed by a posterior distribution estimated by Bayesian statistics (or, if appropriate, by a pseudo-Bayesian, confidence distribution; see Wood 2019). The standard ways to summarize a posterior is the plot of a probability density func- tion, the probability of a one-sided hypothesis, and a credible interval. These standard ways have drawbacks. A probability density plot is not the best for readers to accurately compute a probability mass for a specific range of parameter values (Kay et al. 2016). The probability of a one-sided hypothesis needs a decision threshold for the effect size from which on the probability is computed (typically, the probability of an effect being greater/smaller than zero). A credible interval also needs a decision threshold for the level of probability based on which the interval is computed (typically, the 95% level). From a decision-theoretic perspective, the use of a decision threshold demands a justification based on a context-specific utility function (Berger 1985; Kruschke 2018, 276–78; Lakens et al. 2018, 170; Mudge et al. 2012). Yet, social scientific research often deals with too heterogeneous cases (e.g., all democracies) to find a single utility function, and may want to be agnostic about it. Motivated by these backgrounds, I propose presenting the uncertainty of statistically estimated causal effects, as the probabilities of different effect sizes. More specifically, it is the plot of a complementary cumulative distribution, where the probability is presented for an effect being greater than different effect sizes (here, “greater” is meant in absolute terms: a greater positive value than zero or some positive value, or a greater negative value than zero or some negative value). In this way, it is unnecessary for researchers to use any decision threshold for the “significance,” “confidence,” or “credible” level of uncertainty or for the effect size beyond which the effect is considered practically relevant. This means researchers can be agnostic about a decision threshold and a justification for 2
Presenting the Probabilities of Different Effect Sizes Akisato Suzuki that. In my approach, researchers play a role of an information provider and present different effect sizes and their associated probabilities as such. My approach applies regardless of the types of causal effect estimate (population average, sample average, individual, etc). The positive implications of my approach can be summarized as follows. First, my approach could help social scientists avoid the dichotomy of significance vs. insignifi- cance and present statistical uncertainty regarding causal effects as such: a step also recommended by Gelman and Carlin (2017). This could enable social scientists to better understand and communicate the continuous nature of the uncertainty of statistically estimated causal effects. I demonstrate this point by applying my approach to a previous social scientific study. Second, as a result of social scientists using my approach, decision makers could evaluate whether to use a treatment or not, in light of their own utility functions. The conventional thresholds such as p < 5% could produce statistical insignificance for a treatment effect because of a small sample or effect size, even if the true effect were non- zero. Then, decision makers might believe it is evidence for no effect and therefore decide not to use the treatment. This would be a lost opportunity, however, if the probability of the beneficial treatment effect were actually high enough (if not as high as 95%) for these decision makers to use the treatment and accept the risk of failure. Finally, my approach could help mitigate p-hacking and the publication bias, as re- searchers could feel less need to report only a particular model out of many that produces an uncertainty measure below a certain threshold. It might be argued that my approach will also allow theoretically or methodologically questionable research to claim credibil- ity based on not so high but decent probability (e.g., 70%) that some factor affects an outcome. Yet, the threshold of p < 5% has not prevented questionable research from being published, as the replication crisis suggests. A stricter threshold of a p-value (e.g., Benjamin et al. 2017) may reduce, but does not eliminate, the chance of questionable research satisfying the threshold. A stricter threshold also has a side effect: it would increase false negatives and, as a result, the publication bias. In short, the important thing is to evaluate the credibility of research not only based on an uncertainty measure such as a p-value and probability but also on the entire research design and theoretical arguments: “No single index should substitute for scientific reasoning” (Wasserstein and Lazar 2016, 132). While further consideration and research are necessary to understand what the best practice to present and interpret statistical uncertainty is to maximize scientific integrity, I hope my approach contributes to this discussion. In this article, I assume statistically estimated causal effects are not based on questionable research and the model assumptions are plausible. The rest of the article further explains the motivation for, and the detail of, my approach, and then applies it to a previous social scientific study. All statistical analysis was done in R (R Core Team 2021). The accompanying R package makes my approach easy to implement (see the section “Supplemental Materials”). 2 Motivation Behavioral research suggests the use of probability as a numerical, continuous scale of uncertainty has merits for both decision makers and researchers, compared to a verbal, discrete scale. In communicating the uncertainty of predictions, a numerical scale more 3
Presenting the Probabilities of Different Effect Sizes Akisato Suzuki accurately conveys the degree of uncertainty researchers intend to communicate (Budescu et al. 2014; Jenkins, Harris, and R. M. Lark 2018; Mandel and Irwin 2021); improves decision makers’ forecasting (Fernandes et al. 2018; Friedman et al. 2018); mitigates the public perception that science is unreliable in case the prediction of unlikely events fails (Jenkins, Harris, and R. Murray Lark 2019); and helps people aggregate the dif- ferent sources of uncertainty information in a mathematically consistent way (Mislavsky and Gaertig 2019). Even researchers familiar with quantitative methods often misinter- pret statistical insignificance as evidence for the null effect, while presenting a numerical probability corrects such a dichotomous thinking (McShane and Gal 2016). My approach builds on these insights into understanding and communicating uncertainty, adding to a toolkit for researchers. Social scientists usually turn the continuous measures of the uncertainty of statisti- cally estimated causal effects, such as a p-value and a posterior, into the dichotomy of significance and insignificance using a decision threshold. Such a dichotomy results in the misunderstandings and misuses of uncertainty measures (e.g., see Amrhein, Greenland, and McShane 2019; Esarey and Wu 2016; Gerber and Malhotra 2008; McShane and Gal 2016, 2017; Simonsohn, Nelson, and Simmons 2014). p-hacking is a practice to search for a model that produces a statistically significant effect to increase the chance of publi- cation, and results in a greater likelihood of false positives being published (Simonsohn, Nelson, and Simmons 2014). If only statistically significant effects are to be published, our knowledge based on published studies is biased – the publication bias (e.g., Esarey and Wu 2016; Gerber and Malhotra 2008; Simonsohn, Nelson, and Simmons 2014). Mean- while, statistical insignificance is often mistaken as evidence for no effect (McShane and Gal 2016, 2017). For example, given a decision threshold of 95%, both 94% probabil- ity and 1% probability are categorized as statistically insignificant, although the former presents much smaller uncertainty than the latter. If a study failed to find statistical significance for a beneficial causal effect because of the lack of access to a large sample or because of the true effect size being small, it could be a lost opportunity for decision makers. It might be argued that a small effect size is practically irrelevant, but this is not by definition so but depends on contexts. For example, if a treatment had a small but beneficial effect and also were cheap, it could be useful for decision makers. A decision threshold is not always a bad idea. For example, it may sometimes be nec- essary for experts to suggest a way for non-experts to make a yes/no decision (Kruschke 2018, 271). In such a case, a decision threshold may be justified by a utility function tailored to a particular case (Kruschke 2018, 276; Lakens et al. 2018, 170; Mudge et al. 2012). However, social scientific research often examines too heterogeneous cases to identify a single utility function that applies to every case, and may want to be agnostic about it. This may be one of the reasons why researchers usually resort to the conventional threshold such as whether a point estimate has a p-value of less than 5%, or whether a 95% confidence/credible interval does not include zero. Yet, the conventional threshold (or any other threshold) is unlikely to be universally optimal for decision making, exactly because how small the uncertainty of a causal effect should be, depends on a utility function, which varies across decision-making contexts (Berger 1985; Kruschke 2018, 278; Lakens et al. 2018, 170; Mudge et al. 2012). Even if one adopted the view that research should be free from context-specific utility and should use an “objective” criterion universally to detect a causal effect, p < 5% would not be such a criterion. This is because universally requiring a fixed p-value shifts 4
Presenting the Probabilities of Different Effect Sizes Akisato Suzuki subjectivity from the choice of a decision threshold for a p-value to that for a sample size, meaning that statistical significance can be obtained by choosing a large enough sample size subjectively (Mudge et al. 2012). If “anything that plausibly could have an effect will not have an effect that is exactly zero” in social science (Gelman 2011, 961), a non-zero effect is by definition “detectable” as a sample size increases. While some propose, as an alternative to the conventional threshold, that we evaluate whether an interval estimate excludes not only zero but also practically null values (Gross 2015; Kruschke 2018), this requires researchers to justify why a certain range of values should be considered practically null, which still depends on a utility function (Kruschke 2018, 276–78). My approach avoids this problem, because it does not require the use of any decision threshold either for an uncertainty measure or for an effect size. Using a posterior distribution, it simply presents the probabilities of different effect sizes as such. My approach allows researchers to play a role of an information provider for the uncertainty of statistically estimated causal effects, and let decision makers use this information in light of their own utility functions. The standard ways to summarize a posterior distribution are a probability density plot, the probability of a one-sided hypothesis, and a credible interval. My approach is different from these in the following respects. A probability density plot shows the full distribution of parameter values. Yet, it is difficult for readers to accurately compute a probability mass for a specific range of parameter values, just by looking at a probability density plot (Kay et al. 2016). Perhaps for this reason, researchers sometimes report together with a density plot ei- ther (1) the probability mass for parameter values greater/smaller than a particular value (usually zero), i.e., the probability of a one-sided hypothesis, or (2) a credible interval. However, these two approaches also have drawbacks. In the case of the probability of a one-sided hypothesis, a specific value needs to be defined as the decision threshold for the effect size from which on the probability is computed. Therefore, it brings the afore- mentioned problem, i.e., demanding a justification for that particular threshold based on a utility function. In the case of a credible interval, we must decide what level of probability is used to define an interval, and what are practically null values (Kruschke 2018). This also brings the problem of demanding a justification for particular thresh- olds, one for the level of probability and the other for practically null values, based on a utility function. In addition, when a credible interval includes conflicting values (e.g., both significant positive and significant negative values), it can only imply inconclusive information (Kruschke 2018), although the degree of uncertainty actually differs depend- ing on how much portion of the interval includes these conflicting values (e.g., 5% of the interval vs. 50% of the interval). Experimental studies by Allen et al. (2014), Edwards et al. (2012), and Fernandes et al. (2018) suggest the plot of a complementary cumulative distribution, such as my approach as explained in the next section, is one of the best ways to present uncertainty (although their findings are about the uncertainty of predictions rather than that of causal effects). The plot of a complementary cumulative distribution presents the probability of a random variable taking a value greater (in absolute terms) than some specific value. Formally: P (X > x), where X is a random variable and x is a particular value of X. Allen et al. (2014) and Edwards et al. (2012) find the plot of a complementary cumulative distribution is effective both for probability estimation accuracy and for making a cor- rect decision based on the probability estimation, even under time pressure (Edwards et al. 2012) or cognitive load (Allen et al. 2014). Note that Allen et al. (2014) and Edwards 5
Presenting the Probabilities of Different Effect Sizes Akisato Suzuki et al. (2012) indicate a complementary cumulative distribution is unsuitable to accurately estimating the mean of a distribution. If the quantities of interest included the mean of a posterior distribution, one could report it together. Fernandes et al. (2018) find the plot of a complementary cumulative distribution is one of the best formats for optimal decision making over repeated decision-making contexts, while a probability density plot, a one-sided hypothesis, and a credible interval are not. I am not arguing my proposed approach should be used or ideal to report the un- certainty of statistically estimated causal effects in every case. For example, if one were interested in the Frequentist coverage of a fixed parameter value, she would rather use a confidence interval over repeated sampling. Note that repeated sampling for the same study is usually uncommon in social science; uncertainty is usually specific to a single study. It is possible that there is no universally best way to present uncertainty (Visschers et al. 2009). It is also possible that “the presentation format hardly has an impact on people in situations where they have time, motivation, and cognitive capacity to process information systematically” (Visschers et al. 2009, 284; but also see Suzuki 2021). Yet, even those people would be unable to evaluate uncertainty properly, if the presentation provided only limited information because of its focus on a specific decision threshold without giving any justification for that threshold. For example, if only a 95% credible interval were presented, it would be difficult to see what the range of the effect sizes would be if a decision maker were interested in a different probability level (e.g., 85%). My approach conveys the uncertainty of statistically estimated causal effects without any decision threshold. 3 The Proposed Approach My approach utilizes a posterior distribution estimated by Bayesian statistics (for Bayesian statistics, see for example Gelman et al. 2013; Gill 2015; Kruschke 2015; McElreath 2015). A posterior, denoted as p(θ|D, M ), is the probability distribution of a parameter, θ, given data, D, and model assumptions, M , such as a functional form, an identification strategy, and a prior distribution of θ. Here, let us assume θ is a parameter for the effect of a causal factor. Using a posterior distribution, we can compute the probabilities of different effect sizes. More specifically, we can compute the probability that a causal factor has an effect greater (in absolute terms) than some effect size. Formally: P (θ > θ̃+ |D, M ) for the positive values of θ, where θ̃+ is zero or some positive value; P (θ < θ̃− |D, M ) for the negative values of θ, where θ̃− is zero or some negative value. If we compute this probability while changing θ̃ up to theoretical or practical limits (e.g., up to the theoretical bounds of a posterior or up to the minimum/maximum in posterior samples), it results in a complementary cumulative distribution. As often the case, a posterior distribution might include both positive and negative values. In such a case, we should compute two complementary distribution functions, one for the positive values of θ, i.e., P (θ > θ̃+ |D, M ) and the other for the negative values of θ, i.e., P (θ < θ̃− |D, M ). Whether it is plausible to suspect the effect can be either positive or negative, depends on a theory. If only either direction of the effect is theoretically plausible (e.g., see VanderWeele and Robins 2010), this should be reflected on the prior of θ, e.g., by putting a bound on the range of the parameter. 6
Presenting the Probabilities of Different Effect Sizes Akisato Suzuki Figure 1: Example of presenting a posterior as a complementary cumulative distribution plot. Figure 1 is an example of my approach. I use a distribution of 10,000 draws from a normal distribution with the mean of 1 and the standard deviation of 1. Let us assume these draws are those from a posterior distribution, or “posterior samples,” for a coeffi- cient in a linear regression, so that the values represent a change in an outcome variable. The x-axis is different effect sizes, or different values of minimum predicted changes in the outcome; the y-axis is the probability of an effect being greater than a minimum predicted change. The y-axis has the adjective “near” on 0% and 100%, because the nor- mal distribution is unbounded and, therefore, the plot of the posterior samples cannot represent the exact 0% and 100% probability. From the figure, it is possible to see what is the probability of an effect being greater (in absolute terms) than a certain effect size. For example, we can say: “The effect is expected to increase the outcome by greater than zero point with a probability of approximately 84%.” It is also clear that the positive effect is much more probable than the negative effect: 16% probability for θ < 0 while 84% probability for θ > 0. Finally, with a little more attention, we can also read the probability of a range of the effect size. For example, we can compute the probability that the effect increases the outcome by greater than one point and up to three points, as P (θ > 1) − P (θ > 3) ≈ .49 − .02 = .47. This computation is useful if there is such a “too much” effect for a treatment to become counterproductive (e.g., the effect of a diet on weight loss). For comparison, I also present the standard ways to summarize a posterior, using the same 10,000 draws as in Figure 1: a probability density plot (Figure 2), a credible interval, and one-sided hypotheses (both in Table 1). The probability density plot gives an overall impression that positive values are more probable than negative values. However, it is difficult to compute exactly what probability an effect is greater than, say, 1. Unlike my approach, the y-axis is density rather than probability, and density is not an intuitive quantity. The credible interval is computed based on the conventional decision threshold of the 95% credible level. It includes both negative and positive values and, therefore, the conventional approach leads either to the conclusion that the effect is not statistically significant, or to the one that there is no conclusive evidence for the effect (Gross 2015; Kruschke 2018). The one-sided hypotheses are computed based on the decision thresh- old of the null effect, i.e., P (θ > 0) and P (θ < 0), as commonly done. Because of this threshold, the information of the uncertainty is much more limited than what my 7
Presenting the Probabilities of Different Effect Sizes Akisato Suzuki Figure 2: Example of presenting a posterior as a probability density plot. Mean 95% Credible Interval P (θ > 0) P (θ < 0) 0.98 [−1.01, 2.92] 0.84 0.16 Table 1: Example of presenting a posterior as a credible interval or as one-sided hypotheses. approach presents in Figure 1. Most importantly, the use of these decision thresholds require researchers to justify why these thresholds should be used, rather than, say, the 94% credible interval or P (θ > 0.1) and P (θ < −0.1). This problem does not apply to my approach. My approach can be considered as a generalized way to use one-sided hypothe- ses. It graphically summarizes all one-sided hypotheses (up to a practically computable point), using different effect sizes as different thresholds from which on the probability of an effect is computed. I note three caveats about my approach. First, it takes up more space than presenting regression tables, the common presentation style in social science. Therefore, if we wanted to present all regression coefficients using my approach, it would be too cumbersome. Yet, the purpose of quantitative causal research in social science is usually to estimate the effect size of a causal factor or two, and the remaining regressors are controls to enable the identification of the causal effect(s) of interest. Indeed, it is often hard to interpret all regression coefficients causally because of complex causal mechanisms (Keele, Stevenson, and Elwert 2020). When researchers use matching instead of regression, they typically report only the effect of a treatment variable (Keele, Stevenson, and Elwert 2020, 1–2), although the identification strategy is the same – selection on observables (Keele 2015, 321–22). Thus, if researchers are interested in estimating causal effects, they usually need to report only one or two causal factors. If so, it is not a problem even if my proposed approach takes more space than regression tables. Second, if the effect size is scaled in a nonintuitive measure (such as a log odds ratio in logistic regression), researchers will need to convert it to an intuitive scale to make my approach work best (for detail, see Sarma and Kay 2020). For example, in the case of logistic regression, researchers can express an effect size as a difference in the predicted likelihood of an outcome variable. Third, as all models are wrong, Bayesian models are also wrong; they are the sim- plification of reality. A posterior distribution is conditional on data used and model as- sumptions. It cannot be a reliable estimate, if data used are inadequate for the purpose 8
Presenting the Probabilities of Different Effect Sizes Akisato Suzuki of research (e.g., a sample being unrepresentative of the target population or collected with measurement errors), and/or if one or more of the model assumptions are implau- sible (which is usually the case). Moreover, in practice a posterior usually needs to be computed by a Markov chain Monte Carlo (MCMC) method, and there is no guarantee that the resulting posterior samples precisely mirror the true posterior. Therefore, the estimated probability of an effect should not be considered as the “perfect” measure of uncertainty. For example, even if the estimated probability of an effect being greater than zero is 100% given the model and the computational method, it should NOT be interpreted as the certainty of the effect in practice. Given these limitations, the same principle applies to my proposed approach as to the p-value: “No single index should substitute for scientific reasoning” (Wasserstein and Lazar 2016, 132). What matters is not the “trueness” of a model but the “usefulness” of a model. My approach makes a model more useful than the conventional approaches to evaluating and communicating the uncertainty of statistically estimated causal effects, in the following respects. First, it uses the probability of an effect as an intuitive quan- tity of uncertainty, for a better understanding and communication. Second, it does not require any decision thresholds for uncertainty measures or effect sizes. Therefore, it allows researchers to be agnostic about a utility function required to justify such decision thresholds, and to be an information provider presenting the probabilities of different effect sizes as such. 4 Application I exemplify my proposed approach by applying it to a previous social scientific study (Huff and Kruszewska 2016) and using its dataset (Huff and Kruszewska 2015). Huff and Kruszewska (2016) collected a nationally representative sample of 2,000 Polish adults and experimented whether more violent methods of protest by an opposition group increase or decrease public support for the government negotiating with the group. Specifically, I focus on the two analyses in Figure 4 of Huff and Kruszewska (2016), which present (1) the effect of an opposition group using bombing in comparison to occupation, and (2) the effect of an opposition group using occupation in comparison to demonstrations, on the attitude of the experiment participants towards tax policy in favor of the opposition group. The two treatment variables are measured dichotomously, while the attitude towards tax policy as the dependent variable is measured in a 100-point scale. Huff and Kruszewska (2016) use linear regression per treatment variable to estimate its average effect. The model is Y = β0 + β1 D + , where Y is the dependent variable, D is the treatment variable, β0 is the constant, β1 captures the size of the average causal effect, and is the error term. In the application, I convert the model to the following equivalent Bayesian linear regression model: yi ∼ N ormal(µ, σ), µ = β0 + β1 di , β0 ∼ N ormal(µβ0 = 50, σβ0 = 20), β1 ∼ N ormal(µβ1 = 0, σβ1 = 5), σ ∼ Exponential(rate = 0.5), 9
Presenting the Probabilities of Different Effect Sizes Akisato Suzuki Mean β1 95% Credible Interval N bombing vs. occupation −2.49 [−5.48, 0.49] 996 occupation vs. demonstration −0.53 [−3.31, 2.32] 985 Table 2: Results using the 95% credible interval. R̂ ∼ = 1.00 for all parameters. where yi and di are respectively the outcome Y and the treatment D for an individual i. For the quantity of interest β1 , I use a weakly informative prior of β1 ∼ N ormal(µβ1 = 0, σβ1 = 5). This prior reflects the point that the original study presents no prior belief in favor of the negative or positive average effect of a more violent method by an oppo- sition group on public support, as it hypothesizes both effects as plausible. I also use a weakly informative prior for the constant term, β0 ∼ N ormal(µβ0 = 50, σβ0 = 20), which means that, without the treatment, respondents are expected to have a neu- tral attitude on average, but the baseline attitude of each individual may well vary. σ ∼ Exponential(rate = 0.5) is a weakly informative prior for the model standard de- viation parameter, implying any stochastic factor is unlikely to change the predicted outcome value by greater than 10 points. I use four chains of MCMC process; per chain 10,000 iterations are done, the first 1,000 of which are discarded. The MCMC algorithm used is Stan (Stan Development Team 2019), implemented via the rstanarm package version 2.21.1 (Goodrich et al. 2020) on R version 4.1.0 (R Core Team 2021). The R̂ was approximately 1.00 for every estimated parameter, suggesting the models did not fail to converge. The effective sample size exceeded at least 15,000 for every estimated parameter. Table 2 presents the results in a conventional way: the mean in the posterior of β1 and the 95% credible interval for the model examining the effect of bombing in comparison to occupation, and those for the model examining the effect of occupation in comparison to demonstrations. While the mean is a negative value in both models, the 95% credible interval includes not only negative values but also zero and positive values. This means the typical interval approach (e.g., Gross 2015; Kruschke 2018) would lead us to simply conclude there is no conclusive evidence for the average effect. Figure 3 uses my approach for the effect of bombing in comparison to occupation. It enables richer inference than the above conventional approach. If we focus on the probability of β1 < 0 for example, the model expects that if an opposition group uses bombing instead of occupation, it should reduce public support for tax policy in favor of the group by greater than 0 point, with a probability of approximately 95%. Thus, the negative effect of bombing is much more likely (95% probability) than the positive effect (5% probability). The original conclusion was that the effect was not statistically significant at p < 5%, the threshold set at the pre-registration stage (Huff and Kruszewska 2016, 1794– 1795). However, they put a post hoc caveat that if the threshold of statistical significance had been set at 10%, the effect would have been regarded as statistically significant (Huff and Kruszewska 2016, 1795). This interpretation is inconsistent either with the Fisherian paradigm of p-values or with the Neyman-Person paradigm of p-values (Lew 2012). According to the Fisherian paradigm, the preset threshold of statistical significance is unnecessary, because a p-value in this paradigm is a local measure and not a global false positive rate – the rate of false positives over repeated sampling from the same data distribution (Lew 2012, 1562–63). An exact p-value should be interpreted as such – although it is difficult to make intuitive sense of because it is not the probability of an 10
Presenting the Probabilities of Different Effect Sizes Akisato Suzuki Figure 3: Effect of bombing in comparison to occupation. Figure 4: Effect of occupation in comparison to demonstration. effect but the probability of obtaining data as extreme as, or more extreme than, those that are observed, given the null hypothesis being true (Lew 2012, 1560; Wasserstein and Lazar 2016, 131). According to the Neyman-Person paradigm, no post hoc adjustment to the preset threshold of statistical significance should be made, because a p-value in this paradigm is used as a global false positive rate and not as a local measure (Lew 2012, 1562–63). The estimates from my approach are more straightforward to interpret. We need no a priori decision threshold of a posterior to determine significance, and can post hoc evaluate the probability of an effect (Kruschke 2015, 328). Figure 4 uses my approach for the effect of occupation in comparison to demon- strations. The model expects that if an opposition group uses occupation instead of demonstrations, it should reduce public support for tax policy in favor of the group by greater than 0 point, with a probability of approximately 64%. This suggests weak ev- idence, rather than no evidence, for the negative effect of occupation, meaning that the negative effect is more likely (64% probability) than the positive effect (36% probabil- ity). Meanwhile, the original conclusion was simply that the effect was not statistically significant at p < 5% (Huff and Kruszewska 2016, 1777, 1795). In short, my approach presents richer information about the uncertainty of the statis- tically estimated causal effects, than the significance-vs.-insignificance approach taken by 11
Presenting the Probabilities of Different Effect Sizes Akisato Suzuki the original study. It helps a better understanding and communication of the uncertainty of the average causal effects of violent protesting methods. 5 Conclusion I have proposed the alternative approach to social scientists presenting the uncertainty of statistically estimated causal effects: the probabilities of different effect sizes via the plot of a complementary cumulative distribution function. Unlike the conventional significance-vs.-insignificance approach and the standard ways to summarize a posterior distribution, my approach does not require any preset decision threshold for the “signifi- cance,” “confidence,” or “credible” level of uncertainty or for the effect size beyond which the effect is considered practically relevant. It therefore allows researchers to be agnostic about a decision threshold and a justification for that. In my approach, researchers play a role of an information provider and present different effect sizes and their associated probabilities as such. I have shown through the application to the previous study that my approach presents richer information about the uncertainty of statistically estimated causal effects than the conventional significance-vs.-insignificance approach, helping a better understanding and communication of the uncertainty. My approach has implications for problems in the current (social) scientific practices, such as p-hacking, the publication bias, and seeing statistical insignificance as evidence for the null effect (Amrhein, Greenland, and McShane 2019; Esarey and Wu 2016; Ger- ber and Malhotra 2008; McShane and Gal 2016, 2017; Simonsohn, Nelson, and Simmons 2014). First, if the uncertainty of statistically estimated causal effects were reported by my approach, both researchers and non-experts would be able to understand it in- tuitively, as probability is used as a continuous measure of uncertainty in everyday life (e.g., a weather forecast for the chance of rain). This could help mitigate the dichotomous thinking of there being an effect or no effect, commonly associated with the significance- vs.-insignificance approach. Second, if research outlets such as journals accepted my approach as a way to present the uncertainty of statistically estimated causal effects, researchers could feel less need to report only a model that produces an uncertainty mea- sure below a certain threshold, such as p < 5%. This could help address the problem of p-hacking and the publication bias. Presenting the uncertainty of statistically estimated causal effects as such has impli- cations for decision makers as well. If decision makers had access to the probabilities of different effect sizes, they could use this information in light of their own utility func- tions and decide whether to use a treatment or not. It is possible that even when the conventional threshold of the 95% credible level produced statistical insignificance for a treatment effect, decision makers could have such a utility function to prefer using the treatment over doing nothing, given the level of probability that does not reach the conventional threshold but they see “high enough.” It is also possible that even when a causal effect reaches the conventional threshold of the 95% credible level, decision makers could have such a utility function to see the probability not high enough (e.g., must be 99% rather than 95%) and prefer doing nothing over using the treatment. I acknowledge my proposed approach may not suit everyone’s needs. Yet, I hope this article provides a useful insight into how to understand and communicate the uncertainty of statistically estimated causal effects, and the accompanying R package helps other researchers to implement my approach without difficulty. 12
Presenting the Probabilities of Different Effect Sizes Akisato Suzuki Acknowledgments I would like to thank Johan A. Elkink, Jeff Gill, Zbigniew Truchlewski, Alexandru Moise, and participants in the 2019 PSA Political Methodology Group Annual Conference, the 2019 EPSA Annual Conference, and seminars at Dublin City University and University College Dublin, for their helpful comments. I would like to acknowledge the receipt of funding from the Irish Research Council (the grant number: GOIPD/2018/328) for the development of this work. The views expressed are my own unless otherwise stated, and do not necessarily represent those of the institutes/organizations to which I am/have been related. Supplemental Materials The R code to reproduce the results in this article and the R package to implement my approach (“ccdfpost”) is available on my website at https://akisatosuzuki.github.io. References Allen, Pamela M., John A. Edwards, Frank J. Snyder, Kevin A. Makinson, and David M. Hamby. 2014. “The Effect of Cognitive Load on Decision Making with Graphically Displayed Uncertainty Information.” Risk Analysis 34 (8): 1495–1505. Amrhein, Valentin, Sander Greenland, and Blake McShane. 2019. “Retire Statistical Sig- nificance.” Nature 567:305–307. Benjamin, Daniel J., James O. Berger, Magnus Johannesson, Brian A. Nosek, E.-J. Wa- genmakers, Richard Berk, Kenneth A. Bollen, et al. 2017. “Redefine Statistical Sig- nificance.” Nature Human Behaviour 2 (January): 6–10. Berger, James O. 1985. Statistical Decision Theory and Bayesian Analysis. 2nd ed. New York, NY: Springer. Budescu, David V., Han-Hui Por, Stephen B. Broomell, and Michael Smithson. 2014. “The Interpretation of IPCC Probabilistic Statements around the World.” Nature Climate Change 4 (6): 508–512. Edwards, John A., Frank J. Snyder, Pamela M. Allen, Kevin A. Makinson, and David M. Hamby. 2012. “Decision Making for Risk Management: A Comparison of Graphical Methods for Presenting Quantitative Uncertainty.” Risk Analysis 32 (12): 2055– 2070. Esarey, Justin, and Ahra Wu. 2016. “Measuring the Effects of Publication Bias in Political Science.” Research & Politics 3 (3): 1–9. Fernandes, Michael, Logan Walls, Sean Munson, Jessica Hullman, and Matthew Kay. 2018. “Uncertainty Displays Using Quantile Dotplots or CDFs Improve Transit Decision-Making.” In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, 1–12. https://doi.org/10.1145/3173574.3173718. 13
Presenting the Probabilities of Different Effect Sizes Akisato Suzuki Friedman, Jeffrey A., Joshua D. Baker, Barbara A. Mellers, Philip E. Tetlock, and Richard Zeckhauser. 2018. “The Value of Precision in Probability Assessment: Evidence from a Large-Scale Geopolitical Forecasting Tournament.” International Studies Quarterly 62 (2): 410–422. Gelman, Andrew. 2011. “Causality and Statistical Learning.” American Journal of Soci- ology 117 (3): 955–966. Gelman, Andrew, and John Carlin. 2017. “Some Natural Solutions to the P-Value Com- munication Problem—and Why They Won’t Work.” Journal of the American Sta- tistical Association 112 (519): 899–901. Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2013. Bayesian Data Analysis. 3rd ed. Boca Raton, FL: CRC Press. Gerber, Alan, and Neil Malhotra. 2008. “Do Statistical Reporting Standards Affect What Is Published? Publication Bias in Two Leading Political Science Journals.” Quarterly Journal of Political Science 3 (3): 313–326. Gill, Jeff. 2015. Bayesian Methods: A Social and Behavioral Sciences Approach. 3rd ed. Boca Raton, FL: CRC Press. Goodrich, Ben, Jonah Gabry, Imad Ali, and Sam Brilleman. 2020. “rstanarm: Bayesian Applied Regression Modeling via Stan.” R package version 2.21.1. https : / / mc - stan.org/rstanarm. Gross, Justin H. 2015. “Testing What Matters (If You Must Test at All): A Context- Driven Approach to Substantive and Statistical Significance.” American Journal of Political Science 59 (3): 775–788. Huff, Connor, and Dominika Kruszewska. 2015. “Replication Data for: Banners, Barri- cades, and Bombs: The Tactical Choices of Social Movements and Public Opinion, V1.” Harvard Dataverse. Accessed April 17, 2019. https://doi.org/10.7910/DVN/ DMMQ4L. . 2016. “Banners, Barricades, and Bombs: The Tactical Choices of Social Move- ments and Public Opinion.” Comparative Political Studies 49 (13): 1774–1808. Jenkins, Sarah C., Adam J. L. Harris, and R. M. Lark. 2018. “Understanding ‘Unlikely (20% Likelihood)’ or ‘20% Likelihood (Unlikely)’ Outcomes: The Robustness of the Extremity Effect.” Journal of Behavioral Decision Making 31 (4): 572–586. Jenkins, Sarah C., Adam J. L. Harris, and R. Murray Lark. 2019. “When Unlikely Out- comes Occur: The Role of Communication Format in Maintaining Communicator Credibility.” Journal of Risk Research 22 (5): 537–554. Kay, Matthew, Tara Kola, Jessica R. Hullman, and Sean A. Munson. 2016. “When (Ish) Is My Bus? User-Centered Visualizations of Uncertainty in Everyday, Mobile Pre- dictive Systems.” In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, 5092–5103. https://doi.org/10.1145/2858036.2858558. Keele, Luke. 2015. “The Statistics of Causal Inference: A View from Political Methodol- ogy.” Political Analysis 23 (3): 313–335. 14
Presenting the Probabilities of Different Effect Sizes Akisato Suzuki Keele, Luke, Randolph T. Stevenson, and Felix Elwert. 2020. “The Causal Interpretation of Estimated Associations in Regression Models.” Political Science Research and Methods 8 (1): 1–13. Kruschke, John K. 2015. Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan. London: Academic Press. . 2018. “Rejecting or Accepting Parameter Values in Bayesian Estimation.” Ad- vances in Methods and Practices in Psychological Science 1 (2): 270–280. Lakens, Daniel, Federico G. Adolfi, Casper J. Albers, Farid Anvari, Matthew A. J. Apps, Shlomo E. Argamon, Thom Baguley, et al. 2018. “Justify Your Alpha.” Nature Hu- man Behaviour 2:168–171. Lew, Michael J. 2012. “Bad Statistical Practice in Pharmacology (and Other Basic Biomedical Disciplines): You Probably Don’t Know P.” British Journal of Phar- macology 166 (5): 1559–1567. Mandel, David R., and Daniel Irwin. 2021. “Facilitating Sender-Receiver Agreement in Communicated Probabilities: Is It Best to Use Words, Numbers or Both?” Judgment and Decision Making 16 (2): 363–393. McElreath, Richard. 2015. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Boca Raton, FL: CRC Press. McShane, Blakeley B., and David Gal. 2016. “Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence.” Management Science 62 (6): 1707–1718. . 2017. “Statistical Significance and the Dichotomization of Evidence.” Journal of the American Statistical Association 112 (519): 885–895. Mislavsky, Robert, and Celia Gaertig. 2019. “Combining Probability Forecasts: 60% and 60% Is 60%, but Likely and Likely Is Very Likely.” SSRN, https : / / ssrn . com / abstract=3454796. Mudge, Joseph F., Leanne F. Baker, Christopher B. Edge, and Jeff E. Houlahan. 2012. “Setting an Optimal α That Minimizes Errors in Null Hypothesis Significance Tests.” PLoS ONE 7 (2): 1–7. R Core Team. 2021. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org. Sarma, Abhraneel, and Matthew Kay. 2020. “Prior Setting in Practice: Strategies and Rationales Used in Choosing Prior Distributions for Bayesian Analysis.” In Proceed- ings of the 2020 CHI Conference on Human Factors in Computing Systems, 1–12. https://doi.org/10.1145/3313831.3376377. Simonsohn, Uri, Leif D. Nelson, and Joseph P. Simmons. 2014. “P-Curve: A Key to the File-Drawer.” Journal of Experimental Psychology: General 143 (2): 534–547. Stan Development Team. 2019. Stan Reference Manual, Version 2.27. https://mc-stan. org/docs/2 27/reference-manual/index.html. 15
Presenting the Probabilities of Different Effect Sizes Akisato Suzuki Suzuki, Akisato. 2021. “Which Type of Statistical Uncertainty Helps Evidence-Based Pol- icymaking? An Insight from a Survey Experiment in Ireland.” arXiv: 2108.05100v1 [stat.OT]. https://arxiv.org/abs/2108.05100. VanderWeele, Tyler J., and James M. Robins. 2010. “Signed Directed Acyclic Graphs for Causal Inference.” Journal of the Royal Statistical Society Series B (Statistical Methodology) 72 (1): 111–127. Visschers, Vivianne H.M., Ree M. Meertens, Wim W.F. Passchier, and Nanne N.K. De Vries. 2009. “Probability Information in Risk Communication: A Review of the Re- search Literature.” Risk Analysis 29 (2): 267–287. Wasserstein, Ronald L., and Nicole A. Lazar. 2016. “The ASA’s Statement on p-Values: Context, Process, and Purpose.” The American Statistician 70 (2): 129–133. Wood, Michael. 2019. “Simple Methods for Estimating Confidence Levels, or Tentative Probabilities, for Hypotheses Instead of p Values.” Methodological Innovations 12 (1): 1–9. 16
You can also read