Statistically Significant Result
A Statistically Significant Result is a Result/Outcome that passes a Significance Test (based on a statistical significance measure and statistical significance level).
- AKA: Statistically Significant Outcome.
- Context
- It can simple defined a statistical test result in which $p \leq \alpha$ where $p$ is a p-value and $\alpha$ is a significance level.
- Example(s):
- The differences between the two groups were not Statistically Significant, by Student's t-Test (P < 0.005)).
- …
- Counter-Example(s):
- See: Statistical Significance, Resampling Algorithm.
References
2020
- (Wikipedia, 2020) ⇒ https://en.wikipedia.org/wiki/Statistical_significance Retrieved:2020-2-1.
- In statistical hypothesis testing,[1] [2] a result has statistical significance when it is very unlikely to have occurred given the null hypothesis[3],[4]. More precisely, a study's defined significance level, denoted by [math]\displaystyle{ \alpha }[/math] , is the probability of the study rejecting the null hypothesis, given that the null hypothesis were assumed to be true;[5] and the p-value of a result, [math]\displaystyle{ p }[/math] , is the probability of obtaining a result at least as extreme, given that the null hypothesis were true. "Statistical Hypothesis Testing". www.dartmouth.edu. Retrieved 2019-11-11.</ref> The result is statistically significant, by the standards of the study, when [math]\displaystyle{ p \le \alpha }[/math] [6] [7] [8] [9] [10] [11] [12]. The significance level for a study is chosen before data collection, and is typically set to 5%[13] or much lower—depending on the field of study[14].
In any experiment or observation that involves drawing a sample from a population, there is always the possibility that an observed effect would have occurred due to sampling error alone[15] [16]. But if the p-value of an observed effect is less than (or equal to) the significance level, an investigator may conclude that the effect reflects the characteristics of the whole population , thereby rejecting the null hypothesis[17].
This technique for testing the statistical significance of results was developed in the early 20th century. The term significance does not imply importance here, and the term statistical significance is not the same as research, theoretical, or practical significance [18] Hooper, Peter. "What is P-value?" (PDF). University of Alberta, Department of Mathematical and Statistical Sciences. Retrieved November 10, 2019.</ref>. For example, the term clinical significance refers to the practical importance of a treatment effect[19].
- In statistical hypothesis testing,[1] [2] a result has statistical significance when it is very unlikely to have occurred given the null hypothesis[3],[4]. More precisely, a study's defined significance level, denoted by [math]\displaystyle{ \alpha }[/math] , is the probability of the study rejecting the null hypothesis, given that the null hypothesis were assumed to be true;[5] and the p-value of a result, [math]\displaystyle{ p }[/math] , is the probability of obtaining a result at least as extreme, given that the null hypothesis were true. "Statistical Hypothesis Testing". www.dartmouth.edu. Retrieved 2019-11-11.</ref> The result is statistically significant, by the standards of the study, when [math]\displaystyle{ p \le \alpha }[/math] [6] [7] [8] [9] [10] [11] [12]. The significance level for a study is chosen before data collection, and is typically set to 5%[13] or much lower—depending on the field of study[14].
- ↑ Sirkin, R. Mark (2005). “Two-sample t tests". Statistics for the Social Sciences (3rd ed.). Thousand Oaks, CA: SAGE Publications, Inc. pp. 271–316. ISBN 978-1-412-90546-6.
- ↑ Borror, Connie M. (2009). “Statistical decision making". The Certified Quality Engineer Handbook (3rd ed.). Milwaukee, WI: ASQ Quality Press. pp. 418–472. ISBN 978-0-873-89745-7.
- ↑ Myers, Jerome L.; Well, Arnold D.; Lorch Jr., Robert F. (2010). “Developing fundamentals of hypothesis testing using the binomial distribution". Research design and statistical analysis (3rd ed.). New York, NY: Routledge. pp. 65–90. ISBN 978-0-805-86431-1.
- ↑ "A Primer on Statistical Significance". Math Vault. 2017-04-30. Retrieved 2019-11-11.
- ↑ Dalgaard, Peter (2008). “Power and the computation of sample size". Introductory Statistics with R. Statistics and Computing. New York: Springer. pp. 155–56. doi:10.1007/978-0-387-79054-1_9. ISBN 978-0-387-79053-4.
- ↑ Johnson, Valen E. (October 9, 2013). “Revised standards for statistical evidence". Proceedings of the National Academy of Sciences. 110 (48): 19313–19317. doi:10.1073/pnas.1313476110. PMC 3845140. PMID 24218581. Retrieved 3 July 2014.
- ↑ Redmond, Carol; Colton, Theodore (2001). “Clinical significance versus statistical significance". Biostatistics in Clinical Trials. Wiley Reference Series in Biostatistics (3rd ed.). West Sussex, United Kingdom: John Wiley & Sons Ltd. pp. 35–36. ISBN 978-0-471-82211-0.
- ↑ Cumming, Geoff (2012). Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. New York, USA: Routledge. pp. 27–28.
- ↑ Krzywinski, Martin; Altman, Naomi (30 October 2013). “Points of significance: Significance, P values and t-tests". Nature Methods. 10 (11): 1041–1042. doi:10.1038/nmeth.2698. PMID 24344377.
- ↑ Sham, Pak C.; Purcell, Shaun M (17 April 2014). “Statistical power and significance testing in large-scale genetic studies". Nature Reviews Genetics. 15 (5): 335–346. doi:10.1038/nrg3706. PMID 24739678.
- ↑ Altman, Douglas G. (1999). Practical Statistics for Medical Research. New York, USA: Chapman & Hall/CRC. pp. 167. ISBN 978-0412276309.
- ↑ Devore, Jay L. (2011). Probability and Statistics for Engineering and the Sciences (8th ed.). Boston, MA: Cengage Learning. pp. 300–344. ISBN 978-0-538-73352-6.
- ↑ Craparo, Robert M. (2007). “Significance level". In Salkind, Neil J. (ed.). Encyclopedia of Measurement and Statistics. 3. Thousand Oaks, CA: SAGE Publications. pp. 889–891. ISBN 978-1-412-91611-0.
- ↑ Sproull, Natalie L. (2002). “Hypothesis testing". Handbook of Research Methods: A Guide for Practitioners and Students in the Social Science (2nd ed.). Lanham, MD: Scarecrow Press, Inc. pp. 49–64. ISBN 978-0-810-84486-5.
- ↑ Babbie, Earl R. (2013). “The logic of sampling". The Practice of Social Research (13th ed.). Belmont, CA: Cengage Learning. pp. 185–226. ISBN 978-1-133-04979-1.
- ↑ Faherty, Vincent (2008). “Probability and statistical significance". Compassionate Statistics: Applied Quantitative Analysis for Social Services (With exercises and instructions in SPSS) (1st ed.). Thousand Oaks, CA: SAGE Publications, Inc. pp. 127–138. ISBN 978-1-412-93982-9.
- ↑ McKillup, Steve (2006). “Probability helps you make a decision about your results". Statistics Explained: An Introductory Guide for Life Scientists (1st ed.). Cambridge, United Kingdom: Cambridge University Press. pp. 44–56. ISBN 978-0-521-54316-3.
- ↑ Myers, Jerome L.; Well, Arnold D.; Lorch Jr, Robert F. (2010). “The t distribution and its applications". Research Design and Statistical Analysis (3rd ed.). New York, NY: Routledge. pp. 124–153. ISBN 978-0-805-86431-1.
- ↑ Leung, W.-C. (2001-03-01). “Balancing statistical and clinical significance in evaluating treatment effects". Postgraduate Medical Journal. 77 (905): 201–204. doi:10.1136/pmj.77.905.201. ISSN 0032-5473. PMC 1741942. PMID 11222834.
2019a
- (Amrhein et al., 2019) ⇒ Valentin Amrhein, Sander Greenland, Blake McShane (2019). "Scientists Rise Up Against Statistical Significance". In: Nature 567, 305-307 (2019). DOI: 10.1038/d41586-019-00857-9
- QUOTE: The objection we hear most against retiring statistical significance is that it is needed to make yes-or-no decisions. But for the choices often required in regulatory, policy and business environments, decisions based on the costs, benefits and likelihoods of all potential consequences always beat those made based solely on statistical significance. Moreover, for decisions about whether to pursue a research idea further, there is no simple connection between a P value and the probable results of subsequent studies.
What will retiring statistical significance look like? We hope that methods sections and data tabulation will be more detailed and nuanced. Authors will emphasize their estimates and the uncertainty in them — for example, by explicitly discussing the lower and upper limits of their intervals. They will not rely on significance tests. When P values are reported, they will be given with sensible precision (for example, $P = 0.021$ or $P = 0.13$) — without adornments such as stars or letters to denote statistical significance and not as binary inequalities ($P < 0.05$ or $P > 0.05$). Decisions to interpret or to publish results will not be based on statistical thresholds. People will spend less time with statistical software, and more time thinking.
Our call to retire statistical significance and to use confidence intervals as compatibility intervals is not a panacea. Although it will eliminate many bad practices, it could well introduce new ones. Thus, monitoring the literature for statistical abuses should be an ongoing priority for the scientific community. But eradicating categorization will help to halt overconfident claims, unwarranted declarations of ‘no difference’ and absurd statements about ‘replication failure’ when the results from the original and replication studies are highly compatible. The misuse of statistical significance has done much harm to the scientific community and those who rely on scientific advice. P values, intervals and other statistical measures all have their place, but it’s time for statistical significance to go.
- QUOTE: The objection we hear most against retiring statistical significance is that it is needed to make yes-or-no decisions. But for the choices often required in regulatory, policy and business environments, decisions based on the costs, benefits and likelihoods of all potential consequences always beat those made based solely on statistical significance. Moreover, for decisions about whether to pursue a research idea further, there is no simple connection between a P value and the probable results of subsequent studies.
2019b
- (McShane et al., 2019) ⇒ Blakeley B. McShane, David Gal, Andrew Gelman, Christian Robert, and Jennifer L. Tackett (2019). "Abandon Statistical Significance". In: The American Statistician, 73(sup1). DOI:10.1080/00031305.2018.1527253.
- QUOTE: We discuss problems the null hypothesis significance testing (NHST) paradigm poses for replication and more broadly in the biomedical and social sciences as well as how these problems remain unresolved by proposals involving modified p-value thresholds, confidence intervals, and Bayes factors. We then discuss our own proposal, which is to abandon statistical significance. We recommend dropping the NHST paradigm—and the p-value thresholds intrinsic to it—as the default statistical paradigm for research, publication, and discovery in the biomedical and social sciences. Specifically, we propose that the p-value be demoted from its threshold screening role and instead, treated continuously, be considered along with currently subordinate factors (e.g., related prior evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain) as just one among many pieces of evidence. We have no desire to “ban” p-values or other purely statistical measures. Rather, we believe that such measures should not be thresholded and that, thresholded or not, they should not take priority over the currently subordinate factors. We also argue that it seldom makes sense to calibrate evidence as a function of p-values or other purely statistical measures.
2010
- (Lee, 2010) ⇒ J. Jack Lee (2010). "Demystify Statistical Significance-Time to Move on From the P Value to Bayesian Analysis" . In: Journal of the National Cancer Institute (JNCI), Volume 103, Issue 1. DOI:10.1093/jnci/djq493.
- QUOTE: What is statistical significance? In the mind of many medical researchers, statistical significance means that the P value less than or equal to .05. It also translates to a “positive” result; hence, an article can be published in a journal, a grant successfully reviewed, and a drug approved by the FDA. When P value is less than or equal to .05, it is assumed that there is sufficient evidence that the drug works, thus it should be approved. This overly simplistic view is, of course, erroneous. But unfortunately, it permeates the medical research community. The commentary by Ocana and Tannock[1] points out that statistical significance does not equate to a clinically meaningful difference. They propose that in addition to a statistically significant P value, to declare that a trial is “positive,” the observed difference in a survival outcome must equal or exceed a prespecified clinically important value. In reviewing 18 randomized trials, Ocana and Tannock found that four trials did not report the predefined hazard ratio and another six trials had an observed hazard ratio being less than the predefined hazard ratio (Ocana & Tannock, Table 1 ). Comparing the observed treatment effect with the prespecified effect is certainly a step in the right direction. However, making inference based on a single point estimate is inadequate because the observed treatment effect itself carries uncertainty. Furthermore, requiring the observed treatment effect to be greater than $\delta$ can be overly conservative because the trial will have 50% power even if the true treatment effect is equal to $\delta$. A frequentist confidence interval helps in gauging this uncertainty, but a complete solution can only be obtained by taking the Bayesian approach.
In the frequentist hypothesis-testing framework, the P value is defined as the probability of observing events as extreme or more extreme than the observed data, given that the null hypothesis ($H_0$ ) is true. If the P value is small enough (conventionally, $P \leq .05$), the data provide evidence against the null hypothesis, so we reject the null hypothesis. The P value is not the probability that the null hypothesis is true. It is an indirect measure to assess whether the null hypothesis is true or not. So, what is the probability that the null hypothesis is true and what is the probability that the alternative hypothesis ($H_1$ ) is true? The Bayesian approach addresses these questions directly and provides coherent answers. Bayesian methods treat an unknown parameter (eg, the real treatment effect) as random and the data as fixed and known, which they are. The Bayesian approach calculates the probability of the parameter(s) given the data, whereas the frequentist approach computes the probability of the data given the parameter(s). Since the parameters are unknown and the data have been observed, it makes more sense to formulate a problem using the Bayesian approach.
- QUOTE: What is statistical significance? In the mind of many medical researchers, statistical significance means that the P value less than or equal to .05. It also translates to a “positive” result; hence, an article can be published in a journal, a grant successfully reviewed, and a drug approved by the FDA. When P value is less than or equal to .05, it is assumed that there is sufficient evidence that the drug works, thus it should be approved. This overly simplistic view is, of course, erroneous. But unfortunately, it permeates the medical research community. The commentary by Ocana and Tannock[1] points out that statistical significance does not equate to a clinically meaningful difference. They propose that in addition to a statistically significant P value, to declare that a trial is “positive,” the observed difference in a survival outcome must equal or exceed a prespecified clinically important value. In reviewing 18 randomized trials, Ocana and Tannock found that four trials did not report the predefined hazard ratio and another six trials had an observed hazard ratio being less than the predefined hazard ratio (Ocana & Tannock, Table 1 ). Comparing the observed treatment effect with the prespecified effect is certainly a step in the right direction. However, making inference based on a single point estimate is inadequate because the observed treatment effect itself carries uncertainty. Furthermore, requiring the observed treatment effect to be greater than $\delta$ can be overly conservative because the trial will have 50% power even if the true treatment effect is equal to $\delta$. A frequentist confidence interval helps in gauging this uncertainty, but a complete solution can only be obtained by taking the Bayesian approach.
1980
- (Bentler, P. M.; Bonett, 1980) ⇒ P. M. Bentler, Douglas G. Bonett. (1980). “Significance Tests and Goodness of Fit in the Analysis of Covariance Structures.” In: Psychological Bulletin, 88(3) doi:10.1037/0033-2909.88.3.588
- ABSTRACT: Factor analysis, path analysis, structural equation modeling, and related multivariate statistical methods are based on maximum likelihood or generalized least squares estimation developed for covariance structure models (CSMs). Large-sample theory provides a chi-square goodness-of-fit test for comparing a model (M) against a general alternative M based on correlated variables. It is suggested that this comparison is insufficient for M evaluation. A general null M based on modified independence among variables is proposed as an additional reference point for the statistical and scientific evaluation of CSMs. Use of the null M in the context of a procedure that sequentially evaluates the statistical necessity of various sets of parameters places statistical methods in covariance structure analysis into a more complete framework. The concepts of ideal Ms and pseudo chi-square tests are introduced, and their roles in hypothesis testing are developed. The importance of supplementing statistical evaluation with incremental fit indices associated with the comparison of hierarchical Ms is also emphasized. Normed and nonnormed fit indices are developed and illustrated.
- ↑ Ocana A, Tannock IF. When are “positive” clinical trials in oncology truly positive?, J Natl Cancer Inst., 2011, vol. 103 1(pg. 16-20)