Statistical Significance Measure

A Statistical Significance Measure is a quantitative measure to assess the probability that an observed data pattern occurred by chance (by a random process) alone, given a particular hypothesis.

Context:
- Measure Output (optional): a Significance Level ([math]\displaystyle{ \alpha }[/math]), P-Value ([math]\displaystyle{ p }[/math]).
- It can be an input to a Statistical Significance Test to determined the whether a result is statisticall significant (i.e. whether $p \leq \alpha$).
- It can (typically) be used in Statistical Inference.
- It can (typically) assume the Null Hypothesis is true.
- It can (typically) quantify the strength of evidence against a null hypothesis.
- It can (often) be applied to various types of data and research questions.
- It can (often) be influenced by Sample Size.
- It can (often) be influenced by Effect Size.
- It can (often) be influenced by Statistical Power.
- It can range from being a Classical Statistical Significance Test to being a Bayesian Evidence Evaluation Approach.
- It can range from being a Parametric Statistical Test to being a Non-Parametric Statistical Test.
- It can range from being a Univariate Statistical Test to being a Multivariate Statistical Test.
- It can be an input to Statistical Hypothesis Testing Task (providing the quantitative basis for deciding whether to reject or fail to reject the null hypothesis).
- It can be associated with Type I Error and Type II Error considerations.
- It can be subject to ongoing debate and refinement in statistical practice.
- It can be used in Multiple Testing scenarios, requiring Multiple Comparison Correction.
- …
Example(s):
- a P-value that quantifies the probability of obtaining results at least as extreme as observed, assuming the null hypothesis is true.
- a T-statistic for two continuous variables, used in Student's t-test that relies on the means of two groups.
- an F-statistic for three or more continuous variables, employed in Analysis of Variance (ANOVA) for comparing the means of the groups.
- a Chi-Square Measure for two categorical variables, that tests the independence of the variables.
- a Z-test for two population proportions (normally distributed), that determines if the proportions are significantly different.
- a Likelihood Ratio for comparing the fit of two statistical models.
- an ANOVA test for three or more continuous variables, that compares the means of the groups.
- a Fisher's Exact Test for two categorical variables, that determines the significance of associations in small sample sizes.
- a Mann-Whitney U test for two independent samples, that compares differences between the samples on a continuous or ordinal dependent variable.
- ...
Counter-Example(s):
- An Effect Size Measure, which quantifies the magnitude of an effect rather than its statistical significance.
- A Descriptive Statistic, which summarizes data without making inferences about a larger population.
- A Confidence Interval, which provides a range of plausible values for a parameter rather than a single significance measure.
See: Sampling Error, Null Hypothesis Significance Testing, Resampling Algorithm, Statistical Inference, Statistical Hypothesis Testing, Null Hypothesis Significance Testing, Bayesian Inference, False Discovery Rate, Meta-analysis, Replication Crisis.

References

2024

(Wikipedia, 2024) ⇒ https://en.wikipedia.org/wiki/Statistical_significance Retrieved:2024-7-23.
- In statistical hypothesis testing, a result has statistical significance when a result at least as "extreme" would be very infrequent if the null hypothesis were true.^[1] More precisely, a study's defined significance level, denoted by [math]\displaystyle{ \alpha }[/math] , is the probability of the study rejecting the null hypothesis, given that the null hypothesis is true; and the p-value of a result, [math]\displaystyle{ p }[/math] , is the probability of obtaining a result at least as extreme, given that the null hypothesis is true. The result is statistically significant, by the standards of the study, when [math]\displaystyle{ p \le \alpha }[/math] .^[2] ^[3] The significance level for a study is chosen before data collection, and is typically set to 5% or much lower—depending on the field of study.
  In any experiment or observation that involves drawing a sample from a population, there is always the possibility that an observed effect would have occurred due to sampling error alone. But if the p-value of an observed effect is less than (or equal to) the significance level, an investigator may conclude that the effect reflects the characteristics of the whole population, thereby rejecting the null hypothesis.
  This technique for testing the statistical significance of results was developed in the early 20th century. The term significance does not imply importance here, and the term statistical significance is not the same as research significance, theoretical significance, or practical significance. ^[4] For example, the term clinical significance refers to the practical importance of a treatment effect.

2020

(Wikipedia, 2020) ⇒ https://en.wikipedia.org/wiki/Statistical_significance Retrieved:2020-2-1.
- In statistical hypothesis testing,^[5] ^[6] a result has statistical significance when it is very unlikely to have occurred given the null hypothesis^[1],^[7]. More precisely, a study's defined significance level, denoted by [math]\displaystyle{ \alpha }[/math] , is the probability of the study rejecting the null hypothesis, given that the null hypothesis were assumed to be true;^[8] and the p-value of a result, [math]\displaystyle{ p }[/math] , is the probability of obtaining a result at least as extreme, given that the null hypothesis were true. The result is statistically significant, by the standards of the study, when [math]\displaystyle{ p \le \alpha }[/math] The significance level for a study is chosen before data collection, and is typically set to 5%^[9] or much lower—depending on the field of study^[10]. ...

2019a

(Amrhein et al., 2019) ⇒ Valentin Amrhein, Sander Greenland, Blake McShane (2019). "Scientists Rise Up Against Statistical Significance". In: Nature 567, 305-307 (2019). DOI: 10.1038/d41586-019-00857-9
- QUOTE: The objection we hear most against retiring statistical significance is that it is needed to make yes-or-no decisions. But for the choices often required in regulatory, policy and business environments, decisions based on the costs, benefits and likelihoods of all potential consequences always beat those made based solely on statistical significance. Moreover, for decisions about whether to pursue a research idea further, there is no simple connection between a P value and the probable results of subsequent studies.
  What will retiring statistical significance look like? We hope that methods sections and data tabulation will be more detailed and nuanced. Authors will emphasize their estimates and the uncertainty in them — for example, by explicitly discussing the lower and upper limits of their intervals. They will not rely on significance tests. When P values are reported, they will be given with sensible precision (for example, $P = 0.021$ or $P = 0.13$) — without adornments such as stars or letters to denote statistical significance and not as binary inequalities ($P < 0.05$ or $P > 0.05$). Decisions to interpret or to publish results will not be based on statistical thresholds. People will spend less time with statistical software, and more time thinking.
  Our call to retire statistical significance and to use confidence intervals as compatibility intervals is not a panacea. Although it will eliminate many bad practices, it could well introduce new ones. Thus, monitoring the literature for statistical abuses should be an ongoing priority for the scientific community. But eradicating categorization will help to halt overconfident claims, unwarranted declarations of ‘no difference’ and absurd statements about ‘replication failure’ when the results from the original and replication studies are highly compatible. The misuse of statistical significance has done much harm to the scientific community and those who rely on scientific advice. P values, intervals and other statistical measures all have their place, but it’s time for statistical significance to go.

2019b

(McShane et al., 2019) ⇒ Blakeley B. McShane, David Gal, Andrew Gelman, Christian Robert, and Jennifer L. Tackett (2019). "Abandon Statistical Significance". In: The American Statistician, 73(sup1). DOI:10.1080/00031305.2018.1527253.
- QUOTE: We discuss problems the null hypothesis significance testing (NHST) paradigm poses for replication and more broadly in the biomedical and social sciences as well as how these problems remain unresolved by proposals involving modified p-value thresholds, confidence intervals, and Bayes factors. We then discuss our own proposal, which is to abandon statistical significance. We recommend dropping the NHST paradigm—and the p-value thresholds intrinsic to it—as the default statistical paradigm for research, publication, and discovery in the biomedical and social sciences. Specifically, we propose that the p-value be demoted from its threshold screening role and instead, treated continuously, be considered along with currently subordinate factors (e.g., related prior evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain) as just one among many pieces of evidence. We have no desire to “ban” p-values or other purely statistical measures. Rather, we believe that such measures should not be thresholded and that, thresholded or not, they should not take priority over the currently subordinate factors. We also argue that it seldom makes sense to calibrate evidence as a function of p-values or other purely statistical measures.

2010

(Lee, 2010) ⇒ J. Jack Lee (2010). "Demystify Statistical Significance-Time to Move on From the P Value to Bayesian Analysis" . In: Journal of the National Cancer Institute (JNCI), Volume 103, Issue 1. DOI:10.1093/jnci/djq493.
- QUOTE: What is statistical significance? In the mind of many medical researchers, statistical significance means that the P value less than or equal to .05. It also translates to a “positive” result; hence, an article can be published in a journal, a grant successfully reviewed, and a drug approved by the FDA. When P value is less than or equal to .05, it is assumed that there is sufficient evidence that the drug works, thus it should be approved. This overly simplistic view is, of course, erroneous. But unfortunately, it permeates the medical research community. The commentary by Ocana and Tannock^[11] points out that statistical significance does not equate to a clinically meaningful difference. They propose that in addition to a statistically significant P value, to declare that a trial is “positive,” the observed difference in a survival outcome must equal or exceed a prespecified clinically important value. In reviewing 18 randomized trials, Ocana and Tannock found that four trials did not report the predefined hazard ratio and another six trials had an observed hazard ratio being less than the predefined hazard ratio (Ocana & Tannock, Table 1 ). Comparing the observed treatment effect with the prespecified effect is certainly a step in the right direction. However, making inference based on a single point estimate is inadequate because the observed treatment effect itself carries uncertainty. Furthermore, requiring the observed treatment effect to be greater than $\delta$ can be overly conservative because the trial will have 50% power even if the true treatment effect is equal to $\delta$. A frequentist confidence interval helps in gauging this uncertainty, but a complete solution can only be obtained by taking the Bayesian approach.
  In the frequentist hypothesis-testing framework, the P value is defined as the probability of observing events as extreme or more extreme than the observed data, given that the null hypothesis ($H_0$ ) is true. If the P value is small enough (conventionally, $P \leq .05$), the data provide evidence against the null hypothesis, so we reject the null hypothesis. The P value is not the probability that the null hypothesis is true. It is an indirect measure to assess whether the null hypothesis is true or not. So, what is the probability that the null hypothesis is true and what is the probability that the alternative hypothesis ($H_1$ ) is true? The Bayesian approach addresses these questions directly and provides coherent answers. Bayesian methods treat an unknown parameter (eg, the real treatment effect) as random and the data as fixed and known, which they are. The Bayesian approach calculates the probability of the parameter(s) given the data, whereas the frequentist approach computes the probability of the data given the parameter(s). Since the parameters are unknown and the data have been observed, it makes more sense to formulate a problem using the Bayesian approach.

1980

(Bentler, P. M.; Bonett, 1980) ⇒ P. M. Bentler, Douglas G. Bonett. (1980). “Significance Tests and Goodness of Fit in the Analysis of Covariance Structures.” In: Psychological Bulletin, 88(3) doi:10.1037/0033-2909.88.3.588
- ABSTRACT: Factor analysis, path analysis, structural equation modeling, and related multivariate statistical methods are based on maximum likelihood or generalized least squares estimation developed for covariance structure models (CSMs). Large-sample theory provides a chi-square goodness-of-fit test for comparing a model (M) against a general alternative M based on correlated variables. It is suggested that this comparison is insufficient for M evaluation. A general null M based on modified independence among variables is proposed as an additional reference point for the statistical and scientific evaluation of CSMs. Use of the null M in the context of a procedure that sequentially evaluates the statistical necessity of various sets of parameters places statistical methods in covariance structure analysis into a more complete framework. The concepts of ideal Ms and pseudo chi-square tests are introduced, and their roles in hypothesis testing are developed. The importance of supplementing statistical evaluation with incremental fit indices associated with the comparison of hierarchical Ms is also emphasized. Normed and nonnormed fit indices are developed and illustrated.

↑ ^1.0 ^1.1 Myers, Jerome L.; Well, Arnold D.; Lorch Jr., Robert F. (2010). “Developing fundamentals of hypothesis testing using the binomial distribution". Research design and statistical analysis (3rd ed.). New York, NY: Routledge. pp. 65–90. ISBN 978-0-805-86431-1.
↑ name="Johnson">
↑ Cite error: Invalid <ref> tag; no text was provided for refs named Cumming-p27
↑ Cite error: Invalid <ref> tag; no text was provided for refs named Myers et al-p124
↑ Sirkin, R. Mark (2005). “Two-sample t tests". Statistics for the Social Sciences (3rd ed.). Thousand Oaks, CA: SAGE Publications, Inc. pp. 271–316. ISBN 978-1-412-90546-6.
↑ Borror, Connie M. (2009). “Statistical decision making". The Certified Quality Engineer Handbook (3rd ed.). Milwaukee, WI: ASQ Quality Press. pp. 418–472. ISBN 978-0-873-89745-7.
↑ "A Primer on Statistical Significance". Math Vault. 2017-04-30. Retrieved 2019-11-11.
↑ Dalgaard, Peter (2008). “Power and the computation of sample size". Introductory Statistics with R. Statistics and Computing. New York: Springer. pp. 155–56. doi:10.1007/978-0-387-79054-1_9. ISBN 978-0-387-79053-4.
↑ Craparo, Robert M. (2007). “Significance level". In Salkind, Neil J. (ed.). Encyclopedia of Measurement and Statistics. 3. Thousand Oaks, CA: SAGE Publications. pp. 889–891. ISBN 978-1-412-91611-0.
↑ Sproull, Natalie L. (2002). “Hypothesis testing". Handbook of Research Methods: A Guide for Practitioners and Students in the Social Science (2nd ed.). Lanham, MD: Scarecrow Press, Inc. pp. 49–64. ISBN 978-0-810-84486-5.
↑ Ocana A, Tannock IF. When are “positive” clinical trials in oncology truly positive?, J Natl Cancer Inst., 2011, vol. 103 1(pg. 16-20)

[Myers_et_al-p65-1] 1.0 ^1.1 Myers, Jerome L.; Well, Arnold D.; Lorch Jr., Robert F. (2010). “Developing fundamentals of hypothesis testing using the binomial distribution". Research design and statistical analysis (3rd ed.). New York, NY: Routledge. pp. 65–90. ISBN 978-0-805-86431-1.

[2] ="Johnson">

[Cumming-p27-3] Cite error: Invalid <ref> tag; no text was provided for refs named Cumming-p27

[Myers_et_al-p124-4] Cite error: Invalid <ref> tag; no text was provided for refs named Myers et al-p124

[Sirkin-5] Sirkin, R. Mark (2005). “Two-sample t tests". Statistics for the Social Sciences (3rd ed.). Thousand Oaks, CA: SAGE Publications, Inc. pp. 271–316. ISBN 978-1-412-90546-6.

[Borror-6] Borror, Connie M. (2009). “Statistical decision making". The Certified Quality Engineer Handbook (3rd ed.). Milwaukee, WI: ASQ Quality Press. pp. 418–472. ISBN 978-0-873-89745-7.

[7] "A Primer on Statistical Significance". Math Vault. 2017-04-30. Retrieved 2019-11-11.

[Dalgaard-8] Dalgaard, Peter (2008). “Power and the computation of sample size". Introductory Statistics with R. Statistics and Computing. New York: Springer. pp. 155–56. doi:10.1007/978-0-387-79054-1_9. ISBN 978-0-387-79053-4.

[Salkind-9] Craparo, Robert M. (2007). “Significance level". In Salkind, Neil J. (ed.). Encyclopedia of Measurement and Statistics. 3. Thousand Oaks, CA: SAGE Publications. pp. 889–891. ISBN 978-1-412-91611-0.

[Sproull-10] Sproull, Natalie L. (2002). “Hypothesis testing". Handbook of Research Methods: A Guide for Practitioners and Students in the Social Science (2nd ed.). Lanham, MD: Scarecrow Press, Inc. pp. 49–64. ISBN 978-0-810-84486-5.

[Ocana_Tannock-11] Ocana A, Tannock IF. When are “positive” clinical trials in oncology truly positive?, J Natl Cancer Inst., 2011, vol. 103 1(pg. 16-20)

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]