P-Value
A P-Value is a probability measure that if the null hypothesis were true then the sample variation would produce an estimate that is further away from our hypothesized value (against an alternative hypothesis) than our data estimate.
- Context:
- It can be defined as a Summary Statistic/Estimated Probability from a statistical significance test (on an observed sample) of getting results at least as extreme as the ones you observed, given a correct null hypothesis.
- It can be used to Reject a Null Hypothesis, i.e. when the test statistic probability is less than a predefined significance level, the null hypothesis is rejected.
- It can tell us how likely it is to get a result like this if the Null Hypothesis is true..
- Example(s):
- Assume a coin-toss experiment with a fair-coin null hypothesis (that you suspect is weighted toward heads). If there are more head events than tail events after x coin tosses, then the p-value is an estimated probability that one would get at least as many head eventss if the coin was indeed a fair coin.
- …
- Counter-Example(s):
- a Bayes Factor.
- See: t-Test, Statistical Score, Statistical Significance, Bayesian P-Value, Q Value/False Discovery Rate, Type I Error Rate, Frequentist Inference, Bayesian Inference.
References
2019
- (Wikipedia, 2019) ⇒ https://en.wikipedia.org/wiki/P-value Retrieved:2019-10-15.
- In statistical hypothesis testing, the p-value or probability value is the probability of obtaining test results at least as extreme as the results actually observed during the test, assuming that the null hypothesis is correct. The use of p-values in statistical hypothesis testing is common in many fields of research such as physics, economics, finance, political science, psychology, biology, criminal justice, criminology, and sociology. [1] The misuse of p-values is a controversialtopic in metascience. Italicisation, capitalisation and hyphenation of the term varies. For example, AMA style uses "P value", APA style uses "p value", and the American Statistical Association uses "p-value". [2]
2016a
- (Stat Treak, 2016) ⇒ http://stattrek.com/statistics/dictionary.aspx?definition=P-value Retrieved: 2016-10-09
- QUOTE: A P-value measures the strength of evidence in support of a null hypothesis. Suppose the test statistic in a hypothesis test is equal to S. The P-value is the probability of observing a test statistic as extreme as S, assuming the null hypothesis is true. If the P-value is less than the significance level, we reject the null hypothesis.
2016b
- (Statistical Analysis Glossary, 2016) ⇒ http://www.quality-control-plan.com/StatGuide/sg_glos.htm Retrieved: 2016-10-09
- QUOTE: In a statistical hypothesis test, the P value is the probability of observing a test statistic at least as extreme as the value actually observed, assuming that the null hypothesis is true. This probability is then compared to the pre-selected significance level of the test. If the P value is smaller than the significance level, the null hypothesis is rejected, and the test result is termed significant. The P value depends on both the null hypothesis and the alternative hypothesis. In particular, a test with a one-sided alternative hypothesis will generally have a lower P value (and thus be more likely to be significant) than a test with a two-sided alternative hypothesis. However, one-sided tests require more stringent assumptions than two-sided tests. They should only be used when those assumptions apply.
2015
- (Leek & Peng, 2015) ⇒ Jeffrey T. Leek, and Roger D. Peng. (2015). “Statistics: P values are just the tip of the iceberg.” In: Nature, 520(7549).
- QUOTE: There is no statistic more maligned than the P value. Hundreds of papers and blogposts have been written about what some statisticians deride as 'null hypothesis significance testing' (NHST; see, for example, http://go.nature.com/pfvgqe). NHST deems whether the results of a data analysis are important on the basis of whether a summary statistic (such as a P value) has crossed a threshold. Given the discourse, it is no surprise that some hailed as a victory the banning of NHST methods (and all of statistical inference) in the journal Basic and Applied Social Psychology in February.
2010
- http://en.wikipedia.org/wiki/P-value
- … The lower the p-value, the less likely the result, assuming the Null Hypothesis, so the more "significant" the result, in the sense of Statistical Significance – one often uses p-values of 0.05 or 0.01, corresponding to a 5% chance or 1% of an outcome that extreme, given the null hypothesis. It should be noted, however, that the idea of more or less significance is here only being used for illustrative purposes. The result of a test of significance is either "statistically significant" or "not statistically significant"; there are no shades of gray.
More technically, a p-value of an experiment is a random variable defined over the Sample Space of the experiment such that its distribution under the null hypothesis is uniform on the interval [0,1]. Many p-values can be defined for the same experiment.
- … The lower the p-value, the less likely the result, assuming the Null Hypothesis, so the more "significant" the result, in the sense of Statistical Significance – one often uses p-values of 0.05 or 0.01, corresponding to a 5% chance or 1% of an outcome that extreme, given the null hypothesis. It should be noted, however, that the idea of more or less significance is here only being used for illustrative purposes. The result of a test of significance is either "statistically significant" or "not statistically significant"; there are no shades of gray.
2009
- (Sun & Wu, 2009) ⇒ Yijun Sun, and Dapeng Wu. (2009). “Feature Extraction Through Local Learning.” In: Statistical Analysis and Data Mining, 2(1). doi:10.1002/sam.10028
- QUOTE: … In wrapper methods, a classification algorithm is employed to evaluate the goodness of a selected feature subset, whereas in filter methods criterion functions evaluate feature subsets by their information content, typically interclass distance (e.g., Fisher score) or statistical measures (e.g., p-value of t-test), instead of optimizing the performance of any specific learning algorithm directly.
2001
- (Sterne & Smith, 2001) ⇒ Jonathan A C Sterne, and George Davey Smith. (2001). “Sifting the Evidence — What's wrong with significance tests?". In: BMJ, 322(7280). doi:10.1136/bmj.322.7280.226
- QUOTE: P values, or significance levels, measure the strength of the evidence against the null hypothesis; the smaller the P value, the stronger the evidence against the null hypothesis
An arbitrary division of the results, into “significant” or “non-significant” according to the P value, was not the intention of the founders of statistical inference
A P value of 0.05 need not provide strong evidence against the null hypothesis, but it is reasonable to say that P<0.001 does. In the results sections of papers the precise P value should be presented, without reference to arbitrary thresholds
Results of the medical research should not be reported as “significant” or “non-significant” but should be interpreted in the context of the type of the study and other available evidence. Bias or confounding should always be considered for findings with low P values
- QUOTE: P values, or significance levels, measure the strength of the evidence against the null hypothesis; the smaller the P value, the stronger the evidence against the null hypothesis
1999
- (Goodman, 1999) ⇒ Steven N. Goodman. (1999). “Toward Evidence-based Medical Statistics. 1: The P Value Fallacy.” In: Annals Internal Medicine, 130(12).
- ABSTRACT: An important problem exists in the interpretation of modern medical research data: Biological understanding and previous research play little formal role in the interpretation of quantitative results. This phenomenon is manifest in the discussion sections of research articles and ultimately can affect the reliability of conclusions. The standard statistical approach has created this situation by promoting the illusion that conclusions can be produced with certain "error rates," without consideration of information from outside the experiment. This statistical approach, the key components of which are P values and hypothesis tests, is widely perceived as a mathematically coherent approach to inference. There is little appreciation in the medical community that the methodology is an amalgam of incompatible elements, whose utility for scientific inference has been the subject of intense debate among statisticians for almost 70 years. This article introduces some of the key elements of that debate and traces the appeal and adverse impact of this methodology to the P value fallacy, the mistaken idea that a single number can capture both the long-run outcomes of an experiment and the evidential meaning of a single result. This argument is made as a prelude to the suggestion that another measure of evidence should be used -- the Bayes factor, which properly separates issues of long-run behavior from evidential strength and allows the integration of background knowledge with statistical findings.
1995
- (Bland & Altman, 1995) ⇒ J Martin Bland, and Douglas G Altman. (1995). “Multiple Significance Tests: the Bonferroni method.” In: BMJ 1995;310:170
- QUOTE: Many published papers include large numbers of significance tests. These may be difficult to interpret because if we go on testing long enough we will inevitably find something which is "significant." We must beware of attaching too much importance to a lone significant result among a mass of non-significant ones. It may be the one in 20 which we expect by chance alone. ...
1925
- (Fisher, 1992) ⇒ Ronald A. Fisher. (1925). “Statistical Methods for Research Workers.” Oliver & Boyd.
- ↑ Babbie, E. (2007). The practice of social research 11th ed. Thomson Wadsworth: Belmont, California.
- ↑ http://magazine.amstat.org/wp-content/uploads/STATTKadmin/style%5B1%5D.pdf