Conditional Probability Function

A conditional probability function is a probability function, [math]\displaystyle{ P(X|Y) }[/math], that reports the probability that event [math]\displaystyle{ x \in X }[/math] will occur given that event [math]\displaystyle{ y \in Y }[/math] occurs.

AKA: Conditional Distribution.
Context:
- input: one or more Random Variable Events, [math]\displaystyle{ x \in X }[/math].
- input: Aposteriori Knowledge of Guaranteed Events, [math]\displaystyle{ Y_1,...,Y_n }[/math].
- range: a Conditional Probability Value.
- It can range from being a Univariate Conditional Probability Function to being a Multivariate Conditional Probability Function, [math]\displaystyle{ P(X|Y_1,...,Y_n) }[/math].
- It can range from being a Conditional Probability Mass Function to being a Conditional Probability Density Function.
- It can range from being a Conditional Probability Function Structure (e.g. an estimated conditional probability) to being an Abstract Conditional Probability Function.
- It can (often) be a member of a Conditional Probability Function Set (such as defined by a conditional probability distribution family).
- It can (typically) assume that events [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y_1,...,Y_n }[/math] are from the same non-empty Sample Space [math]\displaystyle{ \mathcal{S} }[/math].
- It assumes that Event [math]\displaystyle{ Y_1,...,Y_n }[/math] is not an Impossible Event [math]\displaystyle{ P(Y_1,...,Y_n)\gt 0 }[/math].
- It can be calculated as [math]\displaystyle{ P(X|Y_1,...,Y_n) = P(X ∩ Y)/P(Y) }[/math], I.e. the joint probability that [math]\displaystyle{ Y_1,...,Y_n }[/math] also occurs is the probability that they both occur divided by the probability that [math]\displaystyle{ X }[/math] occurs.
Example(s):
- [math]\displaystyle{ P(\text{2 heads}|\text{1 head}) }[/math] for a Two-Dice Experiment given knowledge of one of the dice rolls.
- a Trained Conditional Probability Function, such as a trained CRF model.
- a Conditional Expectation Function.
- …
Counter-Example(s):
- a Marginal Probability Function.
- a Class Conditional Probability Function.
- a Joint Probability Function, [math]\displaystyle{ \mathrm{P}(X_1,\ldots,X_n) }[/math].
See: Conditional Statement, Conditional Likelihood, Bayes Rule, Statistical Independence.

References

2015

(Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/conditional_probability Retrieved:2015-6-2.
- In probability theory, a conditional probability measures the probability of an event given that (by assumption, presumption, assertion or evidence)
  another event has occurred.
  For example, the probability that any given person has a cough on any given day may be only 5%. But if we know or assume that the person has a cold, then they are much more likely to be coughing. The conditional probability of coughing given that you have a cold might be a much higher 75%.
  If the event of interest is A and the event B is known or assumed to have occurred, "the conditional probability of A given B”, or "the probability of A under the condition B”, is usually written as P(A|B), or sometimes P(A).
  The concept of conditional probability is one of the most fundamental and one of the most important concepts in probability theory.^[1]
  But conditional probabilities can be quite slippery and require careful interpretation.^[2] For example, there need not be a causal or temporal relationship between A and B.
  In general P(A|B) is not equal to P(B|A). For example, if you have cancer you might have a 90% chance of testing positive for cancer, but if you test negative for cancer you might have only a 10% chance of actually having cancer because cancer is very rare. Falsely equating the two probabilities causes various errors of reasoning such as the base rate fallacy. Conditional probabilities can be correctly reversed using Bayes' Theorem.
  P(A|B) (the conditional probability of A given B) may or may not be equal to P(A) (the unconditional probability of A). If P(A|B) = P(A), A and B are said to be independent.

↑ Sheldon Ross, A First Course in Probability, 8th Edition (2010), Pearson Prentice Hall, ISBN 978-0-13-603313-4
↑ George Casella and Roger L. Berger, Statistical Inference,(2002), Duxbury Press, ISBN 978-0-534-24312-8

2005

(Ladd et al., 2005) ⇒ Andrew M. Ladd, Kostas E. Bekris, Algis Rudys, Lydia E. Kavraki, and Dan S. Wallach. (2005). “Robotics-based Location Sensing Using Wireless Ethernet.” In: Wireless Networks Journal, 11(1-2). doi:10.1007/s11276-004-4755-8
- QUOTE: … While these were good fits for hallways 1 and 2, they failed to model the noisiness of the static localizer on hallways 3 and 4. A conditional probability function trained to the actual points would likely provide better results. Page 9. …

2004

(Bouchard & Triggs, 2004) ⇒ Guillaume Bouchard, and Bill Triggs. (2004). “The Trade-off Between Generative and Discriminative Classifiers.” In: Proceedings of COMPSTAT 2004.
- QUOTE: In supervised classification, inputs [math]\displaystyle{ x }[/math] and their labels [math]\displaystyle{ y }[/math] arise from an unknown joint probability p(x,y). If we can approximate p(x,y) using a parametric family of models [math]\displaystyle{ G }[/math] = {p_θ(x,y),θ ∈ Θ}, then a natural classifier is obtained by first estimating the class-conditional densities, then classifying each new data point to the class with highest posterior probability. This approach is called generative classification.
  However, if the overall goal is to find the classification rule with the smallest error rate, this depends only on the conditional density [math]\displaystyle{ p(y \vert x) }[/math].
  Discriminative methods directly model the conditional distribution, without assuming anything about the input distribution p(x). Well known generative-discriminative pairs include Linear Discriminant Analysis (LDA) vs. Linear logistic regression and naive Bayes vs. Generalized Additive Models (GAM). Many authors have already studied these models e.g. [5,6]. Under the assumption that the underlying distributions are Gaussian with equal covariances, it is known that LDA requires less data than its discriminative counterpart, linear logistic regression [3]. More generally, it is known that generative classifiers have a smaller variance than.
  Conversely, the generative approach converges to the best model for the joint distribution p(x,y) but the resulting conditional density is usually a biased classifier unless its p_θ(x) part is an accurate model for p(x). In real world problems the assumed generative model is rarely exact, and asymptotically, a discriminative classifier should typically be preferred [9, 5]. The key argument is that the discriminative estimator converges to the conditional density that minimizes the negative log-likelihood classification loss against the true density p(x, y) [2]. For finite sample sizes, there is a bias-variance tradeoff and it is less obvious how to choose between generative and discriminative classifiers.

2001

(Halpern, 2001) ⇒ Joseph Y. Halpern. (2001). “Lexicographic Probability, Conditional Probability, and Nonstandard Probability.” In: Proceedings of the 8th conference on Theoretical aspects of rationality and knowledge. ISBN:1-55860-791-9
- QUOTE: A conditional probability measure takes pairs [math]\displaystyle{ U,V }[/math] of subsets as arguments; [math]\displaystyle{ \mu(V,U) }[/math] is generally written [math]\displaystyle{ \mu(V \mid U) }[/math] to stress the conditioning aspects. The first argument comes from some algebra [math]\displaystyle{ \mathcal{F} }[/math] of subsets of a space [math]\displaystyle{ W }[/math]; if [math]\displaystyle{ W }[/math] is infinite, [math]\displaystyle{ \mathcal{F} }[/math] is often taken to be a [math]\displaystyle{ \sigma-algebra }[/math].

[Sheldon_Ross_2010-1] Sheldon Ross, A First Course in Probability, 8th Edition (2010), Pearson Prentice Hall, ISBN 978-0-13-603313-4

[Casella_and_Berger_2002-2] George Casella and Roger L. Berger, Statistical Inference,(2002), Duxbury Press, ISBN 978-0-534-24312-8

[1]

[2]