Conditional Random Fields (CRFs) Model Family

From GM-RKB
(Redirected from Conditional Random Fields)
Jump to navigation Jump to search

A Conditional Random Fields (CRFs) Model Family is a discriminative undirected conditional graphical model family with jointly distributed random variables [math]\displaystyle{ \bf{X} }[/math] and [math]\displaystyle{ \bf{Y} }[/math], where [math]\displaystyle{ \bf{X} }[/math] are observed variables (covariates), and [math]\displaystyle{ \bf{Y} }[/math] are unobserved variables that are globally conditioned on [math]\displaystyle{ \bf{X} }[/math].



References

2018

  • http://wikipedia.org/wiki/Conditional_random_field#Description
    • Lafferty, McCallum and Pereira define a CRF on observations [math]\displaystyle{ \boldsymbol{X} }[/math] and random variables [math]\displaystyle{ \boldsymbol{Y} }[/math] as follows:

      Let [math]\displaystyle{ G = (V , E) }[/math] be a graph such that [math]\displaystyle{ \boldsymbol{Y} = (\boldsymbol{Y}_v)_{v\in V} }[/math], so that [math]\displaystyle{ \boldsymbol{Y} }[/math] is indexed by the vertices of [math]\displaystyle{ G }[/math].

      Then [math]\displaystyle{ (\boldsymbol{X}, \boldsymbol{Y}) }[/math] is a conditional random field when the random variables [math]\displaystyle{ \boldsymbol{Y}_v }[/math], conditioned on [math]\displaystyle{ \boldsymbol{X} }[/math], obey the Markov property with respect to the graph: [math]\displaystyle{ p(\boldsymbol{Y}_v |\boldsymbol{X}, \boldsymbol{Y}_w, w \neq v) = p(\boldsymbol{Y}_v |\boldsymbol{X}, \boldsymbol{Y}_w, w \sim v) }[/math], where [math]\displaystyle{ \mathit{w} \sim v }[/math] means that [math]\displaystyle{ w }[/math] and [math]\displaystyle{ v }[/math] are neighbors in [math]\displaystyle{ G }[/math].

      What this means is that a CRF is an undirected graphical model whose nodes can be divided into exactly two disjoint sets [math]\displaystyle{ \boldsymbol{X} }[/math] and [math]\displaystyle{ \boldsymbol{Y} }[/math], the observed and output variables, respectively; the conditional distribution [math]\displaystyle{ p(\boldsymbol{Y}|\boldsymbol{X}) }[/math] is then modeled.

2015

2013

2012

  • (Wikipedia, 2012) ⇒ http://en.wikipedia.org/wiki/Conditional_random_field
    • … a CRF is an undirected graphical model whose nodes can be divided into exactly two disjoint sets [math]\displaystyle{ \boldsymbol{X} }[/math] and [math]\displaystyle{ \boldsymbol{Y} }[/math], the observed and output variables, respectively; the conditional distribution [math]\displaystyle{ p(\boldsymbol{Y}|\boldsymbol{X}) }[/math] is then modeled. The Markov property means that for any pair of nodes [math]\displaystyle{ v, \lt math\gt w }[/math] \in \boldsymbol{Y}</math> there exists a single neighbour [math]\displaystyle{ w' }[/math] of [math]\displaystyle{ v }[/math] such that [math]\displaystyle{ v }[/math] is conditionally independent of [math]\displaystyle{ w }[/math], given [math]\displaystyle{ w' }[/math] and [math]\displaystyle{ \boldsymbol{X} }[/math]. This means that the graphical structure of [math]\displaystyle{ \boldsymbol{Y} }[/math] (ignoring [math]\displaystyle{ \boldsymbol{X} }[/math]) has to be cycle-free, and is tree-structured by definition. Given the tree-structure of the output graph, [math]\displaystyle{ p(\boldsymbol{Y}|\boldsymbol{X})\propto\exp\left(\sum_{v\in V,k}\lambda_kf_k(v, \boldsymbol{Y}_v, \boldsymbol{X}) + \sum_{e\in E,l}\mu_lg_l(e,\boldsymbol{Y}|_e, \boldsymbol{X})\right) }[/math] where [math]\displaystyle{ \boldsymbol{Y}|_e }[/math] is the two endpoints of the edge [math]\displaystyle{ e }[/math], according to the Hammersley–Clifford theorem. As opposed to a general undirected model, parameter estimation and exact inference are both tractable in this kind of model.

      For general graphs, the problem of exact inference in CRFs is intractable. The inference problem for a CRF is basically the same as for an MRF and the same arguments hold. For graphs with chain or tree structure, exact inference is possible. The algorithms used in these cases are analogous to the forward-backward and Viterbi algorithm for the case of HMMs.

      Learning the parameters [math]\displaystyle{ \theta }[/math] is usually done by maximum likelihood learning for [math]\displaystyle{ p(Y_i|X_i; \theta) }[/math]. If all nodes have exponential family distributions and all nodes are observed during training, this optimization is convex. It can be solved for example using gradient descent algorithms Quasi-Newton methods, such as the L-BFGS algorithm. On the other hand, if some variables are unobserved, the inference problem has to be solved for these variables. This is intractable to do exact in general graphs, so approximations have to be used.

2012b

2011

2009

  • http://www.inference.phy.cam.ac.uk/hmw26/crf/
    • QUOTE: Conditional random fields (CRFs) are a probabilistic framework for labeling and segmenting structured data, such as sequences, trees and lattices. The underlying idea is that of defining a conditional probability distribution over label sequences given a particular observation sequence, rather than a joint distribution over both label and observation sequences. The primary advantage of CRFs over hidden Markov models is their conditional nature, resulting in the relaxation of the independence assumptions required by HMMs in order to ensure tractable inference. Additionally, CRFs avoid the label bias problem, a weakness exhibited by maximum entropy Markov models (MEMMs) and other conditional Markov models based on directed graphical models. CRFs outperform both MEMMs and HMMs on a number of real-world tasks in many fields, including bioinformatics, computational linguistics and speech recognition.

2008

  • (Pereira, 2008) ⇒ Fernando Pereira's Homepage http://www.cis.upenn.edu/~pereira/
    • QUOTE: Many sequence-processing problems involve breaking it into subsequences (person name vs other), or labeling sequence elements (parts of speech). Higher-level analyses involve parsing the sequence into a tree or graph representing its hierarchical structure. Previous approaches using probabilistic generative models like HMMs and PCFGs have difficulty in dealing with correlated features of the input sequence. We have been developing and applying structured linear models, starting with conditional random fields, as a more flexible, effective approach to learning how to segment and parse sequences."

2007

2007b

2006

2006b

2005

2005b

2004

  • http://crf.sourceforge.net/introduction/models.html
  • (McDonald et al., 2004) ⇒ Ryan T. McDonald, R. Scott Winters, Mark Mandel, Yang Jin, Peter S. White, and Fernando Pereira. (2004). “An entity tagger for recognizing acquired genomic variations in cancer literature." Bioinformatics 2004 20(17). doi:10.1093/bioinformatics/bth350.
    • QUOTE: These models are convenient because they allow us to combine the effects of many potentially informative features and have previously been successfully used for other biomedical named entity taggers (McDonald and Pereira, 2004).

       CRFs model the conditional probability of a tag sequence given an observation sequence: P(T|O) where O is an observation sequence, in our case a sequence of tokens in the abstract, and T=t1,t2,...,tn is a corresponding tag sequence in which each tab labels the corresponding token with on of TYPE, LOCATION, INITIAL-STATE, ALTERED-STATE and OTHER. CRFs are log-linear models based on a set of feature functions, fi(tj,tj-1,O) that map predicates on observation/tag-transition pairs to binary values.

      Each feature has an associated weight, li, that measure its effect on the overall choice of tags.

      These models are convenient because they allow us to combine the effects of many potentially informative features.

      Given a trained model, the optimal tag sequence for new examples is found with the Viterbi algorithm (Rabiner, 1993).

2004b

2003a

2003b

  • (Zelenko et al., 2003) ⇒ Dmitry Zelenko, Chinatsu Aone, and Anthony Richardella. (2003). “Kernel Methods for Relation Extraction.” In: Journal of Machine Learning Research, 3.
    • QUOTE: CRFs are an example of exponential models (Berger et al., 1996); as such, they enjoy a number of attractive properties (e.g., global likelihood maximum) and are better suited for modeling sequential data, as contrasted with other conditional models (Lafferty et al., 2001).

2003c

2001