CRF Training Algorithm: Difference between revisions
Jump to navigation
Jump to search
No edit summary |
m (Text replacement - "> ↵" to "> ") |
||
(41 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
A [[CRF Training Algorithm]] is a [[model training algorithm]] that can solve a [[CRF training task]] | A [[CRF Training Algorithm]] is a [[graphical model training algorithm]] that accepts a [[CRF model family]] and can be applied by a [[CRF training system]] (to solve a [[CRF training task]] to produce a [[CRF structure]]). | ||
* <B | * <B>AKA:</B> [[CRF Training Algorithm|Conditional Random Field Trainer]]. | ||
* <B | * <B>Context</U>:</B> | ||
** It can range from being a [[Fully-Supervised CRF Training Algorithm]] to being a [[Semi-Supervised CRF Training Algorithm]]. | ** It can range from being a [[Fully-Supervised CRF Training Algorithm]] to being a [[Semi-Supervised CRF Training Algorithm]]. | ||
** It can be a [[Part Of]] a [[CRF-based Tagging Algorithm]]. | ** It can be a [[Part Of]] a [[CRF-based Tagging Algorithm]]. | ||
** It can require a [[CRF Feature Template]]. | ** It can require a [[CRF Feature Template]]. | ||
* <B | ** … | ||
** | * <B>Example(s):</B> | ||
** | ** It can be based on an [[Iterative Scaling Algorithm]] ([[2001_ConditionalRandomFields|Lafferty et al., 2001]]) | ||
** | ** It can be based on a [[Limited-Memory Quasi-Newton Algorithm]]/[[L-BFGS]]. ([[2006_IntegratingLinguisticKnowledgeI|Tasi et al., 2006]]) | ||
** | ** It can be based on a [[Convex Optimization Algorithm]] ([[2003_ShallowParsingwithConditionalRa|Sha & Pereira, 2003a]]) | ||
* <B | ** It can be based on a [[Stochastic Gradient Optimization Algorithm]] ([[2006_AcceleratedTrainingofConditiona|Vishwanathan et al., 2006]]). | ||
** … | |||
* <B>Counter-Example(s):</B> | |||
** a [[MEMM (Maximum-entropy Markov Model) Training Algorithm]]. | |||
** a [[CRF Inference Algorithm]]. | ** a [[CRF Inference Algorithm]]. | ||
* <B | ** an [[HMM Training Algorithm]]. | ||
** an [[RNN Training Algorithm]]. | |||
* <B>See:</B> [[CRF Label Bias]], [[CRF Length Bias]]. | |||
---- | ---- | ||
---- | ---- | ||
===2010=== | == References == | ||
* ([[2010_PracticalVeryLargeScaleCRFs|Lavergne | |||
** QUOTE: | === 2016 === | ||
* http://www.datasciencecentral.com/profiles/blogs/conditional-random-fields-crf-short-survey | |||
** QUOTE: The time complexity of the [[CRF Training Algorithm|training process]] is large enough: <math>\begin{equation*} O(mNTQ2nS), \end{equation*}</math>, where: | |||
*** m is the number of training iterations | |||
*** N is the number of training data sequences | |||
*** T is the average length of training sequences | |||
*** Q is the number of class labels | |||
*** n is the number of CRF features | |||
*** S is the searching time of the optimization algorithm (for example, L-BFGS algorithm, which is considered good for this). | |||
** In practical implementation, the computational time is often larger due to many other operations like numerical scaling, smoothing etc. | |||
=== 2010 === | |||
* ([[2010_PracticalVeryLargeScaleCRFs|Lavergne et al., 2010]]) ⇒ [[Thomas Lavergne]], [[Olivier Cappé]], and [[François Yvon]]. ([[2010]]). “[http://www.aclweb.org/anthology-new/P/P10/P10-1052.pdf Practical Very Large Scale CRFs].” In: [[Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics]], ([[ACL 2010]]). | |||
** QUOTE: In [[2010_PracticalVeryLargeScaleCRFs|this paper, we]] address the issue of [[CRF Training Task|training]] [[Very Large CRF|very large CRFs]], containing up to [[Hundreds|hundreds]] [[Output Label|output labels]] and [[Billions|several billion]] [[Predictor Feature|features]]. [[Efficient Algorithm|Efficiency]] stems here from the [[Sparse Data Structure|sparsity]] induced by the use of a <i>l</i><sup>1</sup> [[CRF Penalty Term|penalty term]]. Based on our own [[CRF System|implementation]], we compare three recent [[Proposed Algorithm|proposals]] for [[implementing]] this [[CRF Regularization Strategy|regularization strategy]]. | |||
===2006=== | === 2006 === | ||
* ([[2006_AcceleratedTrainingofConditiona|Vishwanathan | * ([[2006_AcceleratedTrainingofConditiona|Vishwanathan et al., 2006]]) ⇒ [[S. V. N. Vishwanathan]], [[Nicol N. Schraudolph]], [[Mark W. Schmidt]], and [[Kevin P. Murphy]]. ([[2006]]). “[http://people.cs.ubc.ca/~murphyk/Papers/icml06_camera.pdf Accelerated Training of Conditional Random Fields with Stochastic Gradient Methods].” In: [[Proceedings of the 23rd International Conference on Machine learning]] ([[ICML-2006]]). [http://dx.doi.org/10.1145/1143844.1143966 doi:10.1145/1143844.1143966] | ||
** QUOTE: We apply [[Stochastic Meta-Descent (SMD)]], a [[stochastic gradient optimization method]] with [[gain vector]] adaptation, to the [[CRF Training Task|training of Conditional Random Fields (CRFs)]]. On several [[large data set]]s, the resulting [[CRF Training Algorithm|optimizer]] converges to the same quality of solution over an [[order of magnitude faster]] than [[limited-memory BFGS]], the leading method reported to date. We report results for both [[exact]] and [[inexact inference technique]]s. | ** QUOTE: [[We]] apply [[Stochastic Meta-Descent (SMD)]], a [[stochastic gradient optimization method]] with [[gain vector]] adaptation, to the [[CRF Training Task|training of Conditional Random Fields (CRFs)]]. On several [[large data set]]s, the resulting [[CRF Training Algorithm|optimizer]] converges to the same quality of solution over an [[order of magnitude faster]] than [[limited-memory BFGS]], the leading method reported to date. We report results for both [[exact]] and [[inexact inference technique]]s. | ||
===2003=== | === 2003 === | ||
* ([[2003_ShallowParsingwithConditionalRa|Sha & Pereira, 2003a]]) | * ([[2003_ShallowParsingwithConditionalRa|Sha & Pereira, 2003a]]) ⇒ [[Fei Sha]], and [[Fernando Pereira]]. ([[2003]]). “[http://acl.ldc.upenn.edu/N/N03/N03-1028.pdf Shallow Parsing with Conditional Random Fields].” In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (HLT-NAACL 2003). [http://dx.doi.org/10.3115/1073445.1073473 doi:10.3115/1073445.1073473] | ||
** | ** QUOTE: To obtain these results, we had to abandon the original [[iterative scaling]] [[CRF Training Algorithm|CRF training algorithm]] for [[Convex Optimization Algorithm|convex optimization algorithm]]s with better [[convergence]] [[properties]]. | ||
---- | ---- |
Latest revision as of 02:40, 27 March 2024
A CRF Training Algorithm is a graphical model training algorithm that accepts a CRF model family and can be applied by a CRF training system (to solve a CRF training task to produce a CRF structure).
- AKA: Conditional Random Field Trainer.
- Context:
- It can range from being a Fully-Supervised CRF Training Algorithm to being a Semi-Supervised CRF Training Algorithm.
- It can be a Part Of a CRF-based Tagging Algorithm.
- It can require a CRF Feature Template.
- …
- Example(s):
- It can be based on an Iterative Scaling Algorithm (Lafferty et al., 2001)
- It can be based on a Limited-Memory Quasi-Newton Algorithm/L-BFGS. (Tasi et al., 2006)
- It can be based on a Convex Optimization Algorithm (Sha & Pereira, 2003a)
- It can be based on a Stochastic Gradient Optimization Algorithm (Vishwanathan et al., 2006).
- …
- Counter-Example(s):
- See: CRF Label Bias, CRF Length Bias.
References
2016
- http://www.datasciencecentral.com/profiles/blogs/conditional-random-fields-crf-short-survey
- QUOTE: The time complexity of the training process is large enough: [math]\displaystyle{ \begin{equation*} O(mNTQ2nS), \end{equation*} }[/math], where:
- m is the number of training iterations
- N is the number of training data sequences
- T is the average length of training sequences
- Q is the number of class labels
- n is the number of CRF features
- S is the searching time of the optimization algorithm (for example, L-BFGS algorithm, which is considered good for this).
- In practical implementation, the computational time is often larger due to many other operations like numerical scaling, smoothing etc.
- QUOTE: The time complexity of the training process is large enough: [math]\displaystyle{ \begin{equation*} O(mNTQ2nS), \end{equation*} }[/math], where:
2010
- (Lavergne et al., 2010) ⇒ Thomas Lavergne, Olivier Cappé, and François Yvon. (2010). “Practical Very Large Scale CRFs.” In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, (ACL 2010).
- QUOTE: In this paper, we address the issue of training very large CRFs, containing up to hundreds output labels and several billion features. Efficiency stems here from the sparsity induced by the use of a l1 penalty term. Based on our own implementation, we compare three recent proposals for implementing this regularization strategy.
2006
- (Vishwanathan et al., 2006) ⇒ S. V. N. Vishwanathan, Nicol N. Schraudolph, Mark W. Schmidt, and Kevin P. Murphy. (2006). “Accelerated Training of Conditional Random Fields with Stochastic Gradient Methods.” In: Proceedings of the 23rd International Conference on Machine learning (ICML-2006). doi:10.1145/1143844.1143966
- QUOTE: We apply Stochastic Meta-Descent (SMD), a stochastic gradient optimization method with gain vector adaptation, to the training of Conditional Random Fields (CRFs). On several large data sets, the resulting optimizer converges to the same quality of solution over an order of magnitude faster than limited-memory BFGS, the leading method reported to date. We report results for both exact and inexact inference techniques.
2003
- (Sha & Pereira, 2003a) ⇒ Fei Sha, and Fernando Pereira. (2003). “Shallow Parsing with Conditional Random Fields.” In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (HLT-NAACL 2003). doi:10.3115/1073445.1073473
- QUOTE: To obtain these results, we had to abandon the original iterative scaling CRF training algorithm for convex optimization algorithms with better convergence properties.