Bidirectional LSTM/CRF Training Algorithm
A Bidirectional LSTM/CRF Training Algorithm is a supervised sequence segmentation algorithm that implements a bi-directional LSTM training algorithm and a CRF training algorithm.
- Context:
- It can be solved by a Bidirectional LSTM/CRF Training System.
- …
- Example(s):
...
.- …
- Counter-Example(s):
- See: LSTM System, neuroner.com.
References
2017d
- (Reimers & Gurevych, 2017a) ⇒ Nils Reimers, and Iryna Gurevych. (2017). "Optimal hyperparameters for deep lstm-networks for sequence labeling tasks". arXiv preprint arXiv:1707.06799.
- QUOTE: LSTM-Networks are a popular choice for linguistic sequence tagging and show a strong performance in many tasks. Figure 1 shows the principle architecture of a BiLSTM-model for sequence tagging. A detailed explanation of the model can be found in (Huang et al., 2015; Ma and Hovy, 2016; Lample et al., 2016). (...)
Table 12 compares the two options when all other hyperparameters are kept the same. It confirms the impression that CRF leads to superior results in most cases, except for the event detection task. The improvement by using a CRF classifier instead of a softmax classifier lies between 0.19 percentage points and 0.85 percentage points for the evaluated tasks (...)
- QUOTE: LSTM-Networks are a popular choice for linguistic sequence tagging and show a strong performance in many tasks. Figure 1 shows the principle architecture of a BiLSTM-model for sequence tagging. A detailed explanation of the model can be found in (Huang et al., 2015; Ma and Hovy, 2016; Lample et al., 2016). (...)
Figure 1: Architecture of the BiLSTM network with a CRF-classifier. A fixed sized character-based representation is derived either with a Convolutional Neural Network or with a BiLSTM network. | Table 12: Network configurations were sampled randomly and each was evaluated with each classifier as a last layer. The first number in a cell depicts in how many cases each classifier produced better results than the others. The second number shows the median difference to the best option for each task. Statistically significant differences with p < 0.01 are marked with † |
2016a
- (Lample et al., 2016) ⇒ Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. (2016). “Neural Architectures for Named Entity Recognition.” In: Proceedings of NAACL-HLT.
- QUOTE:
Figure 1: Main architecture of the network. Word embeddings are given to a bidirectional LSTM. [math]\displaystyle{ l_i }[/math] represents the word i and its left context, [math]\displaystyle{ r_i }[/math] represents the word i and its right context. Concatenating these two vectors yields a representation of the word i in its context, [math]\displaystyle{ c_i }[/math].
- QUOTE:
2016b
- (Ma & Hovy, 2016) ⇒ Xuezhe Ma, and Eduard Hovy (2016). "End-to-end sequence labeling via bi-directional lstm-cnns-crf". arXiv preprint arXiv:1603.01354.
- QUOTE: ... we construct our neural network model by feeding the output vectors of BLSTM into a CRF layer. Figure 3 illustrates the architecture of our network in detail. For each word, the character-level representation is computed by the CNN in Figure 1 with character embeddings as inputs. Then the character-level representation vector is concatenated with the word embedding vector to feed into the BLSTM network. Finally, the output vectors of BLSTM are fed to the CRF layer to jointly decode the best label sequence. As shown in Figure 3, dropout layers are applied on both the input and output vectors of BLSTM.
Figure 1: The convolution neural network for extracting character-level representations of words. Dashed arrows indicate a dropout layer applied before character embeddings are input to CNN. | Figure 3: The main architecture of our neural network. The character representation for each word is computed by the CNN in Figure 1. Then the character representation vector is concatenated with the word embedding before feeding into the BLSTM network. Dashed arrows indicate dropout layers applied on both the input and output vectors of BLSTM. |
2015
- (Huang, Xu & Yu, 2015) ⇒ Zhiheng Huang, Wei Xu, Kai Yu (2015). "Bidirectional LSTM-CRF models for sequence tagging (PDF)". arXiv preprint arXiv:1508.01991.
- … we propose a variety of Long Short-Term Memory (LSTM) based models for sequence tagging. These models include LSTM networks, bidirectional LSTM (BI-LSTM) networks, LSTM with a Conditional Random Field (CRF) layer (LSTM-CRF) and bidirectional LSTM with a CRF layer (BI-LSTM-CRF). Our work is the first to apply a bidirectional LSTM CRF (denoted as BI-LSTM-CRF) model to NLP benchmark sequence tagging data sets. We show that the BI-LSTM-CRF model can efficiently use both past and future input features thanks to a bidirectional LSTM component. It can also use sentence level tag information thanks to a CRF layer. The BI-LSTM-CRF model can produce state of the art (or close to) accuracy on POS, chunking and NER data sets. In addition, it is robust and has less dependence on word embedding as compared to previous observations. ...
... Similar to a LSTM-CRF network, we combine a bidirectional LSTM network and a CRF network to form a BI-LSTM-CRF network (Fig. 7). In addition to the past input features and sentence level tag information used in a LSTM-CRF model, a BILSTM-CRF model can use the future input features. The extra features can boost tagging accuracy as we will show in experiments.
Figure 7: A BI-LSTM-CRF model
- … we propose a variety of Long Short-Term Memory (LSTM) based models for sequence tagging. These models include LSTM networks, bidirectional LSTM (BI-LSTM) networks, LSTM with a Conditional Random Field (CRF) layer (LSTM-CRF) and bidirectional LSTM with a CRF layer (BI-LSTM-CRF). Our work is the first to apply a bidirectional LSTM CRF (denoted as BI-LSTM-CRF) model to NLP benchmark sequence tagging data sets. We show that the BI-LSTM-CRF model can efficiently use both past and future input features thanks to a bidirectional LSTM component. It can also use sentence level tag information thanks to a CRF layer. The BI-LSTM-CRF model can produce state of the art (or close to) accuracy on POS, chunking and NER data sets. In addition, it is robust and has less dependence on word embedding as compared to previous observations. ...