Bidirectional LSTM/CRF Training Algorithm

Context:
- It can be solved by a Bidirectional LSTM/CRF Training System.
- …
Example(s):
- ....
- …
Counter-Example(s):
- an Unidirectional LSTM-based Language Modeling Algorithm.
- a seq2seq-based Neural Modeling Algorithm.
See: LSTM System, neuroner.com.

References

(Reimers & Gurevych, 2017a) ⇒ Nils Reimers, and Iryna Gurevych. (2017). "Optimal hyperparameters for deep lstm-networks for sequence labeling tasks". arXiv preprint arXiv:1707.06799.
- QUOTE: LSTM-Networks are a popular choice for linguistic sequence tagging and show a strong performance in many tasks. Figure 1 shows the principle architecture of a BiLSTM-model for sequence tagging. A detailed explanation of the model can be found in (Huang et al., 2015; Ma and Hovy, 2016; Lample et al., 2016). (...)
  Table 12 compares the two options when all other hyperparameters are kept the same. It confirms the impression that CRF leads to superior results in most cases, except for the event detection task. The improvement by using a CRF classifier instead of a softmax classifier lies between 0.19 percentage points and 0.85 percentage points for the evaluated tasks (...)


Figure 1: Architecture of the BiLSTM network with a CRF-classifier. A fixed sized character-based representation is derived either with a Convolutional Neural Network or with a BiLSTM network.	Table 12: Network configurations were sampled randomly and each was evaluated with each classifier as a last layer. The first number in a cell depicts in how many cases each classifier produced better results than the others. The second number shows the median difference to the best option for each task. Statistically significant differences with p < 0.01 are marked with †

(Lample et al., 2016) ⇒ Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. (2016). “Neural Architectures for Named Entity Recognition.” In: Proceedings of NAACL-HLT.
- QUOTE:
  Figure 1: Main architecture of the network. Word embeddings are given to a bidirectional LSTM. [math]\displaystyle{ l_i }[/math] represents the word i and its left context, [math]\displaystyle{ r_i }[/math] represents the word i and its right context. Concatenating these two vectors yields a representation of the word i in its context, [math]\displaystyle{ c_i }[/math].


Figure 1: The convolution neural network for extracting character-level representations of words. Dashed arrows indicate a dropout layer applied before character embeddings are input to CNN.	Figure 3: The main architecture of our neural network. The character representation for each word is computed by the CNN in Figure 1. Then the character representation vector is concatenated with the word embedding before feeding into the BLSTM network. Dashed arrows indicate dropout layers applied on both the input and output vectors of BLSTM.