EMNLP 2017 BiLSTM-CNN-CRF Training System

Example(s):
Counter-Example(s):
- a ELMo-BiLSTM-CNN-CRF Training System,
- an Unidirectional LSTM Training System,
- a BiLSTM-CRF Training System,
- a CNN Training System,
- a Bidirectional LSTM-CNN Training System that uses a Softmax classifier.
See: Bidirectional Neural Network, Convolutional Neural Network, Conditional Random Field, Bidirectional Recurrent Neural Network, Dynamic Neural Network.

References

(Lample et al., 2016) ⇒ Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. (2016). “Neural Architectures for Named Entity Recognition.” In: Proceedings of NAACL-HLT.
- QUOTE:
  Figure 1: Main architecture of the network. Word embeddings are given to a bidirectional LSTM. [math]\displaystyle{ l_i }[/math] represents the word i and its left context, [math]\displaystyle{ r_i }[/math] represents the word i and its right context. Concatenating these two vectors yields a representation of the word i in its context, [math]\displaystyle{ c_i }[/math].

(Ma & Hovy, 2016) ⇒ Xuezhe Ma, and Eduard Hovy (2016). "End-to-end sequence labeling via bi-directional lstm-cnns-crf". arXiv preprint arXiv:1603.01354.
- QUOTE: Finally, we construct our neural network model by feeding the output vectors of BLSTM into a CRF layer. Figure 3 illustrates the architecture of our network in detail. For each word, the character-level representation is computed by the CNN in Figure 1 with character embeddings as inputs. Then the character-level representation vector is concatenated with the word embedding vector to feed into the BLSTM network. Finally, the output vectors of BLSTM are fed to the CRF layer to jointly decode the best label sequence. As shown in Figure 3, dropout layers are applied on both the input and output vectors of BLSTM.


Figure 1: The convolution neural network for extracting character-level representations of words. Dashed arrows indicate a dropout layer applied before character embeddings are input to CNN.	Figure 3: The main architecture of our neural network. The character representation for each word is computed by the CNN in Figure 1. Then the character representation vector is concatenated with the word embedding before feeding into the BLSTM network. Dashed arrows indicate dropout layers applied on both the input and output vectors of BLSTM.

(Huang, Xu & Yu, 2015) ⇒ Zhiheng Huang, Wei Xu, Kai Yu (2015). "Bidirectional LSTM-CRF models for sequence tagging (PDF)". arXiv preprint arXiv:1508.01991.
- QUOTE: In sequence tagging task, we have access to both past and future input features for a given time, we can thus utilize a bidirectional LSTM network (Figure 4) as proposed in (Graves et al., 2013). In doing so, we can efficiently make use of past features (via forward states) and future features (via backward states) for a specific time frame. We train bidirectional LSTM networks using backpropagation through time (BPTT)(Boden., 2002). The forward and backward passes over the unfolded network over time are carried out in a similar way to regular network forward and backward passes, except that we need to unfold the hidden states for all time steps. We also need a special treatment at the beginning and the end of the data points. In our implementation, we do forward and backward for whole sentences and we only need to reset the hidden states to 0 at the begging of each sentence. We have batch implementation which enables multiple sentences to be processed at the same time.
  
  Figure 4: A bidirectional LSTM network.
  .