ELMo-BiLSTM-CNN-CRF Training System
An ELMo-BiLSTM-CNN-CRF Training System is a Deep Bidirectional LSTM-CNN Training System that uses ELMo Word Representation.
- Example(s):
bilm-tf
[1] - a Tensorflow implementation of the pretrained biLM used to compute ELMo Word Representations.allennlp.modules.elmo
[2]- ELMo representations using PyTorch and AllenNLP.- Keras ELMo Tutorial: [3]
def create_elmo_embeddings(elmo, documents, max_sentences = 1000): (...)
return embeddings, labels
elmo = ElmoEmbedder(cuda_device=1) #Set cuda_device to the ID of your GPU if you have one
train_x, train_y = create_elmo_embeddings(elmo, train_data, 1000)
test_x, test_y = create_elmo_embeddings(elmo, test_data, 1000)
- Counter-Example(s):
- See: Bidirectional Neural Network, Convolutional Neural Network, Conditional Random Field, Bidirectional Recurrent Neural Network, Dynamic Neural Network.
References
2018a
- (Peters et al., 2018) ⇒ Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer (2018). "Deep contextualized word representations". arXiv preprint arXiv:1802.05365.
- QUOTE: ELMo representations are deep, in the sense that they are a function of all of the internal layers of the biLM. More specifically, we learn a linear combination of the vectors stacked above each input word for each end task, which markedly improves performance over just using the top LSTM layer. Combining the internal states in this manner allows for very rich word representations. Using intrinsic evaluations, we show that the higher-level LSTM states capture context-dependent aspects of word meaning (e.g., they can be used without modification to perform well on supervised word sense disambiguation tasks) while lower level states model aspects of syntax (e.g., they can be used to do part-of-speech tagging).
2018b
- (Reimers & Gurevych, 2018a) ⇒ ELMo-BiLSTM-CNN-CRF repository: Retrieved: 2018-08-05
- QUOTE: For an IPython Notebook with a simple example how to use ELMo representations for sentence classification, see: Keras_ELMo_Tutorial.ipynb.
This code is an extension of the emnlp2017-bilstm-cnn-crf implementation. Must examples can be used with only slight adaptation. Also please see that repository for an explanation about the definition of the datasets, the configuration of the hyperparameters, how to use it for multi-task learning, or how to create custom features. Most aspects from emnlp2017-bilstm-cnn-crf work the same in this implementation.
- QUOTE: For an IPython Notebook with a simple example how to use ELMo representations for sentence classification, see: Keras_ELMo_Tutorial.ipynb.
2018c
- (Reimers & Gurevych, 2018b) ⇒ Keras ELMo Tutorial: https://github.com/UKPLab/elmo-bilstm-cnn-crf/blob/master/Keras_ELMo_Tutorial.ipynb Retrieved: 2018-08-05
- This IPython Notebook contains an example how the ELMo embeddings from the paper Deep contextualized word representations (Peters et al., 2018) can be used for document classification.
As the computation of the embeddings is computationally expensive, we will include it into a preprocessing step:
- We read in the dataset (here the IMDB dataset)
- Text is tokenized and truncated to a fix length
- Each text is fed as a sentence to the AllenNLP ElmoEmbedder to get a 1024 dimensional embedding for each word in the document,
- These embeddings are then fed to our neural network that we train
- This IPython Notebook contains an example how the ELMo embeddings from the paper Deep contextualized word representations (Peters et al., 2018) can be used for document classification.
- Computing the embeddings once in the pre-processing significantly reduces the overall computational time. Otherwise, they would be computed for each epoch. However, this requires that enough memory is available, as our transformed dataset will constist of [math]\displaystyle{ \text{number_of_tokens} \cdot 1024 }[/math]
float32
numbers.Note: Our simple tokenization process ignores sentence boundaries and the complete document is fed as one single sentence to the ElmoEmbedder. As the ELMo embeddings are defined sentence wise, it would be better to first identify the sentences in a document and to process the document sentence by sentence to get the correct embeddings.
- Computing the embeddings once in the pre-processing significantly reduces the overall computational time. Otherwise, they would be computed for each epoch. However, this requires that enough memory is available, as our transformed dataset will constist of [math]\displaystyle{ \text{number_of_tokens} \cdot 1024 }[/math]