Sequence-to-Sequence Learning Task
A Sequence-to-Sequence Learning Task is a Neural Sequence Learning Task that maps an input sequence dataset to an output sequence dataset.
- AKA: seq2seq Learning Task.
- Context:
- Task Input Requirement : a Sequence Dataset.
- Task Output Requirement : a Sequence Dataset.
- Task Models and Subtasks Requirements:
- a seq2seq model usually based on a Deep Bidirectional Neural Network Training Task or a Encoder-Decoder Neural Network,
- a Vectorized Word Representation based on a Word/Sentence Representation Learning Task.
- It can be solved by a Sequence-to-Sequence Learning System (by implementing a sequence-to-sequence learning algorithm).
- It can range from being a Word-to-Phrase/Phrase-to-Word Learning Task, to being a Sentence-to-Sentence Learning Task, to being a Sentence-to-Video/Video-to-Sentence Learning Task, to being a Video-to-Video Learning Task.
- It can range from being a Supervised Sequence-to-Sequence Learning Task, to being a Semi-Supervised Sequence-to-Sequence Learning Task, to being a Unsurpervised Sequence-to-Sequence Learning Task.
- It can be used in a Neural Conversational Modelling Task and a Neural Machine Translation Task.
- Example(s):
- an Encoder-Decoder Sequence-to-Sequence Learning Task,
- a Convolutional Sequence-to-Sequence Learning Task,
- a Connectionist Sequence Classification Task,
- a Multi-modal Sequence to Sequence Learning,
- a Sequence-to-Sequence Learning with Variational Auto-Encoder,
- a Sequence-to-Sequence Learning via Shared Latent Representation,
- a Sequence-to-Sequence Translation with Attention Mechanism.
- …
- Counter-Example(s):
- See: Natural Language Processing Task, Sequence Learning Task, Word Sense Disambiguation, LSTM, Deep Neural Network, Memory Augmented Neural Network Training System, Deep Sequence Learning Task, Bidirectional LSTM.
References
2018
- (Liao et al., 2018) ⇒ Binbing Liao, Jingqing Zhang, Chao Wu, Douglas McIlwraith, Tong Chen, Shengwen Yang, Yike Guo, and Fei Wu. (2018). “Deep Sequence Learning with Auxiliary Information for Traffic Prediction.” In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ISBN:978-1-4503-5552-0 doi:10.1145/3219819.3219895
- QUOTE: In this paper, we effectively utilise three kinds of auxiliary information in an encoder-decoder sequence to sequence (Seq2Seq) [7, 32] learning manner as follows: a wide linear model is used to encode the interactions among geographical and social attributes, a graph convolution neural network is used to learn the spatial correlation of road segments, and the query impact is quantified and encoded to learn the potential influence of online crowd queries(...)
Figure 4 shows the architecture of the Seq2Seq model for traffic prediction. The encoder embeds the input traffic speed sequence [math]\displaystyle{ \{v_1,v_2, \cdots ,v_t \} }[/math] and the final hidden state of the encoder is fed into the decoder, which learns to predict the future traffic speed [math]\displaystyle{ \{\tilde{v}_{t+1},\tilde{v}_{t+2}, \cdots,\tilde{v}_{t+t'} \} }[/math]. Hybrid model that integrates the auxiliary information will be proposed based on the Seq2Seq model.
Figure 4: Seq2Seq: The Sequence to Sequence model predicts future traffic speed [math]\displaystyle{ \{\tilde{v}_{t+1},\tilde{v}_{t+2}, \cdots ,\tilde{v}_{t+t'} \} }[/math], given the previous traffic speed [math]\displaystyle{ {v_1,v_2, ...v_t } }[/math].
- QUOTE: In this paper, we effectively utilise three kinds of auxiliary information in an encoder-decoder sequence to sequence (Seq2Seq) [7, 32] learning manner as follows: a wide linear model is used to encode the interactions among geographical and social attributes, a graph convolution neural network is used to learn the spatial correlation of road segments, and the query impact is quantified and encoded to learn the potential influence of online crowd queries(...)
2017
- (Ramachandran et al., 2017) ⇒ Prajit Ramachandran, Peter J. Liu, and Quoc V. Le. (2017). “Unsupervised Pretraining for Sequence to Sequence Learning.” In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017). arViv:1611.02683
- QUOTE: Therefore, the basic procedure of our approach is to pretrain both the seq2seq encoder and decoder networks with language models, which can be trained on large amounts of unlabeled text data. This can be seen in Figure 1, where the parameters in the shaded boxes are pretrained. In the following we will describe the method in detail using machine translation as an example application.
Figure 1: Pretrained sequence to sequence model. The red parameters are the encoder and the blue parameters are the decoder. All parameters in a shaded box are pretrained, either from the source side (light red) or target side (light blue) language model. Otherwise, they are randomly initialized.
- QUOTE: Therefore, the basic procedure of our approach is to pretrain both the seq2seq encoder and decoder networks with language models, which can be trained on large amounts of unlabeled text data. This can be seen in Figure 1, where the parameters in the shaded boxes are pretrained. In the following we will describe the method in detail using machine translation as an example application.
2016
- (Luong et al., 2016) ⇒ Minh-Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. (2016). “Multi-task Sequence to Sequence Learning.” In: Proceedings of 4th International Conference on Learning Representations (ICLR-2016).
- QUOTE: ... for dealing with variable-length inputs and outputs. ... to map variable-length input sequences to variable-length output sequences. ... state-of-the-art results in not only its original application – machine translation – (Luong et al., 2015b; Jean et al., 2015a; Luong et al., 2015a; Jean et al., 2015b; Luong & Manning, 2015), but also image caption generation (Vinyals et al., 2015b), and constituency parsing (Vinyals et al., 2015a).
2015
- (Vinyals & Le, 2015) ⇒ Oriol Vinyals, and Quoc V. Le. (2015). “A Neural Conversational Model.” In: Proceedings of Deep Leaning Workshop.
- QUOTE: Our approach makes use of the sequence-to-sequence (seq2seq) framework described in (Sutskever et al., 2014). The model is based on a recurrent neural network which reads the input sequence one token at a time, and predicts the output sequence, also one token at a time. During training, the true output sequence is given to the model, so learning can be done by backpropagation. The model is trained to maximize the cross entropy of the correct sequence given its context. During inference, given that the true output sequence is not observed, we simply feed the predicted output token as input to predict the next output. This is a “greedy” inference approach. A less greedy approach would be to use beam search, and feed several candidates at the previous step to the next step. The predicted sequence can be selected based on the probability of the sequence.
Figure 1. Using the seq2seq framework for modeling conversations.Concretely, suppose that we observe a conversation with two turns: the first person utters “ABC”, and second person replies “WXYZ”. We can use a recurrent neural network and train to map “ABC” to “WXYZ” as shown in Figure 1 above. The hidden state of the model when it receives the end of sequence symbol “<eos>” can be viewed as the thought vector because it stores the information of the sentence, or thought, “ABC”.
- QUOTE: Our approach makes use of the sequence-to-sequence (seq2seq) framework described in (Sutskever et al., 2014). The model is based on a recurrent neural network which reads the input sequence one token at a time, and predicts the output sequence, also one token at a time. During training, the true output sequence is given to the model, so learning can be done by backpropagation. The model is trained to maximize the cross entropy of the correct sequence given its context. During inference, given that the true output sequence is not observed, we simply feed the predicted output token as input to predict the next output. This is a “greedy” inference approach. A less greedy approach would be to use beam search, and feed several candidates at the previous step to the next step. The predicted sequence can be selected based on the probability of the sequence.
2014a
- (Sutskever et al., 2014) ⇒ Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. (2014). “Sequence to Sequence Learning with Neural Networks.” In: Advances in Neural Information Processing Systems. arXiv:1409.321
- QUOTE: Sequences pose a challenge for DNNs because they require that the dimensionality of the inputs and outputs is known and fixed. In this paper, we show that a straightforward application of the Long Short-Term Memory (LSTM) architecture [16] can solve general sequence to sequence problems. The idea is to use one LSTM to read the input sequence, one timestep at a time, to obtain large fixed dimensional vector representation, and then to use another LSTM to extract the output sequence from that vector (fig. 1). The second LSTM is essentially a recurrent neural network language model [28, 23, 30 ] except that it is conditioned on the input sequence. The LSTM’s ability to successfully learn on data with long range temporal dependencies makes it a natural choice for this application due to the considerable time lag between the inputs and their corresponding outputs (fig. 1).
Figure 1: Our model reads an input sentence “ABC” and produces “WXYZ” as the output sentence. The model stops making predictions after outputting the end-of-sentence token. Note that the LSTM reads the input sentence in reverse, because doing so introduces many short term dependencies in the data that make the optimization problem much easier.
- QUOTE: Sequences pose a challenge for DNNs because they require that the dimensionality of the inputs and outputs is known and fixed. In this paper, we show that a straightforward application of the Long Short-Term Memory (LSTM) architecture [16] can solve general sequence to sequence problems. The idea is to use one LSTM to read the input sequence, one timestep at a time, to obtain large fixed dimensional vector representation, and then to use another LSTM to extract the output sequence from that vector (fig. 1). The second LSTM is essentially a recurrent neural network language model [28, 23, 30 ] except that it is conditioned on the input sequence. The LSTM’s ability to successfully learn on data with long range temporal dependencies makes it a natural choice for this application due to the considerable time lag between the inputs and their corresponding outputs (fig. 1).
2014b
- (Cho et al., 2014a) ⇒ Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. (2014). “Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation”. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, (EMNLP-2014). arXiv:1406.1078
- QUOTE: In this paper, we propose a novel neural network architecture that learns to encode a variable-length sequence into a fixed-length vector representation and to decode a given fixed-length vector representation back into a variable-length sequence.