Transformer-based Character-Level Language Model (LM)

Context:
- It can be produced by an Transformer-based Character-Level Language Modeling System.
Example(s):
- …
Counter-Example(s):
- an RNN-based Character-Level LM, such as an LSTM-based character-level LM.
See: Transformer-based Word-Level Language Model.

References

(Dai et al., 2019) ⇒ Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. (2019). “Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context.” In: CoRR, abs/1901.02860.

(Al-Rfou et al., 2018) ⇒ Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. (2018). “Character-Level Language Modeling with Deeper Self-Attention.” In: CoRR, abs/1808.04444.
- QUOTE: ... In this paper, we show that a non-recurrent model can achieve strong results on character-level language modeling.
  Specifically, we use a deep network of transformer self-attention layers (Vaswani et al. 2017) with causal (backward-looking) attention to process fixed-length inputs and predict upcoming characters. The model is trained on mini-batches of sequences from random positions in the training corpus, with no information passed from one batch to the next.
  Our primary finding is that the transformer architecture is well-suited to language modeling over long sequences and could replace RNNs in this domain. We speculate that the transformer’s success here is due to its ability to “quickly” propagate information over arbitrary distances; by comparison, RNNs need to learn to pass relevant information forward step by step.
  We also find that some modifications to the basic transformer architecture are beneficial in this domain. Most importantly, we add three auxiliary losses, requiring the model to predict upcoming characters (i) at intermediate sequence positions, (ii) from intermediate hidden representations, and (iii) at target positions multiple steps in the future. These losses speed up convergence, and make it possible to train deeper networks.