2018 BreakingtheSoftmaxBottleneckAHi

From GM-RKB
(Redirected from Yang, Day et al., 2018)
Jump to navigation Jump to search

Subject Headings: Softmax Layer.

Notes

Cited By

2018

  • https://openreview.net/forum?id=HkwZSG-CZ
    • REVIEW: Viewing language modeling as a matrix factorization problem, the authors argue that the low rank of word embeddings used by such models limits their expressivity and show that replacing the softmax in such models with a mixture of softmaxes provides an effective way of overcoming this bottleneck. This is an interesting and well-executed paper that provides potentially important insight. It would be good to at least mention prior work related to the language modeling as matrix factorization perspective (e.g. Levy & Goldberg, 2014).
    • REVIEW: This paper uncovers a fundamental issue with large vocabularies and goes beyond just analyzing the issue by proposing a helpful method of addressing this.
    • REVIEW: Language models are important components to many NLP tasks. The current state-of-the-art language models are based on recurrent neural networks which compute the probability of a word given all previous words using a softmax function over a linear function of the RNN's hidden state. This paper argues the softmax is not expressive enough and proposes to use a more flexible mixture of softmaxes. The use of a mixture of softmaxes is motivated from a theoretical point of view by translating language modeling into matrix factorization.

Quotes

Abstract

We formulate language modeling as a matrix factorization problem, and show that the expressiveness of Softmax-based models (including the majority of neural language models) is limited by a Softmax bottleneck. Given that natural language is highly context-dependent, this further implies that in practice Softmax with distributed word embeddings does not have enough capacity to model natural language. We propose a simple and effective method to address this issue, and improve the state-of-the-art perplexities on Penn Treebank and WikiText-2 to 47.69 and 40.68 respectively. The proposed method also excels on the large-scale 1B Word dataset, outperforming the baseline by over 5.6 points in perplexity.

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2018 BreakingtheSoftmaxBottleneckAHiWilliam W. Cohen
Ruslan Salakhutdinov
Zhilin Yang
Zihang Dai
Breaking the Softmax Bottleneck: A High-rank RNN Language Model2018