Beam Search-based Decoding System

References

(Xie et al., 2016) ⇒ Ziang Xie, Anand Avati, Naveen Arivazhagan, Dan Jurafsky, and Andrew Y. Ng. (2016). “Neural Language Correction with Character-Based Attention.” In: CoRR, abs/1603.09727.
- QUOTE: ... For inference we use a beam search decoder combining the neural network and the language model likelihood. Similar to Hannun et al. (2014), at step k, we rank the hypotheses on the beam using the score sk(y1:kjx) = log PNN(y1:kjx) + � log PLM(y1:k) where the hyper-parameter � determines how much the language model is weighted. To avoid penalizing longer hypotheses, we additionally normalize scores by the number of words in the hypothesis jyj. Since decoding is done at the character level, the language model probability PLM(�) is only incorporated after a space or end-of-sentence symbol is encountered. …

(Sutskever et al., 2014) ⇒ Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. (2014). “Sequence to Sequence Learning with Neural Networks.” In: Advances in Neural Information Processing Systems.
- QUOTE: We search for the most likely translation using a simple left-to-right beam search decoder which maintains a small number B of partial hypotheses, where a partial hypothesis is a prefix of some translation. At each timestep we extend each partial hypothesis in the beam with every possible word in the vocabulary. This greatly increases the number of the hypotheses so we discard all but the B most likely hypotheses according to the model’s log probability. As soon as the “<EOS>” symbol is appended to a hypothesis, it is removed from the beam and is added to the set of complete hypotheses. While this decoder is approximate, it is simple to implement. Interestingly, our system performs well even with a beam size of 1, and a beam of size 2 provides most of the benefits of beam search (Table 1).