2017 Lecture10NeuralMachineTranslati

(Manning & Socher, 2017) ⇒ Christopher Manning, and Richard Socher. (2017). “Lecture 10 - Neural Machine Translation and Models with Attention.” In: Lecture in Natural Language Processing with Deep Learning - Stanford CS224N Ling284 (2017).

Subject Headings: Neural seq2seq, Neural MT, Neural Models with Attention. Manning-Socher Neural Machine Translation System, WMT-14 SMT Shared Task, WMT-15 SMT Shared Task.

Notes

Lecture Video: https://www.youtube.com/watch?v=IxQtK2SjWWM

Cited By

http://scholar.google.com/scholar?q=%222017%22+Lecture+10+-+Neural+Machine+Translation+and+Models+with+Attention

2017

(Manning & Socher, 2017k) ⇒ Christopher Manning, and Richard Socher. (2017). “Lecture 11 - Further Topics in Neural Machine Translation and Recurrent Models.” Natural Language Processing with Deep Learning - Stanford CS224N Ling284.

Quotes

Lecture Plan: Going forwards and backwards

Translation, Machine Translation, Neural Machine Translation
Research highlight: Google’s new NMT
Sequence models with attention
Sequence model decoders

The classic test of language understanding!

Both language analysis & generation

Big MT needs … for humanity … and commerce

Translation is a US$40 billion a year industry

image

Huge in Europe, growing in Asia

image

Large social/government/military as well as commercial needs

3

Huge commercial use

Google translates over 100 billion words a day Facebook in 2016 rolled out new homegrown MT

“When we turned [MT] off for some people, they

went nuts!”

eBay uses MT to enable cross-border trade

image

http://www.commonsenseadvisory.com/AbstractView.aspx?ArticleID=36540 https://googleblog.blogspot.com/2016/04/ten-years-of-google-translate.html

4 https://techcrunch.com/2016/05/23/facebook-translation/

image

What is Neural MT (NMT)?

Neural Machine Translation is the approach of modeling the entire MT process via one big artificial neural network*

But sometimes we compromise this goal a little

5

image

Neural encoder-decoder architectures

image

image

Input text

Encoder Decoder

image Translated text

image

image

−0.2
−0.1
0.1
0.4
−0.3
1.1
6

image

[Allen 1987 IEEE 1st ICNN]

image

3310 En-Es pairs constructed on 31 En, 40 Es words, max 10/11 word sentence; 33 used as test set

The grandfather offered the little girl a book ➔ El abuelo le ofrecio un libro a la nina pequena

Binary encoding of words – 50 inputs, 66 outputs; 1 or 3 hidden 150-unit layers. Ave WER: 1.3 words

7

[Chrisman 1992 Connection Science]

Dual-ported RAAM architecture [Pollack 1990 Artificial Intelligence] applied to corpus of 216 parallel pairs of simple En-Es sentences:

You are not angry ⬌Usted no esta furioso

Split 50/50 as train/test, 75% of sentences correctly translated!

8 Modern Sequence Models for NMT

[Sutskever et al. 2014, cf. Bahdanau et al. 2014, et seq.]

Je suis étudiant _

Encoder Decoder

I am a student _ Je suis étudiant

9 Modern Sequence Models for NMT

[Sutskever et al. 2014, cf. Bahdanau et al. 2014, et seq.]

The protests escalated over the weekend <EOS>

Translation generated

Encoder: Builds up sentence meaning

0.1

0.3

0.1

-0.4

0.2

-0.2

-0.1

0.1

0.2

0.6

-0.1

-0.7

0.1

0.2

0.6

-0.1

-0.7

0.1

0.4

0.3

-0.2

-0.3

0.1

0.3

-0.1

-0.7

0.1

0.5

0.9

-0.3

-0.2

0.2

0.6

-0.1

-0.4

0.1

0.2

0.6

-0.1

-0.5

0.1

0.2

-0.8

-0.1

-0.5

0.1

-0.1

0.6

-0.1

-0.7

0.1

0.2

0.6

-0.1

-0.7

0.1

0.2

0.6

-0.1

-0.7

0.1

0.2

-0.1

-0.7

0.1

0.2

0.6

-0.1

-0.7

0.1

-0.4

0.6

-0.1

-0.7

0.1

0.3

0.6

-0.1

-0.7

0.1

0.2

0.6

-0.1

0.3

0.1

0.4

-0.1

-0.7

0.1

-0.1

0.6

-0.1

0.3

0.1

-0.2

0.6

-0.1

-0.7

0.1

0.2

0.4

-0.1

0.2

0.1

-0.4

0.6

-0.1

-0.7

0.1

0.3

0.6

-0.1

-0.5

0.1

-0.3

0.5

-0.1

-0.7

0.1

0.2

0.6

-0.1

-0.7

0.1

Decoder

0.2

0.6

-0.1

-0.7

0.1

0.4

-0.6

0.2

-0.3

0.4

0.2

-0.3

-0.1

-0.4

0.2

0.4

0.1

-0.5

-0.2

0.4

-0.2

-0.3

-0.4

-0.2

0.2

0.6

-0.1

-0.7

0.1

0.2

0.6

-0.1

-0.7

0.1

0.2

0.6

-0.1

-0.7

0.1

0.2

0.6

-0.1

-0.7

0.1

-0.1

0.3

-0.1

-0.7

0.1

-0.2

0.6

0.1

0.3

0.1

-0.4

0.5

-0.5

0.4

0.1

0.2

0.6

-0.1

-0.7

0.1

Source sentence

Die Proteste waren am Wochenende eskaliert <EOS> The protests escalated over the weekend

A deep recurrent neural network

Feeding in last word Conditional Recurrent Language Model

Le chat assis sur le tapis.

image ?

The cat sat on the mat.

Encoder

Y

2017-02-16

11 image Recurrent Neural Network Encoder

h0 h1 h2 h3

… h7

Le chat assis .

   Read a source sentence one symbol at a time.

   The last hidden stateY summarizes the entire source sentence.

   Any recurrent activation function can be used:

       Hyperbolic tangent

       tanh

       image

       Gated recurrent unit [Cho et al., 2014]

       Long short-term memory [Sutskever et al., 2014]

       Convolutional network [Kalchbrenner & Blunsom, 2013]

       image
       Decoder: Recurrent Language Model

       The cat sat on

       image

       z0 z1 z2 z3 …

       Y = h7

       The cat sat

   Usual recurrent language model, except

       Transition

       zt = f (zt-1, xt, Y )

       Backpropagation @zt/@Y

   image

   Same learning strategy as usual: MLE with SGD

1

L(✓, D) =

Tn

N

X

N

n=1

t=1

t

1

t-1

X

log p(xn|xn,. .., xn

, Y ) Progress in Machine Translation

[Edinburgh En-De WMT newstest2013 Cased BLEU; NMT 2015 from U. Montréal]

image

Phrase-based SMT Syntax-based SMT Neural MT

25

20

15

10

5

0

2013 2014 2015 2016

From [Sennrich 2016, http://www.meta-net.eu/events/meta-forum-2016/slides/09_sennrich.pdf]

Neural MT went from a fringe research activity in 2014 to the widely-adopted leading way to do MT in 2016. Amazing !

15 Four big wins of Neural MT

   End-to-end training

   All parameters are simultaneously optimized to minimize a loss function on the network’s output

   Distributed representations share strength

   Better exploitation of word and phrase similarities

   Better exploitation of context

   NMT can use a much bigger context – both source and partial target text – to translate more accurately

   image

   More fluent text generation

Deep learning text generation is much higher quality

16

What wasn’t on that list?

   Black box component models for reordering, transliteration, etc.

   Explicit use of syntactic or semantic structures

   image

   Explicit use of discourse structure, anaphora, etc.

   image
   Statistical/Neural Machine Translation

   A marvelous use of big data but….

   1519年600名西班牙人在墨西哥登陆，去征服几百万人口的阿兹特克帝国，初次交锋他们损兵三分之二。

   In 1519, six hundred Spaniards landed in Mexico to conquer the Aztec Empire with a population of a few million. They lost two thirds of their soldiers in the first clash.

   translate.google.com (2009): 1519 600 Spaniards landed in Mexico, millions of people to conquer the Aztec empire, the first two-thirds of soldiers against their loss. translate.google.com (2011): 1519 600 Spaniards landed in Mexico, millions of people to

   conquer the Aztec empire, the initial loss of soldiers, two thirds of their encounters.

   image

   translate.google.com (2013): 1519 600 Spaniards landed in Mexico to conquer the Aztec empire, hundreds of millions of people, the initial confrontation loss of soldiers two-thirds. translate.google.com (2014/15/16): 1519 600 Spaniards landed in Mexico, millions of

   people to conquer the Aztec empire, the first two-thirds of the loss of soldiers they clash.

   translate.google.com (2017): In 1519, 600 Spaniards landed in Mexico, to conquer the millions of people of the Aztec empire, the first confrontation they killed two-thirds.

   image
   Adoption!!! NMT aggressively rolled out by industry!

   2016/02, Microsoft launches deep neural network MT running offline on Android/iOS. [Link to blog]

   2016/08, Systran launches purely NMT model [Link to press release]

   2016/09, Google launches NMT [Link to blog post]

   With much more hype and gross overclaims of equaling human translation quality

   image

   Great New York Times Magazine feature

   Paper on the research: https://arxiv.org/abs/1611.04558

   image

   image

   Google’s Multilingual Neural Machine Translation System:

   Enabling Zero-Shot Translation

   Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, Jeffrey Dean

   Presented by: Emma Peng

   image

   image

   State-of-the-art:

   Neural Machine Translation (NMT)

   image

   image

   Multilingual NMT? Previously …

   Multiple Encoders → Multiple Decoders [1]

   Shared Encoder → Multiple Decoder [2]

   Multiple Encoders → Shared Decoder [3]

   Er-Es Encoder Er-Es Decoder

   Es-Er Encoder

   Er-NL Encoder Er-NL Decoder

   NL-Er Encoder Shared Decoder

   Er-Fr Encoder Er-Fr Decoder

   Fr-Er Encoder

   image

   image

       Simplicity: single model

       Low-resource language improvements

       Zero-shot translation

image

Artificial token at the beginning of the input sentence to indicate the target language

Add <2es> to indicate that Spanish is the target language

image

   WMT’14:

       Comparable performance: English → French

       State-of-the-art: English → German, French → English

   WMT’15:

       State-of-the-art: German → English

       image

       image

           Train:

               Portuguese → English, English → Spanish (Model 1)

               Or, English ←→ {Portuguese, Spanish} (Model 2)

           Test:

               Portuguese → Spanish Zero-Shot!

Thank you!

   Introducing Attention:
   Vanilla seq2seq & long sentences

   étudiant _

   suis étudiant

   image

   Je suis

   I am a student _ Je

   Problem: fixed-dimensional representation Y

   image

   29
   Attention Mechanism

   Je suis

   Started in computer vision! [Larochelle & Hinton, 2010], [Denil, Bazzani, Larochelle, Freitas, 2012]

   image

   image

   Pool of source states

   I am a student _ Je

   image

       Solution: random access memory

           Retrieve as needed.

           image
           Word alignments

           Le reste

           appartenait

           aux autochtones

           Phrase-based SMT aligned words in a preprocessing-step, usually using EM

           The balance

           was the territory

           of the aboriginal

           people

           Le reste

           image

           image

           image appartenait

           image

           image

           image

           aux autochtones

           The balance

           image

           was the territory

           of the aboriginal

           people

           image
           Learning both imageimageimage translation & alignment

           image

           image

Dzmitry Bahdanau, KyungHuyn Cho, and Yoshua Bengio. Neural Machine

           32 Translation by Jointly Learning to Translate and Align. ICLR’15.

           suis

           Attention Layer

           Context vector

           image

           image
           Attention Mechanism

           I am a student _ Je

           image

           image

           Simplified version of (Bahdanau et al., 2015)

           suis

           Attention Layer

           Context vector

           image

           image
           Attention Mechanism – Scoring

           I am a student _ Je

           image

               Compare target and source hidden states.

               suis

               Attention Layer

               Context vector

               image

               image
               Attention Mechanism – Scoring

3 5

               I am a student _ Je

               image

               Compare target and source hidden states.

               suis

               Attention Layer

               Context vector

               image

               image
               Attention Mechanism – Scoring

               3 5 1

               I am a student _ Je

               image

               Compare target and source hidden states.

               suis

               Attention Layer

               Context vector

               image

               image
               Attention Mechanism – Scoring

               3 5 1 1

               I am a student _ Je

               image

               Compare target and source hidden states.

               suis

               Attention Layer

               Context vector

               image

               image
               Attention Mechanism – Normalization

               image

               0.3 0.5 0.1 0.1

               I am a student _ Je

               image

               Convert into alignment weights.

               image

               image
               Attention Mechanism – Context

               Context vector

               I am a student _ Je

               image

               Build context vector: weighted average.

               suis

               image

               image
               Attention Mechanism – Hidden State

               Context vector

               I am a student _ Je

               image

               Compute the next hidden state.

               image

               image

               image

               Simplified mechanism & more functions:

               image

               image

               Thang Luong, Hieu Pham, and Chris Manning. Effective Approaches to

               41 Attention-based Neural Machine Translation. EMNLP’15.

               image

               image

               Simplified mechanism & more functions:

               image

               image

               image

               Bilinear form: well-adopted.

               image

               Avoid focusing on everything at each time

               image

               image

               image

               image

               image

               image

               Global: all source states. Local: subset of source states.

               image

               image

               Potential for long sequences!

               Thang Luong, Hieu Pham, and Chris Manning. Effective Approaches to

               43 Attention-based Neural Machine Translation. EMNLP’15.

               Attention

               image

               □□□□□

               image

               BLEU

               image

               10 No Attention

               ours, no attn (BLEU 13.9) ours, local−p attn (BLEU 20.9) ours, best system (BLEU 23.0) WMT’14 best (BLEU 20.7)

               Jeans et al., 2015 (BLEU 21.6)

               10 20 30 40 50 60 70

               Sent Lengths

               image

               source Orlando Bloom and Miranda Kerr still love each other

               human Orlando Bloom und Miranda Kerr lieben sich noch immer

               +attn Orlando Bloom und Miranda Kerr lieben einander noch immer .

               base Orlando Bloom und Lucas Miranda lieben einander noch immer .

               image

                   Translates names correctly.

                   image

                   image

                   source

                   human

                   We ′ re pleased the FAA recognizes that an enjoyable passenger experience is with safety and security , said Roger Dow , CEO of the U.S.

                   not incompatible

                   Travel Association .

                   Wir freuen uns , dass die FAA erkennt , dass ein angenehmes Passagiererlebnis nicht im Wider- spruch zur Sicherheit steht , sagte Roger Dow , CEO der U.S. Travel Association .

                   unvereinbar

                   Wir freuen uns , dass die FAA anerkennt , dass ein angenehmes ist nicht mit

                   +attn

                   Sicherheit und Sicherheit die .

                   ist , sagte Roger Dow , CEO der US -

                   base

                   Wir freuen uns u ̈ber die <unk> , dass ein <unk> <unk> mit Sicherheit nicht ist mit Sicherheit und Sicherheit , sagte Roger Cameron , CEO der

                   vereinbar

                   US - <unk> .

                   Translates a doubly-negated phrase correctly.

                   image

                   image

                   source

                   We ′ re pleased the FAA recognizes that an enjoyable passenger experience is not incompatible with safety and security , said Roger Dow , CEO of the U.S. Travel Association .

                   im Wider- spruch zur Sicherheit steht

                   Wir freuen uns , dass die FAA erkennt , dass ein angenehmes

                   human

                   Passagiererlebnis nicht

                   Dow , CEO der U.S. Travel Association .

                   , sagte Roger

                   +attn

                   base

                   Wir freuen uns , dass die FAA anerkennt , dass ein angenehmes ist nicht mit Sicherheit und Sicherheit unvereinbar ist , sagte Roger Dow , CEO der US - die .

                   Wir freuen uns u ̈ber die <unk> , dass ein <unk> <unk> mit Sicherheit nicht vereinbar ist mit Sicherheit und Sicherheit , sagte Roger Cameron , CEO der US - <unk> .

                   Translates a doubly-negated phrase correctly.

               image
               More Attention! The idea of coverage

               Caption generation

               image

               image

               image

               i h?

               How to not miss an mportant image patc

               Xu, Ba, Kiros, Cho, Courville, Salakhutdinov, Zemel, Bengio. Show, Attend and

                   Tell: Neural Image Caption Generation with Visual Attention. ICML’15

                   Doubly attention

                   image

                   image

                   Per image patch

                   image

                   Sum across

                   image

               Sum to 1 in both dimensions

           image

           xists long tim

           e Coverage sete

= 1

           caption words

           image

           ago in SMT!

           ≃ 1

           Xu, Ba, Kiros, Cho, Courville, Salakhutdinov, Zemel, Bengio. Show, Attend and

               Tell: Neural Image Caption Generation with Visual Attention. ICML’15

               image
               Extending attention with linguistic ideas previously in alignment models

                   [Tu, Lu, Liu, Liu, Li, ACL’16]: NMT model with coverage-based attention

                   [Cohn, Hoang, Vymolova, Yao, Dyer, Haffari, NAACL’16]: More substantive models of attention using: position (IBM2) + Markov (HMM) + fertility (IBM3-5) + alignment symmetry (BerkeleyAligner)

                   image

                   image

               Per source word

           Source word fertility

           image
   Sequence Model Decoders: Decoding (0) – Exhaustive Search

       Simple and exact decoding algorithm

       Score each and every possible translation

       Pick the best one

       DO NOT EVEN THINK of TRYING IT OUT!*

       The cat sat on

       image

       image

       z0 z1 z2 z3 …

       The cat sat

       h0 h1 h2 h3 … h7

       image

       * Perhaps with quantum computer and quantum annealing?

       Le chat assis .

       One symbol at a time from

       x˜t ⇠ xt|xt-1,. .., x1,Y

       Until

       x˜t = heos

       Repeat

       The cat sat

       image

       Y = h7

       x0|Y x1|x0,Y x2|x1, x0,Y

       image

       z0 z1 z2 z3

       Pros:

       1. Efficient and unbiased (asymptotically exact)

       Cons:

       image

           High variance

           Pretty inefficient

       image

       Y = h7

       The cat sat

       x0|Y x1|x0,Y x2|x1, x0,Y

       z0 z1 z2 z3

       image

       Efficient, but heavily suboptimal search

       Pick the most likely symbol each time

       x˜t = arg max log p(x x<t,Y )

       Until

       Pros:

       x˜t = heos

           Super-efficient

               Both computation and memory

       image

       Cons:

       1. Heavily suboptimal

       image
       Decoding (3)
       – Beam Search

       Pretty, but not very efficient

       Maintain K hypotheses at a time

t-1

t-1

t-1

       Ht-1 = (x˜1, x˜1,. .., x˜1 ), (x˜2, x˜2,. .., x˜2

       ),. .., (x˜K, x˜K,. .., x˜K )

       Expand each hypothesis

       k k k k

       k k k

       k k k

       Ht = (x˜1 , x˜2 ,. .., x˜t-1, v1), (x˜1 , x˜2 ,. .., x˜t-1, v2),. .., (x˜1 , x˜2 ,. .., x˜t-1, v|V |)

       image

       Pick top-K hypotheses from the union

       Ht = [k=1

Bk,

       where

       0k =1

k0

       Bk = arg max log p(X˜|Y ), Ak = Ak-1 - Bk-1, and A1 = [K

       Ht .

       X˜ 2Ak

       image
       Decoding (3)
       – Beam Search

       Asymptotically exact, as K

       But, not necessarily monotonic improvement w.r.t. K

       image

       K should be selected to maximize the translation quality on a validation set.

       En-Cz: 12m training sentence pairs

Strategy

Chains

Valid Set

Test Set

NLL

BLEU

NLL

BLEU

Ancestral Sampling

50

22.98

15.64

26.25

16.76

Greedy Decoding

-

27.88

15.50

26.49

16.66

Beamsearch

5

20.18

17.03

22.81

18.56

Beamsearch

10

19.92

17.13

22.44

18.59

   [Cho, arXiv 2016]

       Greedy Search

           Computationally efficient

           Not great quality

       Beam Search

           Computationally expensive

           Not easy to parallelize

           Much better quality

   Beam search with a small beam is de facto standard in NMT

References

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2017 Lecture10NeuralMachineTranslati	Christopher D. Manning Richard Socher			Lecture 10 - Neural Machine Translation and Models with Attention						2017

2017 Lecture10NeuralMachineTranslati

Notes

Cited By

2017

Quotes

References

Navigation menu

Search