2017 Lecture10NeuralMachineTranslati
- (Manning & Socher, 2017) ⇒ Christopher Manning, and Richard Socher. (2017). “Lecture 10 - Neural Machine Translation and Models with Attention.” In: Lecture in Natural Language Processing with Deep Learning - Stanford CS224N Ling284 (2017).
Subject Headings: Neural seq2seq, Neural MT, Neural Models with Attention. Manning-Socher Neural Machine Translation System, WMT-14 SMT Shared Task, WMT-15 SMT Shared Task.
Notes
Cited By
2017
- (Manning & Socher, 2017k) ⇒ Christopher Manning, and Richard Socher. (2017). “Lecture 11 - Further Topics in Neural Machine Translation and Recurrent Models.” Natural Language Processing with Deep Learning - Stanford CS224N Ling284.
Quotes
- Lecture Plan
- Going forwards and backwards
- Translation, Machine Translation, Neural Machine Translation
- Research highlight: Google’s new NMT
- Sequence models with attention
- Sequence model decoders
The classic test of language understanding!
Both language analysis & generation
Big MT needs … for humanity … and commerce
Translation is a US$40 billion a year industry
image
Huge in Europe, growing in Asia
image
Large social/government/military as well as commercial needs
3
Huge commercial use
Google translates over 100 billion words a day Facebook in 2016 rolled out new homegrown MT
“When we turned [MT] off for some people, they
went nuts!”
eBay uses MT to enable cross-border trade
image
http://www.commonsenseadvisory.com/AbstractView.aspx?ArticleID=36540 https://googleblog.blogspot.com/2016/04/ten-years-of-google-translate.html
4 https://techcrunch.com/2016/05/23/facebook-translation/
image
What is Neural MT (NMT)?
Neural Machine Translation is the approach of modeling the entire MT process via one big artificial neural network*
- But sometimes we compromise this goal a little
5
image
Neural encoder-decoder architectures
image
image
Input text
Encoder Decoder
image Translated text
image
image
−0.2 −0.1 0.1 0.4 −0.3 1.1 6
image
[Allen 1987 IEEE 1st ICNN]
image
3310 En-Es pairs constructed on 31 En, 40 Es words, max 10/11 word sentence; 33 used as test set
The grandfather offered the little girl a book ➔ El abuelo le ofrecio un libro a la nina pequena
Binary encoding of words – 50 inputs, 66 outputs; 1 or 3 hidden 150-unit layers. Ave WER: 1.3 words
7
[Chrisman 1992 Connection Science]
Dual-ported RAAM architecture [Pollack 1990 Artificial Intelligence] applied to corpus of 216 parallel pairs of simple En-Es sentences:
You are not angry ⬌Usted no esta furioso
Split 50/50 as train/test, 75% of sentences correctly translated!
8 Modern Sequence Models for NMT
[Sutskever et al. 2014, cf. Bahdanau et al. 2014, et seq.]
Je suis étudiant _
Encoder Decoder
I am a student _ Je suis étudiant
9 Modern Sequence Models for NMT
[Sutskever et al. 2014, cf. Bahdanau et al. 2014, et seq.]
The protests escalated over the weekend <EOS>
Translation generated
Encoder: Builds up sentence meaning
0.1
0.3
0.1
-0.4
0.2
0.2
-0.2
-0.1
0.1
0.1
0.2
0.6
-0.1
-0.7
0.1
0.2
0.6
-0.1
-0.7
0.1
0.4
0.4
0.3
-0.2
-0.3
0.1
0.3
-0.1
-0.7
0.1
0.5
0.5
0.9
-0.3
-0.2
0.2
0.6
-0.1
-0.4
0.1
0.2
0.6
-0.1
-0.5
0.1
0.2
-0.8
-0.1
-0.5
0.1
-0.1
0.6
-0.1
-0.7
0.1
0.2
0.6
-0.1
-0.7
0.1
0.2
0.6
-0.1
-0.7
0.1
0.2
-0.1
-0.1
-0.7
0.1
0.2
0.6
-0.1
-0.7
0.1
-0.4
0.6
-0.1
-0.7
0.1
0.3
0.6
-0.1
-0.7
0.1
0.2
0.6
-0.1
0.3
0.1
0.4
0.4
-0.1
-0.7
0.1
-0.1
0.6
-0.1
0.3
0.1
-0.2
0.6
-0.1
-0.7
0.1
0.2
0.4
-0.1
0.2
0.1
-0.4
0.6
-0.1
-0.7
0.1
0.3
0.6
-0.1
-0.5
0.1
-0.3
0.5
-0.1
-0.7
0.1
0.2
0.6
-0.1
-0.7
0.1
Decoder
0.2
0.6
-0.1
-0.7
0.1
0.4
-0.6
0.2
-0.3
0.4
0.2
-0.3
-0.1
-0.4
0.2
0.2
0.4
0.1
-0.5
-0.2
0.4
-0.2
-0.3
-0.4
-0.2
0.2
0.6
-0.1
-0.7
0.1
0.2
0.6
-0.1
-0.7
0.1
0.2
0.6
-0.1
-0.7
0.1
0.2
0.6
-0.1
-0.7
0.1
-0.1
0.3
-0.1
-0.7
0.1
-0.2
0.6
0.1
0.3
0.1
-0.4
0.5
-0.5
0.4
0.1
0.2
0.6
-0.1
-0.7
0.1
Source sentence
Die Proteste waren am Wochenende eskaliert <EOS> The protests escalated over the weekend
A deep recurrent neural network
Feeding in last word Conditional Recurrent Language Model
Le chat assis sur le tapis.
image ?
The cat sat on the mat.
Encoder
Y
2017-02-16
11 image Recurrent Neural Network Encoder
h0 h1 h2 h3
… h7
Le chat assis .
Read a source sentence one symbol at a time.
The last hidden stateY summarizes the entire source sentence.
Any recurrent activation function can be used:
Hyperbolic tangent
tanh
image
Gated recurrent unit [Cho et al., 2014]
Long short-term memory [Sutskever et al., 2014]
Convolutional network [Kalchbrenner & Blunsom, 2013]
image Decoder: Recurrent Language Model
The cat sat on
image
z0 z1 z2 z3 …
Y = h7
The cat sat
Usual recurrent language model, except
Transition
zt = f (zt-1, xt, Y )
X
Backpropagation @zt/@Y
t
image
Same learning strategy as usual: MLE with SGD
1
L(✓, D) =
Tn
N
X
N
n=1
t=1
t
1
t-1
X
log p(xn|xn,. .., xn
, Y ) Progress in Machine Translation
[Edinburgh En-De WMT newstest2013 Cased BLEU; NMT 2015 from U. Montréal]
image
Phrase-based SMT Syntax-based SMT Neural MT
25
20
15
10
5
0
2013 2014 2015 2016
From [Sennrich 2016, http://www.meta-net.eu/events/meta-forum-2016/slides/09_sennrich.pdf]
Neural MT went from a fringe research activity in 2014 to the widely-adopted leading way to do MT in 2016. Amazing !
15 Four big wins of Neural MT
End-to-end training
All parameters are simultaneously optimized to minimize a loss function on the network’s output
Distributed representations share strength
Better exploitation of word and phrase similarities
Better exploitation of context
NMT can use a much bigger context – both source and partial target text – to translate more accurately
image
More fluent text generation
Deep learning text generation is much higher quality
16
What wasn’t on that list?
Black box component models for reordering, transliteration, etc.
Explicit use of syntactic or semantic structures
image
Explicit use of discourse structure, anaphora, etc.
17
image Statistical/Neural Machine Translation
A marvelous use of big data but….
1519年600名西班牙人在墨西哥登陆,去征服几百万人口的阿兹特克帝国,初次交锋他们损兵三分之二。
In 1519, six hundred Spaniards landed in Mexico to conquer the Aztec Empire with a population of a few million. They lost two thirds of their soldiers in the first clash.
translate.google.com (2009): 1519 600 Spaniards landed in Mexico, millions of people to conquer the Aztec empire, the first two-thirds of soldiers against their loss. translate.google.com (2011): 1519 600 Spaniards landed in Mexico, millions of people to
conquer the Aztec empire, the initial loss of soldiers, two thirds of their encounters.
image
translate.google.com (2013): 1519 600 Spaniards landed in Mexico to conquer the Aztec empire, hundreds of millions of people, the initial confrontation loss of soldiers two-thirds. translate.google.com (2014/15/16): 1519 600 Spaniards landed in Mexico, millions of
people to conquer the Aztec empire, the first two-thirds of the loss of soldiers they clash.
translate.google.com (2017): In 1519, 600 Spaniards landed in Mexico, to conquer the millions of people of the Aztec empire, the first confrontation they killed two-thirds.
image Adoption!!! NMT aggressively rolled out by industry!
2016/02, Microsoft launches deep neural network MT running offline on Android/iOS. [Link to blog]
2016/08, Systran launches purely NMT model [Link to press release]
2016/09, Google launches NMT [Link to blog post]
With much more hype and gross overclaims of equaling human translation quality
image
Great New York Times Magazine feature
Paper on the research: https://arxiv.org/abs/1611.04558
19
image
image
Google’s Multilingual Neural Machine Translation System:
Enabling Zero-Shot Translation
Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, Jeffrey Dean
Presented by: Emma Peng
image
image
State-of-the-art:
Neural Machine Translation (NMT)
image
image
Multilingual NMT? Previously …
Multiple Encoders → Multiple Decoders [1]
Shared Encoder → Multiple Decoder [2]
Multiple Encoders → Shared Decoder [3]
Er-Es Encoder Er-Es Decoder
Es-Er Encoder
Er-NL Encoder Er-NL Decoder
NL-Er Encoder Shared Decoder
Er-Fr Encoder Er-Fr Decoder
Fr-Er Encoder
image
image
Simplicity: single model
Low-resource language improvements
Zero-shot translation
image
image
Artificial token at the beginning of the input sentence to indicate the target language
Add <2es> to indicate that Spanish is the target language
image
WMT’14:
Comparable performance: English → French
State-of-the-art: English → German, French → English
WMT’15:
State-of-the-art: German → English
image
image
Train:
Portuguese → English, English → Spanish (Model 1)
Or, English ←→ {Portuguese, Spanish} (Model 2)
Test:
Portuguese → Spanish Zero-Shot!
Thank you!
Introducing Attention: Vanilla seq2seq & long sentences
étudiant _
suis étudiant
image
Je suis
I am a student _ Je
Problem: fixed-dimensional representation Y
image
29 Attention Mechanism
Je suis
Started in computer vision! [Larochelle & Hinton, 2010], [Denil, Bazzani, Larochelle, Freitas, 2012]
image
image
Pool of source states
I am a student _ Je
image
Solution: random access memory
Retrieve as needed.
30
image Word alignments
Le reste
appartenait
aux autochtones
Phrase-based SMT aligned words in a preprocessing-step, usually using EM
The balance
was the territory
of the aboriginal
people
Le reste
image
image
image appartenait
image
image
image
aux autochtones
The balance
image
was the territory
of the aboriginal
people
image Learning both imageimageimage translation & alignment
image
image
Dzmitry Bahdanau, KyungHuyn Cho, and Yoshua Bengio. Neural Machine
32 Translation by Jointly Learning to Translate and Align. ICLR’15.
suis
Attention Layer
Context vector
image
image Attention Mechanism
?
I am a student _ Je
image
image
Simplified version of (Bahdanau et al., 2015)
33
suis
Attention Layer
Context vector
image
image Attention Mechanism – Scoring
3
?
I am a student _ Je
image
Compare target and source hidden states.
34
suis
Attention Layer
Context vector
image
image Attention Mechanism – Scoring
3 5
?
I am a student _ Je
image
Compare target and source hidden states.
35
suis
Attention Layer
Context vector
image
image Attention Mechanism – Scoring
3 5 1
?
I am a student _ Je
image
Compare target and source hidden states.
36
suis
Attention Layer
Context vector
image
image Attention Mechanism – Scoring
3 5 1 1
?
I am a student _ Je
image
Compare target and source hidden states.
37
suis
Attention Layer
Context vector
image
image Attention Mechanism – Normalization
image
0.3 0.5 0.1 0.1
?
I am a student _ Je
image
Convert into alignment weights.
38
image
image Attention Mechanism – Context
Context vector
?
I am a student _ Je
image
Build context vector: weighted average.
39
suis
image
image Attention Mechanism – Hidden State
Context vector
I am a student _ Je
image
Compute the next hidden state.
40
image
image
image
Simplified mechanism & more functions:
image
image
Thang Luong, Hieu Pham, and Chris Manning. Effective Approaches to
41 Attention-based Neural Machine Translation. EMNLP’15.
image
image
Simplified mechanism & more functions:
image
image
image
Bilinear form: well-adopted.
42
image
Avoid focusing on everything at each time
image
image
image
image
image
image
Global: all source states. Local: subset of source states.
image
image
Potential for long sequences!
Thang Luong, Hieu Pham, and Chris Manning. Effective Approaches to
43 Attention-based Neural Machine Translation. EMNLP’15.
Attention
image
25
□□□□□
image
20
BLEU
15
image
10 No Attention
ours, no attn (BLEU 13.9) ours, local−p attn (BLEU 20.9) ours, best system (BLEU 23.0) WMT’14 best (BLEU 20.7)
Jeans et al., 2015 (BLEU 21.6)
10 20 30 40 50 60 70
Sent Lengths
44
image
source Orlando Bloom and Miranda Kerr still love each other
human Orlando Bloom und Miranda Kerr lieben sich noch immer
+attn Orlando Bloom und Miranda Kerr lieben einander noch immer .
base Orlando Bloom und Lucas Miranda lieben einander noch immer .
image
Translates names correctly.
45
image
image
source
human
We ′ re pleased the FAA recognizes that an enjoyable passenger experience is with safety and security , said Roger Dow , CEO of the U.S.
not incompatible
Travel Association .
Wir freuen uns , dass die FAA erkennt , dass ein angenehmes Passagiererlebnis nicht im Wider- spruch zur Sicherheit steht , sagte Roger Dow , CEO der U.S. Travel Association .
unvereinbar
Wir freuen uns , dass die FAA anerkennt , dass ein angenehmes ist nicht mit
+attn
Sicherheit und Sicherheit die .
ist , sagte Roger Dow , CEO der US -
base
Wir freuen uns u ̈ber die <unk> , dass ein <unk> <unk> mit Sicherheit nicht ist mit Sicherheit und Sicherheit , sagte Roger Cameron , CEO der
vereinbar
US - <unk> .
Translates a doubly-negated phrase correctly.
46
image
image
source
We ′ re pleased the FAA recognizes that an enjoyable passenger experience is not incompatible with safety and security , said Roger Dow , CEO of the U.S. Travel Association .
im Wider- spruch zur Sicherheit steht
Wir freuen uns , dass die FAA erkennt , dass ein angenehmes
human
Passagiererlebnis nicht
Dow , CEO der U.S. Travel Association .
, sagte Roger
+attn
base
Wir freuen uns , dass die FAA anerkennt , dass ein angenehmes ist nicht mit Sicherheit und Sicherheit unvereinbar ist , sagte Roger Dow , CEO der US - die .
Wir freuen uns u ̈ber die <unk> , dass ein <unk> <unk> mit Sicherheit nicht vereinbar ist mit Sicherheit und Sicherheit , sagte Roger Cameron , CEO der US - <unk> .
Translates a doubly-negated phrase correctly.
47
image More Attention! The idea of coverage
Caption generation
image
image
image
i h?
How to not miss an mportant image patc
Xu, Ba, Kiros, Cho, Courville, Salakhutdinov, Zemel, Bengio. Show, Attend and
Tell: Neural Image Caption Generation with Visual Attention. ICML’15
Doubly attention
image
image
Per image patch
image
Sum across
image
Sum to 1 in both dimensions
image
xists long tim
e Coverage sete
= 1
caption words
image
ago in SMT!
≃ 1
Xu, Ba, Kiros, Cho, Courville, Salakhutdinov, Zemel, Bengio. Show, Attend and
Tell: Neural Image Caption Generation with Visual Attention. ICML’15
image Extending attention with linguistic ideas previously in alignment models
[Tu, Lu, Liu, Liu, Li, ACL’16]: NMT model with coverage-based attention
[Cohn, Hoang, Vymolova, Yao, Dyer, Haffari, NAACL’16]: More substantive models of attention using: position (IBM2) + Markov (HMM) + fertility (IBM3-5) + alignment symmetry (BerkeleyAligner)
image
image
Per source word
Source word fertility
image Sequence Model Decoders: Decoding (0) – Exhaustive Search
Simple and exact decoding algorithm
Score each and every possible translation
Pick the best one
DO NOT EVEN THINK of TRYING IT OUT!*
The cat sat on
image
image
z0 z1 z2 z3 …
The cat sat
h0 h1 h2 h3 … h7
image
51
* Perhaps with quantum computer and quantum annealing?
Le chat assis .
One symbol at a time from
x˜t ⇠ xt|xt-1,. .., x1,Y
Until
x˜t = heos
Repeat
The cat sat
image
Y = h7
52
x0|Y x1|x0,Y x2|x1, x0,Y
image
z0 z1 z2 z3
Pros:
1. Efficient and unbiased (asymptotically exact)
Cons:
image
High variance
Pretty inefficient
image
Y = h7
The cat sat
x0|Y x1|x0,Y x2|x1, x0,Y
z0 z1 z2 z3
53
image
Efficient, but heavily suboptimal search
Pick the most likely symbol each time
|
x˜t = arg max log p(x x<t,Y )
x
Until
Pros:
x˜t = heos
Super-efficient
Both computation and memory
image
Cons:
1. Heavily suboptimal
54
image Decoding (3) – Beam Search
Pretty, but not very efficient
Maintain K hypotheses at a time
1
2
t-1
1
2
t-1
1
2
t-1
Ht-1 = (x˜1, x˜1,. .., x˜1 ), (x˜2, x˜2,. .., x˜2
),. .., (x˜K, x˜K,. .., x˜K )
Expand each hypothesis
k k k k
k k k
k k k
Ht = (x˜1 , x˜2 ,. .., x˜t-1, v1), (x˜1 , x˜2 ,. .., x˜t-1, v2),. .., (x˜1 , x˜2 ,. .., x˜t-1, v|V |)
image
Pick top-K hypotheses from the union
Ht = [k=1
Bk,
where
K
0k =1
k0
Bk = arg max log p(X˜|Y ), Ak = Ak-1 - Bk-1, and A1 = [K
Ht .
X˜ 2Ak
55
image Decoding (3) – Beam Search
Asymptotically exact, as K
But, not necessarily monotonic improvement w.r.t. K
image
K should be selected to maximize the translation quality on a validation set.
56
En-Cz: 12m training sentence pairs
Strategy
- Chains
Valid Set
Test Set
NLL
BLEU
NLL
BLEU
Ancestral Sampling
50
22.98
15.64
26.25
16.76
Greedy Decoding
-
27.88
15.50
26.49
16.66
Beamsearch
5
20.18
17.03
22.81
18.56
Beamsearch
10
19.92
17.13
22.44
18.59
[Cho, arXiv 2016]
Greedy Search
Computationally efficient
Not great quality
Beam Search
Computationally expensive
Not easy to parallelize
Much better quality
Beam search with a small beam is de facto standard in NMT
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2017 Lecture10NeuralMachineTranslati | Christopher D. Manning Richard Socher | Lecture 10 - Neural Machine Translation and Models with Attention | 2017 |