2017 Lecture11FurtherTopicsinNeuralM
- (Manning & Socher, 2017k) ⇒ Christopher Manning, and Richard Socher. (2017). “Lecture 11 - Further Topics in Neural Machine Translation and Recurrent Models.” Lecture in Natural Language Processing with Deep Learning - Stanford CS224N Ling284 (2017).
Notes
- Lecture Video: https://www.youtube.com/watch?v=6_MO12fPC-0
- Lecture 11 provides a final look at gated recurrent units like GRUs/LSTMs followed by machine translation evaluation, dealing with large vocabulary output, and sub-word and character-based models.
- Also includes research highlight ""Lip reading sentences in the wild.""
- Key phrases: Seq2Seq and Attention Mechanisms, Neural Machine Translation, Speech Processing.
- Google Presentation (version): https://docs.google.com/presentation/d/1XJDrDrf8w2yWqdLQtuPzNBpIAxiubcRxHpXH3rl-a2o/
Cited By
2017
- (Manning & Socher, 2017j) ⇒ Christopher Manning, and Richard Socher. (2017). “Lecture 10 - Neural Machine Translation and Models with Attention.” In: Natural Language Processing with Deep Learning - Stanford CS224N Ling284.
Quotes
Lecture Plan: Going forwards and backwards
- . A final look at gated recurrent units like GRUs/LSTMs
- . Research highlight: Lip reading sentences in the wild
- . Machine translation evaluation
- . Dealing with the large output vocabulary
- . Sub-word and character-based models
1. How Gated Units Fix Things – Backpropagation through Time Intuitively, what happens with RNNs?
1. Measure the influence of = @ log p(xt+n|x<t+n) on @g @ht+n @ht+1 ·· · @ht @g @ht+n @ht+n-1
2. How does the perturbation at t
affect
p(xt+n|x<t+n) ?
- Backpropagation through Time
Vanishing gradient is super-problematic
- When we only observe
1 @h 1 1 Y ✓ @ tanh(a ) ◆1
- We cannot tell whether
- No dependency between t and t+n in data, or
- Wrong configuration of parameters (the vanishing gradient condition):
emax(U ) < 1
max tanh0(x)
- Is the problem with the naïve transition function?
f (ht-1, xt) = tanh(W [xt] + Uht-1 + b)
- With it, the temporal derivative is
@ht+1 @ht = U > @ tanh(a) @a
- It implies that the error must backpropagate through all the intermediate nodes:
- Perhaps we can create shortcut connections.
- Perhaps we can create adaptive shortcut connections.
- Candidate Update h˜t = tanh(W [xt] + Uht-1 + b)
- Update gate ut = o(Wu [xt] + Uuht-1 + bu)
- Let the net prune unnecessary connections
- Candidate Update
h˜t = tanh(W [xt] + U (rt 0 ht-1) + b)
- Reset gate
- Update gate
rt = o(Wr [xt] + Urht-1 + br) ut = o(Wu [xt] + Uuht-1 + bu)
Gated Recurrent Unit tanh-RNN …
- Gated Recurrent Unit
GRU …
Gated recurrent units are much more realistic! Note that there is some overlap in ideas with attention
Two most widely used gated recurrent units
- Gated Recurrent Unit
[ Cho et al., EMNLP2014; Chung, Gulcehre, Cho, Bengio, DLUFL2014
ht = ut 0 h˜t + (1 - ut) 0 ht-1
- Long Short-Term Memory
[ Hochreiter & Schmidhuber, NC1999; Gers, Thesis2001
ht = ot 0 tanh(ct) h˜ = tanh(W [xt]+ U (rt 0 ht-1)+ b) ct = ft 0 ct-1 + it 0 c˜t ut = o-(Wu [xt]+ Uuht-1 + bu) rt = o-(Wr [xt]+ Urht-1 + br)
c˜t = tanh(Wc [xt]+ Ucht-1 + bc) ot = o-(Wo [xt]+ Uoht-1 + bo) it = o-(Wi [xt]+ Uiht-1 + bi) ft = o-(Wf [xt]+ Ufht-1 + bf)
- The LSTM
The LSTM gates all operations so stuff can be forgotten/ignored rather than it all being crammed on top of everything else
The non-linear update for the next time step is just like an RNN
14
LSTM
This part is the the secret! (Of other recent things like ResNets too!) Rather than multiplying, we get ct by adding the non-linear stuff and ct−1! There is a direct, linear connection between ct and ct−1.
memory
- . Use an LSTM or GRU: it makes your life so much simpler!
- . Initialize recurrent matrices to be orthogonal
- . Initialize other matrices with a sensible (small!) scale
- . Initialize forget gate bias to 1: default to remembering
- . Use adaptive learning rate algorithms: Adam, AdaDelta, …
- . Clip the norm of the gradient: 1–5 seems to be a reasonable threshold when used together with Adam or AdaDelta.
- . Either only dropout vertically or learn how to do it right
- . Be patient!
17
[ Saxe et al., ICLR2014; Ba, Kingma, ICLR2015; Zeiler, arXiv2012; Pascanu et al., ICML2013
- Ensembling
- Train 8–10 nets and average their predictions
- It’s easy to do and usually gives good gains!
18
- Ensemble of Conditional Recurrent LM
- Step-wise Ensemble: p(xens|xens,Y ) = EBM p(xm|xm ,Y )
- Ensemble operator implementations
- . Majority voting scheme (OR): M EBm=1 pens = 1 M pm m=1
- . Consensus building scheme (AND):
YM
!1/M
27 25 23 BLEU 21 19 17 15
EBm=1p
ens =
pm m=1
Single (med)
OR-Ensemble (8)
En-De En-Cs En-Ru En-Fi [Jung, Cho & Bengio, ACL2016]
- Lip Reading Sentences in the Wild Joon Son Chung, Andrew Senior, Oriol Vinyals, Andrew Zisserman Presented by
- Michael Fang
+ The cat sat...
Model Architecture: Watch, Listen, Attend and Spell Training Strategies Dataset (Professional-Surpassing!) Results
Watch
The cat sat
Spell
Listen
Slowly increase the length of training sequences Converges training faster, decreases overfitting
Randomly sample from previous prediction instead of ground truth during training Makes training scenario more similar to testing
3. MT Evaluation
- Manual (the best!?):
- SSER (subjective sentence error rate)
- Correct/Incorrect
- Adequacy and Fluency (5 or 7 point scales)
- Error categorization
- Comparative ranking of translations
- Testing in an application that uses MT as one sub-component
- E.g., question answering from foreign language documents
- May not test many aspects of the translation (e.g., cross-lingual IR)
- Automatic metric:
- WER (word error rate) – why problematic?
- BLEU (Bilingual Evaluation Understudy)
- N-gram precision (score is between 0 & 1)
– What percent of machine n-grams can be found in the reference translation? – An n-gram is a sequence of n words – For each n-gram size, not allowed to match identical portion of reference translation more than once (two MT words airport are only correct if two reference words airport; can’t cheat by typing out “the the the the the”)
- Brevity Penalty
– Can’t just type out single word “the” (precision 1.0!)
- It was thought hard to “game” the metric (i.e., to find a way to change MT output so that BLEU goes up, but quality doesn’t)
- BLEU is a weighted geometric mean of n- gram precision (is translation in reference?), with a brevity penalty factor added.
- BLEU4 counts n-grams ≤ length k = 4
pn = # matched n-gram / # MT n-gram wn = weight, e.g., w1=1, w2=½, w3=¼, w4=⅛ BP = exp(min(0, 1 − (lenref/lenMT)))
log BLEU log p2 + 0.25 * log p3 + 0.125 * log p4 – max(words-in-ref / words-in-MT – 1, 0)
Note: only works at corpus level (zeroes kill it); there’s a smoothed variant for sentence-level
BLEU in Action
枪手被警方击毙。 (Foreign Original) the gunman was shot to death by the police . (Reference Translation)
the gunman was police kill . #1 wounded police jaya of #2 the gunman was shot dead by the police . #3 the gunman arrested by police kill . #4 the gunmen were killed . #5 the gunman was shot to death by the police . #6 gunmen were killed by police ?SUB>0 ?SUB>0 #7 al by the police . #8 the ringer is killed by the police . #9 police killed the gunman . #10
green = 4-gram match (good!) red = word not matched (bad!)
Multiple Reference Translations
Reference translation 1: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport .
Reference translation 2: Guam International Airport and its offices are maintaining a high state of alert after receiving an e-mail that was from a person claiming to be the wealthy Saudi Arabian businessman Bin Laden and that threatened to launch a biological and chemical attack on the airport and other public places .
Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.
Reference translation 3: The US International Airport of Guam and its office has received an email from a self-claimed Arabian millionaire named Laden , which threatens to launch a biochemical attack on such public places as airport . Guam authority has been on alert .
Reference translation 4: US Guam International Airport and its office received an email from Mr. Bin Laden and other rich businessman from Saudi Arabia . They said there would be biochemistry air raid to Guam Airport and other public places . Guam needs to be in high precaution about this matter .
Initial results showed that BLEU predicts human judgments well
2.5
Human Judgments
Slide from G. Doddington (NIST)
Automatic evaluation of MT
- People started optimizing their systems to maximize BLEU score
- BLEU scores improved rapidly
- The correlation between BLEU and human judgments of quality went way, way down
- MT BLEU scores now approach those of human translations but their true quality remains far below human translations
- Coming up with automatic MT evaluations has become its own research field
- There are many proposals: TER, METEOR, MaxSim, SEPIA, our own RTE-MT
- TERpA is a representative good one that handles some word choice variation.
- MT research requires some automatic metric to allow a rapid development and evaluation cycle.
4. The word generation problem: dealing with a large output vocab
Softmax parameters
Hidden state
Je |V|
P(Je| …)
I am a student _
The word generation problem
- Word generation problem
Softmax parameters
Hidden state
Je |V|
P(Je| …)
I am a student _
Softmax computation is expensive.
- Word generation problem
- If vocabs are modest, e.g., 50K
The ecotax portico in Pont-de-Buis Le portique écotaxe de Pont-de-Buis
I am a student _
The <unk> portico in <unk> Le <unk> <unk> de <unk>
- Lots of ideas from the neural LM literature!
- Hierarchical models: tree-structured vocabulary
- [Morin & Bengio, AISTATS’05], [Mnih & Hinton, NIPS’09].
- Complex, sensitive to tree structures.
- Noise-contrastive estimation: binary classification
- [Mnih & Teh, ICML’12], [Vaswani et al., EMNLP’13].
- Different noise samples per training example.*
43 *We’ll mention a simple fix for this!
- GPU-friendly.
- Training: a subset of the vocabulary at a time.
- Testing: smart on the set of possible translations.
Sébastien Jean, Kyunghyun Cho, Roland Memisevic, Yoshua Bengio. On Using Very Large Target Vocabulary for Neural Machine Translation. ACL’15.
- Each time train on a smaller vocab Vʹ ≪ V
|Vʹ|
- Each time train on a smaller vocab Vʹ ≪ V
|Vʹ|
- Partition training data in subsets:
- Each subset has 𝜏 distinct target words, |Vʹ| = 𝜏.
- Sequentially select examples: |Vʹ| = 5.
Vʹ = {she, loves, cats, he, likes}
- Sequentially select examples: |Vʹ| = 5.
Vʹ = {cats, have, tails, dogs, chase}
48
- Sequentially select examples: |Vʹ| = 5.
Vʹ = {she, loves, dogs, cats, hate}
- Practice: |V| = 500K, |Vʹ| = 30K or 50K.
- K most frequent words: unigram prob.
- K most frequent words: unigram prob.
- Candidate target words
- Kʹ choices per source word. Kʹ = 3.
She
loves
cats
Testing – Select candidate words K
Kʹ
She loves
cats
+ = Candidate list
- Produce translations within the candidate list
- Practice: Kʹ = 10 or 20, K = 15k, 30k, or 50k.
- “BlackOut: Speeding up Recurrent Neural Network Language Models with very Large Vocabularies” – [Ji, Vishwanathan, Satish, Anderson, Dubey, ICLR’16].
- Good survey over many techniques.
- “Simple, Fast Noise Contrastive Estimation for Large RNN Vocabularies” – [Zoph, Vaswani, May, Knight, NAACL’16].
- Use the same samples per minibatch. GPU efficient.
2nd thought on word generation
- Scaling softmax is insufficient:
- New names, new numbers, etc., at test time.
- But previous MT models can copy words.
- Recall the Pointer Sentinel Mixture Models (Merity et al. 2017) that Richard mentioned
- Gulcehre, Ahn, Nallapati, Zhou, Bengio (2016) Pointing the Unknown Words
- Caution from Google NMT paper: In principle can train a “copy model” but this approach is both unreliable at scale – the attention mechanism is unstable when the network is deep – and copying may not always be the best strategy for rare words – sometimes transliteration is more appropriate
- “Copy” mechanisms are not sufficient.
- Transliteration: Christopher ↦ Kryštof
- Multi-word alignment: Solar system ↦ Sonnensystem
- Need to handle large, open vocabulary
- Rich morphology: nejneobhospodařovávatelnějšímu
(“to the worst farmable one”)
- Informal spelling: goooooood morning !!!!!
- Same seq2seq architecture:
- Use smaller units.
- [Sennrich, Haddow, Birch, ACL’16a], [Chung, Cho, Bengio, ACL’16].
- Hybrid architectures:
- RNN for words + something else for characters.
- [Costa-Jussà & Fonollosa, ACL’16], [Luong & Manning, ACL’16].
- A compression algorithm:
- Most frequent byte pair ↦ a new byte.
Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural Machine 60 Translation of Rare Words with Subword Units. ACL 2016.
- A word segmentation algorithm:
- Start with a vocabulary of characters.
- Most frequent ngram pairs ↦ a new ngram.
61
- A word segmentation algorithm:
- Start with a vocabulary of characters.
- Most frequent ngram pairs ↦ a new ngram.
Dictionary Vocabulary
- A word segmentation algorithm:
- Start with a vocabulary of characters.
- Most frequent ngram pairs ↦ a new ngram.
Dictionary Vocabulary
- A word segmentation algorithm:
- Start with a vocabulary of characters.
- Most frequent ngram pairs ↦ a new ngram.
Dictionary Vocabulary
- A word segmentation algorithm:
- Start with a vocabulary of characters.
- Most frequent ngram pairs ↦ a new ngram.
Dictionary Vocabulary
65 (Example from Sennrich)
- A word segmentation algorithm:
- Start with a vocabulary of characters.
- Most frequent ngram pairs ↦ a new ngram.
- Automatically decide vocabs for NMT
66 https://github.com/rsennrich/nematus
- GNMT uses a variant of this, the wordpiece model, which is generally similar but uses a greedy approximation to maximizing language model log likelihood to choose the pieces
u n l y
(unfortunately)
Character-based LSTM
the
bank
was
closed
u n l y
(unfortunately)
Hybrid NMT
- A best-of-both-worlds architecture:
- Translate mostly at the word level
- Only go to the character level when needed.
- More than 2 BLEU improvement over a copy mechanism.
Thang Luong and Chris Manning. Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models. ACL 2016.
Hybrid NMT
Word-level (4 layers)
End-to-end training 8-stacking LSTM layers.
J 0 I I -
2-stage Decoding
t t t t t - t t t t t - -
- Word-level beam search
t. tt.
- J 0 I I un <unk> chat - t t t t •
- • • ► ►
t t t t t t
- • • • ► ►
t t t t t cat un <unk> chat
t t t t t - 72 t t t t t C U t e
2-stage Decoding Init with word hidden states.
- Word-level beam search
- Char-level beam search for <unk>.
- Train on WMT’15 data (12M sentence pairs)
- newstest2015
30x data Large vocab + copy mechanism
- Train on WMT’15 data (12M sentence pairs)
- newstest2015
30x data Large vocab + copy mechanism
source Her 11-year-old daughter , Shani Bart , said it felt a little bit weird human Její jedenáctiletá dcera Shani Bartová prozradila , že je to trochu zvláštní Její <unk> dcera <unk> <unk> řekla , že je to trochu divné
word
hybrid
Její 11-year-old dcera Shani , řekla , že je to trochu divné Její <unk> dcera , <unk> <unk> , řekla , že je to <unk> <unk> Její jedenáctiletá dcera , Graham Bart , řekla , že cítí trochu divný
- Word-based: identity copy fails.
source Her 11-year-old daughter , Shani Bart , said it felt a little bit weird human Její jedenáctiletá dcera Shani Bartová prozradila , že je to trochu zvláštní Její <unk> dcera <unk> <unk> řekla , že je to trochu divné
word
hybrid
Její 11-year-old dcera Shani , řekla , že je to trochu divné Její <unk> dcera , <unk> <unk> , řekla , že je to <unk> <unk> Její dcera , Graham Bart , řekla , že cítí trochu divný
- Hybrid: correct, 11-year-old – jedenáctiletá.
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2017 Lecture11FurtherTopicsinNeuralM | Christopher D. Manning Richard Socher | Lecture 11 - Further Topics in Neural Machine Translation and Recurrent Models | 2017 |