Averaged Perceptron Algorithm

Example(s):
- nltk.tag.perceptron.
- …
Counter-Example(s):
- Voted Perceptron Algorithm.
- Kernel Perceptron Algorihtm.
See: Perceptron Algorithm, Generalized Perceptron Model, Perceptron, Perceptron Training Algorithm.

References

(Collins, 2002b) ⇒ Michael Collins. (2002). “Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with the Perceptron Algorithm.” In: Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing, (EMNLP 2002). doi:10.3115/1118693.1118694
- QUOTE: There is a simple refinement to the algorithm in figure 1, called the “averaged parameters" method. Define $\alpha^{t,i}_s$ to be the value for the $s$'th parameter after the $i$'th training example has been processed in pass $t$ over the training data. Then the “averaged parameters" are defined as $\gamma_s = \sum_{t=1 \cdots T , i=1 \cdots n} \alpha^{t,i}_s / nT$ for all $s = 1 \cdots d$. It is simple to modify the algorithm to store this additional set of parameters. Experiments in section 4 show that the averaged parameters perform signicantly better than the final parameters $\alpha^{T,n}_s$. The theory in the next section gives justication for the averaging method.

**Figure 1**: The training algorithm for tagging.
Inputs: A training set of tagged sentences, $\left(w^i_{[1:n_i ]} , t^i_{[1:n_i]}\right)$ for $i = 1 \cdots n$. A parameter $T$ specifying number of iterations over the training set. A “local representation” $\Phi$ which is a function that maps history/tag pairs to d-dimensional feature vectors. The global representation $\Phi$ is defined through $\phi$ as in Eq. 1.
Initialization: Set parameter vector $ \overline{\alpha}= 0$.
Algorithm: For $t = 1 \cdots T, i = 1 \cdot n$ Use the Viterbi algorithm to find the output of the model on the i'th training sentence with the current parameter settings, i.e., $z[1:n_i ]= \mathrm{arg max}_{u[1:n_i ]}\in \mathcal{T}^{n_i} \sum \alpha_s \Phi_s \left(w^i_{[1:n_i]} , u_{[1:n_i ]}\right)$ where $\mathcal{T}^{n_i}$ is the set of all tag sequences of length $n_i$ . If $z[1:n_i] \neq t^i_[1:n_i]$ then update the parameters $\alpha_s = \alpha_s + \Phi_s\left(w^i_{[1:n_i]} , t^i_{[1:n_i ])} \right)- \Phi_s\left(w^i_{[1:n_i]}, z_{[1:n_i]}\right)$ Output: Parameter vector $\overline{\alpha}$