Stacked Denoising Autoencoding (SdA) Algorithm
A Stacked Denoising Autoencoding (SdA) Algorithm is a feed-forward neural network learning algorithm that produce a stacked denoising autoencoding network (consisting of layers of sparse autoencoders in which the outputs of each layer is wired to the inputs of the successive layer).
- Context:
- It can learn Robust Representations of the input data.
- It can be implemented by a Stacked Denoising Autoencoding System (that solves a stacked denoising autoencoding task to produce a stacked denoising autoencoding network).
- Example(s):
- …
- Counter-Example(s):
- See: Loss Function, Stochastic Mapping, Minimization Algorithm, Cross-Entropy, Restricted Boltzmann Machine, Multilayer Neural Network.
References
2017
- (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Autoencoder#Denoising_autoencoder Retrieved:2017-6-5.
- Denoising autoencoders take a partially corrupted input whilst training to recover the original undistorted input. This technique has been introduced with a specific approach to good representation. A good representation is one that can be obtained robustly from a corrupted input and that will be useful for recovering the corresponding clean input. This definition contains the following implicit assumptions:
- The higher level representations are relatively stable and robust to the corruption of the input;
- It is necessary to extract features that are useful for representation of the input distribution.
- To train an autoencoder to denoise data, it is necessary to perform preliminary stochastic mapping [math]\displaystyle{ \mathbf{x}\rightarrow\mathbf{\tilde{x}} }[/math] in order to corrupt the data and use [math]\displaystyle{ \mathbf{\tilde{x}} }[/math] as input for a normal autoencoder, with the only exception being that the loss should be still computed for the initial input [math]\displaystyle{ \mathcal{L}(\mathbf{x},\mathbf{\tilde{x}}) }[/math] instead of [math]\displaystyle{ \mathcal{L}(\mathbf{\tilde{x}},\mathbf{\tilde{x}'}) }[/math] .
- Denoising autoencoders take a partially corrupted input whilst training to recover the original undistorted input. This technique has been introduced with a specific approach to good representation. A good representation is one that can be obtained robustly from a corrupted input and that will be useful for recovering the corresponding clean input. This definition contains the following implicit assumptions:
2017
- (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Deep_learning#Stacked Retrieved:2017-6-5.
- The auto encoder idea is motivated by the concept of a good representation. For example, for a classifier, a good representation can be defined as one that will yield a better performing classifier.
An encoder is a deterministic mapping [math]\displaystyle{ f_\theta }[/math] that transforms an input vector x into hidden representation y, where [math]\displaystyle{ \theta = \{\boldsymbol{W}, b\} }[/math] , [math]\displaystyle{ \boldsymbol{W} }[/math] is the weight matrix and b is an offset vector (bias). A decoder maps back the hidden representation y to the reconstructed input z via [math]\displaystyle{ g_\theta }[/math] . The whole process of auto encoding is to compare this reconstructed input to the original and try to minimize this error to make the reconstructed value as close as possible to the original.
In stacked denoising auto encoders, the partially corrupted output is cleaned (de-noised). This idea was introduced in 2010 by Vincent et al. with a specific approach to good representation, a good representation is one that can be obtained robustly from a corrupted input and that will be useful for recovering the corresponding clean input. Implicit in this definition are the following ideas:
- The higher level representations are relatively stable and robust to input corruption;
- It is necessary to extract features that are useful for representation of the input distribution.
- The algorithm consists of multiple steps; starts by a stochastic mapping of [math]\displaystyle{ \boldsymbol{x} }[/math] to [math]\displaystyle{ \tilde{\boldsymbol{x}} }[/math] through [math]\displaystyle{ q_D(\tilde{\boldsymbol{x}}|\boldsymbol{x}) }[/math], this is the corrupting step. Then the corrupted input [math]\displaystyle{ \tilde{\boldsymbol{x}} }[/math] passes through a basic auto encoder process and is mapped to a hidden representation [math]\displaystyle{ \boldsymbol{y} = f_\theta(\tilde{\boldsymbol{x}}) = s(\boldsymbol{W}\tilde{\boldsymbol{x}}+b) }[/math] . From this hidden representation, we can reconstruct [math]\displaystyle{ \boldsymbol{z} = g_\theta(\boldsymbol{y}) }[/math] . In the last stage, a minimization algorithm runs in order to have z as close as possible to uncorrupted input [math]\displaystyle{ \boldsymbol{x} }[/math] . The reconstruction error [math]\displaystyle{ L_H(\boldsymbol{x},\boldsymbol{z}) }[/math] might be either the cross-entropy loss with an affine-sigmoid decoder, or the squared error loss with an affine decoder.
In order to make a deep architecture, auto encoders stack one on top of another.[1] Once the encoding function [math]\displaystyle{ f_\theta }[/math] of the first denoising auto encoder is learned and used to uncorrupt the input (corrupted input), we can train the second level.
Once the stacked auto encoder is trained, its output can be used as the input to a supervised learning algorithm such as support vector machine classifier or a multi-class logistic regression.
- The auto encoder idea is motivated by the concept of a good representation. For example, for a classifier, a good representation can be defined as one that will yield a better performing classifier.
- ↑ Dana H. Ballard (1987). Modular learning in neural networks. Proceedings of AAAI, pages 279–284.
2016
- (Wu et al., 2016) ⇒ Yao Wu, Christopher DuBois, Alice X. Zheng, and Martin Ester. (2016). “Collaborative Denoising Auto-Encoders for Top-N Recommender Systems.” In: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining. ISBN:978-1-4503-3716-8 doi:10.1145/2835776.2835837
- QUOTE: Most real-world recommender services measure their performance based on the top-N results shown to the end users. Thus, advances in top-N recommendation have far-ranging consequences in practical applications. In this paper, we present a novel method, called Collaborative Denoising Auto-Encoder (CDAE), for top-N recommendation that utilizes the idea of Denoising Auto-Encoders.
2011
- http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders
- QUOTE: The greedy layerwise approach for pretraining a deep network works by training each layer in turn. In this page, you will find out how autoencoders can be "stacked" in a greedy layerwise fashion for pretraining (initializing) the weights of a deep network.
A stacked autoencoder is a neural network consisting of multiple layers of sparse autoencoders in which the outputs of each layer is wired to the inputs of the successive layer. Formally, consider a stacked autoencoder with n layers. Using notation from the autoencoder section, let [math]\displaystyle{ W^{(k, 1)}, W^{(k, 2)}, b^{(k, 1)}, b^{(k, 2)} }[/math] denote the parameters [math]\displaystyle{ W^{(1)}, W^{(2)}, b^{(1)}, b^{(2)} }[/math] for kth autoencoder. Then the encoding step for the stacked autoencoder is given by running the encoding step of each layer in forward order:
- QUOTE: The greedy layerwise approach for pretraining a deep network works by training each layer in turn. In this page, you will find out how autoencoders can be "stacked" in a greedy layerwise fashion for pretraining (initializing) the weights of a deep network.
2011
- (Glorot et al., 2011a) ⇒ Xavier Glorot, Antoine Bordes, and Yoshua Bengio. (2011). “Domain Adaptation for Large-scale Sentiment Classification: A Deep Learning Approach.” In: Proceedings of the 28th International Conference on Machine Learning (ICML-11).
- QUOTE: The basic framework for our models is the Stacked Denoising Auto-encoder (Vincent et al., 2008). An auto-encoder is comprised of an encoder function [math]\displaystyle{ h(\cdot) }[/math] and a decoder function [math]\displaystyle{ g(\cdot) }[/math], typically with the dimension of [math]\displaystyle{ h(\cdot) }[/math] smaller than that of its argument. The reconstruction of input x is given by r (x) = g (h (x)), and auto-encoders are typically trained to minimize a form of reconstruction error loss (x; r (x)). Examples of reconstruction error include the squared error, or like here, when the elements of x or r (x) can be considered as probabilities of a discrete event, the Kullback-Domain Adaptation for Sentiment Classification with Deep Learning Liebler divergence between elements of x and elements of r (x). When the encoder and decoder are linear and the reconstruction error is quadratic, one recovers in h (x) the space of the principal components (PCA) of x. Once an auto-encoder has been trained, one can stack another auto-encoder on top of it, by training a second one which sees the encoded output of the first one as its training data. Stacked auto-encoders were one of the first methods for building deep architectures (Bengio et al., 2006), along with Restricted Boltzmann Machines (RBMs) (Hinton et al., 2006). Once a stack of auto-encoders or RBMs has been trained, their parameters describe multiple levels of representation for x and can be used to initialize a supervised deep neural network (Bengio, 2009) or directly feed a classifier, as we do in this paper.
An interesting alternative to the ordinary autoencoder is the Denoising Auto-encoder (Vincent et al., 2008) or DAE, in which the input vector x is stochastically corrupted into a vector ~x, and the model is trained to denoise, i.e., to minimize a denoising reconstruction error loss (x; r (~x)). Hence the DAE cannot simply copy its input ~x in its code layer h (~x), even if the dimension of h (~x) is greater than that of ~x. The denoising error can be linked in several ways to the likelihood of a generative model of the distribution of the uncorrupted examples x (Vincent, 2011).
- QUOTE: The basic framework for our models is the Stacked Denoising Auto-encoder (Vincent et al., 2008). An auto-encoder is comprised of an encoder function [math]\displaystyle{ h(\cdot) }[/math] and a decoder function [math]\displaystyle{ g(\cdot) }[/math], typically with the dimension of [math]\displaystyle{ h(\cdot) }[/math] smaller than that of its argument. The reconstruction of input x is given by r (x) = g (h (x)), and auto-encoders are typically trained to minimize a form of reconstruction error loss (x; r (x)). Examples of reconstruction error include the squared error, or like here, when the elements of x or r (x) can be considered as probabilities of a discrete event, the Kullback-Domain Adaptation for Sentiment Classification with Deep Learning Liebler divergence between elements of x and elements of r (x). When the encoder and decoder are linear and the reconstruction error is quadratic, one recovers in h (x) the space of the principal components (PCA) of x. Once an auto-encoder has been trained, one can stack another auto-encoder on top of it, by training a second one which sees the encoded output of the first one as its training data. Stacked auto-encoders were one of the first methods for building deep architectures (Bengio et al., 2006), along with Restricted Boltzmann Machines (RBMs) (Hinton et al., 2006). Once a stack of auto-encoders or RBMs has been trained, their parameters describe multiple levels of representation for x and can be used to initialize a supervised deep neural network (Bengio, 2009) or directly feed a classifier, as we do in this paper.
2008
- (Vincent et al., 2008) ⇒ Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. (2008). “Extracting and Composing Robust Features with Denoising Autoencoders.” In: Proceedings of the 25th International Conference on Machine learning (ICML 2008).