Neural Auto-Encoding Network
An Neural Auto-Encoding Network is a encoding/decoding neural network whose input and output are from the same space.
- Context:
- It can (typically) use reconstruction loss, such as mean squared error or binary cross-entropy, to measure the quality of the reconstruction.
- It can (often) include a Bottleneck Layer that forces the network to learn compressed representations.
- It can (often) help in anomaly detection by comparing the reconstruction error for normal vs. abnormal inputs.
- ...
- It can be trained by an Auto-Encoder Training System (that implements a auto-encoder training algorithm).
- ...
- Example(s):
- a Stacked Auto-Encoder: An architecture where multiple autoencoders are stacked on top of each other. The encoder output from one autoencoder becomes the input to the next layer, progressively learning more abstract representations.
- a Denoising Auto-Encoder: A network trained to reconstruct inputs from partially corrupted versions of the data. It introduces noise to the input and learns to remove that noise during the reconstruction.
- a Variational Auto-Encoder (VAE): A probabilistic autoencoder that models the data using a latent variable distribution. It uses both an encoder that maps inputs to a latent space and a decoder that samples from the latent space to generate data.
- a Sparse Autoencoder: An autoencoder with a sparsity constraint on the activation of the neurons in the hidden layer, encouraging the model to learn a sparse representation of the input.
- a Contractive Autoencoder (CAE): A variant that adds a regularization term to the loss function to make the encoding less sensitive to small changes in the input data, enforcing robustness in feature extraction.
- a Deep Convolutional Autoencoder: A convolutional neural network-based autoencoder, where the encoder and decoder are built with convolutional and deconvolutional layers, commonly used for image reconstruction tasks.
- …
- Counter-Example(s):
- a Neural seq2seq Model.
- Principal Components Analysis.
- a word2vec Algorithm, which does not auto-encode "context windows" to "context windows".
- See: Variational Auto-Encoder, Stacked Model, Auto-Encoding Algorithm, Feature Learning, Distributed Representation, Dimensionality Reduction, Non-Linear Mapping, Bottleneck Layer.
References
2018a
- (Wikipedia, 2018) ⇒ https://en.wikipedia.org/wiki/Autoencoder Retrieved:2018-11-5.
- An autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction. Recently, the autoencoder concept has become more widely used for learning generative models of data [1]. Some of the most powerful AI in the 2010s have involved sparse autoencoders stacked inside of deep neural networks.
- ↑ Generating Faces with Torch, Boesen A., Larsen L. and Sonderby S.K., 2015
2018b
- (Wikipedia, 2018) ⇒ https://en.wikipedia.org/wiki/Autoencoder#Purpose Retrieved:2018-11-5.
- QUOTE: An autoencoder learns to compress data from the input layer into a short code, and then uncompress that code into something that closely matches the original data. This forces the autoencoder to engage in dimensionality reduction, for example by learning how to ignore noise. Some architectures use stacked sparse autoencoder layers for image recognition. The first autoencoder might learn to encode easy features like corners, the second to analyze the first layer's output and then encode less local features like the tip of a nose, the third might encode a whole nose, etc., until the final autoencoder encodes the whole image into a code that matches (for example) the concept of "cat". An alternative use is as a generative model: for example, if a system is manually fed the codes it has learned for "cat" and "flying", it may attempt to generate an image of a flying cat, even if it has never seen a flying cat before.
2018c
- https://qr.ae/TUhH8k
- QUOTE: An autoencoder (or auto-associator, as it was classically known as) is a special case of an encoder-decoder architecture — first, the target space is the same as the input space (i.e., English inputs to English targets) and second, the target is to be equal to the input. So we would be mapping something like vectors to vectors (note that this could still be a sequence, as they are recurrent autoencoders, but you are now in this case, not predicting the future but simply reconstructing the present given a state/memory and the present). Now, an autoencoder is really meant to do auto-association, so we are essentially trying to build a model to “recall” the input, which allows the autoencoder to do things like pattern completion so if we give our autoencoder a partially corrupted input, it would be able to “retrieve” the correct pattern from memory.
2015
- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/autoencoder Retrieved:2015-1-19.
- An autoencoder, autoassociator or Diabolo network is an artificial neural network used for learning efficient codings.
The aim of an auto-encoder is to learn a compressed, distributed representation (encoding) for a set of data, typically for the purpose of dimensionality reduction.
- An autoencoder, autoassociator or Diabolo network is an artificial neural network used for learning efficient codings.
2012
- (Sainath et al., 2012) ⇒ Tara N. Sainath, Brian Kingsbury, and Bhuvana Ramabhadran. (2012). “Auto-Encoder Bottleneck Features using Deep Belief Networks.” In: Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on, pp. 4153-4156. IEEE.
- ABSTRACT: Neural network (NN) bottleneck (BN) features are typically created by training a NN with a middle bottleneck layer. Recently, an alternative structure was proposed which trains a NN with a constant number of hidden units to predict output targets, and then reduces the dimensionality of these output probabilities through an auto-encoder, to create auto-encoder bottleneck (AE-BN) features. The benefit of placing the BN after the posterior estimation network is that it avoids the loss in frame classification accuracy incurred by networks that place the BN before the softmax. In this work, we investigate the use of pre-training when creating AE-BN features. Our experiments indicate that with the AE-BN architecture, pre-trained and deeper NNs produce better AE-BN features. On a 50-hour English Broadcast News task, the AE-BN features provide over a 1% absolute improvement compared to a state-of-the-art GMM / HMM with a WER of 18.8% and pre-trained NN hybrid system with a WER of 18.4%. In addition, on a larger 430-hour Broadcast News task, AE-BN features provide a 0.5% absolute improvement over a strong GMM / HMM baseline with a WER of 16.0%. Finally, system combination with the GMM/HMM baseline and AE-BN systems provides an additional 0.5% absolute on 430 hours over the AE-BN system alone, yielding a final WER of 15.0%.
2011
- (Glorot et al., 2011a) ⇒ Xavier Glorot, Antoine Bordes, and Yoshua Bengio. (2011). “Domain Adaptation for Large-scale Sentiment Classification: A Deep Learning Approach.” In: Proceedings of the 28th International Conference on Machine Learning (ICML-11).
- QUOTE: The basic framework for our models is the Stacked Denoising Auto-encoder (Vincent et al., 2008). An auto-encoder is comprised of an encoder function [math]\displaystyle{ h(\cdot) }[/math] and a decoder function [math]\displaystyle{ g(\cdot) }[/math], typically with the dimension of [math]\displaystyle{ h(\cdot) }[/math] smaller than that of its argument. The reconstruction of input x is given by r (x) = g (h (x)), and auto-encoders are typically trained to minimize a form of reconstruction error loss (x; r (x)). Examples of reconstruction error include the squared error, or like here, when the elements of x or r (x) can be considered as probabilities of a discrete event, the Kullback-Domain Adaptation for Sentiment Classification with Deep Learning Liebler divergence between elements of x and elements of r (x). When the encoder and decoder are linear and the reconstruction error is quadratic, one recovers in h (x) the space of the principal components (PCA) of x. Once an auto-encoder has been trained, one can stack another auto-encoder on top of it, by training a second one which sees the encoded output of the first one as its training data. Stacked auto-encoders were one of the first methods for building deep architectures (Bengio et al., 2006), along with Restricted Boltzmann Machines (RBMs) (Hinton et al., 2006). Once a stack of auto-encoders or RBMs has been trained, their parameters describe multiple levels of representation for x and can be used to initialize a supervised deep neural network (Bengio, 2009) or directly feed a classifier, as we do in this paper.
An interesting alternative to the ordinary autoencoder is the Denoising Auto-encoder (Vincent et al., 2008) or DAE, in which the input vector x is stochastically corrupted into a vector ~x, and the model is trained to denoise, i.e., to minimize a denoising reconstruction error loss (x; r (~x)). Hence the DAE cannot simply copy its input ~x in its code layer h (~x), even if the dimension of h (~x) is greater than that of ~x. The denoising error can be linked in several ways to the likelihood of a generative model of the distribution of the uncorrupted examples x (Vincent, 2011).
- QUOTE: The basic framework for our models is the Stacked Denoising Auto-encoder (Vincent et al., 2008). An auto-encoder is comprised of an encoder function [math]\displaystyle{ h(\cdot) }[/math] and a decoder function [math]\displaystyle{ g(\cdot) }[/math], typically with the dimension of [math]\displaystyle{ h(\cdot) }[/math] smaller than that of its argument. The reconstruction of input x is given by r (x) = g (h (x)), and auto-encoders are typically trained to minimize a form of reconstruction error loss (x; r (x)). Examples of reconstruction error include the squared error, or like here, when the elements of x or r (x) can be considered as probabilities of a discrete event, the Kullback-Domain Adaptation for Sentiment Classification with Deep Learning Liebler divergence between elements of x and elements of r (x). When the encoder and decoder are linear and the reconstruction error is quadratic, one recovers in h (x) the space of the principal components (PCA) of x. Once an auto-encoder has been trained, one can stack another auto-encoder on top of it, by training a second one which sees the encoded output of the first one as its training data. Stacked auto-encoders were one of the first methods for building deep architectures (Bengio et al., 2006), along with Restricted Boltzmann Machines (RBMs) (Hinton et al., 2006). Once a stack of auto-encoders or RBMs has been trained, their parameters describe multiple levels of representation for x and can be used to initialize a supervised deep neural network (Bengio, 2009) or directly feed a classifier, as we do in this paper.
2006
- (Hinton & Salakhutdinov, 2006) ⇒ Geoffrey E. Hinton, and Ruslan R. Salakhutdinov. (2006). “Reducing the Dimensionality of Data with Neural Networks.” In: Science, 313(5786). doi:10.1126/science.1127647
- QUOTE: Dimensionality reduction facilitates the classification, visualization, communication, and storage of high-dimensional data. A simple and widely used method is principal components analysis (PCA), which finds the directions of greatest variance in the data set and represents each data point by its coordinates along each of these directions. We describe a nonlinear generalization of PCA that uses an adaptive, multilayer "encoder" network.
… Starting with random weights in the two networks, they can be trained together by minimizing the discrepancy between the original data and its reconstruction. The required gradients are easily obtained by using the chain rule to backpropagate error derivatives first through the decoder network and then through the encoder network (1). The whole system is called an "autoencoder" and is depicted in Fig. 1. It is difficult to optimize the weights in nonlinear autoencoders that have multiple hidden layers (2–4). With large initial weights, autoencoders typically find poor local minima; with small initial weights, the gradients in the early layers are tiny, making it infeasible to train autoencoders with many hidden layers.
- QUOTE: Dimensionality reduction facilitates the classification, visualization, communication, and storage of high-dimensional data. A simple and widely used method is principal components analysis (PCA), which finds the directions of greatest variance in the data set and represents each data point by its coordinates along each of these directions. We describe a nonlinear generalization of PCA that uses an adaptive, multilayer "encoder" network.