Mixture of Experts (MoE) Model
Jump to navigation
Jump to search
A Mixture of Experts (MoE) Model is a machine learning algorithm where multiple experts (learners) are used to divide the problem space into homogeneous regions.
References
2022
- (Wikipedia, 2022) ⇒ https://en.wikipedia.org/wiki/Mixture_of_experts Retrieved:2022-3-4.
- Mixture of experts (MoE) refers to a machine learning technique where multiple experts (learners) are used to divide the problem space into homogeneous regions. An example from the computer vision domain is combining a neural network model for human detection with another for pose estimation. If the output is conditioned on multiple levels of probabilistic gating functions, the mixture is called a hierarchical mixture of experts.
A gating network decides which expert to use for each input region. Learning thus consists of 1) learning the parameters of individual learners and 2) learning the parameters of the gating network.
- Mixture of experts (MoE) refers to a machine learning technique where multiple experts (learners) are used to divide the problem space into homogeneous regions. An example from the computer vision domain is combining a neural network model for human detection with another for pose estimation. If the output is conditioned on multiple levels of probabilistic gating functions, the mixture is called a hierarchical mixture of experts.
2021
- (Fedus et al., 2021) ⇒ William Fedus, Barret Zoph, and Noam Shazeer. (2021). “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.” In: The Journal of Machine Learning Research, 23(1). DOI:10.5555/3586589.3586709.** QUOTE: ... In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model -- with outrageous numbers of parameters -- but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs and training instability -- we address these with the Switch Transformer. We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. ...
2018
- (Nguyen and Chamroukhi, 2018) ⇒ Hien D. Nguyen, Faicel Chamroukhi (2018). “Practical and theoretical aspects of mixture‐of‐experts modeling: An overview.” In: Wiley Interdisciplinary Reviews. [doi:xxxxxxx]
- QUOTE: "In Section 2, we discuss the construction of MoE models via the choice of gating and Expert functions. In Section 3, we present some of the aforementioned recent theoretical results in a …"
2017
- (Shazeer et al., 2017) ⇒ Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, ... (2017). “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.” In: arXiv preprint. [doi:xxxxxxx]
- QUOTE: "We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination …"
2016
- (Chamroukhi, 2016) ⇒ Faicel Chamroukhi (2016). “Robust mixture of experts modeling using the t distribution.” In: Neural Networks. [doi:xxxxxxx]
- QUOTE: "... Mixture of experts usually uses normal Expert, that is, expert components following the Gaussian distribution. Along this paper, we will call it the normal mixture of experts, ... of normal experts may be …"
2014
- (Peralta and Soto, 2014) ⇒ Benjamin Peralta, Alexis Soto (2014). “Embedded local feature selection within mixture of experts.” In: Information Sciences. [doi:xxxxxxx]
- QUOTE: "Accordingly, this work contributes with a regularized variant of Mixture of experts that incorporates an embedded process for local Feature selection using L 1 regularization. Experiments using …"