Gradient-Descent Optimization Algorithm

A Gradient-Descent Optimization Algorithm is an iterative optimization algorithm that uses a differentiable function (a gradient function).

AKA: Method of Steepest Descent, Inverse Gradient Ascent.
Context:
- It can range from being a Batch Gradient Descent Algorithm to being an Online Gradient Descent Algorithm.
- It can range from being an Exact Gradient Descent Algorithm to being a Stochastic Gradient Descent Algorithm.
- It can be implemented in a Gradient Descent-based Optimization System (that can solve a Gradient Descent-based Optimization Task).
Example(s):
See: Boosting Algorithm, Finite Element Algorithm, Error Surface.

References

2014

http://en.wikipedia.org/wiki/Gradient_descent
- Gradient descent is a first-order optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point. If instead one takes steps proportional to the positive of the gradient, one approaches a local maximum of that function; the procedure is then known as gradient ascent.
  Gradient descent is also known as steepest descent, or the method of steepest descent. When known as the latter, gradient descent should not be confused with the method of steepest descent for approximating integrals.

2012

http://en.wikipedia.org/wiki/Gradient_descent#Description
- QUOTE: Gradient descent is based on the observation that if the multivariable function [math]\displaystyle{ F(\mathbf{x}) }[/math] is defined and differentiable in a neighborhood of a point [math]\displaystyle{ \mathbf{a} }[/math], then [math]\displaystyle{ F(\mathbf{x}) }[/math] decreases fastest if one goes from [math]\displaystyle{ \mathbf{a} }[/math] in the direction of the negative gradient of [math]\displaystyle{ F }[/math] at [math]\displaystyle{ \mathbf{a} }[/math], [math]\displaystyle{ -\nabla F(\mathbf{a}) }[/math]. It follows that, if :[math]\displaystyle{ \mathbf{b} = \mathbf{a}-\gamma\nabla F(\mathbf{a}) }[/math] for [math]\displaystyle{ \gamma \to 0 }[/math] a small enough number, then [math]\displaystyle{ F(\mathbf{a})\geq F(\mathbf{b}) }[/math]. With this observation in mind, one starts with a guess [math]\displaystyle{ \mathbf{x}_0 }[/math] for a local minimum of [math]\displaystyle{ F }[/math], and considers the sequence [math]\displaystyle{ \mathbf{x}_0, \mathbf{x}_1, \mathbf{x}_2, \dots }[/math] such that :[math]\displaystyle{ \mathbf{x}_{n+1}=\mathbf{x}_n-\gamma_n \nabla F(\mathbf{x}_n),\ n \ge 0. }[/math] We have :[math]\displaystyle{ F(\mathbf{x}_0)\ge F(\mathbf{x}_1)\ge F(\mathbf{x}_2)\ge \cdots, }[/math] so hopefully the sequence [math]\displaystyle{ (\mathbf{x}_n) }[/math] converges to the desired local minimum. Note that the value of the step size [math]\displaystyle{ \gamma }[/math] is allowed to change at every iteration. With certain assumptions on the function [math]\displaystyle{ F }[/math] (for example, [math]\displaystyle{ F }[/math] convex and [math]\displaystyle{ \nabla F }[/math] Lipschitz) and particular choices of [math]\displaystyle{ \gamma }[/math] (e.g., chosen via a line search that satisfies the Wolfe conditions), convergence to a local minimum can be guaranteed. When the function [math]\displaystyle{ F }[/math] is convex, all local minima are also global minima, so in this case gradient descent can converge to the global solution.
  This process is illustrated in the picture to the right. Here [math]\displaystyle{ F }[/math] is assumed to be defined on the plane, and that its graph has a bowl shape. The blue curves are the contour lines, that is, the regions on which the value of [math]\displaystyle{ F }[/math] is constant. A red arrow originating at a point shows the direction of the negative gradient at that point. Note that the (negative) gradient at a point is orthogonal to the contour line going through that point. We see that gradient descent leads us to the bottom of the bowl, that is, to the point where the value of the function [math]\displaystyle{ F }[/math] is minimal.

1999

(Orr, 1999a) ⇒ Genevieve Orr. (1999). "Linear Neural Networks." In: "CS-449: Neural Networks." Fall 99
- To find the gradient G for the entire data set, we sum at each weight the contribution given by equation 6 over all the data points. We can then subtract a small proportion µ (called the learning rate) of G from the weights to perform gradient descent.
- 1. Initialize all weights to small random values.
- 2. REPEAT until done
  - 1. For each weight wij set
  - 2. For each data point (x, t)p
    - 1. set input units to x
    - 2. compute value of output units
    - 3. For each weight wij set
- 3. For each weight wij set
- An alternative approach is online learning, where the weights are updated immediately after seeing each data point. Since the gradient for a single data point can be considered a noisy approximation to the overall gradient G (Fig. 5), this is also called stochastic (noisy) gradient descent.

Gradient-Descent Optimization Algorithm

References

2014

2012

1999

Navigation menu

Search