One-half Squared-Error Cost Function
An One-half Squared-Error Cost Function is a Squared-Error Cost Function that is an average sum-of-squares error plus a regularization term.
- Example(s):
- For a training set of $m$ examples, this cost function is:
$J(W,b) =\displaystyle \left[ \dfrac{1}{m} \sum_{i=1}^m \left( \dfrac{1}{2} \left\| h_{W,b}(x^{(i)}) - y^{(i)} \right\|^2 \right) \right] + \dfrac{\lambda}{2} \sum_{l=1}^{n_l-1} \; \sum_{i=1}^{s_l} \; \sum_{j=1}^{s_{l+1}} \left( W^{(l)}_{ji} \right)^2$,
- For a single training example $(x,y)$ this cost function is:
$J(W,b; x,y) = \dfrac{1}{2} \left\| h_{W,b}(x) - y \right\|^2.$
- …
- For a training set of $m$ examples, this cost function is:
- Counter Example(s):
- See: Square Loss Function, Squared Error Function, Cross-Entropy Measure, Mean Absolute Error, Mean Squared Error, Learning Cost Function.
References
2014
- (DL, 2014) ⇒ http://deeplearning.stanford.edu/wiki/index.php/Backpropagation_Algorithm
- QUOTE: Suppose we have a fixed training set [math]\displaystyle{ \{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \} }[/math] of [math]\displaystyle{ m }[/math] training examples. We can train our neural network using batch gradient descent. In detail, for a single training example [math]\displaystyle{ (x,y) }[/math], we define the cost function with respect to that single example to be:
[math]\displaystyle{ \begin{align} J(W,b; x,y) = \frac{1}{2} \left\| h_{W,b}(x) - y \right\|^2. \end{align} }[/math]
This is a (one-half) squared-error cost function. Given a training set of [math]\displaystyle{ m }[/math] examples, we then define the overall cost function to be:
[math]\displaystyle{ \begin{align} J(W,b) &= \left[ \frac{1}{m} \sum_{i=1}^m J(W,b;x^{(i)},y^{(i)}) \right] + \frac{\lambda}{2} \sum_{l=1}^{n_l-1} \; \sum_{i=1}^{s_l} \; \sum_{j=1}^{s_{l+1}} \left( W^{(l)}_{ji} \right)^2 \\ &= \left[ \frac{1}{m} \sum_{i=1}^m \left( \frac{1}{2} \left\| h_{W,b}(x^{(i)}) - y^{(i)} \right\|^2 \right) \right] + \frac{\lambda}{2} \sum_{l=1}^{n_l-1} \; \sum_{i=1}^{s_l} \; \sum_{j=1}^{s_{l+1}} \left( W^{(l)}_{ji} \right)^2 \end{align} }[/math]
The first term in the definition of [math]\displaystyle{ J(W,b) }[/math] is an average sum-of-squares error term. The second term is a regularization term (also called a weight decay term) that tends to decrease the magnitude of the weights, and helps prevent overfitting.
- QUOTE: Suppose we have a fixed training set [math]\displaystyle{ \{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \} }[/math] of [math]\displaystyle{ m }[/math] training examples. We can train our neural network using batch gradient descent. In detail, for a single training example [math]\displaystyle{ (x,y) }[/math], we define the cost function with respect to that single example to be: