Using gradient information in minimizing error function, in Bishop's Pattern Recognition

Question 1

In Bishop's book Pattern Recognition, there appears the following paragraph on page 239, where I included the equation he refers to

In the quadratic approximation to the error function, $$\displaystyle E(\mathbf{w}) \simeq E(\mathbf{\hat w}) + (\mathbf{w} − \mathbf{\hat w})^{\textrm T}\mathbf{b} + \frac 1 2 (\mathbf w − \mathbf{\hat w})^{\textrm T}\mathbf{H}(\mathbf{w} − \mathbf{\hat w})$$ With $\mathbf{b}$ the gradient and $\mathbf{H}$ the Hessian, the error surface is specified by the quantities $\mathbf{b}$ and $\mathbf{H},ドル which contain a total of $W(W + 3)/2$ independent elements (because the matrix $\mathbf{H}$ is symmetric), where $W$ is the dimensionality of $\mathbf{w}$ (i.e., the total number of adaptive parameters in the network). The location of the minimum of this quadratic approximation therefore depends on $O(W^2)$ parameters, and we should not expect to be able to locate the minimum until we have gathered $O(W^2)$ independent pieces of information. If we do not make use of gradient information, we would expect to have to perform $O(W^2)$ function evaluations, each of which would require $O(W)$ steps. Thus, the computational effort needed to find the minimum using such an approach would be $O(W^3)$.

Now compare this with an algorithm that makes use of the gradient information. Because each evaluation of $\nabla E$ brings $W$ items of information, we might hope to find the minimum of the function in $O(W)$ gradient evaluations. As we shall see, by using error backpropagation, each such evaluation takes only $O(W)$ steps and so the minimum can now be found in $O(W^2)$ steps. For this reason, the use of gradient information forms the basis of practical algorithms for training neural networks.

I can't see the intuition he's referring to. What is he referring to as information, and how is he estimating the number of steps to the minimum?

Question 2

Double check your equation, in the book you have $\simeq$ instead of equality. Note also the $\widehat{\mathbf{W}}$ notation.

Question 3

In this section of the book, Bishop tries to argue about why should we care about an algorithm that only considers the gradient of the error function $\nabla E$, as opposed to an algorithm that considers the Hessian of the error function matrix $\nabla^2 E$.

To do so, he talks in terms of the information provided by each of these two quantities. With information, he is referring to what $\nabla E$ or $\nabla^2 E$ can tell us about the surface of the error function $E$. (The analysis of these two are considered in a prior section).

Now, in order to fully obtain the information from $\nabla^2 E$, we have to consider a quadratic number of terms $O(W^2)$, i.e., we need to compute $\forall i,j\in\{1,\ldots,W\}\frac{\partial E}{\partial {\bf w}_i\partial {\bf w}_j}$. In a simple, yet inefficient program, this would constitute to do a double for loop to obtain each of the elements for $\nabla^2E$:

# .. Initialize H as an empty WxW matrix
# .. Consider a vector w, with W weights
for i in range(W):
 for j in (W):
 # Compute the partial w.r.t. i and j of E(w) and
 # update the i,jth entry of H
 H[i,j] = partial(E, w, i, j)

Next, For a given vector $\bf w$, we would like to update this $\bf w$ with the information now provided by $\nabla^2 E$. Again, in a computer program, this constitutes to update each entry of the weight vector ${\bf w}$ by means of considering each entry in $\nabla^2E$, i.e., three for-loops:

# .. Consider H (now filled with information)
# .. Consider a vector w, with W weights
for i in range(W):
 for j in range(W):
 for k in range(W):
 update_w(E, w, k, i, j)

Where update_w updates the k-th entry of the vector $\bf w$ based on the information provided by the i,k-th entry of $\nabla^2 E$.

This last program considers a cubic number of terms. Hence we can see that its complexity is $O(W^3)$.

Note that in order to update $\bf w$ we did not consider the information provided by $\nabla E$. By considering the information from $\nabla E$ and neglecting what $\nabla^2 E$ has of information, using a similar argument we arrive at an algorithm that has complexity $O(W^2)$ as opposed to $O(W^3)$

Question 4

I think Bishop compares algorithms based on $\nabla E$ and $E,ドル not $\nabla E$ and $\nabla \nabla E$ like you're saying. The key element is that one evaluation of a gradient can be done in $O(W)$ (using backpropagation), as opposed to $O(W^2),ドル which we would naively expect from approximating each component $\nabla E_i$ by a finite difference. Evaluating $E$ also takes $O(W)$ (each of $W$ parameters must be used). Paradoxically, computing $\nabla E$ is as cheap as computing $E,ドル but yields $W$ "pieces" instead of just 1ドル$!

Question 5

Now, I am not 100% sure, but I suppose that $\nabla \nabla E$ can also be computed using backpropagation in $O(W^2),ドル as opposed to naively expected $O(W^3)$. This is why Bishop is not comparing $\nabla \nabla E$ to $\nabla E,ドル because both can give $O(W^2)$ "pieces of information" in $O(W^2)$ time.

Question 6

Finally, I have serious doubts about your claim that $O(W^3)$ operations are necessary to update a vector $\mathbf w$ based on a matrix $\mathbf H$. For example, a matrix-vector multiplication takes $O(W^2),ドル that is, $\mathbf w ^\tau := \mathbf w + \mathbf H \mathbf w$ can be computed in $O(W^2)$ without 3 nested for-loops.

Question 7

The way I understand this passage is as follows.

Assume the error function $E({\bf w})$ could be approximated well by a quadratic function in ${\bf w}$ $${\bf w^Tb} +\frac{1}{2} {\bf w^T H w}.$$ The minimum of such function would be uniquely determined by knowing all the coefficients contained in ${\bf b, H}$, of which there are $W(W+3)/2=O(W^2)$ of those.

Now we have two ways of collecting $O(W^2)$ "pieces of information" to retrieve the $O(W^2)$ coefficients of the error function.

We could just do $O(W^2)$ point evaluations of $E({\bf w})$ at different ${\bf w}$. Each point evaluation further requires $O(W)$ computations which could be seen like this. Both in the sum-of-squares and the cross-entropy error functions, you need to perform $N$ evaluations (for each term in the sum) of an expression involving $y_n({\bf w,x})$, and evaluating $y_n({\bf w,x})$ is $O(W)$. Assuming that the data set is fixed, $N$ is just a constant, so evaluating $E({\bf w})$ at a point is $N \cdot O(W)=O(W)$. So altogether to retrieve all the coefficients one needs $O(W^2) \cdot O(W) = O(W^3)$ computations.

Another way would be to collect $W$ units of information at a time by computing the gradient vector $\nabla E$ (instead of a point evaluation which gives one number, one unit of information). So to collect $O(W^2)$ pieces of information we need to restore the function, this time we need $O(W)$ evaluations of the gradient vector since $W \cdot O(W) = O(W^2)$. The interesting part is that backpropagation allows to compute the gradient vector still in $O(W)$ computations (intuitively, one may have though that if we computed one number in $O(W)$, we'd compute a vector of length $W$ in $O(W^2)$ steps; but this is the magic of backpropagation). So now miraculously, using $O(W)$ computations you are computing a vector $\nabla E$ instead of a number $E({\bf w})$, and to collect $O(W^2)$ information you perform $O(W)$ computations instead of $O(W^2)$. Each computation is still $O(W)$, so the total cost is now $O(W^2)$ computations.

Question 8

The information in gradient the author is referring to is basically applying newton's method iteratively, which uses Hessian matrix H and produces excellent local convergence (even direct convergence in case of quadratic cost function).

The application of Newton’s method for training large neural networks is limited by the significant computational burden it imposes. The number of elements in the Hessian is squared in the number of parameters, so with k parameters (and for even very small neural networks the number of parameters k can be in the millions), Newton’s method would require the inversion of a k ×ばつ k matrix—with computational complexity of O(k^3). The following image is screenshot from Deeplearning book by Goodfellow et al. You can read more about newton's method from it. Link to chapter: http://www.deeplearningbook.org/contents/optimization.html

enter image description here

Question 9

This doesn't actually appear to answer the question, I'm afraid. The phrase "the information in gradient ... is basically applying Newton's method" doesn't explain what gradient information means, nor do you address the question of how he estimates the number of steps required.

Gerardo Duran-Martin 5904 silver badges13 bronze badges · Answer 1 · 2019-07-07 22:15:05Z

In this section of the book, Bishop tries to argue about why should we care about an algorithm that only considers the gradient of the error function $\nabla E$, as opposed to an algorithm that considers the Hessian of the error function matrix $\nabla^2 E$.

To do so, he talks in terms of the information provided by each of these two quantities. With information, he is referring to what $\nabla E$ or $\nabla^2 E$ can tell us about the surface of the error function $E$. (The analysis of these two are considered in a prior section).

Now, in order to fully obtain the information from $\nabla^2 E$, we have to consider a quadratic number of terms $O(W^2)$, i.e., we need to compute $\forall i,j\in\{1,\ldots,W\}\frac{\partial E}{\partial {\bf w}_i\partial {\bf w}_j}$. In a simple, yet inefficient program, this would constitute to do a double for loop to obtain each of the elements for $\nabla^2E$:

# .. Initialize H as an empty WxW matrix
# .. Consider a vector w, with W weights
for i in range(W):
 for j in (W):
 # Compute the partial w.r.t. i and j of E(w) and
 # update the i,jth entry of H
 H[i,j] = partial(E, w, i, j)

Next, For a given vector $\bf w$, we would like to update this $\bf w$ with the information now provided by $\nabla^2 E$. Again, in a computer program, this constitutes to update each entry of the weight vector ${\bf w}$ by means of considering each entry in $\nabla^2E$, i.e., three for-loops:

# .. Consider H (now filled with information)
# .. Consider a vector w, with W weights
for i in range(W):
 for j in range(W):
 for k in range(W):
 update_w(E, w, k, i, j)

Where update_w updates the k-th entry of the vector $\bf w$ based on the information provided by the i,k-th entry of $\nabla^2 E$.

This last program considers a cubic number of terms. Hence we can see that its complexity is $O(W^3)$.

Note that in order to update $\bf w$ we did not consider the information provided by $\nabla E$. By considering the information from $\nabla E$ and neglecting what $\nabla^2 E$ has of information, using a similar argument we arrive at an algorithm that has complexity $O(W^2)$ as opposed to $O(W^3)$

I think Bishop compares algorithms based on $\nabla E$ and $E,ドル not $\nabla E$ and $\nabla \nabla E$ like you're saying. The key element is that one evaluation of a gradient can be done in $O(W)$ (using backpropagation), as opposed to $O(W^2),ドル which we would naively expect from approximating each component $\nabla E_i$ by a finite difference. Evaluating $E$ also takes $O(W)$ (each of $W$ parameters must be used). Paradoxically, computing $\nabla E$ is as cheap as computing $E,ドル but yields $W$ "pieces" instead of just 1ドル$!
Now, I am not 100% sure, but I suppose that $\nabla \nabla E$ can also be computed using backpropagation in $O(W^2),ドル as opposed to naively expected $O(W^3)$. This is why Bishop is not comparing $\nabla \nabla E$ to $\nabla E,ドル because both can give $O(W^2)$ "pieces of information" in $O(W^2)$ time.
Finally, I have serious doubts about your claim that $O(W^3)$ operations are necessary to update a vector $\mathbf w$ based on a matrix $\mathbf H$. For example, a matrix-vector multiplication takes $O(W^2),ドル that is, $\mathbf w ^\tau := \mathbf w + \mathbf H \mathbf w$ can be computed in $O(W^2)$ without 3 nested for-loops.

Bananeen 1414 bronze badges · Answer 2 · 2020-12-30 19:43:06Z

The way I understand this passage is as follows.

Assume the error function $E({\bf w})$ could be approximated well by a quadratic function in ${\bf w}$ $${\bf w^Tb} +\frac{1}{2} {\bf w^T H w}.$$ The minimum of such function would be uniquely determined by knowing all the coefficients contained in ${\bf b, H}$, of which there are $W(W+3)/2=O(W^2)$ of those.

Now we have two ways of collecting $O(W^2)$ "pieces of information" to retrieve the $O(W^2)$ coefficients of the error function.

We could just do $O(W^2)$ point evaluations of $E({\bf w})$ at different ${\bf w}$. Each point evaluation further requires $O(W)$ computations which could be seen like this. Both in the sum-of-squares and the cross-entropy error functions, you need to perform $N$ evaluations (for each term in the sum) of an expression involving $y_n({\bf w,x})$, and evaluating $y_n({\bf w,x})$ is $O(W)$. Assuming that the data set is fixed, $N$ is just a constant, so evaluating $E({\bf w})$ at a point is $N \cdot O(W)=O(W)$. So altogether to retrieve all the coefficients one needs $O(W^2) \cdot O(W) = O(W^3)$ computations.

Another way would be to collect $W$ units of information at a time by computing the gradient vector $\nabla E$ (instead of a point evaluation which gives one number, one unit of information). So to collect $O(W^2)$ pieces of information we need to restore the function, this time we need $O(W)$ evaluations of the gradient vector since $W \cdot O(W) = O(W^2)$. The interesting part is that backpropagation allows to compute the gradient vector still in $O(W)$ computations (intuitively, one may have though that if we computed one number in $O(W)$, we'd compute a vector of length $W$ in $O(W^2)$ steps; but this is the magic of backpropagation). So now miraculously, using $O(W)$ computations you are computing a vector $\nabla E$ instead of a number $E({\bf w})$, and to collect $O(W^2)$ information you perform $O(W)$ computations instead of $O(W^2)$. Each computation is still $O(W)$, so the total cost is now $O(W^2)$ computations.

Vishwas Lathi 11 bronze badge · Answer 3 · 2018-04-03 14:44:42Z

The information in gradient the author is referring to is basically applying newton's method iteratively, which uses Hessian matrix H and produces excellent local convergence (even direct convergence in case of quadratic cost function).

The application of Newton’s method for training large neural networks is limited by the significant computational burden it imposes. The number of elements in the Hessian is squared in the number of parameters, so with k parameters (and for even very small neural networks the number of parameters k can be in the millions), Newton’s method would require the inversion of a k ×ばつ k matrix—with computational complexity of O(k^3). The following image is screenshot from Deeplearning book by Goodfellow et al. You can read more about newton's method from it. Link to chapter: http://www.deeplearningbook.org/contents/optimization.html

enter image description here

This doesn't actually appear to answer the question, I'm afraid. The phrase "the information in gradient ... is basically applying Newton's method" doesn't explain what gradient information means, nor do you address the question of how he estimates the number of steps required.

Stack Exchange Network

Using gradient information in minimizing error function, in Bishop's Pattern Recognition

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Using gradient information in minimizing error function, in Bishop's Pattern Recognition

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions