CCNBook/Learning/Backpropagation

From Computational Cognitive Neuroscience Wiki
Jump to: navigation, search

Back to Learning Chapter

In this subtopic, we trace the mathematical progression of error-driven learning from a simple two-layer network, which was developed in 1960, to a network with three or more layers, which took 26 years to be invented for the last time (several others invented it earlier, but it didn't really catch on). In the process, we develop a much more rigorous understanding of what error-driven learning is, which can also be applied directly to understanding what the XCAL learning function in its error-driven mode is doing. We start off with a high-level conceptual summary, working backward from XCAL, that should be accessible to those with a basic mathematical background (requiring only basic algebra), and then get progressively more into the math, where we take advantage of concepts from calculus (namely, the notion of a partial derivative).

The highest-level summary is that XCAL provides a very good approximation to an optimal form of error-driven learning, called error backpropagation, which works by directly minimizing a computed error statistic through steepest gradient descent. In other words, backpropagation is mathematically designed to learn whatever you throw at it in the most direct way possible, and XCAL basically does the same thing. If you want to first understand the principled math behind backpropagation, skip down to read the Gradient Descent on Error and the Delta Rule section below, and then return here to see how XCAL approximates this function.

The critical difference is that XCAL uses bidirectional activation dynamics to communicate error signals throughout the network, whereas backpropagation uses a biologically implausible procedure that propagates error signals backward across weight values, in the opposite direction of the way that activation typically flows (hence the name). As discussed in the main chapter, the XCAL network experiences a sequence of activation states, going from an expectation to experiencing a subsequent outcome, and learns on the difference between these two states. In contrast, backpropagation computes a single error delta value that is effectively the difference between the outcome and the expectation, and then sends this single value backwards across the weights. In the following math, we show how these two ways of doing error-driven learning are approximately equivalent.

The primary value of this exercise is to first establish that XCAL can perform a powerful, effective form of error-driven learning, and also to obtain further insights into the essential character of this error-driven learning by understanding how it is derived from first principles. One of the most important intuitive ideas that emerges from this analysis is the notion of credit assignment -- you are encouraged to read up through that section (or just skip ahead to it if you really can't stand the following math, which again only requires basic algebra).

From XCAL to Backpropagation via GeneRec

To begin, the error-driven aspect of XCAL is effectively:

  •  \Delta w \approx \langle xy \rangle_s - \langle xy \rangle_m

which reflects a contrast between the average firing rate during the outcome, represented by the first term, and that over the expectation, represented by the second term.

We will see how this XCAL rule is related to the backpropagation error-minimizing rule, but achieves this function in a more biologically constrained way. This was the same goal of previous attempts including the GeneRec (generalized recirculation) algorithm (O'Reilly, 1996), which is equivalent to the Contrastive Hebbian Learning (CHL) equation (Movellan etc):

  •  \Delta w = \left(x^+ y^+\right) - \left(x^- y^-\right) .

Here, the first term is the activity of the sending and receiving units during the outcome (in the plus phase), while the second term is the activity during the expectation (in the minus phase). CHL is so-named because it involves the contrast or difference between two Hebbian-like terms. As you can see, XCAL is essentially equivalent to CHL, despite a few differences:

  • XCAL actually uses the XCAL dWt function instead of a direct subtraction, which causes weight changes to go to 0 at when short term activity is 0 (as dictated by the biology).
  • XCAL is based on average activations across the entire evolution of attractors (reflected by accumulated Ca++ levels), instead of based on single points of activation (i.e., the final attractor state in each of two phases, as used somewhat unrealistically in CHL -- how would the plasticity rules 'know' exactly what counts as the final state of each phase?).

Both of these factors are discussed more in the subtopic on Implementational Details. But for the present purposes, we can safely ignore them, which allows us to leverage all of the analysis that went into understanding GeneRec -- itself a large step towards biological plausibility relative to backpropagation.

The core of this analysis revolves around the following simpler version of the GeneRec equation, which we call the GeneRec delta equation:

  •  \Delta w = x^- \left(y^+ - y^- \right)

where the weight change is driven only by the delta in activity on the receiving unit y between the plus (outcome) and minus (expectation) phases, multiplied by the sending unit activation x. One can derive the full CHL equation from this simpler GeneRec delta equation by adding a constraint that the weight changes computed by the sending unit to the receiving unit be the same as those of the receiving unit to the sending unit (i.e., a symmetry constraint based on bidirectional connectivity), and by replacing the minus phase activation for the sending unit with the average of the minus and plus phase activations (which ends up being equivalent to the midpoint method for integrating a differential equation). You can find the actual mathematics of this derivation later in this subtopic, but you can take our word for it for the time being.

Interestingly, the GeneRec delta equation is equivalent in form to the delta rule, which we derive below as the optimal way to reduce error in a two layer network (input units sending to output units, with no hidden units in between). The delta rule was originally derived in 1960 by Widrow and Hoff, and it is also basically equivalent to a gradient descent solution to linear regression. This is very basic old-school math.

But two-layer networks are very limited in what they can compute. As we discussed in the Networks Chapter, you really need those hidden layers to form higher-level ways of re-categorizing the input, to solve challenging problems (you will also see this directly in the simulation explorations in this chapter). As we discuss more below, the limitations of the delta rule and two-layer networks were highlighted in a very critical paper by Minsky and Papert in 1969, which brought research in the field of neural network models nearly to a standstill for nearly 20 years.

Figure 1: Illustration of backpropgation computation in three-layer network. First, the feedforward activation pass generates a pattern of activations across the units in the network, cascading from input, to hidden to output. Then, "delta" values are propagated backward in the reverse direction across the same weights. The delta sum is broken out in the hidden layer to facilitate comparison with the GeneRec algorithm as shown in the next figure.
Figure 2: Illustration of GeneRec/XCAL computation in three-layer network, for comparison with previous figure showing backpropagation. Activations settle in the expectation/minus phase, in response to input activations presented to the input layer. Activation flows bidirectionally, so that the hidden units are driven both by inputs and activations that arise on the output units. In the outcome/plus phase, "target" values drive the output unit activations, and due to the bidirectional connectivity, these also influence the hidden units in the plus phase. Mathematically, changing the weights based on the difference in hidden layer activation states between the plus and minus phases results in a close approximation to the delta value computed by backpropagation. This same rule is then used to change the weights into the hidden units from the input units (delta times sending activation), which is the same form used in backpropagation, and identical in form to the delta rule.

In 1986, David Rumelhart and colleagues published a landmark paper on the backpropagation learning algorithm, which essentially extended the delta rule to networks with three or more layers (Figure ). These models have no limitations on what they can learn, and they opened up a huge revival in neural network research, with backpropagation neural networks providing practical and theoretically interesting solutions to a very wide range of problems.

The essence of the backpropagation (also called "backprop") algorithm is captured in this delta backpropagation equation:

  •  \Delta w = x \left( \sum_k \delta_k w_k \right) y'

where x is again the sending activity value, \delta is the error derivative for the units in the next layer above the layer containing the current receiving unit y (with each such unit indexed by the subscript k), and w_k is the weight from the receiving unit y to the k'th such unit in the next layer above (see Figure ). Ignore the z' term for the time being -- it is the derivative of the receiving units activation function, and it will come in handy in a bit.

So we're propagating this "delta" (error) value backward across the weights, in the opposite direction that the activation typically flows in the "feedforward" direction, which is from the input to the hidden to the output (backprop networks are typically feedforward, though bidirectional versions have been developed as discussed below). This is the origin of the "backpropagation" name.

Before we unpack this equation a bit more, let's consider what happens at the output layer in a standard three-layer backprop network like that pictured in the Figure. In these networks, there is no outcome/plus phase, but instead we just compare the output activity of units in the output layer (effectively the expectation) and compute externally the difference between these activities and the target activity values t. The difference is the delta value:

  •  \delta = t - z

and is used to drive learning by changing the weight from sending unit y in the hidden layer to a given output unit z is:

  •  \Delta w = y \delta = y (t - z)

You should recognize that this is exactly the delta rule as described above (where we keep in mind that y is now a sending activation to the output units). The delta rule is really the essence of all error-driven learning methods.

Now let's get back to the delta backpropagation equation, and see how we can get from it to GeneRec (and thus to XCAL). We just need to replace the \delta_k term with the value for the output units, and then do some basic rearranging of terms, and we get very close to the GeneRec delta equation:

  •  \Delta w = x \left( \sum_k (t_k - z_k) w_k \right) y'
  •  \Delta w = x \left( \sum_k t_k w_k - \sum_k z_k w_k \right) y'

If you compare this last equation with the GeneRec delta equation, they would be equivalent (except for the y' term that we're still ignoring) if we made the following definitions:

  •  y^+ = \left. \sum_k t_k w_k \right.
  •  y^- =  \left. \sum_k z_k w_k \right.
  •  x^- = x

Interestingly, these sum terms are identical to the net input that unit y would receive from unit z if the weight went the other way, or, critically, if y also received a symmetric, bidirectional connection from z, in addition to sending activity to z. Thus, we arrive at the critical insight behind the GeneRec algorithm relative to the backpropagation algorithm:

  • Symmetric bidirectional connectivity can convey error signals as the difference between two activity states (plus/outcome vs. minus/expectation), instead of sending a single "delta" error value backward down a single weight in the opposite (backpropagation) direction.

The only wrinkle in this argument at this point is that we had to assign the activation states of the receiving unit to be equal to those net-input like terms (even though we use non-linear thresholded activation functions), and also those net input terms ignore the other inputs that the receiving unit should also receive from the sending units in the input layer. The second problem is easily dispensed with, because those inputs from the input layer would be common to both "phases" of activation, and thus they cancel out when we subtract  y^+ - y^- . The first problem can be solved by finally no longer ignoring the y' term -- it turns out that the difference between a function evaluated at two different points can be approximated as the difference between the two points, times the derivative of the function:

  •  f(a) - f(b) \approx f'(a) (a-b)

So we can now say that the activations states of y are a function of these net input terms:

  •  y^+ =  f \left( \sum_k t_k w_k \right)
  •  y^- =  f \left( \sum_k z_k w_k \right)

and thus their difference can be approximated by the difference in net inputs times the activation function derivative:

  •  y^+ - y^- \approx y' \left( \sum_k t_k w_k - \sum_k z_k w_k \right)

Which gets us right back to the GeneRec delta equation as being a good approximation to the delta backpropagation equation:

  •  \Delta w = x^- \left(y^+ - y^- \right)  \approx x \left( \sum_k \delta_k w_k \right) y'

So if you've followed along to this point, you can now rest easy by knowing that the GeneRec (and thus XCAL) learning functions are actually very good approximations to error backpropagation. As we noted at the outset, XCAL uses bidirectional activation dynamics to communicate error signals throughout the network, in terms of averaged activity over two distinct states of activation (expectation followed by outcome), whereas backpropagation uses a biologically implausible procedure that propagates a single error value (outcome - expectation) backward across weight values, in the opposite direction of the way that activation typically flows.

Gradient Descent on Error and the Delta Rule

Now, we'll back up a bit and trace more of a historical trajectory through error-driven learning, starting by deriving the delta rule through the principle of steepest gradient descent on an error function. To really understand the mathematics here, you'll need to understand calculus and the notion of a derivative. Interestingly, we only need the most basic forms of derivatives to do this math -- it really isn't very fancy. The basic strategy is to define an error function which tells you how poorly your network is doing at a task, and then take the negative of the derivative of this error function relative to the synaptic weights in the network, which then tells you how to adjust the synaptic weights so as to minimize error. This is what error-driven learning does, and mathematically, we take the simplest, most direct approach.

A very standard error function, commonly used in statistics, is the sum squared error (SSE):

  •  SSE = \sum_k \left( t_k - z_k \right)^2

which is the sum over output units (indexed by k) of the target activation t minus the actual output activation that the network produced (z) , squared. There is typically an extra sum here too, over all the different input/output patterns that the network is being trained on, but it cancels out for all of the following math, so we can safely ignore it.

In the context of the expectation and outcome framework of the main chapter, the outcomes are the targets, and the expectations are the output activity of the network.

For the time being, we assume a linear activation function of activations from sending units y, and that we just have a simple two-layer network with these sending units projecting directly to the output units:

  •  z_k = \left. \sum_j y_j w_{jk} \right.

Taking the negative of the derivative of SSE with respect to the weight w, which is easier computed by breaking it down into two parts using the 'chain rule to first get the derivative of SSE with respect to the output activation z, and multiply that by the derivative of z with respect to the weight:

  •  \Delta w_{jk} =  -\frac{\partial SSE}{\partial w_{jk}} =  -\frac{\partial SSE}{\partial z_k} \frac{\partial z_k}{\partial w_{jk}}   = (t_k - z_k) y_j

When you break down each step separately, it is all very straightforward:

  •  \frac{\partial SSE}{\partial z_k}  = -2 (t_k - z_k)
  •  \frac{\partial z_k}{\partial w_{jk}} = y_j

(the other elements of the sums drop out because the first partial derivative is with respect to z_k so derivative for all other z's is zero, and similarly the second partial derivative is with respect to y_j so the derivative for the other y's is zero.)

Thus, the negative of  \frac{\partial SSE}{\partial w_{jk}} is  2 (t_k -z_k) and since 2 is a constant, we can just absorb it into the learning rate parameter.

Breaking down the error-minimization in this way, it becomes apparent that the weight change should be adjusted in proportion to both the error (difference between the target and the output) and the extent to which the sending unit y was active. This modulation of weight change by activity of the sending unit achieves a critical credit assignment function (or rather blame assignment in this case), so that when an error is made at the output, weights should only change for the sending units that contributed to that error. Sending units that were not active did not cause the error, and their weights are not adjusted.

As noted above, the original delta rule was published by Widrow and Hoff in 1960, followed in 1969 by the critique by Minsky & Papert, showing that such models could not learn a large class of basic but nonlinear logical functions, for example the XOR function. XOR states that the output should be true (active) if either one of two inputs are true, but not both. This requires a strong form of nonlinearity that simply could not be represented by such models. In retrospect, it should have been obvious that the problem was the use of a two-layer network, but as often happens, this critique left a bad "odor" over the field, and people simply pursued other approaches (mainly symbolic AI, which Minsky was an advocate for).

Error Backpropagation

Then, roughly 26 years later, David Rumelhart and colleagues published a paper on the backpropagation learning algorithm, which extended the delta-rule style error-driven learning to networks with three or more layers. The addition of the extra layer(s) now allows such networks to solve XOR and any other kind of problem (there are proofs about the universality of the learning procedure). The problem is that above, we only considered how to change weights from a sending unit y to an output unit z, based on the error between the target t and actual output activity. But for multiple stages of hidden layers, how do we adjust the weights from the inputs to the hidden units? Interestingly, the mathematics of this involves simply adding a few more steps to the chain rule.

The overall derivation is as follows. The goal is to again minimize the error (SSE) as a function of the weights,

 \Delta w_{ij} =  -\frac{\partial SSE}{\partial w_{ij}} =  -\frac{\partial SSE}{\partial z_k} \frac{\partial z_k}{\partial \eta_k} \frac{\partial \eta_k}{\partial y_j} \frac{\partial y_j}{\partial \eta_j}  \frac{\partial \eta_j}{\partial w_{ij}}

Although this looks like a lot, it is really just applying the same chain rule as above repeatedly. To know how to change the weights from input unit x_i to hidden unit y_j, we have to know how changes in this weight  w_{ij} are related to changes in the SSE. This involves computing how the SSE changes with output activity, how output activity changes with its net input, how this net input changes with hidden unit activity y_j, how in turn this activity changes with its net input \eta_j, and finally, how this net input changes with the weights from sending unit x_i to hidden unit y_j. Once all of these factors are computed, they can be multiplied together to determine how the weight  w_{ij} should be adjusted to minimize error, and this can be done for all sending units to all hidden units (and also as derived earlier, for all hidden units to all output units).

We again assume a linear activation function at the output for simplicity, so that  \frac{\partial z_k}{\partial \eta_k} = 1 . We allow for non-linear activation functions in the hidden units y, and simply refer to the derivative of this activation function as y' (which for the common sigmoidal activation functions turns out to be y (1-y) but we leave it in generic form here so that it can be applied to any differentiable activation function. The solution to the above equation is then, applying each step in order,

 -\frac{\partial SSE}{\partial w_{ij}} = \sum_k (t_k - z_k) * 1 * w_{jk} * y' * x_i = x \left( \sum_k \delta_k w_{jk} \right) y' as specified earlier.

You can see that this weight change occurs not only in proportion to the error at the output, and the 'backpropagated' error at the hidden unit activity y, but also to the activity of the sending unit  x_i . So, once again, the learning rule assigns credit/blame to change weights based on active input units that contributed to the error, weighted by the degree to the error at the level of hidden unit activities contribute to errors at the output (which is the weight  w_{jk} ). At all steps along the process, the appropriate units and weights are factored in to minimize errors. This procedure can be repeated for any arbitrary number of layers, with repeated application of the chain rule.

Biological Implausibility

todo

== Generalized Recirculation and the Contrastive Hebbian Learning (CHL) Function ==