CCNBook/Learning/Backpropagation
Back to Learning Chapter
In this subtopic, we trace the mathematical progression of error-driven learning from a simple two-layer network, which was developed in 1960, to a network with three or more layers, which took 26 years to be invented for the last time (several others invented it earlier, but it didn't really catch on). In the process, we develop a much more rigorous understanding of what error-driven learning is, which can also be applied directly to understanding what the XCAL learning function in its error-driven mode is doing. We start off with a high-level conceptual summary, working backward from XCAL, that should be accessible to those with a basic mathematical background (requiring only basic algebra), and then get progressively more into the math, where we take advantage of concepts from calculus (namely, the notion of a partial derivative).
The highest-level summary is that XCAL provides a very good approximation to an optimal form of error-driven learning, called error backpropagation, which works by directly minimizing a computed error statistic through steepest gradient descent. In other words, backpropagation is mathematically designed to learn whatever you throw at it in the most direct way possible, and XCAL basically does the same thing. If you want to first understand the principled math behind backpropagation, skip down to read the Gradient Descent on Error and the Delta Rule section below, and then return here to see how XCAL approximates this function.
The critical difference is that XCAL uses bidirectional activation dynamics to communicate error signals throughout the network, whereas backpropagation uses a biologically implausible procedure that propagates error signals backward across weight values, in the opposite direction of the way that activation typically flows (hence the name). As discussed in the main chapter, the XCAL network experiences a sequence of activation states, going from an expectation to experiencing a subsequent outcome, and learns on the difference between these two states. In contrast, backpropagation computes a single error delta value that is effectively the difference between the outcome and the expectation, and then sends this single value backwards across the weights. In the following math, we show how these two ways of doing error-driven learning are approximately equivalent.
The primary value of this exercise is to first establish that XCAL can perform a powerful, effective form of error-driven learning, and also to obtain further insights into the essential character of this error-driven learning by understanding how it is derived from first principles. One of the most important intuitive ideas that emerges from this analysis is the notion of credit assignment -- you are encouraged to read up through that section (or just skip ahead to it if you really can't stand the following math, which again only requires basic algebra).
Contents
From XCAL to Backpropagation via GeneRec
To begin, the error-driven aspect of XCAL is effectively:
which reflects a contrast between the average firing rate during the outcome, represented by the first term, and that over the expectation, represented by the second term.
We will see how this XCAL rule is related to the backpropagation error-minimizing rule, but achieves this function in a more biologically constrained way. This was the same goal of previous attempts including the GeneRec (generalized recirculation) algorithm (O'Reilly, 1996), which is equivalent to the Contrastive Hebbian Learning (CHL) equation (Movellan etc):
- .
Here, the first term is the activity of the sending and receiving units during the outcome (in the plus phase), while the second term is the activity during the expectation (in the minus phase). CHL is so-named because it involves the contrast or difference between two Hebbian-like terms. As you can see, XCAL is essentially equivalent to CHL, despite a few differences:
- XCAL actually uses the XCAL dWt function instead of a direct subtraction, which causes weight changes to go to 0 at when short term activity is 0 (as dictated by the biology).
- XCAL is based on average activations across the entire evolution of attractors (reflected by accumulated Ca++ levels), instead of based on single points of activation (i.e., the final attractor state in each of two phases, as used somewhat unrealistically in CHL -- how would the plasticity rules 'know' exactly what counts as the final state of each phase?).
Both of these factors are discussed more in the subtopic on Implementational Details. But for the present purposes, we can safely ignore them, which allows us to leverage all of the analysis that went into understanding GeneRec -- itself a large step towards biological plausibility relative to backpropagation.
The core of this analysis revolves around the following simpler version of the GeneRec equation, which we call the GeneRec delta equation:
where the weight change is driven only by the delta in activity on the receiving unit y between the plus (outcome) and minus (expectation) phases, multiplied by the sending unit activation x. One can derive the full CHL equation from this simpler GeneRec delta equation by adding a constraint that the weight changes computed by the sending unit to the receiving unit be the same as those of the receiving unit to the sending unit (i.e., a symmetry constraint based on bidirectional connectivity), and by replacing the minus phase activation for the sending unit with the average of the minus and plus phase activations (which ends up being equivalent to the midpoint method for integrating a differential equation). You can find the actual mathematics of this derivation later in this subtopic, but you can take our word for it for the time being.
Interestingly, the GeneRec delta equation is equivalent in form to the delta rule, which we derive below as the optimal way to reduce error in a two layer network (input units sending to output units, with no hidden units in between). The delta rule was originally derived in 1960 by Widrow and Hoff, and it is also basically equivalent to a gradient descent solution to linear regression. This is very basic old-school math.
But two-layer networks are very limited in what they can compute. As we discussed in the Networks Chapter, you really need those hidden layers to form higher-level ways of re-categorizing the input, to solve challenging problems (you will also see this directly in the simulation explorations in this chapter). As we discuss more below, the limitations of the delta rule and two-layer networks were highlighted in a very critical paper by Minsky and Papert in 1969, which brought research in the field of neural network models nearly to a standstill for nearly 20 years.
In 1986, David Rumelhart and colleagues published a landmark paper on the backpropagation learning algorithm, which essentially extended the delta rule to networks with three or more layers (Figure 1). These models have no limitations on what they can learn, and they opened up a huge revival in neural network research, with backpropagation neural networks providing practical and theoretically interesting solutions to a very wide range of problems.
The essence of the backpropagation (also called "backprop") algorithm is captured in this delta backpropagation equation:
where x is again the sending activity value, is the error derivative for the units in the next layer above the layer containing the current receiving unit y (with each such unit indexed by the subscript k), and is the weight from the receiving unit y to the k'th such unit in the next layer above (see Figure 1). Ignore the term for the time being -- it is the derivative of the receiving units activation function, and it will come in handy in a bit.
So we're propagating this "delta" (error) value backward across the weights, in the opposite direction that the activation typically flows in the "feedforward" direction, which is from the input to the hidden to the output (backprop networks are typically feedforward, though bidirectional versions have been developed as discussed below). This is the origin of the "backpropagation" name.
Before we unpack this equation a bit more, let's consider what happens at the output layer in a standard three-layer backprop network like that pictured in the Figure. In these networks, there is no outcome/plus phase, but instead we just compare the output activity of units in the output layer (effectively the expectation) and compute externally the difference between these activities and the target activity values t. The difference is the delta value:
and is used to drive learning by changing the weight from sending unit y in the hidden layer to a given output unit z is:
You should recognize that this is exactly the delta rule as described above (where we keep in mind that y is now a sending activation to the output units). The delta rule is really the essence of all error-driven learning methods.
Now let's get back to the delta backpropagation equation, and see how we can get from it to GeneRec (and thus to XCAL). We just need to replace the term with the value for the output units, and then do some basic rearranging of terms, and we get very close to the GeneRec delta equation:
If you compare this last equation with the GeneRec delta equation, they would be equivalent (except for the y' term that we're still ignoring) if we made the following definitions:
Interestingly, these sum terms are identical to the net input that unit y would receive from unit z if the weight went the other way, or, critically, if y also received a symmetric, bidirectional connection from z, in addition to sending activity to z. Thus, we arrive at the critical insight behind the GeneRec algorithm relative to the backpropagation algorithm:
- Symmetric bidirectional connectivity can convey error signals as the difference between two activity states (plus/outcome vs. minus/expectation), instead of sending a single "delta" error value backward down a single weight in the opposite (backpropagation) direction.
The only wrinkle in this argument at this point is that we had to assign the activation states of the receiving unit to be equal to those net-input like terms (even though we use non-linear thresholded activation functions), and also those net input terms ignore the other inputs that the receiving unit should also receive from the sending units in the input layer. The second problem is easily dispensed with, because those inputs from the input layer would be common to both "phases" of activation, and thus they cancel out when we subtract . The first problem can be solved by finally no longer ignoring the y' term -- it turns out that the difference between a function evaluated at two different points can be approximated as the difference between the two points, times the derivative of the function:
So we can now say that the activations states of y are a function of these net input terms:
and thus their difference can be approximated by the difference in net inputs times the activation function derivative:
Which gets us right back to the GeneRec delta equation as being a good approximation to the delta backpropagation equation:
So if you've followed along to this point, you can now rest easy by knowing that the GeneRec (and thus XCAL) learning functions are actually very good approximations to error backpropagation. As we noted at the outset, XCAL uses bidirectional activation dynamics to communicate error signals throughout the network, in terms of averaged activity over two distinct states of activation (expectation followed by outcome), whereas backpropagation uses a biologically implausible procedure that propagates a single error value (outcome - expectation) backward across weight values, in the opposite direction of the way that activation typically flows.
Gradient Descent on Error and the Delta Rule
Now, we'll back up a bit and trace more of a historical trajectory through error-driven learning, starting by deriving the delta rule through the principle of steepest gradient descent on an error function. To really understand the mathematics here, you'll need to understand calculus and the notion of a derivative. Interestingly, we only need the most basic forms of derivatives to do this math -- it really isn't very fancy. The basic strategy is to define an error function which tells you how poorly your network is doing at a task, and then take the negative of the derivative of this error function relative to the synaptic weights in the network, which then tells you how to adjust the synaptic weights so as to minimize error. This is what error-driven learning does, and mathematically, we take the simplest, most direct approach.
A very standard error function, commonly used in statistics, is the sum squared error (SSE):
which is the sum over output units (indexed by k) of the target activation t minus the actual output activation that the network produced (z) , squared. There is typically an extra sum here too, over all the different input/output patterns that the network is being trained on, but it cancels out for all of the following math, so we can safely ignore it.
In the context of the expectation and outcome framework of the main chapter, the outcomes are the targets, and the expectations are the output activity of the network.
For the time being, we assume a linear activation function of activations from sending units y, and that we just have a simple two-layer network with these sending units projecting directly to the output units:
Taking the negative of the derivative of SSE with respect to the weight w, which is easier computed by breaking it down into two parts using the 'chain rule to first get the derivative of SSE with respect to the output activation z, and multiply that by the derivative of z with respect to the weight:
When you break down each step separately, it is all very straightforward:
(the other elements of the sums drop out because the first partial derivative is with respect to z_k so derivative for all other z's is zero, and similarly the second partial derivative is with respect to y_j so the derivative for the other y's is zero.)
Thus, the negative of is and since 2 is a constant, we can just absorb it into the learning rate parameter.
Breaking down the error-minimization in this way, it becomes apparent that the weight change should be adjusted in proportion to both the error (difference between the target and the output) and the extent to which the sending unit y was active. This modulation of weight change by activity of the sending unit achieves a critical credit assignment function (or rather blame assignment in this case), so that when an error is made at the output, weights should only change for the sending units that contributed to that error. Sending units that were not active did not cause the error, and their weights are not adjusted.
As noted above, the original delta rule was published by Widrow and Hoff in 1960, followed in 1969 by the critique by Minsky & Papert, showing that such models could not learn a large class of basic but nonlinear logical functions, for example the XOR function. XOR states that the output should be true (active) if either one of two inputs are true, but not both. This requires a strong form of nonlinearity that simply could not be represented by such models. In retrospect, it should have been obvious that the problem was the use of a two-layer network, but as often happens, this critique left a bad "odor" over the field, and people simply pursued other approaches (mainly symbolic AI, which Minsky was an advocate for).
Error Backpropagation
Then, roughly 26 years later, David Rumelhart and colleagues published a paper on the backpropagation learning algorithm, which extended the delta-rule style error-driven learning to networks with three or more layers. The addition of the extra layer(s) now allows such networks to solve XOR and any other kind of problem (there are proofs about the universality of the learning procedure). The problem is that above, we only considered how to change weights from a sending unit y to an output unit z, based on the error between the target t and actual output activity. But for multiple stages of hidden layers, how do we adjust the weights from the inputs to the hidden units? Interestingly, the mathematics of this involves simply adding a few more steps to the chain rule.
The overall derivation is as follows. The goal is to again minimize the error (SSE) as a function of the weights,
Although this looks like a lot, it is really just applying the same chain rule as above repeatedly. To know how to change the weights from input unit to hidden unit , we have to know how changes in this weight are related to changes in the SSE. This involves computing how the SSE changes with output activity, how output activity changes with its net input, how this net input changes with hidden unit activity , how in turn this activity changes with its net input , and finally, how this net input changes with the weights from sending unit to hidden unit . Once all of these factors are computed, they can be multiplied together to determine how the weight should be adjusted to minimize error, and this can be done for all sending units to all hidden units (and also as derived earlier, for all hidden units to all output units).
We again assume a linear activation function at the output for simplicity, so that . We allow for non-linear activation functions in the hidden units y, and simply refer to the derivative of this activation function as (which for the common sigmoidal activation functions turns out to be but we leave it in generic form here so that it can be applied to any differentiable activation function. The solution to the above equation is then, applying each step in order,
as specified earlier.
You can see that this weight change occurs not only in proportion to the error at the output, and the 'backpropagated' error at the hidden unit activity y, but also to the activity of the sending unit . So, once again, the learning rule assigns credit/blame to change weights based on active input units that contributed to the error, weighted by the degree to the error at the level of hidden unit activities contribute to errors at the output (which is the weight ). At all steps along the process, the appropriate units and weights are factored in to minimize errors. This procedure can be repeated for any arbitrary number of layers, with repeated application of the chain rule.
Biological Implausibility
todo
== Generalized Recirculation and the Contrastive Hebbian Learning (CHL) Function ==