|Author||Randall C. O'Reilly|
|First Published||Aug 6 2016|
|Tags||Learning, Pattern Association, Hebbian, Error-Driven|
|Description||Network learns to associate input and output patterns -- Hebbian learning can learn easy mappings, but not more difficult ones -- Error-driven learning is necessary for those.|
|Updated||6 August 2016, 7 August 2016, 6 September 2016, 13 January 2017, 11 January 2018, 6 February 2018|
|Versions||8.0.0, 8.0.2, 8.0.3, 8.0.4, 8.0.5|
|Emergent Versions||8.0.0, 8.0.4, 8.5.0, 8.5.1|
This simulation illustrates how error-driven and hebbian learning can operate within a simple task-driven learning context, with no hidden layers. The situation is reduced to the simplest case, where a set of 4 input units project to 2 output units. The "task" is specified in terms of the relationships between patterns of activation over the input units, and the corresponding desired or target values of the output units. This type of network is often called a pattern associator because the objective is to associate patterns of activity on the input with those on the output.
You should see the network in the PatAssocNet tab in the far right 3d view frame. Note that there are 2 output units receiving inputs from 4 input units through a set of feedforward weights.
| ⇒ Click the tab at the top of 3D view panel (far right) to view the events in the environment. |
As you can see, the input-output relationships to be learned in this "task" are simply that the leftmost two input units should make the left output unit active, while the rightmost units should make the right output unit active. This can be thought of as categorizing the first two inputs as "left" with the left output unit, and the next two as "right" with the right output unit.
This is a relatively easy task to learn because the left output unit just has to develop strong weights to the leftmost input units and ignore the ones to the right, while the right output unit does the opposite. Note that we are using FFFB inhibition, which tends to result in one active output unit (though not strictly).
The network is trained on this task by simply clamping both the input and output units to their corresponding values from the events in the environment, and performing pure BCM learning.
You should see all 4 events from the environment presented in a random order.
| ⇒ Now press then (Testing Trials) 4 times. |
You will see the activations in the output units are different this time. This is because it was the testing phase, which is run after every epoch of training. During this testing phase, all 4 events are presented to the network, except this time the output units are not clamped to the correct answer, but are instead updated solely according to their current weights from the input units (which are clamped as before). Thus, the testing phase records the current actual performance of the network on this task, when it is not being "coached" (that is why it's a test). This is equivalent to the minus phase activations during training.
| ⇒ Now click on the tab in the far right 3D view panel. |
The results of the test run you just ran are displayed. Each row represents one of the four events, with the input pattern and the actual output activations shown on the right. The sse column reports the summed squared error (SSE), which is simply the summed difference between the actual output activation during testing (o_k) and the target value (t_k) that was clamped during training:
where the sum is over the 2 output units. We are actually computing the thresholded SSE, where absolute differences of less than 0.5 are treated as zero, so the unit just has to get the activation on the correct side of 0.5 to get zero error. We thus treat the units as representing underlying binary quantities (i.e., whether the pattern that the unit detects is present or not), with the graded activation value expressing something like the likelihood of the underlying binary hypothesis being true. All of our tasks specify binary input/output patterns.
With only a single training epoch, the output unit is likely making some errors.
| ⇒ Click on the tab in the far right panel if its not already active and then the master tab in the middle panel. Press the and buttons while you watch the grid in the right frame. |
You will see the grid view update after each epoch of training, showing the pattern of outputs and the individual sse errors.
| ⇒ Next, click on the tab in the far right panel. |
Now you will see a summary plot across epochs of the sum of the thresholded SSE measure across all the events in the epoch. This shows what is often referred to as the learning curve for the network, and it should have decreased steadily down to zero, indicating that the network has learned the task. Training will stop automatically after the network has exhibited 5 correct epochs in a row (just to make sure it has really learned the problem), or it stops after 30 epochs if it fails to learn.
Let's see what the network has learned.
| ⇒ Click the tab in the (far right) panel to display the network. Press the button in the master 4 times. |
This will step through each of the training patterns -- you should see that it is producing the correct output units for each input pattern. This also updated the , which you can click on it to display the entire behavior of the network across all four trials, all at once. You should see that the network has learned this easy task, turning on the left output for the first two patterns, and the right one for the next two. Now, let's take a look at the weights for the output unit to see exactly how this happened.
| ⇒ Click on the tab in the right frame and then on the r.wt button along the top left margin. Now select the left output unit in the network (it should be in the "red arrow" select mode). |
|Question 4.3: Describe the pattern of weights in qualitative terms for each of the two output units (e.g., left output has strong weights from the ?? input units, and weaker weights from the ?? input units).|
|Question 4.4: Why would a Hebbian-style learning mechanism, which increases weights for units that are active together at the same time, produce the pattern of weights you just observed? This should be simple qualitative answer, referring to the specific patterns of activity in the input and output of the EasyEnv patterns.|
The Hard Task
Now, let's try a more difficult task.
| ⇒ Set env_type on the master tab to HARD. Click the tab at the top of the far right panel to view the events in the HARD environment. |
In this harder environment, there is overlap among the input patterns for cases where the left output should be on, and where it should be off (and the right output on). This overlap makes the task hard because the unit has to somehow figure out what the most distinguishing or task relevant input units are, and set its weights accordingly.
This task reveals a problem with Hebbian learning: it is only driven by the correlation between the output and input units, so it cannot learn to be sensitive to which inputs are more task relevant than others (unless this happens to be the same as the input-output correlations, as in the easy task). This hard task has a complicated pattern of overlap among the different input patterns. For the two cases where the left output should be on, the middle two input units are very strongly correlated with the output activity, while the outside two inputs are half-correlated. The two cases where the left output should be off (and the right one on) overlap considerably with those where it should be on, with the last event containing both of the highly correlated inputs. Thus, if the network just pays attention to correlations, it will tend to respond incorrectly to this last case.
Let's see what happens when we run the network on this task.
You should see that the weights into the left output unit increase, often with the two middle ones being more strongly increasing due to the higher correlation. The right output tends to have a strong weight from the 2nd input unit, and then somewhat weaker weights to the right two inputs, again reflecting the input correlations. Note that in contrast to a purely Hebbian learning mechanism, the BCM learning does not strictly follow the input correlations, as it depends significantly on the output unit activations over time as well, which determine the floating threshold for weight increase vs. decrease.
| ⇒ Return to viewing the act variable in and then do 4 times. |
| ⇒ Do several more max_epochs parameter to 50, or even 100, in the master if you wish. s on this HARD task. You can try increasing the |
|Question 4.5: Does the network ever solve the task? Run the network several times, setting the max epochs parameter to 30 (the default value), 50 and 100. Report the final sse at the end of training for each run.|
Hebbian learning does not seem to be able to solve tasks where the correlations do not provide the appropriate weight values. In the broad space of tasks that people learn (e.g., naming objects, reading words, etc) it seems unlikely that there will always be a coincidence between correlational structure and the task solution. Thus, we must conclude that Hebbian learning by itself is of limited use for task learning. In contrast, we will see in the next section that error-driven learning, which specifically adapts the weights precisely to solve input/output mappings, can handle this HARD task without much difficulty.
Exploration of Error-Driven Task Learning
| ⇒ First, reset the parameters to their default values using the -- this also resets the env_type back to EASY. button in the master |
| ⇒ Select ERR_DRIVEN instead of HEBB for the learn_rule value in the ControlPanel, and then, while watching the Learning Parameters fields, click . |
This will switch weight updating from the purely Hebbian (BCM) form of XCAL learning, to the form that is purely error driven, in terms of the contrast between plus (short term average) and minus (medium term) phases of activation. In this simple two-layer network, this form of learning is effectively equivalent to the Delta rule error-driven learning algorithm. The effects of this switch can be seen in the Learning Parameters group, which shows the learning rate for the weights (lrate, always .04) and for the bias weights (bias_lrate, which is 0 for Hebbian learning because it has no way of training the bias weights, and is equal to lrate for error driven), and the proportion of Hebbian (BCM) learning, which amounts to the proportion of learning driven by the medium-term floating threshold (xcal.m_lrn which is error-driven learning) versus the long-term average (xcal.l_lrn which is Hebbian learning). IMPORTANT: Note that you have to hit the button to actually set these Learning Parameters values according to the learn_rule setting.
Before training the network, we will explore how the minus-plus activation phases work in the simulator.
| ⇒ Make sure that you are monitoring activations in the network by selecting the act button along the top left margin. Also make sure the quarter_update_net_view is checked. Then, hit in the ControlPanel three times to present the first minus phase of training. |
You will see in the network the actual activation produced in response to the input pattern (also known as the expectation or minus phase activation). Each Quarter represents 25 msec of time, and the first 75 msec (3 Quarters) of a 100 msec trial period constitutes the minus phase.
| ⇒ Hit again. |
You will see the target (also known as the outcome or plus phase) activation. Learning occurs after this second, plus phase of activation. You can recognize targets, like all external inputs, because their activations are exactly .95 or 0 -- note that we are clamping activations to .95 (not 1.0) because units cannot easily produce activations above .95 with typical net input values due to the saturating nonlinearity of the rate code activation function. You can also switch to viewing the targ value, which will show you the target inputs prior to the activation clamping. In addition, the minus phase activation is always viewable as act_m and the plus phase as act_p.
| ⇒ Go ahead and the network to complete the training on the EASY task. |
The network has no trouble learning this task, as you can see in the EpochOutputData graph shown in the network window (or by itself in the tab). You can run multiple s to see how reliably and rapidly it learns this problem. But the real challenge is whether it can learn the HARD task.
You should see that the network learns this task without much difficulty, because error-driven learning is directly a function of how well the network is actually doing, driving the weights specifically to solve the task, instead of doing something else like encoding correlational structure. Now we'll push the limits of even this powerful error-driven learning.
| ⇒ Set env_type to IMPOSSIBLE. Then, click on the tab in the far right panel. |
Notice that each input unit in this environment is active equally often when the output is active as when it is inactive. That is, there is complete overlap among the patterns that activate the different output units. These kinds of problems are called ambiguous cue problems, or nonlinear discrimination problems (Sutherland & Rudy, 1989); (O'Reilly & Rudy, 2000). This kind of problem might prove difficult, because every input unit will end up being equivocal about what the output should do. Nevertheless, the input patterns are not all the same -- people could learn to solve this task fairly trivially by just paying attention to the overall patterns of activation. Let's see if the network can do this.
| ⇒ Press max_epochs to 100 or even higher. and on the ControlPanel. Do it again, and again.. Increase the |
|Question 4.6: Does the network ever learn to solve this "Impossible" problem? Report the final sse values for your runs.|
Because error-driven learning cannot learn what appears to be a relatively simple task, we conclude that something is missing. Unfortunately, that is not the conclusion that Minsky & Papert (1969) reached in their highly influential book, Perceptrons. Instead, they concluded that neural networks were hopelessly inadequate because they could not solve problems like the one we just explored. This conclusion played a large role in the waning of the early interest in neural network models of the 1960s. As we'll see, all that was required was the addition of a hidden layer interposed between the input and output layers (and the necessary math to make learning work with this hidden layer, which is really just an extension of the chain rule used to derive the delta rule for two layers in the first place).
| ⇒ You may now close the project (use the window manager close button on the project window or menu item) and then open a new one, or just quit emergent entirely by doing menu option or clicking the close button on the root window. |