CECN1 Self Organizing
From Computational Cognitive Neuroscience Wiki
Contents |
Self-Organizing Learning
- The project file: self_org.proj (click and Save As to download, then open in Emergent)
Back to CECN1 Projects
Project Documentation
(note: this is a literal copy from the simulation documentation -- it contains links that will not work within the wiki)
We will continue with the "lines" theme in this exploration, by exposing a set of hidden units to an environment consisting of horizontal and vertical lines on a 5x5 input "retina."
We focus first on the network. The 5x5 input projects to a hidden layer of 20 units, which are all fully connected to the input with random initial weights.
- As usual, select r.wt and view the weights for these units.
Because viewing the pattern of weights over all the hidden units will be of primary concern as the network learns, we have a special grid view showing on the upper left of the network display, which displays the weights for all hidden units. In addition, there is a graph view in the upper right, which will display key information as the network learns.
Let's see the environment the network will be experiencing.
- Click the .T3Tab.Lines_Input_Data tab in the right 3D view window.
This will bring up a display showing the first 10 training items on the left, which are composed of the elemental horizontal and vertical lines shown in the grid view display on the right. You can use the narrow violet scrollbar for the left grid view to scroll through all 45 of the patterns (you have to click the red arrow first to be able to grab the scrollbar). These 45 input patterns represent all unique pairwise combinations of vertical and horizontal lines. Thus, there are no real correlations between the lines, with the only reliable correlations being between the pixels that make up a particular line. To put this another way, each line can be thought of as appearing in a number of different randomly related contexts (i.e., with other lines).
It should be clear that if we computed the correlations between individual pixels across all of these images, everything would be equally (weakly) correlated with everything else. Thus, learning must be conditional on the particular type of line for any meaningful correlations to be extracted. We will see that this conditionality will simply self-organize through the interactions of the learning rule and the kWTA inhibitory competition. Note also that because two lines are present in every image, the network will require at least two active hidden units per input, assuming each unit is representing a particular line.
- Click back to the .T3Tab.Network display, and return to viewing act in the network window. Then, hit Init (say Yes to initializing the weights) and Step in the control panel, to present a single pattern to the network.
You should see one of the event patterns containing two lines in the input of the network, and a pattern of roughly two active hidden units.
The hidden layer is using the average-based kWTA inhibition function, with the k parameter set to 2 as you can see in the kwta.k parameter in the control panel. This function allows for some variability of actual activation level depending on the actual distribution of excitation across units in the hidden layer. Thus, when more than two units are active, these units are being fairly equally activated by the input pattern due to the random initial weights not being very selective. This is an important effect, because these weaker additional activations may enable these units to bootstrap into stronger activations through gradual learning, should they end up being reliably active in conjunction with a particular input feature (i.e., a particular line in this case).
- You can Step some more. When you tire of single stepping, just press the Run button on the control panel. You may want to turn off the Display in the network (in the network view panel, .PanelTab.Network).
After 30 epochs (passes through all 45 different events in the environment) of learning, the network will stop. You should have noticed that the grid weights view was updated as the training proceeded. This grid view display shows all of the network's weights (figure 4.13 in the textbook shows an example). The larger-scale 5x4 grid is topographically arranged in the same layout as the network. Within each of these 20 grid elements is a smaller 5x5 grid representing the input units, showing the weights for each unit. By clicking on the hidden units in the network window with the r.wt variable selected, you should be able to verify this correspondence.
As training proceded, the weights came to more and more clearly reflect the lines present in the environment (figure 4.13). Thus, individual units developed selective representations of the correlations present within individual lines, while ignoring the random context of the other lines.
These individual line representations developed as a result of the interaction between learning and inhibitory competition as follows. Early on, the units that won the inhibitory competition were those that happened to have larger random weights for the input pattern. CPCA learning then tuned these weights to be more selective for that input pattern, causing them to be more likely to respond to that pattern and others that overlap with it (i.e., other patterns sharing one of the two lines). To the extent that the weights are stronger for one of the two lines in the input, the unit will be more likely to respond to inputs having this line, and thus the conditional probability for the input units in this line will be stronger than for the other units, and the weights will continue to increase. This is where the contrast enhancement bias plays an important role, because it emphasizes the strongest of the unit's correlations and deemphasizes the weaker ones. This will make it much more likely that the strongest correlation in the environment -- the single lines -- end up getting represented.
You might have noticed in the weights displayed in the grid view during learning that some units initially seemed to be becoming selective for multiple lines, but then as other units were better able to represent one of those lines, they started to lose that competition and fall back to representing only one line. Thus, the dynamics of the inhibitory competition are critical for the self-organizing effect, and it should be clear that a firm inhibitory constraint is important for this kind of learning (otherwise units will just end up being active a lot, and representing a mish-mash of line features). Nevertheless, the average-based kWTA function is sufficiently flexible that it can allow more than two units to become active, so you will probably see that sometimes multiple hidden units end up encoding the same line feature.
The net result of this self-organizing learning is a nice combinatorial distributed representation, where each input pattern is represented as the combination of the two line features present therein. This is the "obvious" way to represent such inputs, but you should appreciate that the network nevertheless had to discover this representation through the somewhat complex self-organizing learning procedure.
- To see this representation in action, turn the network Display back on, and Step through a few more events.
Notice that in general two or more units are strongly activated by each input pattern, with the extra activation reflecting the fact that some lines are coded by multiple units.
Another thing to notice in the weights shown in the grid view (figure 4.13) is that some units are obviously not selective for anything. These "loser" units (also known as "dead" units) were never reliably activated by any input feature, and thus did not experience much learning. It is typically quite important to have such units lying around, because self-organization requires some "elbow room" during learning to sort out the allocation of units to stable correlational features. Having more hidden units also increases the chances of having a large enough range of initial random selectivities to seed the self-organization process. The consequence is that you need to have more units than is minimally necessary, and that you will often end up with leftovers (plus the redundant units mentioned previously).
From a biological perspective, we know that the cortex does not produce new neurons in adults, so we conclude that in general there is probably an excess of neural capacity relative to the demands of any given learning context. Thus, it is useful to have these leftover and redundant units, because they constitute a "reserve" that could presumably get activated if new features were later presented to the network (e.g., diagonal lines). We are much more suspicious of algorithms that require precisely tuned quantities of hidden units to work properly (more on this later).
Unique Pattern Statistic
Although looking at the weights is informative, we could use a more concise measure of how well the network's internal model matches the underlying structure of the environment. We one such measure is plotted in the graph view as the network learns.
This log shows the results of the unique pattern statistic, shown as uniq_pats in the graph), which records the number of unique hidden unit activity patterns that were produced as a result of probing the network with all 10 different types of horizontal and vertical lines (presented individually). Thus, there is a separate testing process which, after each epoch of learning, tests the network on all 10 lines, records the resulting hidden unit activity patterns (with the kWTA parameter set to 1, though this is not critical due to the flexibility of the average-based kWTA function), and then counts up the number of unique such patterns (subject to thresholding so we only care about binary patterns of activity).
The logic behind this measure is that if each line is encoded by (at least) one distinct hidden unit, then this will show up as a unique pattern. If, however, there are units that encode two or more lines together (which is not a good model of this environment), then this will not result in a unique representation for these lines, and the resulting measure will be lower. Thus, to the extent that this statistic is less than 10, the internal model produced by the network does not fully capture the underlying independence of each line from the other lines. Note, however, that the unique pattern statistic does not care if multiple hidden units encode the same line (i.e., if there is redundancy across different hidden units) -- it only cares that the same hidden unit not encode two different lines.
You should have seen on this run that the model produced a perfect internal model according to this statistic, which accords well with our analysis of the weight patterns. To get a better sense of how well the network learns in general, you can run a batch of 8 training runs starting with a different set of random initial weights each time.
- Press Batch Init and Batch Run in the control panel to run. You probably want to turn off the network Display again. When it is done, you can click on the .T3Tab.BatchOutputData tab to view the results.
After the 8 training runs, the batch view shows summary statistics about the average (mean), maximum, and minimum of the unique pattern statistic at the end of each network training run. The last column contains a count of the number of times that a "perfect 10" on the unique pattern statistic was recorded. You should get a perfect score for all 8 runs.
Parameter Manipulations
Now, let's explore the effects of some of the parameters in the control panel. First, let's manipulate the wt_sig.gain parameter, which should affect the contrast (and therefore selectivity) of the unit's weights.
- Set wt_sig.gain to 1 instead of 6, Apply, and Batch Init, Run the network.
Question 4.6 (a) What statistics for the number of uniquely represented lines did you obtain? (b) In what ways were the final weight patterns shown in the weight grid log different from the default case? (c) Explain how these two findings of hidden unit activity and weight patterns are related, making specific reference to the role of selectivity in self-organizing learning.
- Set wt_sig.gain back to 6, change wt_sig.off from 1.25 to 1, Apply, and run a Batch. To make the effects of this parameter more dramatic, lower wt_sig.off to .75 and Batch again.
Question 4.7 (a) What statistics did you obtain for these two cases (1 and .75)? (b) Was there a noticeable change in the weight patterns compared to the default case? (Hint: Consider what the unique pattern statistic is looking for). (c) Explain these results in terms of the effects of wt_sig.off as adjusting the threshold for where correlations are enhanced or decreased as a function of the wt_sig.gain contrast enhancement mechanism. (d) Again, explain why this is important for self-organizing learning.
- Set wt_sig.off back to 1.25 (or hit Defaults).
Now, let's consider the savg_cor.cor parameter, which controls the amount of renormalization of the weight values based on the expected activity level in the sending layer. A value of 1 in this parameter will make the weights increase more rapidly, as they are driven to a larger maximum value (equation 4.18). A value of 0 will result in smaller weight increases. As described before, smaller values of savg_cor.cor are appropriate when we want the units to have more selective representations, while larger values are more appropriate for more general or categorical representations. Thus, using a smaller value of this parameter should help to prevent units from developing less selective representations of multiple lines. This is why we have a default value of .5 for this parameter.
- Switch to using a savg_cor.cor value of 1, and then Batch Run the network.
You should observe results very similar to those when you decreased wt_sig.off -- both of these manipulations reduce the level of correlation that is necessary to produce strong weights.
- Set savg_cor.cor back to .5, and then set rnd.mean to .5.
This sets the initial random weight values to have a mean value of .5 instead of .25.
- Batch run the network, and pay particular attention to the weights.}
You should see that this ended up eliminating the loser units, so that every unit now codes for a line. This result illustrates one of the interesting details about self-organizing learning. In general, the CPCA rule causes weights to increase for those input units that are active, and decrease for those that are not. However, this qualitative pattern is modulated by the soft weight bounding property discussed earlier --- larger weights increase less rapidly and decrease more rapidly, and vice versa for smaller weights.
When we start off with larger weight values, the amount of weight decrease will be large relative to the amount of increase. Thus, hidden units that were active for a given pattern will subsequently receive less net input for a similar but not identical pattern (i.e., a pattern containing 1 of the 2 lines in common with the previous pattern), because the weights will have decreased substantially to those units that were off in the original pattern but on in the subsequent one. This decreased probability of reactivation means that other, previously inactive units will be more likely to be activated, with the result that all of the units end up participating. This can sometimes be a useful effect if the network is not drawing in sufficient numbers of units, and just a few units are "hogging" all of the input patterns.
Finally, let's manipulate the learning rate parameter \verb\lrate\.
- First, set rnd.mean back to .25, and then set lrate to .1 instead of .01 and do a Batch run.
Question 4.8 (a) Does this tenfold increase in learning rate have any noticeable effect on the network, as measured by the unique pattern statistics and the weight patterns shown in the grid log? (b) Explain why this might be the case, comparing these results to the effects of learning rate that you observed in question 4.1.
This exercise should give you a feel for the dynamics that underly self-organizing learning, and also for the importance of contrast enhancement for the CPCA algorithm to be effective. More generally, you should now appreciate the extent to which various parameters can provide appropriate (or not) a priori biases on the learning process, and the benefit (or harm) that this can produce.
