From Computational Cognitive Neuroscience Wiki
Jump to: navigation, search
Project Name objrec
Filename File:objrec.proj Open Project in emergent
Author Randall C. O'Reilly
Publication OReillyMunakataFrankEtAl12
First Published Aug 13 2016
Tags Object Recognition, Invariance, Binding, Hierarchy, Categorization
Description This simulation explores how a hierarchy of areas in the ventral stream of visual processing (up to inferotemporal (IT) cortex) can produce robust object recognition that is invariant to changes in position, size, etc of retinal input images.
Updated 13 August 2016, 14 January 2017, 14 January 2018
Versions 8.0.0, 8.0.2, 8.0.3
Emergent Versions 8.0.0, 8.0.4, 8.5.0
Other Files File:objrec actrf it.dat, File:objrec actrf v4.dat, File:objrec test1.tst.dat, File:objrec test2.tst.dat, File:objrec train1.epc.dat, File:objrec train1.wts.gz, File:objrec train2.epc.dat, File:objrec train2.wts.gz

Back to CCNBook/Sims/All or Perception.

IMPORTANT: to skip training and testing, you need to grab the .wts.gz and .dat files from the Other Files section.


This simulation explores how a hierarchy of areas in the ventral stream of visual processing (up to inferotemporal (IT) cortex) can produce robust object recognition that is invariant to changes in position, size, etc of retinal input images.

It is recommended that you click here to undock this document from the main project window. Use the Window menu to find this window if you lose it, and you can always return to this document by browsing to this document from the docs section in the left browser panel of the project's main window.

Network Structure

Figure OR.1: V1 filtering steps, simulating simple and complex cell firing properties, including length-sum and end-stop cells. Top shows organization of these filters in each 4x5 V1 hypercolumn

We begin by looking at the network structure, which goes from V1 to V4 to IT and then Output, where the name of the object is represented (area V2 is not represented in this model, because it is thought to be important for depth and figure-ground encoding which is not relevant here). The V1 layer has a 10x10 large-scale grid structure, where each of these grid elements represents one hypercolumn of units, capturing in a very compact and efficient manner the kinds of representations we observed developing in the previous V1Rf simulation. Each hypercolumn contains a group of 20 (4x5) units, which process a localized patch of the input image. These units encode oriented edges at 4 angles (along the X axis), and the rows represent simple and complex cells as follows ( Figure OR.1Figure OR.1): simple cells are represented by the last 2 rows encoding different polarities (bright below dark and vice-versa); the first row represents complex length-sum cells that integrate over polarity and neighboring simple cells; and the middle 2 rows are end stop units that are excited by a given length-sum orientation and inhibited by surrounding simple cells at one end. Neighboring groups process half-overlapping regions of the image. In addition to connectivity, these groups organize the inhibition within the layer. This means that there is both inhibitory competition across the whole V1 layer, but there is a greater degree of competition within a single hypercolumn, reflecting the fact that inhibitory neurons within a local region of cortex are more likely to receive input from neighboring excitatory neurons. This effect is approximated by having the FFFB inhibition operate at two scales at the same time: a stronger level of inhibition within the unit group (hypercolumn), and a lower level of inhibition across all units in the layer. This ensures that columns not receiving a significantly strong input will not be active at all (because they would get squashed from the layer-level inhibition generated by other columns with much more excitation), while there is also a higher level of competition to select the most appropriate features within the hypercolumn.

The V4 layer is also organized into a grid of hypercolumns, this time 5x5 in size, with each hypercolumn having 49 units (7x7). As with V1, inhibition operates at both the hypercolumn and entire layer scales here. Each hypercolumn of V4 units receives from 4x4 V1 hypercolumns, with neighboring columns again having half-overlapping receptive fields. Next, the IT layer represents just a single hypercolumn of units (10x10 or 100 units) within a single inhibitory group, and receives from the entire V4 layer. Finally, the Output layer has 20 units, one for each of the different objects.

You can view the patterns of connectivity described above by clicking on r.wt, and then on units in the various layers.


Figure OR.2: Set of 20 objects composed from horizontal and vertical line elements used for the object recognition simulation. By using a restricted set of visual feature elements, we can more easily understand how the model works, and also test for generalization to novel objects (object 18 and 19 are not trained initially, and then subsequently trained only in a relatively few locations -- learning there generalizes well to other locations).

Now, let's see how the network is trained.

First, go back to viewing act in the Network display. Then, do Init and Step Quarter for 3 times through the minus phase, in the ControlPanel.
You will see the minus phase of settling for the input image, which is one of the shapes shown in Figure OR.2Figure OR.2, at a random location, size and slight rotation in the plane. The full bitmap image is shown in the display on the upper right of the network, and the patterns on the V1 input layer are the result of processing with the V1 oriented edge detectors shown in

[[ File:{{{2}}} | {{{3}}} | thumb | {{{4}}} | Figure OR.1: {{{5}}} ]].

Press Step Quarter again to see the plus phase. You can then continue to Step Trial through a series of inputs to get a feel for what some of the different input patterns look like.

Because it takes a while for this network to be trained, we will just load the weights from a trained network. The network was trained for 50 epochs of 100 object inputs per epoch, or 5,000 object presentations. However, it took only roughly 25 epochs (2,500 object presentations) for performance to approach asymptote. With all of the variation in the way a given input can be presented, this does not represent all that much sampling of the space of variability.

Load the weights using LoadWeights on the ControlPanel, and select objrec_train1.wts.gz. Then, Step Quarter a couple of times to see the minus and plus phases of the trained network as it performs the object recognition task.

You should see that the plus and minus phase output states are usually the same, meaning that the network is correctly recognizing most of the objects being presented.

To provide a more comprehensive test of its performance, you can run the testing program, which runs through 1000 presentations of the objects and records the overall level of error. Because this may take a while, you can also just load the resulting log file.

To run the test, do: Test Init and Test Run. To load the log file, do LoadTestData and select objrec_test1.tst.dat, and then click on the TestErrorData to see the resulting graph.

You will see that error rates are generally below 5% (and often zero) except for the two final objects which the network was never trained on (which it always gets wrong). Thus, the network shows quite good performance at this challenging task of recognizing objects in a location-invariant and size-invariant manner.

Receptive Field Analysis

Having seen that the network is solving this difficult problem, the obvious next question is, "how?" To answer this, we need to examine how input patterns are transformed over the successive layers of the network. We do this by computing the receptive fields of units in the V4 and IT layers. The receptive field essentially means the range of different stimuli that a given unit in the network responds to -- what it is tuned to detect. During the Test process, the system computes an activation-based receptive field for the layer listed in the control panel (ActBasedRField trg_lay_name), which should be V4 to start with.

In this procedure, we present all the input patterns to the network and record how units respond to them. If we are interested in which patterns activate, e.g., a given V4 unit, then we aggregate activity over other layers every time that V4 unit is active (and weighted by how much it is active). If for a given input pattern the target V4 unit is not active, then the current activity pattern across all the other layers doesn't count toward that unit's overall receptive field. When the unit is active, the activity patterns do count, and do so in proportion to the unit's activity. This weighted-average computation ends up producing a useful aggregate picture of what tends to activate that unit. Of particular interest is activity in the Image layer, which is just a copy of the input image, not directly connected to anything, and used only for this statistic.

Click on the ActRFData tab -- if you ran the test above, then the results for V4 should be there. Otherwise, in the ActRFData middle panel tab, at the bottom, click on Load Any Data and select objrec_actrf_v4.dat, which should then populate the display with lots of colorful data (the V1 layer data is not saved in this file to reduce size -- it is difficult to interpret in any case). You can also use the Load ActRFData button in the ControlPanel to do the same thing.

The columns show the different layers of the network, with the right-most one being the input Image column, which we will focus on first. Change to the red arrow (interactive) mode (can also just press the ESC key after clicking in the right panel), and scroll the right scroll bar down, while noting the kinds of patterns you observe in the Image column. Each row of the table corresponds to a different V4 unit, starting with the lower left unit and goes within hypercolumn first, to the upper right. (It will often be the brightest yellow unit in the V4 layer in the same row, because every time that unit is active, it is active! But not always: sometimes when a V4 unit is active, it might be part of an attractor whereby another unit is always active with it, and maybe even more so).

You should see that these V4 units are encoding simple conjunctions of line elements, in a relatively small range of locations within the retinal input. The fact that they respond across multiple locations makes the weight patterns seem somewhat smeared out, but that is a good indication that they are performing a critical invariance role.

6.3: Explain the significance of the level of conjunctive representations and spatial invariance observed in the V4 receptive fields, in terms of the overall computation performed by the network.

Continue to scroll through the V4 units, but now notice the activation based receptive field for the Output units.

You should see that there are typically a handful of output units (i.e., objects) that each V4 unit is strongly co-activated with. This indicates a distributed representation, where each V4 unit participates in encoding multiple different objects.

6.4: Using the images of the objects shown above (which are in the same configuration as the output units), explain one V4 unit's participation in a particular output representation based on the features shown in its input receptive fields. (Hint: Pick a unit that is particularly selective for specific input patterns and specific output units, because this makes things easier to see.)

Next, do ActRFData to load the objrec_actrf_it.dat data for the IT layer. Scroll through to observe the activation based receptive fields for the Image inputs and the Output layer.

You should observe much more complex patterns of line orientations, distributed over more of the input, and fewer, more strongly-defined Output receptive fields.

6.5: Based on your probing of the IT units, do they appear to code for entire objects, or just parts of different objects? Explain.

One can also compare the relative selectivity of these IT units for particular output units (objects) as compared to the V4 units. By focusing specifically on the number of objects a given unit clearly doesn't participate in, it should be clear that the IT units are more selective than the V4 units, which substantiates the idea that the IT units are encoding more complex combinations of features that are shared by fewer objects (thus making them more selective to particular subsets of objects). Thus, we see evidence here of the hierarchical increase in featural complexity required to encode featural relationships while also producing spatial invariance.

Summary and Discussion of Receptive Field Analyses

Using the activation-based receptive field technique, we have obtained some insight into the way this network performs spatially invariant object recognition, gradually over multiple levels of processing. Similarly, the complexity of the featural representations increases with increasing levels in the hierarchy. By doing both of these simultaneously and in stages over multiple levels, the network is able to recognize objects in an environment that depends critically on the detailed spatial arrangement of the constituent features, thereby apparently avoiding the binding problem described previously.

You may be wondering why the V4 and IT representations have their respective properties -- why did the network develop in this way? In terms of the degree of spatial invariance, it should be clear that the patterns of connectivity restrict the degree of invariance possible in V4, whereas the IT neurons receive from the entire visual field (in this small-scale model), and so are in a position to have fully invariant representations. Also, the IT representations can be more invariant, and more complex because they build off of limited invariance and featural complexity in the V4 layer. This ability for subsequent layers to build off of the transformations performed in earlier layers is a central general principle of cognition.

The representational properties you observed here can have important functional implications. For example, in the next section, we will see that the nature of the IT representations can play an important role in enabling the network to generalize effectively. To the extent that IT representations encode complex object features, and not objects themselves, these representations can be reused for novel objects. Because the network can already form relatively invariant versions of these IT representations, their reuse for novel objects will mean that the invariance transformation itself will generalize to novel objects.

Generalization Test

In addition to all of the above receptive field measures of the network's performance, we can perform a behavioral test of its ability to generalize in a spatially invariant manner, using the two objects (numbers 18 and 19 in above Figure) that were not presented to the network during training. We can now train on these two objects in a restricted set of spatial locations and sizes, and assess the network's ability to respond to these items in novel locations and sizes. Presumably, the bulk of what the network needs to do is learn an association between the IT representations and the appropriate output units, and good generalization should result to all other spatial locations.

In addition to presenting the novel objects during training, we also need to present familiar objects; otherwise the network will suffer from catastrophic interference. The following procedure was used. On each trial, there was a 50% chance that a novel object would be presented. If a novel object was presented, its location, scaling and rotation parameters were chosen using .5 of the maximum range of these values in the original training. Given that these 4 factors (translation in x, translation in y, size, and rotation) are combinatorial, that means that roughly .5^4 or .0625 of the total combinatorial space was explored. If a familiar object was presented, then its size and position was chosen completely at random from all the possibilities. This procedure was repeated for just 10 epochs of 100 objects per epoch. Importantly, the learning rate in everything but the IT and Output connections was set to zero, to ensure that the results were due to IT-level learning and not in earlier pathways. In the brain, it is very likely that these earlier areas of the visual system experience less plasticity than higher areas as the system matures.

To setup the system for this form of generalization training, click the GenTrain button in the ControlPanel. This loads the objrec_train1.wts.gz weights, sets the epoch counter to 40 to get it to train for 10 epochs up to the 50 epoch stopping point, and sets the environment generation to be of the form described above. Once you do this, you can just do Train Init and NOT initialize the weights, followed by Train Run. This should just take a few minutes, depending on your computer, but you can bypass this step by doing LoadWeights and selecting the objrec_train2.wts.gz file.
After the network is trained, you can then run the testing (Test Init, Test Run) as before, or just load the test data from objrec_test2.tst.dat.

The results show that the network got around 80% correct (20% error) on average across the new 18 and 19 patterns. This is given training on only 6% of the space, suggesting that the network has learned generalized invariance transforms that can be applied to novel objects. Given the restriction of learning to the IT and Output pathways, we can be certain that no additional learning in lower pathways had to be done to encode these novel objects.

To summarize, these generalization results demonstrate that the hierarchical series of representations can operate effectively on novel stimuli, as long as these stimuli possess structural features in common with other familiar objects. The network has learned to represent combinations of these features in terms of increasingly complex combinations that are also increasingly spatially invariant. In the present case, we have facilitated generalization by ensuring that the novel objects are built out of the same line features as the other objects. Although we expect that natural objects also share a vocabulary of complex features, and that learning would discover and exploit them to achieve a similarly generalizable invariance mapping, this remains to be demonstrated for more realistic kinds of objects. One prediction that this model makes is that the generalization of the invariance mapping will likely be a function of featural similarity with known objects, so one might expect a continuum of generalization performance in people (and in a more elaborate model).

You may now close the project (use the window manager close button on the project window or File/Close Project menu item) and then open a new one, or just quit emergent entirely by doing Quit emergent menu option or clicking the close button on the root window.