CECN1 Semantics
From Computational Cognitive Neuroscience Wiki
Contents |
Semantic Representations from Word Co-Occurrences and Hebbian Learning
- The project file: sem.proj (click and Save As to download, then open in Emergent -- NOTE: requires version 4.13 or higher for testing
- Additional file for pretrained weights (required):
Back to CECN1 Projects
Project Documentation
(note: this is a literal copy from the simulation documentation -- it contains links that will not work within the wiki)
- To start, it is usually a good idea to do
Object/Edit Dialogin the menu just above this text, which will open this documentation in a separate window that you can more easily come back to. Alternatively, you can always return by clicking on theProjectDocstab at the top of this middle panel.
This network is trained using Hebbian learning on paragraphs from an earlier draft of the Computational Explorations.. textbook, allowing it to learn about the overall statistics of when different words co-occur with other words, and thereby learning a surprisingly capable (though clearly imperfecdt) level of semantic knowlege about the topics covered in the textbook. This replicates the key results from the Latent Semantic Analysis research by Landauer and Dumais (1997).
This network takes roughly a day to train, so we will start by loading in pre-trained weights.
- Do LoadWeights in the overally .PanelTab.ControlPanel.
Individual Unit Representations
To start, let's examine the weights of individual units in the network.
- Select r.wt, and then select various hidden units at random to view.
You should observe sparse patterns of weights, with different units picking up on different patterns of words in the input. However, because the input units are too small to be labeled, you can't really tell which words a given unit is activated by. The GetWordRF button (get receptive field in terms of words) on the control panel provides a solution to this problem.
- View the weights for the lower-leftmost hidden unit, and then hit the GetWordRF button. Then click on the .T3Tab.WeightWords -- you might need to zoom in using the Dolly wheel on the lower right of the display to be able to read these words. Alternatively, you can click on the data/OutputData subgroup/WeightWords object in the left tree browser, and then click on the (matrix) cell for the Para column, and you'll see the full list of words in a more legible format.
You should see an alphabetized list of the words that this hidden unit receives from with a weight above a threshold of .5 (See Table 10.11 in the text for an example from the old model). One of the most interesting things to notice here is that the unit represents multiple roughly synonymous terms. For example, you might see the words "act," "activation," and "activations."
Question 10.10 List some other examples of roughly synonymous terms represented by this unit.
This property of the representation is interesting for two reasons. First, it indicates that the representations are doing something sensible, in that semantically related words are represented by the same unit. Second, these synonyms probably do not occur together in the same paragraph very often. Typically, only one version of a given word is used in a given context. For example, "The activity of the unit is..." may appear in one paragraph, while "The unit's activation was..." may appear in another. Thus, for such representations to develop, it must be based on the similarity in the general contexts in which similar words appear (e.g., the co-occurrence of "activity" and "activation" with "unit" in the previous example). This generalization of the semantic similarity structure across paragraphs is essential to enable the network to transcend rote memorization of the text itself, and produce representations that will be effective for processing novel text items.
- View the GetWordRF representations for several other units to get a sense of the general patterns across units.
Although there clearly is sensible semantic structure at a local level within the individual-unit representations, it should also be clear that there is no single, coherent theme relating all of the words represented by a given unit. Thus, individual units participate in representing many different clusters of semantic structure, and it is only in the aggregate patterns of activity across many units that more coherent representations emerge. This network thus provides an excellent example of distributed representation.
Summarizing Similarity with Cosines
To probe these distributed representations further, we can present words to the input and measure the hidden layer activation patterns that result. Specifically we are interested in the extent to which the hidden representations overlap for different sets of words, which tells us how similar overall the internal semantic representation is. Instead of just eyeballing the pattern overlap, we can compute a numerical measure of similarity using normalized inner products or cosines between pairs of sending weight patterns (see Equation 10.1 from the dyslexia model in Section 10.3 in the text).
- Enter "attention" into words1 in the control panel, and "binding" into words2 (do not enter the quotes -- just the word). Then press Test: Run. If you get an error message about the word not being found in the list of valid words, then check for errors and correct. Then press Test: Init, Run to re-run.
The .T3Tab.DistanceMatrix tab will activate and show you the cosines between the hidden representations of these two words. You should get a number around .40 (note that the similarity of attention with itself and binding with itself is 1 -- the cosine scale goes from -1 to 1, with 1 being the most similar and -1 being the most dissimilar).
- Select the .T3Tab.SemNet network view tab and go back to the control panel and do Test: Init, Step to see each set of words being presented individually (make sure you are viewing act in the network) -- you can see the actual distributed representations for these words, which is what the cosine is based on. You should observe that the two patterns overlap roughly 50%.
- Now replace words2 with "dyslexia" and do Run again.
You should see that attention and dyslexia are only related by around .07 -- not very close. This should match your overall intuition: we talk about attention as being critical for solving the binding problem in several different situations, but we don't talk much about the role of attention in dyslexia (though perhaps we should, but that is another matter).
- Compare several other words that the network should know about from 'reading" this textbook (tip: click open the data/InputData/WordList data table in the left browser and scroll through that to see what words are in the valid list (frequency greater than 5, not purely syntatic)
Question 10.11 (a) Report the cosine values from the DistanceMatrix. (b) Comment on how well this matches your intuitive semantics from having read this textbook yourself.
Distributed Representations of Multiple Words
We now present multiple word inputs at the same time, and see how the network chooses a hidden layer representation that best fits this combination of words. Thus, novel semantic representations can be produced as combinations of semantic representations for individual words. This ability is critical for some of the more interesting and powerful applications of these semantic representations (e.g., multiple choice question answering, essay grading, etc.).
One interesting question we can explore is to what extent we can sway an otherwise somewhat ambiguous term to be interpreted in a particular way. For example, the term "attention" can be used in two somewhat different contexts. One context concerns the implementational aspects of attention, most closely associated with "competition." Another context concerns the use of attention to solve the binding problem, that is associated with "invariant object recognition." Let's begin this exploration by first establishing the baseline association between "attention" and "invariant object recognition."
- Enter "attention" for words1 and "invariant object recognition" for words2, and do a Run.
You should get a cosine of around .37. Now, let's see if adding "binding" in addition to "attention" increases the hidden layer similarity.
- Add "binding" to words1 and do a Run.
The similarity does indeed increase, producing a cosine of around .49. To make sure that there is an interaction between "attention" and "binding" producing this increase, we need to test with just "binding" alone.
- Cut out "attention" from words1, so it just has "binding", and do a Run.
The similarity drops back to .47. Thus, there is something special about the combination of "attention" and "binding" together that is not present by using each of them alone (although this effect is smaller in this network than the original one described in the textbook, due to the strength of the binding association). Note also that the direct overlap between attention and binding alone is only .40. Now if we instead probe with "attention competition" (still against "invariant object recognition," we should activate a different sense of attention, and get a smaller cosine.
- Set words1 to "attention competition" and Run.
The similarity does now decrease, with a cosine of only around .16. Thus, we can see that the network's activation dynamics can be influenced to emphasize different senses of a word. Thus, this is potentially a very powerful and flexible form of semantic representation that combines rich, overlapping distributed representations and activation dynamics that can magnify or diminish the similarities of different word combinations.
Question 10.12 Think of another example of a word that has different senses (that is well represented in this textbook), and perform an experiment similar to the one we just performed to manipulate these different senses. Document and discuss your results.
A Multiple-Choice Quiz
Based on your knowledge of the textbook, which of the options following each "question" provides the best match to the meaning?
- 0. neural activation function
- A spiking rate code membrane potential point
- B interactive bidirectional feedforward
- C language generalization nonwords
- 1. transformation
- A emphasizing distinctions collapsing differences
- B error driven hebbian task model based
- C spiking rate code membrane potential point
- 2. bidirectional connectivity
- A amplification pattern completion
- B competition inhibition selection binding
- C language generalization nonwords
- 3. cortex learning
- A error driven task based hebbian model
- B error driven task based
- C gradual feature conjunction spatial invariance
- 4. object recognition
- A gradual feature conjunction spatial invariance
- B error driven task based hebbian model
- C amplification pattern completion
- 5. attention
- A competition inhibition selection binding
- B gradual feature conjunction spatial invariance
- C spiking rate code membrane potential point
- 6. weight based priming
- A long term changes learning
- B active maintenance short term residual
- C fast arbitrary details conjunctive
- 7. hippocampus learning
- A fast arbitrary details conjunctive
- B slow integration general structure
- C error driven hebbian task model based
- 8. dyslexia
- A surface deep phonological reading problem damage
- B speech output hearing language nonwords
- C competition inhibition selection binding
- 9. past tense
- A overregularization shaped curve
- B speech output hearing language nonwords
- C fast arbitrary details conjunctive
We can present this same quiz to the network, and determine how well it does relative to students in the class! The telegraphic form of the quiz is because it contains only the content words that the network was actually trained on. The best answer is always A, and B was designed to be a plausible foil, while C is obviously unrelated (unlike people, the network can't pick up on these regularities across test items). The quiz is presented to the network by first presenting the "question," recording the resulting hidden activation pattern, and then presenting each possible answer and computing the cosine of the resulting hidden activation with that of the question. The answer that has the closest cosine is chosen as the network's answer.
- Press the Run Quiz button, and then click on the .T3Tab.QuizOutputData tab to see the overall results. You will see a listing of the response to each question.along with the scoring of that as an error or not. At the top you will see summary statistics for overall performance.
You should observe that the network does OK, but not exceptionally -- 60-80 percent performance is typical (i.e., .2 to .4 error mean). Usually, the network does a pretty good job of rejecting the obviously unrelated answer C, but it does not always match our sense of A being better than B. In question 6, the B phrase was often mentioned in the context of the question phrase, but as a contrast to it, not a similarity. Because the network does not have the syntactic knowledge to pick up on this kind distinction, it considers them to be closely related because they appear together. This probably reflects at least some of what goes on in humans -- we have a strong association between "black" and "white" even though they are opposites. However, we can also use syntactic information to further refine our semantic representations -- a skill that is lacking in this network. The next section describes a model that begins to address this skill.
