CECN1 FSA

From Computational Cognitive Neuroscience Wiki

Jump to: navigation, search

Finite State Automaton (FSA): Temporal Context and Sequences

  • The project file: fsa.proj (click and Save As to download, then open in Emergent -- NOTE: requires version 4.13)
  • Additional files for pre-trained weights and epoch data (optional):

Back to CECN1 Projects

Project Documentation

(note: this is a literal copy from the simulation documentation -- it contains links that will not work within the wiki)

  • To start, it is usually a good idea to do Object/Edit Dialog in the menu just above this text, which will open this documentation in a separate window that you can more easily come back to. Alternatively, you can always return by clicking on the ProjectDocs tab at the top of this middle panel.

We begin by exploring the network .T3Tab.CopyContext.

Click on r.wt and observe the connectivity.

Note in particular that the context layer units have a single receiving weight from the hidden units. The context units use this connection to determine which hidden unit to update from (but the weight is not used, and just has a random value). Otherwise, the network is standardly fully connected. Notice also that there is a seemingly extraneous Targets layer, which is not connected to anything. This is simply for display purposes -- it shows the two possible valid outputs, which can be compared to the actual output.

Let's view the activations, and see a few trials of learning.

  • Click on act in the network. Make sure the network has been initialized (Init), and then you can Step through a trial.

This is the minus phase for the beginning of a sequence (one pass through the FSA grammar), which always starts with the letter B, and the context units zeroed. The network will produce some random expectation of which letters are coming next. Note that there is some noise in the unit activations -- this helps them pick one unit out of the two possible ones at random.

  • Then, Step again to see the plus phase.

You should see that one of the two possible subsequent letters (T or P) is strongly activated -- this unit indicates which letter actually came next in the sequence. Thus, the network only ever learns about one of the two possible subsequent letters on each trial (because they are chosen at random). It has to learn that a given node has two possible outputs by integrating experience over different trials, which is one of the things that makes this a somewhat challenging task to learn.

An interesting aspect of this task is that even when the network has done as well as it possibly could, it should still make roughly 50 percent "errors," because it ends up making a discrete guess as to which output will come next, which can only be right 50 percent of the time. This could cause problems for learning if it introduced a systematic error signal that would constantly increase or decrease the bias weights. This is not a problem because a unit will be correctly active about as often as it will be incorrectly inactive, so the overall net error will be zero. Note that if we allowed both units to become active this would not be the case, because one of the units would always be incorrectly active, and this would introduce a net negative error and large negative bias weights (which would eventually shut down the activation of the output units).

One possible objection to having the network pick one output at random instead of allowing both to be on, is that it somehow means that the network will be "surprised" by the actual response when it differs from the guess (i.e., about 50\% of the time). This is actually not the case, because the hidden layer representation remains essentially the same for both outputs (reflecting the node identity, more or less), and thus does not change when the actual output is presented in the plus phase. Thus, the "higher level" internal representation encompasses both possible outputs, while the lower-level output representation randomly chooses one of them. This situation will be important later as we consider how networks can efficiently represent multiple items (see chapter 7 for further discussion).

To monitor the network's performance over learning, we need an error statistic that converges to zero when the network has learned the task perfectly (which is not the case with the standard SSE, due to the randomness of the task). Thus, we have a new statistic that reports an error (of 1) if the output unit was not one of the two possible outputs (i.e., as shown in the Targets layer). This is labeled as fsa_err_sum in the log displays.

  • Now, continue to Step into the minus phase of the next event in the sequence.

You should see now that the Context units are updated with a copy of the prior hidden unit activations.

  • To verify this, click on act_p.

This will show the plus phase activations from the previous event.

  • Now you can continue to Step through the rest of the sequence. Once you get a feel for the activity states, you can Run and watch the training log in .T3Tab.CopyContext. (You may want to click off the network display in .T3Tab.CopyContext to speed up processing).

As the network runs, a special program (ReberGenData) dynamically creates 25 new sequences of events every other epoch (to speed the computation, because the program is relatively slow). Thus, instead of creating a whole bunch of training examples from the underlying FSA in advance, they are created on-line with a program that implements the Reber grammar FSA.

Because it may take a while to train, you can opt to load a fully trained network and its training log.

  • To do so, Stop the network at any time via .PanelTab.ControlPanel.
  • To load the network, click on the CopyContext network in the left browser under the networks section and select LoadWeights from the Object menu, and choose fsa.trained.wts.
  • To load the log file, click on the EpochOutputData object under data/OutputData in the left browser panel, then do LoadData from its Object menu, and select fsa.epc.dat (and select reset_first to clear out any existing data).

The network should take anywhere between 12 and 1500 epochs to learn the problem to the point where it gets zero errors in one epoch (this was the range for ten random networks we ran). The pre-trained network took 12 epochs to get to this first zero, but we trained it longer (622 epochs total) to get it to the point where it got 4 zeros in a row. This stamping in of the representations makes them more robust to the noise, but the network still makes occasional errors even with this extra training. The 12 epochs amounts to only 150 different sequences and the 622 epochs amounts to 7775 sequences (each set of 25 sequences lasts for 2 epochs).

In either case, the Leabra network is much faster than the backpropagation network used by Cleeremans, Servan-Schreiber, and McClelland (1989), which took 60,000 sequences (i.e., 4,800 epochs under our scheme). However, we were able to train backpropagation networks with larger hidden layers (30 units instead of 3) to learn in between 136 and 406 epochs. Thus, there is some evidence of an advantage for the additional constraints of model learning and inhibitory competition in this task, given that the Leabra networks generally learned much faster (and backpropagation required a much larger learning rate).

Now we can test the trained network to see how it has solved the problem, and also to see how well it distinguishes grammatical from ungrammatical letter strings.

  • Click on the .T3Tab.TrialOutputDataTest tab in the right panel to display the test results. Then, Init and Step through several inputs in order to see what the hidden states look like.

This will test the network with several sequences of letters, with the results shown in the grid view on the right. Note that the network display is being updated every cycle, so you can see the stochastic choosing of one of the two possible outputs. The network should be producing the correct outputs, as indicated both by the fsa_err column and by the fact that the Output pattern matches the Targets pattern, though it might make an occasional mistake due to the noise.

To better understand the hidden unit representations, we need a sequence of reasonable length (i.e., more than ten or so events). In these longer sequences, the FSA has revisited various nodes due to selecting the looping path, and this revisiting will tell us about the representation of the individual nodes. Thus, if the total number of events in the sequence was below ten (events are counted in the trial column of the grid view), we need to keep to find a suitable sequence.

  • To do so, turn the network Display toggle off (to speed things up), and press Test again until you find a sequence with ten or more events. After running the sequence with ten or more events, press Cluster, and view the .T3Tab.clust_data.

This will bring up a cluster plot of the hidden unit states for each event (e.g., figure 6.15). Figure 6.15 provides a decoding of the cluster plot elements.


Question 6.4. Interpret the cluster plot you obtained (especially the clusters with events at zero distance) in terms of the correspondence between hidden states and the current node versus the current letter. Remember that current node and current letter information is reflected in the letter and number before the arrow.


This produces a random sequence of letters. Obviously, the network is not capable of predicting which letter will come next, and so it makes lots of errors. Thus, one could use this network as a grammaticality detector, to determine if a given string fits the grammar. In this sense, the network has incorporated the FSA structure itself into its own representations.

When you are done with this simulation, you can either close this project in preparation for loading the next project, or you can quit completely from the simulator.

Personal tools