CECN1 Reinforcement Learning

From Computational Cognitive Neuroscience Wiki

Jump to: navigation, search

Contents

Reinforcement Learning and Classical Conditioning

  • The project file: rl_cond.proj (click and Save As to download, then open in Emergent -- NOTE: requires version 4.13 or higher)

Back to CECN1 Projects

Project Documentation

(note: this is a literal copy from the simulation documentation -- it contains links that will not work within the wiki)

  • To start, it is usually a good idea to do Object/Edit Dialog in the menu just above this text, which will open this documentation in a separate window that you can more easily come back to. Alternatively, you can always return by clicking on the ProjectDocs tab at the top of this middle panel.

This replaces section 6.7.3 in the Computational Explorations.. textbook

To explore the TD learning rule (using a new version of the phase-based implementation described in the textbook in Chapter 6 -- see next section for details), we use the simple classical conditioning task discussed in the textbook section XXX. Thus, the network will learn that a stimulus (tone) reliably predicts the reward (and then that another stimulus reliably predicts that tone). First, we need to justify the use of the TD algorithm in this context, and motivate the nature of the stimulus representations used in the network.

You might recall that we said that the delta rule (aka the Rescorla-Wagner rule) provides a good model of classical conditioning, and thus wonder why TD is needed. It all has to do with the issue of timing. If one ignores the timing of the stimulus relative to the response, then in fact the TD rule becomes equivalent to the delta rule when everything happens at one time step (it just trains V(t) to match r(t)). However, animals are sensitive to the timing relationship, and, more importantly for our purposes, modeling this timing provides a particularly clear and simple demonstration of the basic properties of TD learning.

The only problem is that this simple demonstration involves a somewhat unrealistic representation of timing. Basically, the stimulus representation has a distinct unit for each stimulus for each point in time, so that there is something unique for the TD system to learn from. This representation is the complete serial compound (CSC) proposed by Sutton & Barto (1990), and we will see exactly how it works when we look at the model. As we have noted, we will explore a more plausible alternative in chapter 9 where the TD error signal controls the updating of a context representation that maintains the stimulus over time.

Important Differences From Textbook

This model employs a "new and improved" implementation of the TD (temporal differences) learning algorithm within the Leabra framework. Instead of the single adaptive critic (AC) unit as described in the text, there are now 4 separate TD layers that each compute a part of the overall TD algorithm, as described below. Separating out the computations like this fixes a limitation with the previous version, and is computationally necessary to do a fully accurate implementation of TD using phase-based activation states in Leabra.

Each of these layers, except the one labeled TD, uses a distributed Scalar Value representation to encode a scalar value in terms of a pattern of activity across the units. Specifically, each unit has a preferred value for which it will fire maximally -- this value is shown above the unit in the network. It will also fire less strongly for nearby values, in an overall Gaussian fashion. The actual value represented across the whole layer is shown in the first unit in the layer strictly for display purposes (it does not participate in the distributed code, and is removed from sending net input to other layers because the act value is always 0 -- it is only visible when viewing the act_eq or act_m, act_p values) -- it is computed by taking the weighted average over the units in the layer of the unit's target value times its activation.

The actual TD computations are performed by the simulation code using these scalar values extracted from the associated layers, and are not directly computed by the network-level interactions among the layers themselves. Although it is possible to get units to compute the necessary additions and subtractions required by the TD algorithm, it is much simpler and more robust to perform these calculations directly using the values represented by the layers. The critical network-level computation is learning about the reward value of stimuli, and this is done using standard learning mechanisms in the TDRewPred layer.

Here is a description of what each layer does:

  • ExtRew -- just represents the external reward input -- it doesn't do any computation, but just provides a way to display in the network what the current reward value is. It gets input from the input data table, representing the US (unconditioned stimulus).
  • TDRewPred -- this is the key learning layer, which learns to predict the reward value on the next time step based on the current stimulus inputs: V(t+1). This prediction is generated in the plus phase of leabra settling based on its current weights from the iput layer, whereas in the minus phase the layer's state is clamped to the prediction made on the previous trial: V(t).
  • TDRewInteg -- this layer integrates the reward prediction and external reward layer values, and the difference in its plus-minus phase activation states are what drive the TD delta (dopamine-like) signal. Thus, it is most similar to the AC unit in the textbook version. Specifically, its minus phase activation is V(t) -- the expectation of reward computed by the rew pred layer on the previous trial, and its plus phase activation is the expectation of reward on the next trial plus any actual rewards being received at the present time. Thus, its plus phase state is the sum of the ExtRew and TDRewPred values, and this sum is directly clamped as a Gaussian activation state on the layer.
  • TD -- this unit computes the plus - minus values from the rew integ layer, which reflects the TD delta value and is thought to act like the dopamine signal in the brain.

The Network

Let's start by examining the network (which differs from figure 6.22 in the text). The input layer (located at the top, to capture the relative anatomical locations of this cortical area relative to the midbrain dopamine system represented by the TD layers below it) contains three rows of 20 units each. This is the CSC, where the rows each represent a different stimulus (A, B, C), and the columns represent points in time: each unit has a stimulus and a time label (e.g., A_10 = time step 10 of the A stimulus). The TD layers are as described above.

  • Click on r.wt in the .PanelTab.RlCondNet netview control panel tab and then on the TDRewPred units -- you will see that they all start with a uniform weight of .1. Then, click back to viewing act.

The Basic TD Learning Mechanism

Let's see how the CSC works in action.

  • Do Step on the control panel.

Nothing should happen in the Input layer, because no stimulus is present at time step 0. The various TD layers will exhibit their Gaussian bumps of activation representing 0 values, and the TD layer itself has a zero activation. Thus, no reward was either expected or obtained, and there is no deviation from expectation. Note the trial_name: field shown below the network -- it indicates the time step (e.g., t=0).

  • Continue to Step until you see an activation in the Input layer (should be 10 more steps).

This input activation represents the fact that the conditioned stimulus (CS) A (i.e., the "tone" stimulus) came on at t=10. There should be no effect of this on the TD layers, because they have not associated this CS with reward yet.

  • Continue to Step some more.

You will see that this stimulus remains active for 6 more time steps (through t=15), and at the end of this time period, the ExtRew layer represents a value of 1 instead of 0, indicating that an external reward was delivered to the network. Because the TDRewPred layer has not learned to expect this reward, the TD delta value is positive, as reflected by the activity of the TD unit, and as plotted on the graph above the network, which shows a spike at this time step 15. This TD spike is also associated with learning in the TDRewPred layer, as we'll see the next time we go through this sequence of trials.

  • Continue to Step through the end of this sequence of inputs, and through the next set (epoch) until time step 14.

You should see that the TDRewPred layer now gets activated at time step 14, signaling a prediction of reward that will come on the next time step. This expectation of reward, even in the absence of a delivered reward on the ExtRew layer (which still shows a 0 value representation), is sufficient to drive the TD "dopamine spike" as shown on the graph.

  • Step one more time (t=15).

Now the situation is reversed: the ExtRew layer shows that the reward has been presented, but the TD value is 0. This is because the TDRewPred layer accurately predicted this reward on the prior time step, and thus the reward prediction error, which TD signals, is zero! In terms of the overall graph display, you can see that the "dopamine spike" of TD delta has moved forward one step in time. This is the critical feature of the TD algorithm: by learning to anticipate rewards one time step later, it ends up moving the dopamine spike earlier in time.

  • Now Run the model and see what happens with more training.

You should see that the spike moves "forward" in time with each training step, but can't move any further than the onset of the CS at time step 10. This is the same process that was shown in figure 6.20.

We can also examine the weights to see what the network has learned.

  • Click on r.wt and then on the TDRewPred layer units -- you should see that right around the 1.0 value (units labeled 0.9 and 1.1) there are increased weights from the A stimulus for time steps 10-14.

Extinction and Second Order Conditioning

At this point, there are many standard phenomena in classical conditioning that can be explored with this model. We will look at two: extinction and second order conditioning. Extinction occurs when the stimulus is no longer predictive of reward -- it then loses its ability to predict this reward (which is appropriate). Second order conditioning, as we discussed earlier, is where a conditioned stimulus can serve as the unconditioned stimulus for another stimulus -- in other words, one can extend the prediction of reward backward across two separate stimuli.

We can simulate extinction by simply turning off the reward that appears at t=15. To do this, we need to alter the parameters on the control panel that determine the nature of the stimulus input and reward. The first parameter is the env_type, which determines which stimuli are being presented (CS A, CS B, and US). Currently, we are presenting the CSA and US. To turn off the US, select CSA_NO. For this to take effect, you need to hit the Gen Inputs button at the bottom of the control panel, which uses the parameters to generate an environment. The other parameters shown determine when the stimuli are presented, if they are selected by the env_type parameter. We can leave these in their current state.

  • Now, hit Reset Trial Data to clear out the trial-level data plotted in the graph view, and then Step through the sequence of inputs.

Question 6.5 (a) What happened at the point where the reward was supposed to occur? (b) Explain why this happened using the TD equations. (c) Then, Run the network and describe what occurs next in terms of the TD error signals plotted in the graph view, and explain why TD does this. (c) After the network is done learning again, does the stimulus still evoke an expectation of reward?


If you look at the weights into the TDRewPred layer, you'll notice something interesting. The weights into the units around a value of 1 have not decreased very much, but the weights into the units around the 0 value have increased. Thus, this model shows that extinction learning may be more about learning new counteracting associations, instead of unlearning the previous associations. There is now considerable evidence for this in the brain, with some very important implications. For example, the original associations are always "latent" in the system, so they can be reactivated more quickly later on.

Now, let's explore second order conditioning. We must first retrain the network on the stimulus 1 association.

  • Select CSA_US for the env_type, do Gen Inputs, then do Init, answer Yes to Initializing the network weights, and then do Run to setup the initial association as last time.

Now, we will turn on the CS B stimulus, which starts at t=2 and lasts until time step 10.

  • Select CSA_CSB_US as the env_type (if the full text is not shown, it is the first one listed), then do Gen Inputs, and go back to viewing act if you aren't already. Hit Reset Trial Data to clear the graph view. Then, Step through the trial (you might need to go through twice to get a full trial, depending on where it stopped last time).

Essentially, the first stimulus acts just like a reward by triggering a positive delta value, and thus allows the second stimulus to learn to predict this first stimulus.

  • Push Run, and then Stop when the graph view stops changing.

You will see that the early anticipation of reward gets carried out to the onset of the CS B stimulus (which comes first in time).

Finally, we can present some of the limitations of the CSC representation. One obvious problem is capacity -- each stimulus requires a different set of units for all possible time intervals that can be represented. Also, the CSC begs the question of how time is initialized to zero at the right point so every trial is properly synchronized. Finally, the CSC requires that the stimulus stay on (or some trace of it) up to the point of reward, which is unrealistic. This last problem points to an important issue with the TD algorithm, which is that although it can learn to bridge temporal gaps, it requires some suitable representation to support this bridging. We will see in chapters 9 and 11 that this and the other problems can be resolved by allowing the TD system to control the updating of context-like representations associated with the prefrontal cortex.

Advanced Explorations

More advanced explorations can be performed by manipulating the extra input patterns in the GenCondInputs program found under .programs in the left browser panel. Here you can manipulate the probabilities of stimuli being presented, and introduce randomness in the timings. Generally speaking, these manipulations tend to highlight the limitations of the CSC input represenation, and of TD more generally. See http://grey.colorado.edu/emergent/index.php/Leabra_PVLV for information about an alternative, biologically-based approach, which is also supported by this simulation software.

Personal tools