Leabra TD
From Emergent
Back to Leabra
This describes the Leabra implementation of the TD temporal differences reinforcement learning algorithm, which uses a set of specialized layer spec types. Each of the TD layers, except the one labeled TD, uses a distributed ScalarValLayerSpec representation to encode a scalar value in terms of a pattern of activity across the units. Specifically, each unit has a preferred value for which it will fire maximally -- this value is shown above the unit in the network. It will also fire less strongly for nearby values, in an overall Gaussian fashion. The actual value represented across the whole layer is shown in the first unit in the layer strictly for display purposes (it does not participate in the distributed code, and is removed from sending net input to other layers because the act value is always 0 -- it is only visible when viewing the act_eq or act_m, act_p values) -- it is computed by taking the weighted average over the units in the layer of the unit's target value times its activation.
The actual TD computations are performed by the simulation code using these scalar values extracted from the associated layers, and are not directly computed by the network-level interactions among the layers themselves. Although it is possible to get units to compute the necessary additions and subtractions required by the TD algorithm, it is much simpler and more robust to perform these calculations directly using the values represented by the layers. The critical network-level computation is learning about the reward value of stimuli, and this is done using standard learning mechanisms in the TDRewPred layer.
Here is a description of what each layer does:
- ExtRew -- just represents the external reward input -- it doesn't do any computation, but just provides a way to display in the network what the current reward value is. It gets input from the input data table, representing the US (unconditioned stimulus).
- TDRewPred -- this is the key learning layer, which learns to predict the reward value on the next time step based on the current stimulus inputs: V(t+1). This prediction is generated in the plus phase of leabra settling based on its current weights from the iput layer, whereas in the minus phase the layer's state is clamped to the prediction made on the previous trial: V(t).
- TDRewInteg -- this layer integrates the reward prediction and external reward layer values, and the difference in its plus-minus phase activation states are what drive the TD delta (dopamine-like) signal. Thus, it is most similar to the AC unit in the textbook version. Specifically, its minus phase activation is V(t) -- the expectation of reward computed by the rew pred layer on the previous trial, and its plus phase activation is the expectation of reward on the next trial plus any actual rewards being received at the present time. Thus, its plus phase state is the sum of the ExtRew and TDRewPred values, and this sum is directly clamped as a Gaussian activation state on the layer.
- TD -- this unit computes the plus - minus values from the rew integ layer, which reflects the TD delta value and is thought to act like the dopamine signal in the brain.
Implementational Details
- DaModUnit and DaModUnitSpec
- ExtRewLayerspec -- represents external rewards
- TDRewPredLayerSpec -- reward prediction layer spec -- predicts future reward based on current sensory inputs
- TDRewIntegLayerSpec -- integrates predicted and external rewards -- time derivative of this layer is the TD signal that simulates dopamine
- TdLayerSpec -- just computes temporal derivative of the rew integ layerspec.
