# Backpropagation

## Introduction

Backpropagation is perhaps the most commonly used neural network learning algorithm. Several different "flavors" of backpropagation have been developed over the years, several of which have been implemented in the software, including the use of different error functions such as cross-entropy, and recurrent backprop, from the simple recurrent network to the Almeida-Pineda algorithm up to the real-time continuous recurrent backprop. The implementation allows the user to extend the unit types to use different activation and error functions in a straightforward manner.

Note that the simple recurrent networks (SRN, a.k.a. Elman networks) are described in the feedforward backprop section, as they are more like feedforward networks than the fully recurrent ones.

The basic structure of the backpropagation algorithm consists of two phases, an activation propagation phase, and an error backpropagation phase. In the simplest version of Bp, both of these phases are strictly feed-forward and feed-back, and are computed sequentially layer-by-layer. Thus, the implementation assumes that the layers are organized sequentially in the order that activation flows.

In the recurrent versions, both the activation and the error propagation are computed in two steps so that each unit is effectively being updated simultaneously with the other units. This is done in the activation phase by first computing the net input to each unit based on the other units current activation values, and then updating the activation values based on this net input. Similarly, in the error phase, first the derivative of the error with respect to the activation (dEdA) of each unit is computed based on current dEdNet values, and then the dEdNet values are updated based on the new dEdNet.

## Feedforward Bp Reference

The classes defined in the basic feedforward Bp implementation include:

- {{gendoc|class=BpLayer}

Bias weights are implemented by adding a BpCon object to the BpUnit directly, and not by trying to allocate some kind of self projection or some other scheme like that. In addition, the BpUnitSpec has a pointer to a BpConSpec to control the updating etc of the bias weight. Thus, while some code was written to support the special bias weights on units, it amounts to simply calling the appropriate function on the BpConSpec.

### Variations on the Standard

- implements a linear activation function
- implements a threshold linear activation

function with the threshold set by the parameter @code{threshold}. Activation is zero when net is below threshold, net-threshold above that.

- adds noise to the activations of units. The noise

is specified by the noise member.

- computes a binary activation, with the

probability of being active a sigmoidal function of the net input (e.g., like a Boltzmann Machine unit).

- computes activation as a Gaussian function of the

distance between the weights and the activations. The variance of the Gaussian is spherical (the same for all weights), and is given by the parameter var.

- computes activation as a Gaussian function of the

standard dot-product net input (not the distance, as in the RBF). The mean of the effectively uni-dimensional Gaussian is specified by the mean parameter, with a standard deviation of std_dev.

- computes activation as an exponential function of the

net input (e^net). This is useful for implementing SoftMax units, among other things.

- takes one-to-one input from a corresponding

exponential unit, and another input from a LinearBpUnitSpec unit that
computes the sum over all the exponential units, and computes the
division between these two. This results in a SoftMax unit. Note that
the LinearBpUnitSpec must have fixed weights all of value 1, and that
the SoftMaxUnit's must have the one-to-one projection from exp units
first, followed by the projection from the sum units. See
`demo/bp_misc/bp_softmax.proj`

for a demonstration of how to
configure a SoftMax network.

- computes very simple Hebbian learning instead of

backpropagation. It is useful for making comparisons between delta-rule
and Hebbian leanring. The rule is simply ```
dwt = ru->act *
su->act
```

, where `ru->act`

is the target value if present.

- scales the error sent back to the sending units by

the factor @code{err_scale}. This can be used in cases where there are multiple output layers, some of which are not supposed to influence learning in the hidden layer, for example.

- implements the delta-bar-delta learning rate

adaptation scheme (Jacobs, 1988). It should only be used in batch mode weight updating. The connection type must be

, which contains a connection-wise learning rateparameter. This learning rate is additively incremented by
lrate_incr when the sign of the current and previous weight
changes are in agreement, and decrements it multiplicatively by
lrate_decr when they are not. The demo project
`demo/bp_misc/bp_ft_dbd.proj`

provides an example of how to set
up delta-bar-delta learning.