# Difference between revisions of "Backpropagation"

(New page: {{gendoc}}) |
|||

(13 intermediate revisions by 5 users not shown) | |||

Line 1: | Line 1: | ||

− | {{gendoc}} | + | = Introduction = |

+ | |||

+ | Back to [[Network Algorithm Guide]] | ||

+ | |||

+ | '''IMPORTANT:''' See [[Backprop_8.0_Update]] for important changes in version 8.0 of emergent. | ||

+ | |||

+ | Backpropagation is perhaps the most commonly used neural network | ||

+ | learning algorithm. Several different "flavors" of backpropagation have | ||

+ | been developed over the years, several of which have been implemented in | ||

+ | the software, including the use of different error functions such | ||

+ | as cross-entropy, and recurrent backprop, from the simple recurrent | ||

+ | network to the Almeida-Pineda algorithm up to the real-time continuous | ||

+ | recurrent backprop. The implementation allows the user to extend the | ||

+ | unit types to use different activation and error functions in a | ||

+ | straightforward manner. | ||

+ | |||

+ | Note that the simple recurrent networks (SRN, a.k.a. Elman networks) are | ||

+ | described in the feedforward backprop section, as they are more like | ||

+ | feedforward networks than the fully recurrent ones. | ||

+ | |||

+ | The basic structure of the backpropagation algorithm consists of two phases, an | ||

+ | activation propagation phase, and an error backpropagation phase. In | ||

+ | the simplest version of Bp, both of these phases are strictly | ||

+ | feed-forward and feed-back, and are computed sequentially | ||

+ | layer-by-layer. Thus, the implementation assumes that the layers are | ||

+ | organized sequentially in the order that activation flows. | ||

+ | |||

+ | In the recurrent versions, both the activation and the error propagation | ||

+ | are computed in two steps so that each unit is effectively being updated | ||

+ | simultaneously with the other units. This is done in the activation | ||

+ | phase by first computing the net input to each unit based on the other | ||

+ | units current activation values, and then updating the activation values | ||

+ | based on this net input. Similarly, in the error phase, first the | ||

+ | derivative of the error with respect to the activation (dEdA) of each | ||

+ | unit is computed based on current dEdNet values, and then the dEdNet | ||

+ | values are updated based on the new dEdNet. | ||

+ | |||

+ | = Feedforward Bp Reference = | ||

+ | |||

+ | The classes defined in the basic feedforward Bp implementation include: | ||

+ | * {{gendoc|BpConSpec}}, {{gendoc|BpCon}} | ||

+ | * {{gendoc|BpUnit}}, {{gendoc|BpUnitSpec}} | ||

+ | * {{gendoc|BpLayer}} | ||

+ | * {{gendoc|BpNetwork}} | ||

+ | |||

+ | Bias weights are implemented by adding a BpCon object | ||

+ | to the BpUnit directly, and not by trying to allocate some kind of | ||

+ | self projection or some other scheme like that. In addition, the | ||

+ | BpUnitSpec has a pointer to a BpConSpec to control the updating | ||

+ | etc of the bias weight. Thus, while some code was written to support | ||

+ | the special bias weights on units, it amounts to simply calling the | ||

+ | appropriate function on the BpConSpec. | ||

+ | |||

+ | == Variations on the Standard == | ||

+ | |||

+ | * {{gendoc|LinearBpUnitSpec}} implements a linear activation function | ||

+ | * {{gendoc|ThreshLinBpUnitSpec}} implements a threshold linear activation | ||

+ | function with the threshold set by the parameter @code{threshold}. | ||

+ | Activation is zero when net is below threshold, net-threshold above | ||

+ | that. | ||

+ | * {{gendoc|NoisyBpUnitSpec}} adds noise to the activations of units. The noise | ||

+ | is specified by the noise member. | ||

+ | * {{gendoc|StochasticBpUnitSpec}} computes a binary activation, with the | ||

+ | probability of being active a sigmoidal function of the net input (e.g., | ||

+ | like a Boltzmann Machine unit). | ||

+ | * {{gendoc|RBFBpUnitSpec}} computes activation as a Gaussian function of the | ||

+ | distance between the weights and the activations. The variance of the | ||

+ | Gaussian is spherical (the same for all weights), and is given by the | ||

+ | parameter var. | ||

+ | * {{gendoc|BumpBpUnitSpec}} computes activation as a Gaussian function of the | ||

+ | standard dot-product net input (not the distance, as in the RBF). The | ||

+ | mean of the effectively uni-dimensional Gaussian is specified by the | ||

+ | mean parameter, with a standard deviation of std_dev. | ||

+ | * {{gendoc|ExpBpUnitSpec}} computes activation as an exponential function of the | ||

+ | net input (e^net). This is useful for implementing SoftMax units, among | ||

+ | other things. | ||

+ | * {{gendoc|SoftMaxBpUnitSpec}} takes one-to-one input from a corresponding | ||

+ | exponential unit, and another input from a LinearBpUnitSpec unit that | ||

+ | computes the sum over all the exponential units, and computes the | ||

+ | division between these two. This results in a SoftMax unit. Note that | ||

+ | the LinearBpUnitSpec must have fixed weights all of value 1, and that | ||

+ | the SoftMaxUnit's must have the one-to-one projection from exp units | ||

+ | first, followed by the projection from the sum units. See | ||

+ | <code>demo/bp_misc/bp_softmax.proj</code> for a demonstration of how to | ||

+ | configure a SoftMax network. | ||

+ | * {{gendoc|HebbBpConSpec}} computes very simple Hebbian learning instead of | ||

+ | backpropagation. It is useful for making comparisons between delta-rule | ||

+ | and Hebbian leanring. The rule is simply <code>dwt = ru->act * | ||

+ | su->act</code>, where <code>ru->act</code> is the target value if present. | ||

+ | * {{gendoc|ErrScaleBpConSpec}} scales the error sent back to the sending units by | ||

+ | the factor @code{err_scale}. This can be used in cases where there are | ||

+ | multiple output layers, some of which are not supposed to influence | ||

+ | learning in the hidden layer, for example. | ||

+ | * {{gendoc|DeltaBarDeltaBpConSpec}} implements the delta-bar-delta learning rate | ||

+ | adaptation scheme (Jacobs, 1988). It should only be used in batch | ||

+ | mode weight updating. The connection type must be | ||

+ | {{gendoc|DeltaBarDeltaBpCon}}, which contains a connection-wise learning rate | ||

+ | parameter. This learning rate is additively incremented by | ||

+ | lrate_incr when the sign of the current and previous weight | ||

+ | changes are in agreement, and decrements it multiplicatively by | ||

+ | lrate_decr when they are not. The demo project | ||

+ | <code>demo/bp_misc/bp_ft_dbd.proj</code> provides an example of how to set | ||

+ | up delta-bar-delta learning. | ||

+ | |||

+ | == Simple Recurrent Networks (SRN's) == | ||

+ | |||

+ | Simple recurrent networks (SRN) (Elman, 1988) involve the use of | ||

+ | a special set of context units which copy their values from the hidden | ||

+ | units, and from which the hidden units receive inputs. Thus, it | ||

+ | provides a simple form of recurrence that can be used to train | ||

+ | networks to perform sequential tasks over time. The | ||

+ | {{gendoc|BpWizard}} has a <code>Network/SRNContext</code> function that will | ||

+ | automatically build an SRN context layer as described below. | ||

+ | |||

+ | The implementation of SRN's uses a special version of the | ||

+ | BpUnitSpec called the {{gendoc|BpContextSpec}}. This spec overloads | ||

+ | the activation function to simply copy from a corresponding hidden unit. | ||

+ | The correspondence between hidden and context units is established by | ||

+ | creating a single one-to-one projection into the context units from the | ||

+ | hidden units. The context units look for the sending unit on the other | ||

+ | side of their first connection in their first connection group for the | ||

+ | activation to copy. This kind of connection should be created with a | ||

+ | {{gendoc|OneToOnePrjnSpec}}. | ||

+ | |||

+ | '''Important:''' The context units should be in a layer that | ||

+ | ''follows'' the hidden units they copy from. This is because the | ||

+ | context units should provide input to the hidden units before copying | ||

+ | their activation values. This means that the hidden units should update | ||

+ | themselves first. | ||

+ | |||

+ | The context units do not have to simply copy the activations directly | ||

+ | from the hidden units. Instead, they can perform a time-averaging of | ||

+ | information through the use of an updating equation as described below. | ||

+ | The parameters of the context spec are as follows: | ||

+ | |||

+ | The demo project <file>demo/bp_srn/srn_fsa.proj</file> is an example of a | ||

+ | SRN network that uses the sequence processes. | ||

+ | |||

+ | = Recurrent Backprop--NYI! = | ||

+ | |||

+ | '''NOTE: Recurrent backpropagation (RBp) is not currently supported (as of {{emergent}} 5.0.2)''' | ||

+ | |||

+ | Check the [[Comparison of Neural Network Simulators]] for existing simulators that support RBp. | ||

+ | |||

+ | |||

+ | Recurrent backpropagation (RBp) extends the basic functionality of | ||

+ | feed-forward backprop to networks with recurrently interconnected units, | ||

+ | which can exhibit interesting dynamical properties as activation | ||

+ | propagates through the network over time. | ||

+ | |||

+ | The recurrent backprop implementation (RBp) defines a new set of types | ||

+ | that are derived from the corresponding Bp versions: | ||

+ | * {{gendoc|RBpConSpec}} | ||

+ | * {{gendoc|RBpUnit}} | ||

+ | * {{gendoc|RBpUnitSpec}} | ||

+ | Note that RBp uses the same Connection type as Bp. In addition, support for the Almeida-Pineda algorithm is made possible by special ApBp Programs. | ||

+ | |||

+ | There are a couple of general features of the version of recurrent | ||

+ | backprop that the user should be aware of. First | ||

+ | of all, the model used is that of a discrete approximation to a | ||

+ | continuous dynamic system, which is defined by the sigmoidal activation | ||

+ | of the net input to the units. The granularity of the discrete | ||

+ | approximation is controlled by the dt parameter, which should be | ||

+ | in the range between 0 and 1, with smaller values corresponding to a | ||

+ | finer, closer to continuous approximation. Thus, the behavior of the | ||

+ | network should be roughly similar for different dt values, with | ||

+ | the main effect of dt being to make updating smoother or rougher. | ||

+ | |||

+ | Also, there are two ways in which the units can settle, one involves | ||

+ | making incremental changes to the activation values of units, and the | ||

+ | other involves making incremental changes to the net inputs. The latter | ||

+ | is generally preferred since it allows networks with large weights to | ||

+ | update activations quickly compared to activation-based updates, which | ||

+ | have a strict ceiling on the update rate since the maximum activation | ||

+ | value is 1, while the maximum net input value is unbounded. | ||

+ | |||

+ | As in standard backpropagation, recurrent backprop operates in two | ||

+ | phases: activation propagation and error backpropagation. The | ||

+ | difference in recurrent backprop is that both of these phases extend | ||

+ | over time. Thus, the network is run for some number of activation | ||

+ | update cycles, during which a record of the activation states is kept by | ||

+ | each unit, and then a backpropagation is performed that goes all the way | ||

+ | back in time through the record of these activation states. The | ||

+ | backpropagation happens between the receiving units at time t and the | ||

+ | sending units at the previous time step, time t-1. Another way of | ||

+ | thinking about this process is to unfold the network in time, which | ||

+ | would result in a large network with a new set of layers for each time | ||

+ | step, but with the same set of weights used repeatedly for each time | ||

+ | step unfolding. Doing this, it is clear that the sending units are in | ||

+ | the previous time step relative to the receiving units. | ||

+ | |||

+ | The exact length of the activation propagation phase and the timing and | ||

+ | frequency of the backpropagation phases can be controlled in different | ||

+ | ways that are appropriate for different kinds of tasks. In cases where | ||

+ | there is a clearly-defined notion of a set of distinct temporal | ||

+ | sequences, one can propagate activation all the way through each | ||

+ | sequence, and then backpropagate at the end of the sequence. This is | ||

+ | the default mode of operation for the processes. | ||

+ | |||

+ | There are other kinds of environments where there is no clear boundary | ||

+ | between one sequence and the next. This is known as "real time" mode, | ||

+ | and it works by periodically performing a backpropagation operation | ||

+ | after some number of activation updates have been performed. Thus, | ||

+ | there is a notion of a "time window" over which the network will be | ||

+ | sensitive to temporal contingencies through the weight updates driven by | ||

+ | a single backpropagation operation. In addition, these backpropagations | ||

+ | can occur with a period that is less than the length of the time window, | ||

+ | so that there is some overlap in the events covered by successive | ||

+ | backpropagation operations. This can enable longer-range temporal | ||

+ | contingencies to be bootstrapped from a series of overlapping | ||

+ | backpropagations, each with a smaller time window. | ||

+ | |||

+ | There is a simpler variation of a recurrent backpropagation algorithm | ||

+ | that was invented by Almeida and Pineda, and is named after them. In | ||

+ | this algorithm, the activation updating phase proceeds iteratively until | ||

+ | the maximum change between the previous and the current activation | ||

+ | values over all units is below some criterion. Thus, the network | ||

+ | settles into a stable attractor state. Then, the backpropagation phase | ||

+ | is performed repeatedly until it too settles on a stable set of error | ||

+ | derivative terms (i.e., the maximum difference between the derivative of | ||

+ | the error for each unit and the previously computed such derivative is | ||

+ | below some threshold). These asymptotic error derivatives are then used | ||

+ | to update the weights. Note that the backpropagation operates | ||

+ | repeatedly on the asymptotic or stable activation values computed during | ||

+ | the first stage of settling, and not on the trajectory of these | ||

+ | activation states as in the "standard" version of RBp. The | ||

+ | Almeida-Pineda form of the algorithm is enabled by using the APBp | ||

+ | Programs, which compute the two phases of settling over cycles of | ||

+ | either activation propagation or error backpropagation. |

## Latest revision as of 02:51, 31 August 2016

## Contents

# Introduction

Back to Network Algorithm Guide

**IMPORTANT:** See Backprop_8.0_Update for important changes in version 8.0 of emergent.

Backpropagation is perhaps the most commonly used neural network learning algorithm. Several different "flavors" of backpropagation have been developed over the years, several of which have been implemented in the software, including the use of different error functions such as cross-entropy, and recurrent backprop, from the simple recurrent network to the Almeida-Pineda algorithm up to the real-time continuous recurrent backprop. The implementation allows the user to extend the unit types to use different activation and error functions in a straightforward manner.

Note that the simple recurrent networks (SRN, a.k.a. Elman networks) are described in the feedforward backprop section, as they are more like feedforward networks than the fully recurrent ones.

The basic structure of the backpropagation algorithm consists of two phases, an activation propagation phase, and an error backpropagation phase. In the simplest version of Bp, both of these phases are strictly feed-forward and feed-back, and are computed sequentially layer-by-layer. Thus, the implementation assumes that the layers are organized sequentially in the order that activation flows.

In the recurrent versions, both the activation and the error propagation are computed in two steps so that each unit is effectively being updated simultaneously with the other units. This is done in the activation phase by first computing the net input to each unit based on the other units current activation values, and then updating the activation values based on this net input. Similarly, in the error phase, first the derivative of the error with respect to the activation (dEdA) of each unit is computed based on current dEdNet values, and then the dEdNet values are updated based on the new dEdNet.

# Feedforward Bp Reference

The classes defined in the basic feedforward Bp implementation include:

- ,
- ,

Bias weights are implemented by adding a BpCon object to the BpUnit directly, and not by trying to allocate some kind of self projection or some other scheme like that. In addition, the BpUnitSpec has a pointer to a BpConSpec to control the updating etc of the bias weight. Thus, while some code was written to support the special bias weights on units, it amounts to simply calling the appropriate function on the BpConSpec.

## Variations on the Standard

- implements a linear activation function
- implements a threshold linear activation

function with the threshold set by the parameter @code{threshold}. Activation is zero when net is below threshold, net-threshold above that.

- adds noise to the activations of units. The noise

is specified by the noise member.

- computes a binary activation, with the

probability of being active a sigmoidal function of the net input (e.g., like a Boltzmann Machine unit).

- computes activation as a Gaussian function of the

distance between the weights and the activations. The variance of the Gaussian is spherical (the same for all weights), and is given by the parameter var.

- computes activation as a Gaussian function of the

standard dot-product net input (not the distance, as in the RBF). The mean of the effectively uni-dimensional Gaussian is specified by the mean parameter, with a standard deviation of std_dev.

- computes activation as an exponential function of the

net input (e^net). This is useful for implementing SoftMax units, among other things.

- takes one-to-one input from a corresponding

exponential unit, and another input from a LinearBpUnitSpec unit that
computes the sum over all the exponential units, and computes the
division between these two. This results in a SoftMax unit. Note that
the LinearBpUnitSpec must have fixed weights all of value 1, and that
the SoftMaxUnit's must have the one-to-one projection from exp units
first, followed by the projection from the sum units. See
`demo/bp_misc/bp_softmax.proj`

for a demonstration of how to
configure a SoftMax network.

- computes very simple Hebbian learning instead of

backpropagation. It is useful for making comparisons between delta-rule
and Hebbian leanring. The rule is simply ```
dwt = ru->act *
su->act
```

, where `ru->act`

is the target value if present.

- scales the error sent back to the sending units by

the factor @code{err_scale}. This can be used in cases where there are multiple output layers, some of which are not supposed to influence learning in the hidden layer, for example.

- implements the delta-bar-delta learning rate

adaptation scheme (Jacobs, 1988). It should only be used in batch mode weight updating. The connection type must be

, which contains a connection-wise learning rateparameter. This learning rate is additively incremented by
lrate_incr when the sign of the current and previous weight
changes are in agreement, and decrements it multiplicatively by
lrate_decr when they are not. The demo project
`demo/bp_misc/bp_ft_dbd.proj`

provides an example of how to set
up delta-bar-delta learning.

## Simple Recurrent Networks (SRN's)

Simple recurrent networks (SRN) (Elman, 1988) involve the use of a special set of context units which copy their values from the hidden units, and from which the hidden units receive inputs. Thus, it provides a simple form of recurrence that can be used to train networks to perform sequential tasks over time. The

has a`Network/SRNContext`

function that will
automatically build an SRN context layer as described below.

The implementation of SRN's uses a special version of the

BpUnitSpec called the . This spec overloadsthe activation function to simply copy from a corresponding hidden unit. The correspondence between hidden and context units is established by creating a single one-to-one projection into the context units from the hidden units. The context units look for the sending unit on the other side of their first connection in their first connection group for the activation to copy. This kind of connection should be created with a

.**Important:** The context units should be in a layer that
*follows* the hidden units they copy from. This is because the
context units should provide input to the hidden units before copying
their activation values. This means that the hidden units should update
themselves first.

The context units do not have to simply copy the activations directly from the hidden units. Instead, they can perform a time-averaging of information through the use of an updating equation as described below. The parameters of the context spec are as follows:

The demo project <file>demo/bp_srn/srn_fsa.proj</file> is an example of a SRN network that uses the sequence processes.

# Recurrent Backprop--NYI!

**NOTE: Recurrent backpropagation (RBp) is not currently supported (as of emergent 5.0.2)**

Check the Comparison of Neural Network Simulators for existing simulators that support RBp.

Recurrent backpropagation (RBp) extends the basic functionality of
feed-forward backprop to networks with recurrently interconnected units,
which can exhibit interesting dynamical properties as activation
propagates through the network over time.

The recurrent backprop implementation (RBp) defines a new set of types that are derived from the corresponding Bp versions:

Note that RBp uses the same Connection type as Bp. In addition, support for the Almeida-Pineda algorithm is made possible by special ApBp Programs.

There are a couple of general features of the version of recurrent backprop that the user should be aware of. First of all, the model used is that of a discrete approximation to a continuous dynamic system, which is defined by the sigmoidal activation of the net input to the units. The granularity of the discrete approximation is controlled by the dt parameter, which should be in the range between 0 and 1, with smaller values corresponding to a finer, closer to continuous approximation. Thus, the behavior of the network should be roughly similar for different dt values, with the main effect of dt being to make updating smoother or rougher.

Also, there are two ways in which the units can settle, one involves making incremental changes to the activation values of units, and the other involves making incremental changes to the net inputs. The latter is generally preferred since it allows networks with large weights to update activations quickly compared to activation-based updates, which have a strict ceiling on the update rate since the maximum activation value is 1, while the maximum net input value is unbounded.

As in standard backpropagation, recurrent backprop operates in two phases: activation propagation and error backpropagation. The difference in recurrent backprop is that both of these phases extend over time. Thus, the network is run for some number of activation update cycles, during which a record of the activation states is kept by each unit, and then a backpropagation is performed that goes all the way back in time through the record of these activation states. The backpropagation happens between the receiving units at time t and the sending units at the previous time step, time t-1. Another way of thinking about this process is to unfold the network in time, which would result in a large network with a new set of layers for each time step, but with the same set of weights used repeatedly for each time step unfolding. Doing this, it is clear that the sending units are in the previous time step relative to the receiving units.

The exact length of the activation propagation phase and the timing and frequency of the backpropagation phases can be controlled in different ways that are appropriate for different kinds of tasks. In cases where there is a clearly-defined notion of a set of distinct temporal sequences, one can propagate activation all the way through each sequence, and then backpropagate at the end of the sequence. This is the default mode of operation for the processes.

There are other kinds of environments where there is no clear boundary between one sequence and the next. This is known as "real time" mode, and it works by periodically performing a backpropagation operation after some number of activation updates have been performed. Thus, there is a notion of a "time window" over which the network will be sensitive to temporal contingencies through the weight updates driven by a single backpropagation operation. In addition, these backpropagations can occur with a period that is less than the length of the time window, so that there is some overlap in the events covered by successive backpropagation operations. This can enable longer-range temporal contingencies to be bootstrapped from a series of overlapping backpropagations, each with a smaller time window.

There is a simpler variation of a recurrent backpropagation algorithm that was invented by Almeida and Pineda, and is named after them. In this algorithm, the activation updating phase proceeds iteratively until the maximum change between the previous and the current activation values over all units is below some criterion. Thus, the network settles into a stable attractor state. Then, the backpropagation phase is performed repeatedly until it too settles on a stable set of error derivative terms (i.e., the maximum difference between the derivative of the error for each unit and the previously computed such derivative is below some threshold). These asymptotic error derivatives are then used to update the weights. Note that the backpropagation operates repeatedly on the asymptotic or stable activation values computed during the first stage of settling, and not on the trajectory of these activation states as in the "standard" version of RBp. The Almeida-Pineda form of the algorithm is enabled by using the APBp Programs, which compute the two phases of settling over cycles of either activation propagation or error backpropagation.