Distributed Memory

From emergent
Jump to: navigation, search

Distributed Memory computation (DMem) refers to running an emergent Network using multiple processors in parallel.

See also:

Distributed Memory Computation in the Network

The Network supports parallel processing across connections, where different distributed memory (dmem) processes compute different subsets of connections, and then share their results. For example if 4 processors were working on a single network, each would have connections for approximately 1/4 of the units in the network. When the net input to the network is computed, each process computes this on its subset of connections, and then shares the results with all the other processes.

Given the relatively large amount of communication required for synchronizing net inputs and other variables at each cycle of network computation, this is efficient only for relatively large networks (e.g., above 250 units per layer for 4 layers). In benchmarks on a Pentium 4 Xeon cluster system connected with a fast Myrinet fiber-optic switched network connection, networks of 500 units per layer for 4 layers achieved better than 2x speedup by splitting across 2 processors, presumably by making the split network fit within processor cache whereas the entire one did not. This did not scale that well for more than 2 processors, suggesting that cache is the biggest factor for this form of dmem processing.

In all dmem cases, each processor maintains its own copy of the entire simulation project, and each performs largely the exact same set of functions to remain identical throughout the computation process. Processing only diverges at carefully controlled points, and the results of this divergent processing are then shared across all processors so they can re-synchronize with each other. Therfore, 99.99% of the code runs exactly the same under dmem as it does under a single-process, making the code extensions required to support this form of parallel processing minimal.

The main parameter for controlling dmem processing is the dmem_nprocs field, which determines how many of the available processors are allocated to processing network connections. Other processors left over after the network allocation are allocated to processing event-wise distributed memory computation see next secton for information on this). The other parameter is dmem_sync_level, which is set automatically by most algorithms based on the type of synchronization that they require (feedforward networks generally require layer-level synchronization, while recurrent, interactive networks require network-level synchronization).

Distributed Memory Computation Across Trials

The Epoch Program supports distributed memory (dmem) computation by farming out trials of processing individual input patterns across different distributed memory processors. For example, if you had 4 such processors available, and an input data table of 16 events, each processor could process 4 of these events, resulting in a theoretical speedup of 4x. This will happen automatically if you start a dmem simulation with more than Network.dmem_nprocs processors -- see below for details.

In all dmem cases (see previous section for Network-level dmem) each processor maintains its own copy of the entire simulation project, and each performs largely the exact same set of functions to remain identical throughout the computation process. Processing only diverges at carefully controlled points, and the results of this divergent processing are then shared across all processors so they can re-synchronize with each other. Therfore, 99.99% of the code runs exactly the same under dmem as it does under a single-process, making the code extensions required to support this form of dmem minimal.

If learning is taking place, the weight changes produced by each of these different sets of events must be integrated back together. This is means that weights must be updated in SMALL_BATCH or BATCH mode when using dmem (this parameter is set on the Network object).

Trial-level distributed memory computation can be combined with network-wise dmem. The Network level dmem_nprocs parameter determines how many of the available processors are allocated to the network. If there are multiples of these numbers of processors left over, they are allocated to the Trial-level dmem computation. For example, if there were 8 processors available, and each network was allocated 2 processors, then there would be 4 sets of networks available for dmem processing of trials. Groups of two processors representing a complete network would work together on a given set of events.

If Network.wt_update is set to BATCH, then weights are synchronized across processors at the end of each epoch. Results should be identical to those produced by running on a single-processor system under BATCH mode.

If Network.wt_update is SMALL_BATCH, then the small_batch_n parameter is divided by the number of dmem processors at work to determine how frequently to share weight changes among processors. If small_batch_n is an even multiple of the number of dmem processors processing events, then results will be identical to those obtained on a single processor. Otherwise, the effective batch_n value will be different. For example, if there are 4 dmem processors, then a value of batch_n = 4 means that weights changes are applied after each processor processes one event. However, batch_n = 6 cannot be processed in this way: changes will occur as though batch_n = 4. Similarly, batch_n = 1 actually means batch_n = 4. If batch_n = 8, then weight changes are applied after every 2 sets of dmem event processing steps, etc.

Note that wt_update cannot be ONLINE in dmem mode, and will be set to SMALL_BATCH automatically by default.

Note that the event-wise model may not be that sensible under dmem if there is any state information carried between events in a sequence (e.g., a SRN context layer or any other form of active memory), as is often the case when using sequences, because this state information is NOT shared between processes within a sequence (it cannot be -- events are processed in parallel, not in sequence).

Implementation

See files ta_dmem.h and ta_dmem.cpp.