This page describes GPU vectorization, focusing initially on CUDA and then perhaps OpenCL.
As of Version 8.0 CPU CUDA is well integrated with the Thread Optimization memory structures, and uses these as the host-side memory blocks, and just allocates corresponding device-side blocks to match.
static inline methods on ConGroup_cuda and Network_cuda types handle the indexing arithmetic to access individual UnitVars and Connection variables from ConGroup structures. Backpropagation is currently the only supported architecture -- see
bp_cuda.* files in bp source tree for full implementation. All very clean (once all the bugs got worked out).
8/29/2016 Vision Model
We tested Backpropagation on a Dell performance desktop tower with 2x10 core Intel(R) Xeon(R) CPU's E5-2650 v3 @ 2.30GHz (Haswell), and an NVIDIA GTX980 Maxwell-generation NVIDIA card (which cost about $670 in 8/2015), using our current "large" object recognition model with a 32x32 V1 input layer. This has 99,969,536 connections (call it 100 million) and 122,385 units. Full memory report below. Each epoch in the test is *50* trials:
- GPU: ~21 seconds per epoch of 50 inputs, 16.5 of which is input processing (need to do something more about that for sure!) = 4.5 secs for actual network computation on the GPU.
- 1 thread CPU: 114 s-per-e, 16.4 input = 97.6 secs for net
- 4 threads CPU: 47 s-per-e, 16.2 input = 30.8 secs for net
- 8 threads CPU: 32.5 s-per-e, 15.5 input = 17 secs for net
So the GPU is 21.68 times faster than 1 CPU, 6.84 x faster than 4 threads, and 3.77 times faster than 8 threads.
In comparison, we typically run this model on our compute cluster, which has much faster Intel(R) Xeon(R) CPU E5-2667 v2 @ 3.30GHz CPU's (Ivy Bridge), 2x8 cores per node, and really scales a LOT better with threading compared to the (newer!) Haswell chips (we had to work hard to get these older chips!). We join together 8 such nodes (36 total in cluster) using fast InfiniBand MPI interconnects, each processing different inputs (data parallelism) and sharing weight changes at various "small batch" sizes. This compute cluster config beats the single GPU by about 3x:
- Cluster runs 8192 trials per epoch, takes 400 sec total per epoch, wt sync MPI is 30 sec for sm batch 128, and apply inputs is 112 sec, so network = 258 secs / 8192 trials = .0315 secs / trial
- GPU = 4.5 secs / 50 trials = .09 secs / trial.
The price and power usage of this compute cluster is considerably higher than $670.. So overall, the GPU performance is excellent. And combining multiple GPU's is certainly possible (immediately by using the MPI trick, and using whatever tricks people currently use for multi-GPU performance).
Overall, however, our results are generally consistent with fair comparisons between HPC multi-core CPU cluster systems and GPU's, which show roughly comparable performance once you carefully optimize each platform. e.g.,:
Finally, earlier results prior to Thread Optimization showed that Leabra was not as amenable to these kinds of speedups, because of all the massive optimizations already in place to speed up the netinput computation -- it ends up being a very sparse set of units updated per cycle, peppered throughout the network, and this does not favor the brute-force power of GPU vectorization. Only a 2.5x speedup was achievable for that, and it makes up roughly 1/3 of the compute time. The weight change and weight update computations were however around 10x faster.
LVisNet memory report: number of units: 122385 bytes per unitvar: 72 total unit memory: 8.4 MB number of recv con groups: 133176 number of send con groups: 157248 bytes per con group: 72 total con group memory: 19.9 MB number of connections: 9.99695e+07 bytes per con+idx: 16 total con memory: 2.06 GB recv cons: 1.31 GB send cons: 763 MB grand total memory: 2.09 GB owned connection statistics: max_size: 4096 avg_size: 187 pct_vector_chunked: 0.992273 total_size: 99969536 total_nonshared: 99969536 total_shared: 0
- http://gfx.io/ -- gfxCardStatus app for mac that allows you to turn on/off GPU -- needed to reset ram by turning on/off discrete, allows benchmark to run.
OLD benchmark data: 9/8/2013 -- first CUDA results
- Each time running with x64 20000 3000
- cluster GPU ( NVIDIA Tesla M2070 GPU ): total time used: 19.5151 total flops: 8.4e+10 mflops/sec: 4304.37
- Laptop GPU ( Nvidia 650M ): total time used: 31.5547 total flops: 8.4e+10 mflops/sec: 2662.05
- Laptop CPU ( i7-3720QM Turbo to 3.6GHz): total time used: 57.4391 total flops: 8.4e+10 mflops/sec: 1462.42
Same hardware as Kai, but running under Mac OS instead of Linux(?)
- GeForx GT 650M is rated at roughly 600 gflops max performance
- M2070 is rated at roughly 1028 gflops max -- our perf seems to scale roughly according to these ratings.
- our computed gflops is WELL shy of the rated max!
Updated code to use same send unit selection mechanism (random number < cutoff) as in updated Vectorizing code (9/8/2013).
- ./build/x64cuda n_units 20000 n_per_un 3000 n_epochs 1 GPU 1
- total time used: 6.35328 total flops: 1.68e+10 mflops/sec: 2644.3
- ./build/x64cuda n_units 20000 n_per_un 3000 n_epochs 1 GPU 0
- total time used: 12.6772 total flops: 1.68e+10 mflops/sec: 1325.22
- ./build/x64cuda n_units 16384 n_per_un 4096 n_epochs 1 GPU 1
- total time used: 6.47911 total flops: 1.87905e+10 mflops/sec: 2900.16
- total time used: 6.39273 total flops: 1.87905e+10 mflops/sec: 2939.35
- /build/x64cuda n_units 16384 n_per_un 4096 n_epochs 1 GPU 0
- total time used: 12.7204 total flops: 1.87905e+10 mflops/sec: 1477.19
- total time used: 12.8858 total flops: 1.87905e+10 mflops/sec: 1458.24
Clear 2x speedup for very large networks -- can get nearly another 1.5-2x from the 2070
But more reasonable sizes benefit progressively less:
- ./build/x64cuda n_units 8192 n_per_un 2048 n_epochs 2 GPU 1
- total time used: 4.30427 total flops: 9.39524e+09 mflops/sec: 2182.77
- ./build/x64cuda n_units 8192 n_per_un 2048 n_epochs 2 GPU 0
- total time used: 7.08053 total flops: 9.39524e+09 mflops/sec: 1326.91
1.64 x speedup
- ./build/x64cuda n_units 4096 n_per_un 1024 n_epochs 10 GPU 1
- total time used: 10.3928 total flops: 1.17441e+10 mflops/sec: 1130.01
- ./build/x64cuda n_units 4096 n_per_un 1024 n_epochs 10 GPU 0
- total time used: 10.7546 total flops: 1.17441e+10 mflops/sec: 1092
3% speedup -- and everything below here is significant slow DOWN