This page is a developer notes section for vectorizing the code, focusing on SSE-style SIMD processing built into current CPU's.
See GPU for CUDA and OpenCL based GPU vectorization.
- 1 links tips etc
- 2 8/2016: Succinct summary of SSE / AVX vectorization issues: thread-separate memory conflict
- 3 5/2015 -- visual filtering in V1RegionSpec / RetinaProc
- 4 9/2014 -- sorting of connections, vectorized dWt, and optimized threading (cycle-level threading)
- 5 Benchmark Data
- http://www.agner.org/optimize -- great resource on all manner of optimization techniques
- vectorclass.h in particular looks like a great way to go for using the "intrinsics" in a readable way
- http://llvm.org/docs/Vectorizers.html -- clang 3.4 auto vectorization
8/2016: Succinct summary of SSE / AVX vectorization issues: thread-separate memory conflict
A vector variable can be loaded either from directly contiguous ram (you pass a starting address and it loads e.g., 8 floats in a row from there) or using the “gather” method from arbitrary indexes relative to a starting address. Loading each element of the vector with a separate step is sufficiently slow that it negates the benefits of vectorizing.
The problem is that the memory for unit-level variables (e.g., act) is organized into separate thread-specific chunks, and the units are allocated to chunks in an interleaved fashion to maximize the similarity of processing across threads (otherwise if done with sequential batches of units for each thread, then one thread might get all units from a small layer that doesn’t need much processing, while another thread has units with massive connectivity, etc). The advantages of thread-specific memory far outweigh those of vectorization from everything I’ve seen.
So, you can’t vectorize any connection-level code that involves unit-level variables, because even if those units are contiguously ordered within a layer (which they typically are), they actually live in different threads.
You CAN vectorize purely connection-level code, e.g., Compute_Weights
And in Leabra, the sender-based netinput only references the sender’s activation (a single constant for all sending connections) and sends it to a temporary thread-specific buffer of netin values for each recv unit, which are then integrated in a separate step.
Likewise, it IS possible to create a single big “strip” of vectorized unit variables, as copies of those in the thread-specific memory, and vectorize over those, but typically the overhead involved in that exceeds the benefits.
Mac 10.8.4 issues with AVX support
- mac assembler is very old and does not support AVX
- clang theoretically does, but perhaps not from what gcc 4.8.1 produces -- can't get the following to work (even with macports clang-3.4)
- sysctl machdep.cpu.features -- gives you a list of what your cpu supports
- replace as with clang: http://mac-os-forge.2317878.n4.nabble.com/gcc-as-AVX-binutils-and-MacOS-X-10-7-td144472.html
5/2015 -- visual filtering in V1RegionSpec / RetinaProc
Just converted this code to use new simple threading deployment and SIMD code for the most expensive filtering step. Makes about a 10% difference in performance.
Results below are for 100 iterations of bench/vis_bench/v1filter_bench.proj with 512x512 image.
Results for MacBook Pro 15" 2014, Haswell Chip
Overall, 1/2 the time is spent in the V1s filtering step, which is the most optimized and the optimizations give a 30% speedup. The rest of the stuff in V1c doesn't add too much over V1s when threading is in place. Initial image transform is relatively expensive and could also be optimized perhaps.
- Full Transform and Filter process, with full V1c Length Sum, End Stop, and V1SG passthrough:
- 4 threads: New: 15 s, Old: 17 s
- 2 threads: New: 17 s, Old: 21 s
- 1 thread: New: 23 s, Old: 30 s
- Just Filter per above:
- 4 threads: New: 9 s, Old: 10 s = 10%
- 2 threads: New: 11 s, Old: 14 s = 22%
- 1 thread: New: 17 s, Old: 25 s = 32%
- Just V1s filtering:
- 4 threads: New: 7 s, Old: 9.7 s = 30%
- 2 threads: New: 8 s, Old: 12 s = 33%
- 1 thread: New: 12 s, Old: 17 s = 30%
Results for Intel(R) Xeon(R) CPU E5-2667 v2 @ 3.30GHz (Blanca)
Only have new code results, on the V1s only process -- getting a fair amount of thread saturation (and these are 512 images so smaller ones will saturate quicker):
- 8 threads: 6 s
- 6 threads: 6.5 s
- 4 threads: 7 s
- 2 threads: 8.15 s
- 1 thread: 9.3 s
9/2014 -- sorting of connections, vectorized dWt, and optimized threading (cycle-level threading)
Connections are now organized by "vector chunks" such that, for Leabra (sender-based), all the receiving units within a chunk (currently 4 connections in a row) are sequentially ordered. This is done by a very efficient 2 pass algorithm, which assumes (as is almost always the case) that the connections are in order in the first place. Any non-chunk-size guys get put at the end and processed separately. This allows fully parallel vec.load() and .store() operations on the recv unit variables, which greatly enhances performance.
For threading, the key innovation is to process the entire cycle’s worth of computation in one thread run, with fast "busy" sync steps at each point where the threads need to sync (e.g., after sending netin, before integrating the new netins). In contrast, the previous approach was to deploy threads for each function separately (e.g., send netins). The overhead of starting and stopping the threads are the biggest costs here. Also, all the non-connection computations (unit level, e.g., Compute Activations) could not be profitably sped up due to the overhead, so we were losing out on a decent chunk of possible parallelization.
The same "nibble" style thread allocation is used to allocate computation to threads -- this provides the best load balancing dynamics -- any fixed allocation scheme will suffer from the dramatically shifting nature of the load over cycles and over time -- this was plotted and indeed the cycle loads shift across units considerably.
Here’s the final results for this round of optimization. Interestingly, for the large vision model, it is so big that the threading optimizations do not actually make that much difference! Basically, you get a roughly 2x improvement over the old model regardless of the number of threads, which reflects the vectorization and, significantly, the reorganization of the connections, and some significant reorganization of the code to combine previously separate functions, etc. However, it is still the case that you get a bit more speedup on this big net for 4 vs. 1 thread comparison (3.06 vs. 2.78 before).
But where you see MASSIVE threading specific advantages is more on the smaller models, e.g., the “actual” leabra bench results shown 2nd below. Surprisingly, on my super-fast laptop, the thread overhead in the old code *completely swamps* any performance advantage — getting virtually nothing from threading, and less than nothing for the 625 unit guy. But with the new threading, we get a nice 2.5ish boost for 4 threads..
See Vectorizing Benchmarks for lots of raw data..