The SA-C compiler does a variety of optimizations, some traditional and some specifically designed to suit the language and its reconfigurable hardware targets. The compiler converts the entire SA-C program to an internal dataflow form called ``Data Dependence and Control Flow'' (DDCF) graphs, on which it performs most of its optimizations. The traditional optimizations include Common Subexpression Elimination, Constant Folding, Invariant Code Motion, and Dead Code Elimination. The compiler also does specialized variants of Loop Stripmining, Array Value Propagation, Loop Unrolling, Function Inlining, Lookup Tables, Loop Body Pipelining, and Array Blocking, along with loop and array Size Propagation Analysis. Along with these, four more FPGA specific optimizations are now described in some detail: Producer Consumer Loop Fusion, Temporal Common Subexpression Elimination, Window Narrowing, and Pipelining.
The performance of many systems today, both conventional and specialized, is often limited by the time required to move data to the processing units. The fusion of producer-consumer loops is often helpful, since it reduces data traffic and may eliminate intermediate data structures. In simple cases, where arrays are processed element-by-element, fusion is straightforward. However, the windowing behavior of many IP operators presents a challenge as producer and consumer loops may not iterate over the same index domain or even have the same number of iterations. Nevertheless, it is often possible to fuse such loop pairs by examining their data dependencies.
Common Subexpression Elimination (CSE) is an old and well known compiler optimization that eliminates redundancies by looking for identical subexpressions that compute the same value. The redundancies are removed by keeping just one of the subexpressions and using its result for all the computations that need it. This could be called ``spatial CSE'' since it looks for common subexpressions within a block of code.
The SA-C compiler performs conventional CSE, but it also performs Temporal CSE, looking for value in one loop iteration that were previously computed in other loop iterations. In such cases, the redundant computation can be eliminated by holding the values in registers so that they are available later and need not be recomputed.
A useful phenomenon often occurs with Temporal CSE: one or more columns in the left part of the window are unreferenced, making it possible to eliminate those columns. Narrowing the window lessens the FPGA space required to store the window's values.
After multiple applications of fusion and unrolling, the resulting loop often has a long critical path, resulting in a low clock frequency. Adding stages of pipeline registers can break up the critical path and thereby boost the frequency. The SA-C compiler uses propagation delay estimates, empirically gathered for each of the DFG node types, to determine the proper placement of pipeline registers. The user specifies to the compiler the number of register stages to place.