Three main optimization paths are available in PolyOpt/C. They are geared towards improving data locality for fewer data cache misses, and both coarse- and fine-grain shared memory parallelization with OpenMP. They will be applied on all Static Control Parts that were automatically detected in the input program. Program transformations are generated via the PoCC polyhedral engine.
--polyopt-fixed-tiling
This path automatically computes and applies a complex, SCoP-specific
sequence of loop transformations to enable parallel blocked (if possible) execution
of the SCoP. The default tile size is 32, and can be specified at
compile time only. Parallel loops are marked with OpenMP pragmas,
inner-most vectorizable loops are marked with ivdep pragmas. Parallel
or pipeline-parallel tile execution is achieved when tiling is possible.
The Pluto module is used to compute the loop transformation sequence, in the form of a series of affine multidimensional schedules.
Giving the flag --polyopt-fixed-tiling
to PolyOpt/C is
equivalent to giving the sequence:
--polyopt-pluto-fuse-smartfuse
--polyopt-pluto-tile
--polyopt-pluto-parallel
--polyopt-pluto-prevector
--polyopt-generate-pragmas
Given the input program:
for (i = 0; i < n; i++) for (j = 0; j < n; j++) { C[i][j] *= beta; for (k = 0; k < n; k++) C[i][j] +=A[i][k] * B[k][j] * alpha; } |
One can optionally specify a file to set the tile sizes, to
override the default 32 value. This file must be called
tile.sizes
, and be stored in the current working directory. It
must contain one tile size per dimension to be tiled. For example:
$> cat tile.sizes 16 64 1 |
The result of --polyopt-fixed-tiling
on the above example, with
the specified tile.sizes
file is shown below. Note, if a
tile.sizes
file exists in the current working directory it will
always be used.
{ int c6; int c3; int c1; int c2; int c4; if (n >= 1) { #pragma omp parallel for private(c4, c2, c6) for (c1 = 0; c1 <= (((n + -1) * 16 < 0? ((16 < 0?-((-(n + -1) + 16 + 1) / 16) : -((-(n + -1) + 16 - 1) / 16))) : (n + -1) / 16)); c1++) for (c2 = 0; c2 <= (((n + -1) * 64 < 0? ((64 < 0?-((-(n + -1) + 64 + 1) / 64) : -((-(n + -1) + 64 - 1) / 64))) : (n + -1) / 64)); c2++) for (c4 = 16 * c1; c4 <= ((16 * c1 + 15 < n + -1? 16 * c1 + 15 : n + -1)); c4++) #pragma ivdep #pragma vector always for (c6 = 64 * c2; c6 <= ((64 * c2 + 63 < n + -1? 64 * c2 + 63 : n + -1)); c6++) (C[c4])[c6] *= beta; #pragma omp parallel for private(c4, c2, c3, c6) for (c1 = 0; c1 <= (((n + -1) * 16 < 0? ((16 < 0?-((-(n + -1) + 16 + 1) / 16) : -((-(n + -1) + 16 - 1) / 16))) : (n + -1) / 16)); c1++) for (c2 = 0; c2 <= (((n + -1) * 64 < 0? ((64 < 0?-((-(n + -1) + 64 + 1) / 64) : -((-(n + -1) + 64 - 1) / 64))) : (n + -1) / 64)); c2++) for (c3 = 0; c3 <= n + -1; c3++) for (c4 = 16 * c1; c4 <= ((16 * c1 + 15 < n + -1? 16 * c1 + 15 : n + -1)); c4++) #pragma ivdep #pragma vector always for (c6 = 64 * c2; c6 <= ((64 * c2 + 63 < n + -1? 64 * c2 + 63 : n + -1)); c6++) (C[c4])[c6] += ((((A[c4])[c3]) * ((B[c3])[c6])) * alpha); } } |
--polyopt-parametric-tiling
NOTE: The parametric tiling path is still experimental, and correctness of the generated code is not guaranteed in all cases. In particular, a known issue is when parametric tiling is applied on a loop nest where the outer loop is sequential (wavefront creation is required) and the inner loops are permutable but not fusable. We are working hard to fix this remaining problem.
To the best of our knowledge, the generated code is correct when all
statements in a (possibly imperfectly nested) loop nest can be
maximally fused. Remember that polyhedral transformations are
automatically computed before the parametric tiling pass to enforce
this property on the code when possible. The above issue impacts only
program parts where it is not possible to exhibit a polyhedral
transformation making either the outer loop parallel, or all loops
fusable in the loop nest. This is not a frequent pattern, for instance
none of the 28 benchmarks of the PolyBench 2.0 test suite exhibit this
issue.
This path automatically computes and applies a complex, SCoP-specific
sequence of loop transformations to enable parallel blocked execution
of the SCoP. The generated code is parametrically tiled when possible,
and the tile sizes can be specified at runtime via the
__pace_tile_sizes[]
array. By default, the tile sizes are set
to 32. Parallel loops are marked with OpenMP pragmas.
The Pluto module is used to compute a loop transformation sequence that makes tiling legal, when possible, and the PTile module performs parametric tiling. Parallel or pipeline-parallel tile execution is achieved if tiling is possible.
Giving the flag --polyopt-parametric-tiling
to PolyOpt/C is
equivalent to giving the sequence:
--polyopt-pluto-fuse-smartfuse
--polyopt-pluto-parallel
--polyopt-codegen-use-ptile
--polyopt-codegen-insert-ptile-api
The PACE tiling API requires to use the function
PACETileSizeVectorInit(int*, int, int)
to fill-in the tile
sizes. This function takes an array of integers, the number of tile
size parameters, and a unique identifier for the SCoP. This function
can be in another compilation unit, inserted automatically by the PACE
compiler, or added manually by the user. It allows to select the tile
size at run-time, before the computation starts.
The result of --polyopt-parametric-tiling
on the above
dgemm
example is shown below.
{ int ___pace_tile_sizes[3]; PACETileSizeVectorInit(___pace_tile_sizes,3,2); int c2; int c2t1; int c1; int c3; int c1t1; float T1c3 = (float )(___pace_tile_sizes[0]); int c3t1; float T1c2 = (float )(___pace_tile_sizes[1]); float T1c1 = (float )(___pace_tile_sizes[2]); if (n >= 1) { { int tmpLb; int tmpUb; tmpLb = round(-1 + 1 / T1c1); tmpUb = round(n * (1 / T1c1) + 1 / T1c1 * -1); #pragma omp parallel for private(c2t1, c1, c2) for (c1t1 = tmpLb; c1t1 <= tmpUb; ++c1t1) for (c2t1 = round(-1 + 1 / T1c2); c2t1 <= round(n * (1 / T1c2) + 1 / T1c2 * -1); ++c2t1) for (c1 = (c1t1 * T1c1 > 0?c1t1 * T1c1 : 0); c1 <= ((c1t1 * T1c1 + (T1c1 + -1) < n + -1? c1t1 * T1c1 + (T1c1 + -1) : n + -1)); c1++) for (c2 = (c2t1 * T1c2 > 0?c2t1 * T1c2 : 0); c2 <= ((c2t1 * T1c2 + (T1c2 + -1) < n + -1? c2t1 * T1c2 + (T1c2 + -1) : n + -1)); c2++) (C[c1])[c2] *= beta; } { int tmpLb; int tmpUb; tmpLb = round(-1 + 1 / T1c1); tmpUb = round(n * (1 / T1c1) + 1 / T1c1 * -1); #pragma omp parallel for private(c2t1, c3t1, c1, c2, c3) for (c1t1 = tmpLb; c1t1 <= tmpUb; ++c1t1) for (c2t1 = round(-1 + 1 / T1c2); c2t1 <= round(n * (1 / T1c2) + 1 / T1c2 * -1); ++c2t1) for (c3t1 = round(-1 + 1 / T1c3); c3t1 <= round(n * (1 / T1c3) + 1 / T1c3 * -1); ++c3t1) for (c1 = (c1t1 * T1c1 > 0?c1t1 * T1c1 : 0); c1 <= ((c1t1 * T1c1 + (T1c1 + -1) < n + -1? c1t1 * T1c1 + (T1c1 + -1) : n + -1)); c1++) for (c2 = (c2t1 * T1c2 > 0?c2t1 * T1c2 : 0); c2 <= ((c2t1 * T1c2 + (T1c2 + -1) < n + -1? c2t1 * T1c2 + (T1c2 + -1) : n + -1)); c2++) for (c3 = (c3t1 * T1c3 > 0?c3t1 * T1c3 : 0); c3 <= ((c3t1 * T1c3 + (T1c3 + -1) < n + -1? c3t1 * T1c3 + (T1c3 + -1) : n + -1)); c3++) (C[c1])[c2] += ((((A[c1])[c3]) * ((B[c3])[c2])) *alpha); } } } |
--polyopt-parallel-only
This path automatically computes and applies a complex, SCoP-specific
sequence of loop transformations to enable parallel execution of the
SCoP while improving data locality. In contrast to the other paths, no
tiling is applied on the generated program. Parallel loops are marked
with OpenMP pragmas. The Pluto module is used to compute a loop
transformation sequence that expose coarse-grain parallelism when
possible.
Giving the flag --polyopt-parallel-only
to PolyOpt/C is
equivalent to giving the sequence:
--polyopt-pluto-fuse-smartfuse
--polyopt-pluto-parallel
--polyopt-generate-pragmas
The result of --polyopt-parallel-only
on the above dgemm
example is shown below. Note that pre-vectorization is disabled in
this mode, fixed tiling must be enabled for it to be active. To
prevent the distribution of the two statements, the user can rely on
the fine-tuning flags, e.g., --polyopt-pluto-fuse-maxfuse
.
{ int c2; int c1; int c3; if (n >= 1) { #pragma omp parallel for private(c2) for (c1 = 0; c1 <= n + -1; c1++) for (c2 = 0; c2 <= n + -1; c2++) (C[c1])[c2] *= beta; #pragma omp parallel for private(c3, c2) for (c1 = 0; c1 <= n + -1; c1++) for (c2 = 0; c2 <= n + -1; c2++) for (c3 = 0; c3 <= n + -1; c3++) (C[c1])[c2] += ((((A[c1])[c3]) * ((B[c3])[c2])) * alpha); } } |