Optimization Paths - PolyOpt - Polyhedral Optimization Framework

Next: Fine-tuning Optimizations, Previous: Specifics of Polyhedral Programs, Up: Top

3 Optimization Paths

Three main optimization paths are available in PolyOpt/C. They are geared towards improving data locality for fewer data cache misses, and both coarse- and fine-grain shared memory parallelization with OpenMP. They will be applied on all Static Control Parts that were automatically detected in the input program. Program transformations are generated via the PoCC polyhedral engine.

3.1 `--polyopt-fixed-tiling`

3.1.1 Description

This path automatically computes and applies a complex, SCoP-specific sequence of loop transformations to enable parallel blocked (if possible) execution of the SCoP. The default tile size is 32, and can be specified at compile time only. Parallel loops are marked with OpenMP pragmas, inner-most vectorizable loops are marked with ivdep pragmas. Parallel or pipeline-parallel tile execution is achieved when tiling is possible.

The Pluto module is used to compute the loop transformation sequence, in the form of a series of affine multidimensional schedules.

Giving the flag --polyopt-fixed-tiling to PolyOpt/C is equivalent to giving the sequence:

--polyopt-pluto-fuse-smartfuse
--polyopt-pluto-tile
--polyopt-pluto-parallel
--polyopt-pluto-prevector
--polyopt-generate-pragmas

3.1.2 Example

Given the input program:

for (i = 0; i < n; i++) for (j = 0; j < n; j++) { C[i][j] *= beta; for (k = 0; k < n; k++) C[i][j] +=A[i][k] * B[k][j] * alpha; }

One can optionally specify a file to set the tile sizes, to override the default 32 value. This file must be called tile.sizes, and be stored in the current working directory. It must contain one tile size per dimension to be tiled. For example:

$> cat tile.sizes 16 64 1

The result of --polyopt-fixed-tiling on the above example, with the specified tile.sizes file is shown below. Note, if a tile.sizes file exists in the current working directory it will always be used.

{ int c6; int c3; int c1; int c2; int c4; if (n >= 1) { #pragma omp parallel for private(c4, c2, c6) for (c1 = 0; c1 <= (((n + -1) * 16 < 0? ((16 < 0?-((-(n + -1) + 16 + 1) / 16) : -((-(n + -1) + 16 - 1) / 16))) : (n + -1) / 16)); c1++) for (c2 = 0; c2 <= (((n + -1) * 64 < 0? ((64 < 0?-((-(n + -1) + 64 + 1) / 64) : -((-(n + -1) + 64 - 1) / 64))) : (n + -1) / 64)); c2++) for (c4 = 16 * c1; c4 <= ((16 * c1 + 15 < n + -1? 16 * c1 + 15 : n + -1)); c4++) #pragma ivdep #pragma vector always for (c6 = 64 * c2; c6 <= ((64 * c2 + 63 < n + -1? 64 * c2 + 63 : n + -1)); c6++) (C[c4])[c6] *= beta; #pragma omp parallel for private(c4, c2, c3, c6) for (c1 = 0; c1 <= (((n + -1) * 16 < 0? ((16 < 0?-((-(n + -1) + 16 + 1) / 16) : -((-(n + -1) + 16 - 1) / 16))) : (n + -1) / 16)); c1++) for (c2 = 0; c2 <= (((n + -1) * 64 < 0? ((64 < 0?-((-(n + -1) + 64 + 1) / 64) : -((-(n + -1) + 64 - 1) / 64))) : (n + -1) / 64)); c2++) for (c3 = 0; c3 <= n + -1; c3++) for (c4 = 16 * c1; c4 <= ((16 * c1 + 15 < n + -1? 16 * c1 + 15 : n + -1)); c4++) #pragma ivdep #pragma vector always for (c6 = 64 * c2; c6 <= ((64 * c2 + 63 < n + -1? 64 * c2 + 63 : n + -1)); c6++) (C[c4])[c6] += ((((A[c4])[c3]) * ((B[c3])[c6])) * alpha); } }

3.2 `--polyopt-parametric-tiling`

3.2.1 Description

NOTE: The parametric tiling path is still experimental, and correctness of the generated code is not guaranteed in all cases. In particular, a known issue is when parametric tiling is applied on a loop nest where the outer loop is sequential (wavefront creation is required) and the inner loops are permutable but not fusable. We are working hard to fix this remaining problem.

To the best of our knowledge, the generated code is correct when all statements in a (possibly imperfectly nested) loop nest can be maximally fused. Remember that polyhedral transformations are automatically computed before the parametric tiling pass to enforce this property on the code when possible. The above issue impacts only program parts where it is not possible to exhibit a polyhedral transformation making either the outer loop parallel, or all loops fusable in the loop nest. This is not a frequent pattern, for instance none of the 28 benchmarks of the PolyBench 2.0 test suite exhibit this issue.

This path automatically computes and applies a complex, SCoP-specific sequence of loop transformations to enable parallel blocked execution of the SCoP. The generated code is parametrically tiled when possible, and the tile sizes can be specified at runtime via the __pace_tile_sizes[] array. By default, the tile sizes are set to 32. Parallel loops are marked with OpenMP pragmas.

The Pluto module is used to compute a loop transformation sequence that makes tiling legal, when possible, and the PTile module performs parametric tiling. Parallel or pipeline-parallel tile execution is achieved if tiling is possible.

Giving the flag --polyopt-parametric-tiling to PolyOpt/C is equivalent to giving the sequence:

--polyopt-pluto-fuse-smartfuse
--polyopt-pluto-parallel
--polyopt-codegen-use-ptile
--polyopt-codegen-insert-ptile-api

3.2.2 Example

The PACE tiling API requires to use the function PACETileSizeVectorInit(int*, int, int) to fill-in the tile sizes. This function takes an array of integers, the number of tile size parameters, and a unique identifier for the SCoP. This function can be in another compilation unit, inserted automatically by the PACE compiler, or added manually by the user. It allows to select the tile size at run-time, before the computation starts.

The result of --polyopt-parametric-tiling on the above dgemm example is shown below.

{ int ___pace_tile_sizes[3]; PACETileSizeVectorInit(___pace_tile_sizes,3,2); int c2; int c2t1; int c1; int c3; int c1t1; float T1c3 = (float )(___pace_tile_sizes[0]); int c3t1; float T1c2 = (float )(___pace_tile_sizes[1]); float T1c1 = (float )(___pace_tile_sizes[2]); if (n >= 1) { { int tmpLb; int tmpUb; tmpLb = round(-1 + 1 / T1c1); tmpUb = round(n * (1 / T1c1) + 1 / T1c1 * -1); #pragma omp parallel for private(c2t1, c1, c2) for (c1t1 = tmpLb; c1t1 <= tmpUb; ++c1t1) for (c2t1 = round(-1 + 1 / T1c2); c2t1 <= round(n * (1 / T1c2) + 1 / T1c2 * -1); ++c2t1) for (c1 = (c1t1 * T1c1 > 0?c1t1 * T1c1 : 0); c1 <= ((c1t1 * T1c1 + (T1c1 + -1) < n + -1? c1t1 * T1c1 + (T1c1 + -1) : n + -1)); c1++) for (c2 = (c2t1 * T1c2 > 0?c2t1 * T1c2 : 0); c2 <= ((c2t1 * T1c2 + (T1c2 + -1) < n + -1? c2t1 * T1c2 + (T1c2 + -1) : n + -1)); c2++) (C[c1])[c2] *= beta; } { int tmpLb; int tmpUb; tmpLb = round(-1 + 1 / T1c1); tmpUb = round(n * (1 / T1c1) + 1 / T1c1 * -1); #pragma omp parallel for private(c2t1, c3t1, c1, c2, c3) for (c1t1 = tmpLb; c1t1 <= tmpUb; ++c1t1) for (c2t1 = round(-1 + 1 / T1c2); c2t1 <= round(n * (1 / T1c2) + 1 / T1c2 * -1); ++c2t1) for (c3t1 = round(-1 + 1 / T1c3); c3t1 <= round(n * (1 / T1c3) + 1 / T1c3 * -1); ++c3t1) for (c1 = (c1t1 * T1c1 > 0?c1t1 * T1c1 : 0); c1 <= ((c1t1 * T1c1 + (T1c1 + -1) < n + -1? c1t1 * T1c1 + (T1c1 + -1) : n + -1)); c1++) for (c2 = (c2t1 * T1c2 > 0?c2t1 * T1c2 : 0); c2 <= ((c2t1 * T1c2 + (T1c2 + -1) < n + -1? c2t1 * T1c2 + (T1c2 + -1) : n + -1)); c2++) for (c3 = (c3t1 * T1c3 > 0?c3t1 * T1c3 : 0); c3 <= ((c3t1 * T1c3 + (T1c3 + -1) < n + -1? c3t1 * T1c3 + (T1c3 + -1) : n + -1)); c3++) (C[c1])[c2] += ((((A[c1])[c3]) * ((B[c3])[c2])) *alpha); } } }

3.3 `--polyopt-parallel-only`

3.3.1 Description

This path automatically computes and applies a complex, SCoP-specific sequence of loop transformations to enable parallel execution of the SCoP while improving data locality. In contrast to the other paths, no tiling is applied on the generated program. Parallel loops are marked with OpenMP pragmas. The Pluto module is used to compute a loop transformation sequence that expose coarse-grain parallelism when possible.

Giving the flag --polyopt-parallel-only to PolyOpt/C is equivalent to giving the sequence:

--polyopt-pluto-fuse-smartfuse
--polyopt-pluto-parallel
--polyopt-generate-pragmas

3.3.2 Example

The result of --polyopt-parallel-only on the above dgemm example is shown below. Note that pre-vectorization is disabled in this mode, fixed tiling must be enabled for it to be active. To prevent the distribution of the two statements, the user can rely on the fine-tuning flags, e.g., --polyopt-pluto-fuse-maxfuse.

{ int c2; int c1; int c3; if (n >= 1) { #pragma omp parallel for private(c2) for (c1 = 0; c1 <= n + -1; c1++) for (c2 = 0; c2 <= n + -1; c2++) (C[c1])[c2] *= beta; #pragma omp parallel for private(c3, c2) for (c1 = 0; c1 <= n + -1; c1++) for (c2 = 0; c2 <= n + -1; c2++) for (c3 = 0; c3 <= n + -1; c3++) (C[c1])[c2] += ((((A[c1])[c3]) * ((B[c3])[c2])) * alpha); } }