[<< home] [news] [description] [documentation] [download] [installation] [performance results] [publications] [history] | On development, version 1.1 available |
PoCC is a flexible source-to-source iterative and model-driven compiler, embedding most of the state-of-the-art tools for polyhedral compilation. The main features are:
PoCC embeds powerful Free software for polyhedral compilation. This software is accessible from the main driver, and several IR conversion functions allows to communicate easily between passes of the compiler.
Communication: three groups are available for subscription
Please note that the two following documents are highly preliminary, and incomplete. The documentation will be improved soon. In the meantime, don't hesitate to contact the author for any question.
The stable mode of PoCC does not require any software beyond
a working GNU toolchain and Perl.
Several other modes are
available through the SVN version of PoCC (available on request). To
use those other modes such as base, devel
and irregular, some software is additionally required to
build the development version of PoCC:
PoCC features several modes to configure the compiler. The
default mode for the installer is stable. To change it and
use a development mode, change the value of
the POCC_VERSION
variable at the beginning of
the install.sh
file.
The installation of PoCC is packaged in an installer
script install.sh
. PoCC is not aimed at being installed
in a specific location on the system. Instead, append
the pocc-1.1/bin
directory to your PATH
variable to use PoCC from any location.
$> tar xzf pocc-1.1.tar.gz $> cd pocc-1.1 $> ./install.sh $> export PATH=$PATH:`pwd`/bin
For a test run of the compiler:
$> pocc gemver.c [PoCC] Compiling file: gemver.c [PoCC] INFO: pass-thru compilation, no optimization enabled [PoCC] Running Clan [PoCC] Running Candl [PoCC] Starting Codegen [PoCC] Running CLooG [CLooG] INFO: 3 dimensions (over 5) are scalar. [PAST] Converting CLAST to PoCC AST [PoCC] Using the PAST back-end [PoCC] Output file is gemver.pocc.c. [PoCC] All done.
To inspect the available options:
PoCC, the Polyhedral Compiler Collection, version 1.1. Written by Louis-Noel Pouchet <pouchet@cse.ohio-state.edu> Major contributions by Cedric Bastoul and Uday Bondhugula. Available options for PoCC are: -h --help Print this help -v --version Print version information -o --output <arg> Output file [filename.pocc.c] --output-scop Output scoplib file to filename.pocc.scop --cloogify-scheds Create CLooG-compatible schedules in the scop --bounded-ctxt Parser: bound all global parameters >= -1 --default-ctxt Parser: bound all global parameters >= 32 --inscop-fakearray Parser: use FAKEARRAY[i] to explicitly declare write dependences --read-scop Parser: read SCoP file instead of C file as input --no-candl Dependence analysis: don't run candl [off] --candl-dep-isl-simp Dependence analysis: simplify with ISL [off] --candl-dep-prune Dependence analysis: prune redundant deps [off] --polyfeat Run Polyhedral Feature Extraction [off] --polyfeat-rar Consider RAR dependences in PolyFeat [off] -d --delete-files Delete files previously generated by PoCC [off] --verbose Verbose output [off] --quiet Minimal output [off] -l --letsee Optimize with LetSee [off] --letsee-space <arg> LetSee: search space: [precut], schedule --letsee-walk <arg> LetSee: traversal heuristic: [exhaust], random, skip, m1, dh, ga --letsee-dry-run Only generate source files [off] --letsee-normspace LetSee: normalize search space [off] --letsee-bounds <arg> LetSee: search space bounds [-1,1,-1,1,-1,1] --letsee-mode-m1 <arg> LetSee: scheme for M1 traversal [i+p,i,0] --letsee-rtries <arg> LetSee: number of random draws [50] --letsee-prune-precut LetSee: prune precut space --letsee-backtrack LetSee: allow bactracking in schedule mode -p --pluto Optimize with PLuTo [off] --pluto-parallel PLuTo: OpenMP parallelization [off] --pluto-tile PLuTo: polyhedral tiling [off] --pluto-l2tile PLuTo: perform L2 tiling [off] --pluto-fuse <arg> PLuTo: fusion heuristic: maxfuse, [smartfuse], nofuse --pluto-unroll PLuTo: unroll loops [off] --pluto-ufactor <arg> PLuTo: unrolling factor [4] --pluto-polyunroll PLuTo: polyhedral unrolling [off] --pluto-prevector PLuTo: perform prevectorization [off] --pluto-multipipe PLuTo: multipipe [off] --pluto-rar PLuTo: consider RAR dependences [off] --pluto-rar-cf PLuTo: consider RAR dependences for cost function only [off] --pluto-lastwriter PLuTo: perform lastwriter dep. simp. [off] --pluto-scalpriv PLuTo: perform scalar privatization [off] --pluto-bee PLuTo: use Bee [off] --pluto-quiet PLuTo: be quiet [off] --pluto-ft PLuTo: ft [off] --pluto-lt PLuTo: lt [off] --pluto-ext-candl PLuTo: Read dependences from SCoP [off] --pluto-tile-scat PLuTo: Perform tiling inside scatterings [off] --pluto-bounds <arg> PLuTo: Transformation coefficients bounds [+inf] -n --no-codegen Do not generate code [off] --cloog-cloogf <arg> CLooG: first level to scan [1] --cloog-cloogl <arg> CLooG: last level to scan [-1] --print-cloog-file CLooG: print input CLooG file --no-past Do not use the PAST back-end [off] --past-hoist-lb Hoist loop bounds [off] --pragmatizer Use the AST pragmatizer [off] --ptile Use PTile for parametric tiling [off] --ptile-fts Use full-tile separation in PTile [off] --punroll Use PAST loop unrolling [off] --register-tiling PAST register tiling [off] --punroll-size <arg> PAST unrolling size [4] --vectorizer Post-transform for vectorization [off] --codegen-timercode Codegen: insert timer code [off] --codegen-timer-asm Codegen: insert ASM timer code [off] --codegen-timer-papi Codegen: insert PAPI timer code [off] -c --compile Compile program with C compiler [off] --compile-cmd <arg> Compilation command [gcc -O3 -lm] --run-cmd-args <arg> Program execution arguments [] --prog-timeout <arg> Timeout for compilation and execution, in second [unlimited]
To run iterative search among possible precuts, with tiling and parallelization enabled:
$> pocc --letsee --pluto-tile --pluto-parallel --codegen-timercode --verbose gemver.c [...]
We experimented on three high-end machines: a 4-socket Intel hexa-core Xeon E7450 (Dunnington) at 2.4GHz with 64 GB of memory (24 cores, 24 hardware threads), a 4-socket AMD quad-core Opteron 8380 (Shanghai) at 2.50GHz (16 cores, 16 hardware threads) with 64 GB of memory, and an 2-socket IBM dual-core Power5+ at 1.65GHz (4 cores, 8 hardware threads) with 16 GB of memory.
All systems were running Linux 2.6.x. We used Intel ICC 10.0 with
options -fast -parallel -openmp
referred to as
icc-par, and with -fast
referred to as
icc-nopar, GCC 4.3.3 with options -O3 -msse3
-fopenmp
as gcc, and IBM/XLC 10.1 compiled for Power5
with options -O3 -qhot=nosimd -qsmp -qthreaded
referred to as
xlc-par, and -O3 -qhot=nosimd
referred as
xlc-nopar. We report the performance of the precut
iterative compilation mode of PoCC as iter-xx when used on top
of the xx compiler. Precut search is enabled in PoCC with
options --letsee-space precut
, and tiling and parallelization for precuts with --pluto-tile --pluto-parallel
.
We consider 8 benchmarks, typical from compute-intensive sequences of algebra operations. atax, bicg and gemver are compositions of BLAS operations, ludcmp solves simultaneous linear equations by LU decomposition, advect3d is an advection kernel for weather modeling and doitgen is an in-place 3D-2D matrix product. correl creates a correlation matrix, and varcovar creates a variance-covariance matrix, both are used in Principal Component Analysis in the StatLib library. The time to compute the space, pick a candidate and compute a full transformation is negligible with respect to the compilation and execution time of the tested versions. In our experiments, the full compilation process takes a few seconds for the smaller benchmarks, and up to about 1 minute for correl on Xeon.
For doitgen, correl and varcovar, three compute-bound benchmarks, our technique exposes a program with a significant parallel speedup of up to 112x on the Opteron machine. Our optimization technique goes far beyond parallelizing programs, and for these benchmarks locality and vectorization improvements were achieved by our framework. For advect3d, atax, bicg, and gemver we also observe a significant speedup, but this is limited by memory bandwidth as these benchmarks are memory-bound. Yet, we are able to achieve a solid performance improvement for those benchmarks over the native compilers, of up to 3.8x for atax on the Xeon machine and 5x for advect3d on the Opteron machine. For ludcmp, although parallelism was exposed, the speedup remains limited as the program offers little opportunity for high-level optimizations. Yet, our technique outperforms the native compiler, by a factor up to 2x on the Xeon machine.
For the Xeon and Opteron machines, the iterative process outperforms ICC 10 with auto-parallelization, with a factor between 1.2x for gemver on Intel to 15.3x for doitgen. For both of these kernels, we also compared with an implementation using Intel Math Kernel Library (MKL) 10.0 and AMD Core Math Library (ACML) 4.1.0 for the Xeon and Opteron machines respectively, and we obtain a speedup of 1.5x to 3x over these vendor libraries.
For varcovar, our technique outperforms the native compiler by a factor up to 15x. Although maximal fusion significantly improved performance, the best iteratively found fusion structure allows for a much better improvement, up to 1.75x better. Maximal fusion is also outperformed for all but ludcmp and doitgen for some machines only. This highlights the power of the method to discover an efficient balance between parallelism (both coarse-grain and fine-grain) and locality.
On the Power5+ machine, on all but advect3d the iterative process outperforms XLC with auto-parallelization, by a factor between 1.1x for atax to 21x for varcovar.
Machine |
Speedup for Xeon, Opteron and Power5+ processors over the best single-threaded version |
Performance improvement over maximal fusion, and over the best reference auto-parallelizing compiler |
4-sockets Intel Dunnington Xeon E7450 24 H/W threads | ||
4-sockets AMD Shangai Opteron 8380 16 H/W threads | ||
2-sockets IBM dual-core Power5+ 8 H/W threads |
PoCC was supported in part by the EU-funded ACOTES project (European Union's Sixth Framework IST together with NXP, STMicroelectronics, Nokia, INRIA, IBM Haifa Research Lab and Universitat Politecnica de Catalunya)
Advanced Compiler Technologies for Embedded Streaming: |
3/12/2012: Release of pocc-1.1
--pluto-tile-scat
and --cloogify-scheds
options together.