Objectives
The objective of this assignment is to explore implementations of
Polynomial Multiplication. Record the performance of your programs on
a Fish machine
(Anchovy to Wahoo in the machines list), and experimentally determine the
gains you get with 1-8 threads.
Download and untar PA3.tar. You should not need to
update the Makefile. Do not set the number of threads in the program, set it
at the Linux OS level (using the export or setenv command). Otherwise we
cannot run your program for different numbers of threads. Write your report
in pdf format.
The following two sequential programs have been implemented for you.
- [PolyMultGSQ] (Grade School Quadratic) This is the naive
implementation of Polynomial Multiplication.
- [PolyMultGold] This (sequential) implementation includes all of
the optimizations up to PolyMultBLQ. It is only provided as a binary. Use
it as a benchmark against which to compare your programs.
You will need to implement the following programs.
- [PolyMultINQ] (Inverted Quadratic)
- Invert the inner and outer loops of both computational loops from
the PolyMultGSQ implementation.
- Hint: once the loops have been inverted you can condense the
computation into a single double for loop.
- Explore loop parallelism. Report the optimal strategy and number
of threads.
- [PolyMultOPQ] (Outer Product Quadratic)
- Rewrite PolyMultGSQ to use a square iteration space (as shown in
lecture).
- Explore loop parallelism. Report the optimal strategy and number
of threads.
- [PolyMultBLQ] (Blocked Quadratic)
- Block both the inner and outer loops from PolyMultOPQ.
- Use the same parameter (tune1) to set the block size for each
loop.
- Explore loop parallelism. Report the optimal strategy, number of
threads, and block size.
- [PolyMultDCQ] (Divide and Conquer Quadratic)
- Rewrite polyMultOPQ to use the divide and conquer strategy
presented in lecture.
- Use tune2 to set the minimum leaf size.
- Explore task parallelism and test the use of different non
recursive algorithms in the leafs
(e.g. PolyMult{INQ,OPQ,BLQ}). Determine and report which configuration
gives you the best performance.
Report
- Include the details from the explore steps in each of the programs
above.
- When measuring speedup and finding optimal tuning parameters use a
degree of 65535 (2^16-1)
- Justify the optimal design and configuration of each program with
either experimental evidence or design principles from this class (for
some programs you may need to use both).
- For each parallelization, report speedup on 1-8 threads, using the
speedup equation from the lecture/book.
- As always, report the details of the machine where you are running
your tests.
Submission
Submit a PA3.tar file through Checkin. Create this file
by using the rule for PA3.tar in the provided Makefile. (This will ensure
that your project has the correct directory structure.)
Testing
Good luck and have fun!