The objective of this assignment is for you to implement the Divide and
Conquer Dynamic Programming algorithm for the 0/1 knapsack problem and
parallelize it on CUDA and report your performance on the department's Fermi based machines (pomegranates). The assignment
has three parts, each one incrementally building on the previous. You should
be able to finally make the code compute bound and achieve a good value of
Gops-per-second.
- HW4A: knapAlgo [15 points]
Start from sequential code that you
did for HW3, and implement the fill_table function in CUDA (you will
essentially be using the code that you wrote for HW0, which does a sequence of
kernel calls, one for each row of the table). Make sure that your program now
produces correct answers for large values of the capacity. Modify the program
so that beyond a certain depth, the fill_table function is executed
on the host rather than the GPU, and study and report the value of depth at which the overhead of GPU parallelization is no longer worth the parallelization gains.
- HW4B: knapAlgoSmallN (Making the CUDA compute bound) [40
points]
Using the techniques described in the lecture, modify the CUDA code so that
the function fill_table function is evaluated in a single kernel call.
We suggest that you proceed in the following steps.
- First write a simple CUDA function that accomplishes the correct
synchronization between the thread blocks. Think of writing a toy program
similar to Waruna's example for syncthreads (except that that one was for
threads within a single threadblock, and this one is for multiple
threadblocks).
- Next embellish this with the code that correctly updates (with only one
thread per threadblock active) the section of the array that is allocated to
this threadblock. make sure that the correct values are written and read from
global memory between the synchronizations. For this to work there should be
enough global memory such that each threadblock can store its entire "output"
(i.e., N*WMAX or N*sigma(Wi) memory per threadblock). Hence, in order to
maximize the number of active threadblocks, this scheme will only work for
relatively small values of N.
- After ensuring that this code produces the correct answers, parallelize
the computation of a threadblock. This is the part that was not completely
detailed in the class, since there are a few different options that you could
pursue.
- HW4C: knapAlgoLargeN [25 points]
In 4B, the total amount of global memory required is Number of Blocks*N*WMAX. This restricts the number of objects your code can handle. Now modify your program so that an arbitrary value of N can be handled.
For this you will repeatedly (in a sequence of kernel calls) call the code of
Part B.
- HW4D: Report [20 points]
Your report should present the
performance, i.e., total number of Gops-per second achieved for large values
of N and C. Use k50kx1M.txt for reporting performance. If you need a smaller input file for debugging, use k5kx100k.txt any input file from previous assignments. You can also use generateInput.c to generate your own input files.
Please be sure to report the times for the complete application,
including any code that is executed on the host as well as the time for data
transfer to and from the host. Measure each of the different components and
present an Amdahl's law analysis.
Analyze your code's performance with respect to the system peak performance.
Submit, using the Checkin tab, your codes, your time and speedup
statistics, and observations as well as your analysis in a
report. For all three codes, the verbose options should be similar to HW3.