See Lab 5 for information on compiling and running CUDA Programs. Run all experiments on the fish machines. *** Do not use wahoo for your CUDA experiments. ***
For this assignment, you will modify two provided CUDA kernels and empirically test and analyze their performance.
Download PA5.tar. This contains a general Makefile, starter and timing files for your exercises.
The provided kernel is part of a program that performs vector MAX reduction. Notice that the MAX operator is commutative, i.e., MAX(a,b) = MAX(b,a), and associative, i.e., MAX(a,MAX(b,c)) = MAX(MAX(a,b),c). This means that you can execute MAXs in any order, so you can apply the coalescing thread transformation you experimented with in Lab 5. In the provided kernel each thread computes one maximum of a contiguous partition of the input array and writes it to global memory. The host mem copies the result back, and computes the max of all maxes.
Given a 1D grid of 80 threadblocks and a 1D threadblock of 128 threads, and a problem size n = 1,280,000,000, each thread computes the maximum of 125,000 elements. The input array is of size n, whereas the result array is of size n/125,000, in this case 80*128= 10240.
In your modified kernel each thread will read, in a coalescing fashion, n / 80*128 interleaved global array elements. In the case of n=1,280,000,000, each thread reads again 125,000 elements. Leave the grid and threadblock dims the same. Again, each thread computes one maximum, of a now interleaved partition. The intermediate maxes computed by the GPU threads will be different from the ones computed by the original kernel. However, the max of maxes computed by the host will be the same as the original max of maxes.
Measure and report the difference in performance of the two codes.
Additional files, relevant to this part of the PA are:
You will write a new device kernel, called vecMaxKernel01.cu, to compute vector maxima using coalesced memory reads. Change the makefile to compile a program called vecMax01 using your kernel rather than the provided kernel.
The tar file above provides the following files for Matmult:
Running make produces a binary called matmult00 and it is invoked like this,
$ ./matmult00 X
where X controls the problem size (i.e., the matrices are NxN matrices of size N=X*FOOTPRINT_SIZE). FOOTPRINT_SIZE is defined in matmultKernel.h. This is done to avoid nasty padding issues. If you run it with X=100 for example, then you see this,
$ ./matmult00 100 Data dimensions: 1600x1600 Grid Dimensions: 100x100 Block Dimensions: 16x16 Footprint Dimensions: 16x16 Time: 0.017199 (sec), nFlops: 8192000000, GFlopsS: 476.305669The "Grid Dimensions" represent the number of CUDA thread blocks being used and matches the value passed as X. The "Block Dimensions" represent the size of each CUDA thread block (i.e., the number of threads per block). This is controlled by the compile time constant BLOCK_SIZE in matmultKernel.h. The "Footprint dimensions" represent the size of the patch of C computed by each CUDA block. This is controlled by FOOTPRINT_SIZE. In the provided matmult00 implementation, the FOOTPRINT_SIZE and BLOCK_SIZE are the same which means that each thread updates a single element of C.
Your task is to modify the kernel, (copy matmultKernel00.cu into matmultKernel01.cu), and produce a new binary matmult01 where each thread updates more than one element of C, based on the values of FOOTPRINT_SIZE and BLOCK_SIZE. For example, in order to have each thread update 4 elements of its block's patch, then you would set FOOTPRINT_SIZE to twice that of BLOCK_SIZE. Your submitted code should have each thread compute 4 values in the resulting C matrix.
Here is an example of running the solution code's matmult01,
$ ./matmult01 100 Data dimensions: 3200x3200 Grid Dimensions: 100x100 Block Dimensions: 16x16 Footprint Dimensions: 32x32 Time: 0.108151 (sec), nFlops: 65536000000, GFlopsS: 605.967812Notice that parameter 100 now creates a 100*32 = 3200 square matrix problem, because the footprint has become 32x32. Also notice a nice speedup.
Investigate how each of the following factors influences performance of matrix multiplication in CUDA: