For this assignment, you will modify two provided CUDA kernels. The first is part of a program that performs vector addition. The second is part of a program to perform matrix multiplication. The provided vector addition program does not coalesce memory accesses. You will modify it to coalesce memory access. You will modify the matrix multiplication program to investigate its performance and how it can be optimized by changing how the task is parallelized. You will create an improved version of the matrix multiplication program and empirically test the time it takes to run. You will analyze the results of your timing tests.
You will turn in all code you have written and used for this assignment, a makefile that will compile your code, and a standard lab report documenting your experiments and results. We need to be able to compile your code by executing "make".
Resources that may be useful for this assignment include:
For this assignment, you will need to use a linux machine with an NVIDIA graphics card. In the CSU labs, the machines in rooms 120 and 325 have such a graphics card. In general, the machines in 325 are preferred for use with CUDA. These machines are the same as the ones used in your lab.
You can use ssh to work on the lab machines remotely since your work with CUDA will not require that you have access to the main console of the machine you are using (you will not be doing any graphics).
To use CUDA on the lab machines at CSU, you will need to set the right environment variables. It's convenient to edit your .cshrc or your .profile file (depending on whether you are using tcsh or bash) to set them when you log in. You should add
Download and untar the CUDA1.tar file.
To compile a CUDA program, use the CUDA compiler nvcc. You should use the provided makefile as a starting point for this assignment. Initially, you will use the provided makefile to verify that your environment is correctly configured to use CUDA.
As discussed in class, vecadd is a micro benchmark to determine the effectiveness of coalescing. You are provided with a non-coalescing version, and it is your job to create a coalescing version, and to measure the difference in performance of the two codes.
Here is a set of files provided in CUDA1.tar
Compile and run the provided program vecadd00 and collect data on the time the program takes with the following number of values per thread: 500, 1000, 2000.
Include a short comment in your make file describing what your various programs do, and fully describe and analyse the experiments you performed in your lab report.
In vecadd01 you will use a new device kernel, called vecAddKernel01.cu, to perform vector addition using coalesced memory reads. Change the makefile to compile a program called vecadd01 using your kernel rather than the provided kernel. Modify the makefile as appropriate so the clean and tar commands will deal with any files you have added.
Test your new program and compare the time it takes to perform the same work performed by the original. Note and analyse your results, and document your observations in the report.
Here is a set of provided files in CUDA1.tar for Matmult:
You should investigate how each of the following factors influences performance of matrix multiplication in CUDA:
When run with a single parameter, the provided code multiplies that parameter by FOOTPRINT_SIZE (set to 16 in matmult00) and creates square matrices of the resulting size. This was done to avoid nasty padding issues: you always have data blocks perfectly fitting the grid. Here is an example of running matmult00 on figs:
figs 46 # matmult00 32 Data dimensions: 512x512 Grid Dimensions: 32x32 Block Dimensions: 16x16 Footprint Dimensions: 16x16 Time: 0.016696 (sec), nFlops: 268435456, GFlopsS: 16.077853
Notice that parameter 32 value creates a 32*16 = 512 sized square matrix problem.
In your new kernel each thread computes four values in the resulting C block rather than one. So now the FOOTPRINT_SIZE becomes 32. ( Notice that this is taken care of by the Makefile.) You will time the execution of this new program with matrices of the sizes listed above to document how your changes affect the performance of the program. To get good performance, you will need to be sure that the new program coalesces its reads and writes from global memory. You will also need to unroll any loops that you might be inclined to insert into your code.
Here is an example of running my matmult01 on figs:
figs 28>matmult01 16 Data dimensions: 512x512 Grid Dimensions: 16x16 Block Dimensions: 16x16 Footprint Dimensions: 32x32 Time: 0.012185 (sec), nFlops: 268435456, GFlopsS: 22.030248Notice that parameter 16 now creates a 16*32 = 512 square matrix problem, because the footprint has become 32x32. Also notice a nice speedup.
Note the results of each experiment in your report.