For this assignment, you will modify two provided CUDA kernels. The first is part of a program that performs vector addition. The second is part of a program to perform matrix multiplication. The provided vector addition program does not coalesce memory accesses. You will modify it to coalesce memory access. You will modify the matrix multiplication program to investigate its performance and how it can be optimized by changing how the task is parallelized. You will create an improved version of the matrix multiplication program and empirically test the time it takes to run. You will analyze the results of your timing tests.
You will turn in all code you have written for this assignment, a makefile that will compile your code, and a pdf report documenting your experiments and results. Your document will be named using your first and last names in the form FirstnameLastnameCUDA1.pdf. Your grade will be based on this report document (30 pts for clarity of writing), the performance of your code (40 pts) and your analysis of your results (30 pts). Your analysis must state clearly what gave you improved performance.
Resources that may be useful for this assignment include:
For this assignment, you will need to use a linux machine with an NVIDIA graphics card. In the CSU labs, the machines in rooms 215, 225, and 315 have such a graphics card. In general, the machines in 315 are preferred for use with CUDA. These machines are: dates, figs, grapes, huckle-`qberries, kiwis, lemons, melons, nectarines, peaches, pears, raspberries, pomegranates, kumquats, bananas, coconuts, apples, and oranges.
You should use all but bananas, coconuts, apples, and oranges for your development and debugging, then use those four machines for your timing tests. Those four machines have more powerful GPU cards with TESLA processors. You may use your own personal computer for this assignment, but the materials that you turn in should run without alteration on the lab machines. The instructor for this class cannot assist you in configuring your personal computer to run CUDA programs.
You can use ssh to work on the lab machines remotely since your work with CUDA will not require that you have access to the main console of the machine you are using (you will not be doing any graphics). To determine other machines that are suitable for this assignment (those in rooms 215 and 225) look at the list of CS department machines noted above. Note that the machines in room 120 (those named after vegetables) do not have NVIDIA graphics cards, and are not suitable for this assignment.
To use CUDA on the lab machines at CSU, you will need to set the right environment variables. CUDA is upgraded to version 5 and Fedora is upgraded to version 17. However, CUDA needs gcc version less than 4.7, but in Fedora 17 the default gcc version is 4.7. Therefore, to compile CUDA code, you need to change the gcc version to 4.6.3.
It's convenient to edit your .cshrc or your .profile file (depending on whether you are using tcsh or bash) to set them when you log in. You should add
To compile a CUDA program, use the CUDA compiler nvcc. You should use the provided makefile as a starting point for your makefile for this assignment. Initially, you will use the provided makefile to verify that your environment is correctly configured to use CUDA. When you work on the remainder of the assignment, you will use the makefile to keep a record or notebook of the versions of your code and the results of your experiments. If you are not familiar with the make command or makefiles, you may want to review the documentation for make.
You should create a directory for this assignment, copy the provided files to that directory, and connect to that directory. Then use the makefile to compile the provided sample source code. Assuming you have not changed the name of the makefile, you can compile both of the provided progams by typing
make
or you can compile each progam separately by typing
make vecadd00 or make matmult00
Then use ls to verify that executable programs were generated, and run them by typing (for example)
vecadd00 16 or matmult00 16
Here is a set of provided files
Compile and run the provided program vecadd00 and collect some data on the time the program takes with the following number of values per thread: 500, 1000, 2000, 4000, 8000. Also try with the closest power of two (e.g., 512, 1024, etc). Some of the larger problem sizes may only work on the larger (TESLA) GPUs (apples, oranges, bananas, and coconuts).
Include a short comment in your makefile describing what your various programs do, and fully describe and analyse the experiments you performed in your report.
In vecadd01 you will use a new device kernel to perform vector addition using coalesced memory reads called vecAddKernel01.cu. Change the makefile to compile a program called vecadd01 using your kernel rather than the provided kernel, and modify the makefile as appropriate so the clean and tar commands will deal with any files you have added (there should just be one in this case, and the changes in this case should be uncommenting existing lines and adding vecadd01 to the EXECS variable). Test your new program and compare the time it takes to perform the same work performed by the original. Note and analyse your results in your makefile.
Here is a set of provided files: (you still need the Makefile)
You should investigate how each of the following factors influences performance of matrix multiplication in CUDA:
When run with a single parameter, the provided code multiplies that parameter by BLOCK_SIZE (set to 16) and creates square matrices of the resulting size. Thus, you should run the provided code with the initial parameter of 8, 16, 32, 64, 128, 256, and 512 (also try non-powers of two). This will provide you with initial data on how the size of the matrices influences run time.
In your new kernel each thread computes four values in the resulting C block rather than one. Time the execution of this new program with matrices of the sizes listed above to document how your changes affect the performance of the program. To get good performance, you will need to be sure that the new program coalesces its reads and writes from global memory. You will also need to unroll any loops that you might be inclined to insert into your code. In some cases, you may be able to use #pragma unroll to unroll your loops, but in others, you will probably need to hand-unroll the loops.
Note the results of each experiment in your Lab Notes. The type of code you have been asked to write can obtain speeds of about 220 GFlops/sec on Tesla1060. Can you reach that figure?
Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 license.
Last updated January 2013