CS560 CUDA assigment one: getting started

Introduction

The purpose of this exercise is for you to learn how to write programs using CUDA, how to run such programs on an NVIDIA graphics processor, and how to think about about the factors that govern the performance of CUDA programs.

For this assignment, you will modify two provided CUDA kernels. The first is part of a program that performs vector addition. The second is part of a program to perform matrix multiplication. The provided vector addition program does not coalesce memory accesses. You will modify it to coalesce memory access. You will modify the matrix multiplication program to investigate its performance and how it can be optimized by changing how the task is parallelized. You will create an improved version of the matrix multiplication program and empirically test the time it takes to run. You will analyze the results of your timing tests.

You will turn in all code you have written for this assignment, a makefile that will compile your code, and a pdf report documenting your experiments and results. Your document will be named using your first and last names in the form FirstnameLastnameCUDA1.pdf. Your grade will be based on this report document (30 pts for clarity of writing), the performance of your code (40 pts) and your analysis of your results (30 pts). Your analysis must state clearly what gave you improved performance.

Resources that may be useful for this assignment include:

The NVIDIA CUDA Developer web site
The NVIDIA CUDA Zone web site
The CSU CUDA-FAQ
List of CS department machines
Documentation for gnu make
Local Directory for CUDA documentation: /usr/local/cuda/doc

Compiling and Running a CUDA Program

For this assignment, you will need to use a linux machine with an NVIDIA graphics card. In the CSU labs, the machines in rooms 215, 225, and 315 have such a graphics card. In general, the machines in 315 are preferred for use with CUDA. These machines are: dates, figs, grapes, huckle-`qberries, kiwis, lemons, melons, nectarines, peaches, pears, raspberries, pomegranates, kumquats, bananas, coconuts, apples, and oranges.

You should use all but bananas, coconuts, apples, and oranges for your development and debugging, then use those four machines for your timing tests. Those four machines have more powerful GPU cards with TESLA processors. You may use your own personal computer for this assignment, but the materials that you turn in should run without alteration on the lab machines. The instructor for this class cannot assist you in configuring your personal computer to run CUDA programs.

You can use ssh to work on the lab machines remotely since your work with CUDA will not require that you have access to the main console of the machine you are using (you will not be doing any graphics). To determine other machines that are suitable for this assignment (those in rooms 215 and 225) look at the list of CS department machines noted above. Note that the machines in room 120 (those named after vegetables) do not have NVIDIA graphics cards, and are not suitable for this assignment.

To use CUDA on the lab machines at CSU, you will need to set the right environment variables. CUDA is upgraded to version 5 and Fedora is upgraded to version 17. However, CUDA needs gcc version less than 4.7, but in Fedora 17 the default gcc version is 4.7. Therefore, to compile CUDA code, you need to change the gcc version to 4.6.3.

It's convenient to edit your .cshrc or your .profile file (depending on whether you are using tcsh or bash) to set them when you log in. You should add

/usr/local/gcc-4.6.3/bin to path
/usr/local/cuda/bin to path
/usr/local/gcc-4.6.3/lib64 to LD_LIBRARY_PATH
/usr/local/cuda-5.0.35/lib64 to LD_LIBRARY_PATH
/usr/local/cuda/man to MANPATH

For detailed instructions, see the CUDA FAQ mentioned above.

To compile a CUDA program, use the CUDA compiler nvcc. You should use the provided makefile as a starting point for your makefile for this assignment. Initially, you will use the provided makefile to verify that your environment is correctly configured to use CUDA. When you work on the remainder of the assignment, you will use the makefile to keep a record or notebook of the versions of your code and the results of your experiments. If you are not familiar with the make command or makefiles, you may want to review the documentation for make.

You should create a directory for this assignment, copy the provided files to that directory, and connect to that directory. Then use the makefile to compile the provided sample source code. Assuming you have not changed the name of the makefile, you can compile both of the provided progams by typing


	make

or you can compile each progam separately by typing

	make vecadd00

or

	make matmult00

Then use ls to verify that executable programs were generated, and run them by typing (for example)

	vecadd00 16

or

	matmult00 16

1a. Vector add: coalescing memory accesses

Here is a set of provided files

Compile and run the provided program vecadd00 and collect some data on the time the program takes with the following number of values per thread: 500, 1000, 2000, 4000, 8000. Also try with the closest power of two (e.g., 512, 1024, etc). Some of the larger problem sizes may only work on the larger (TESLA) GPUs (apples, oranges, bananas, and coconuts).

Include a short comment in your makefile describing what your various programs do, and fully describe and analyse the experiments you performed in your report.

In vecadd01 you will use a new device kernel to perform vector addition using coalesced memory reads called vecAddKernel01.cu. Change the makefile to compile a program called vecadd01 using your kernel rather than the provided kernel, and modify the makefile as appropriate so the clean and tar commands will deal with any files you have added (there should just be one in this case, and the changes in this case should be uncommenting existing lines and adding vecadd01 to the EXECS variable). Test your new program and compare the time it takes to perform the same work performed by the original. Note and analyse your results in your makefile.

1b. Shared / shared CUDA Matrix Multiply

For the next part of this assignment, you will use matmultKernel00 as a basis for another kernel with which you will investigate the performance of matrix multiplication using a GPU and the CUDA programming environment. Your kernel should be called matmultKernel01.cu. Your code files must include internal documentation on the nature of each program, and your make file must include notes on the nature of the experiment. You can have intermediate stages of this new kernel, e.g., with and without unrolling, but we will grade you on your final improved version of the code and its documentation.

Here is a set of provided files: (you still need the Makefile)

You should investigate how each of the following factors influences performance of matrix multiplication in CUDA:

(1) The size of matrices to be multiplied
(2) The size of the block computed by each thread block
(3) Any other performance enhancing modifications you find

You should time the initial code provided using square matrices of each of the following sizes: 128, 256, 512, 1024, 2048, 4096, and 8192. Some of the larger problem sizes may only work on the larger (TESLA) GPUs.

When run with a single parameter, the provided code multiplies that parameter by BLOCK_SIZE (set to 16) and creates square matrices of the resulting size. Thus, you should run the provided code with the initial parameter of 8, 16, 32, 64, 128, 256, and 512 (also try non-powers of two). This will provide you with initial data on how the size of the matrices influences run time.

In your new kernel each thread computes four values in the resulting C block rather than one. Time the execution of this new program with matrices of the sizes listed above to document how your changes affect the performance of the program. To get good performance, you will need to be sure that the new program coalesces its reads and writes from global memory. You will also need to unroll any loops that you might be inclined to insert into your code. In some cases, you may be able to use #pragma unroll to unroll your loops, but in others, you will probably need to hand-unroll the loops.

Note the results of each experiment in your Lab Notes. The type of code you have been asked to write can obtain speeds of about 220 GFlops/sec on Tesla1060. Can you reach that figure?

Turning in Your Results

Finish your lab report explaining the results of your experiments. Provide any general conclusions you have reached about CUDA programming. Can you formulate any rules of thumb for obtaining good performance when using CUDA? name your pdf file using your first and last name (use the form FirstnameLastnameCUDA1.pdf, for example, JohnDoeCUDA1.pdf). To turn in your assignment, use tar to package your code (you may use the make tar command from the makefile), your makefile, and your Lab Notes file. Your tar file should be named in a manner similar to your lab report, but with the tar extension in place of the pdf extension.

Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 license.

Last updated January 2013