Lab 4: Getting Started with Intel Tools

Goal:

This lab will let you try some of the frequently used Intel tools on the CS department machines.

References:

Source:

Background:

The Intel Math Kernel Library (MKL) helps you achieve maximum performance with a math computing library of highly optimized, extensively parallelized routines for CPU and GPU. In this lab, we will focus on the matrix multiplication / dgemm kernel.

The Roofline Toolkit provides a visual representation of application performance in relation to hardware limitations, including memory bandwidth and computational peaks. A Roofline chart produces a plot of GFLOPs vs the Arithmetic Intensity as shown below:

Tasks:

Task A: Testing MKL kernels
Steps:

Load the right modeule using the command: load compilers/icc
Download the mkl-samples and the simple matrix multiply code from the sources section of this document
Extract the downloaded files and enter the directory.
Run the following two commands to compile files. make; make run_dgemm_example
Now, we can use the following command to do some testing. ./release/matrix_multiplication

Submission: Compare the execution time for the simple matrix multiply code and the dgemm code (MKL) for similar sized inputs. Report the approximate execution time while avoiding any major run-to-run variations.

Task B: Testing the Roofline Toolkit
Steps:

Load the right modeule using the command: load dev/intel-advisor
Use the following command to launch the toolkit: advixe-gui. Make sure to have Xdisplay set up while using an SSH session for this task.
Create a new project and use the compiled binaries created in Task A.
Running tests with higher overhead would result in the experiments repeated a few times to report consistent results. The Roofline chart would be displayed after the tests are completed.
The Red circles in the chart represent the parts of code where the most amount of time is spent. Other details can be seen on hovering over the circles.
Look at the sources section and try to identify the part of code which consumes the most time. (The -g flag must be enabled during compilation to see this)

Click here to see screenshots to run the Roofline Toolkit: Roofline MKL Example

Submission: Based on the image in the Background section, find if the tested binaries are bandwidth bound or compute bound. (Hint: Try experimenting with sizes small enough to fit in cache and large sizes which go in the DRAM)

Task C: Profile Your Own Code Using the Roofline Toolkit
Steps:

Consider the provided (sequential) code for the 2D stencil, for PA1.
Modify the compilation of this project so that it uses icc instead of gcc.
Use advixe-gui to profile the original and optimized versions of the code from this project.
Revisit Lab2, and compare the plots for all the different permutations and do the same.
Optional: Study what happens with a (any) tiled version of syr2k and report what happens.

Submission:

What were the bottlenecks of this project, i.e., did you identify which line was limiting performance? How can you be sure?
How did your optimization(s) change the bottleneck?
What is the best performance achieved, as a percentage of TMP?

What to Turn In

Work with your group to write a report containing your results from the three Submission sections from the tasks above. If you had any "aha moments," pleasae describe what the particular insights were. Submit it to Canvas.

After you are done, you may find it useful to see one af the YouTube videos like this, this, or this, about how to "read" laogarithmic plot.

Created by Vidit Save 1/30/2023