SA-C Image Processing Library Performance

Introduction

To test the performance of SA-C programs on simple image processing procedures, we implemented 32 routines to exactly match the API of routines from Intel's Image Processing Library (IPL). These routines were verified by comparing their results to output from Intel's routines; the results match exactly, including the use of saturating arithmetic and (for some routines) rounding remainders of .5 up or down depending on the column number.

The comparisons were made by compiling SA-C routines with the November 2000 version of the SA-C compiler and executing them on an Annapolis Microsystems StarFire with an Xilinx XV-1000 FPGA. The Intel IPL routines were executed under WindowsNT on a 450MHz Pentium II . (We believe the XV-1000 & 450MHz Pentium are of approximately the same age.) The test images were 8-bit 512x512 images.

Execution times are reported in seconds, as are data upload and download times for the RCS. FPGA clock frequencies are reported in MHz.

Performance Results

0.021259

Routine	Pentium Exec.	RCS Exec.	RCS data download	RCS data upload	Frequency (MHz)
AddS	0.081531	0.008355	0.02221	0.03292	39.5
And	0.003179	0.008492	0.04418	0.03298	38.9
AndS	0.001865	0.008331	0.02222	0.03275	39.6
Close	0.018069	0.012800	0.02337	0.03308	25.0
Convolve2D	0.006548	0.006624	0.02341	0.03385	25.1
Dilate	0.011578	0.028910	0.02376	0.03385	25.0
Erode	0.016764	0.028910	0.02375	0.03385	25.0
Gaussian3x3	0.005670	0.006637	0.03386	0.03390	25.1
Greater	0.000109	0.008461	0.04430	0.00413	39.0
GreaterS	0.011431	0.009563	0.02220	0.02238	34.5
LShiftS	0.001469	0.008537	0.02239	0.02256	38.7
Less	0.000074	0.008438	0.04434	0.00413	39.1
LessS	0.011567	0.008179	0.02176	0.00409	40.4
MaxFilter	0.005189	0.021259	0.02304	0.03341	28.2
MinFilter	0.005328	0.021755	0.02304	0.03342	27.5
Multiply	0.003541	0.009055	0.04322	0.03306	36.4
MultiplyS	0.039078	0.008707	0.02228	0.03291	37.9
MultiplySScale	0.002057	0.009470	0.02224	0.03293	34.9
MultiplyScale	0.003659	0.011854	0.04510	0.03336	27.8
NormC	0.001053	0.010593		0.00008	31.0
NormL1	0.002023	0.009099		0.00008	36.0
Not	0.001883	0.008530	0.02227	0.03291	38.7
Open	0.017859	0.034167	0.02386	0.03380	25.0
Or	0.003093	0.008492	0.04490	0.03289	38.6
OrS	0.001976	0.008331	0.02220	0.03272	39.6
RShiftS	0.001359	0.008537	0.02233	0.03302	38.7
Square	0.045160	0.008530	0.02613	0.03297	38.7
Subtract	0.030182	0.007858	0.04330	0.03248	42.0
SubtractS	0.001854	0.008546	0.02226	0.03293	38.6
Threshold	0.001251	0.007806	0.02223	0.03703	42.0
Xor	0.002554	0.009390	0.04451	0.03302	35.1
XorS	0.001367	0.008331	0.02183	0.03272	39.6

Analysis

As you would expect, the Pentium II outperforms the reconfigurable system on simple image processing operators. This is because these tasks are I/O bound, and the I/O paths on the FPGAs are no wider than on the Pentium while operating at a slower clock speed. As a result, the FPGA is unable to exploit its advantage in terms of parallelism. (See ARAGTAP for an example of the reconfigurable system outperforming the Pentium on more complex tasks.)

Readers should also note the other numbers provided here: the time required to download the source image(s) to the RCS, and the time required to upload the output back to the host. These times are artifacts of the FPGAs being on a seperate co-processor board, while the Pentium is the main processor. Future reconfigurable systems may have the FPGAs and risk processors on the same chip, in which case these transfer times go away. In the meantime, it takes about 0.02 seconds to download a 512x512 8-bit image across our PCI bus; operators that take two images as arguments have twice that download time. Upload times depend on whether the result is an image or a single value, and if it is an image whether it is binary, 8-bit, or more.

Performance of SA-C IPL Routines

Introduction

Performance Results

Analysis