A tridiagonal matrix system is an equation of the form Ax=b, where x and b are vectors, and A is a tridiagonal matrix. In other words, A is necessarily square, and has non-zero entries only along its diagonal and immediately adjacent to its diagonal. The goal is to find x, given A and b.
This is not the type of application SA-C was originally designed for, but two of our colleagues from the Colorado State University Department of Atmospheric Science (Graeme Stephens and Philip Gabriel) suggested that the SA-C compiler could be applied to natural resource problems, and suggested that we look at solving systems of tridiagonal matrices as a start. We thank them for their interest and advice.
We have implemented a simple iterative algorithm (suggested by Philip Gabriel) for solving diagonally dominant tridiagonal matrices. The source code below contains initial code for reading the input matrix and vector, and for computing various terms for debugging purposes. The piece that solves the linear system and runs on the FPGA is clearly commented in the middle of the code. The performance results below are with regard to this section of the code.
#define SIZE 8
#define ITERATIONS 10
fix16.14[SIZE] main (float A[SIZE,SIZE], float X[SIZE]) {
bool debug = true;
uint8 rA, uint8 cA = extents (A);
uint8 rX = extents (X);
assert ((rA==cA), "A not square");
assert ((rA==rX), "Sizes A and X not compliant");
print(debug,"A: ", A);
print(debug,"X: ", X);
float B[:] =
for VA (~,:) in A {
float ip = for a in VA dot x in X
return (sum (a * x));
} return ( vector(ip));
print(debug,"B = A.X: ", B);
float Dinv[:] =
for uint8 i in [rA] return(array((float)(1.0)/A[i,i]));
float L[:], float U[:] =
for uint8 i in [rA] {
float Li = i==0 ? (float)(0.0) : A[i,i-1];
float Ui = i==rA-1 ? (float)(0.0) : A[i,i+1];
} return(array(Li),array(Ui) );
// DiB = Dinv*B DiL,DiU = Dinv*(L+U)
float DiB[:], float DiL[:], float DiU[:] =
for d in Dinv dot b in B dot l in L dot u in U
return( array(d*b), array(d*l), array(d*u) );
print(debug,"DiB: ", DiB);
print(debug,"DiL: ", DiL);
print(debug,"DiU: ", DiU);
fix16.14 DiBtc[:] = DiB;
fix16.14 DiUtc[:] = DiU;
fix16.14 DiLtc[:] = DiL;
print(debug,"DiBtc: ", DiBtc);
print(debug,"DiLtc: ", DiLtc);
print(debug,"DiUtc: ", DiUtc);
fix16.14 Y[:] = DiB; // This goes to the FPGA
fix16.14 res[:] =
// PRAGMA(no_unroll)
for uint8 iter in [ITERATIONS] {
fix16.14 Yperim[:] =
for uint8 i in [rA+2]
return(array(i==0 || i==(rA+1) ? (fix16.14)(0.0) : Y[i-1]));
// Compute next Y = DiB - (DiL+DiU) * Y
next Y =
for b in DiBtc dot l in DiLtc dot u in DiUtc dot window WY[3] in Yperim
return(array(b - (l*WY[0] + u*WY[2])));
}return(final(Y)); // End of FPGA code.
print(debug,"Iterative solution of A.X = B: ");
}return(res);
Performance was tested on an 8x8 diagonally dominant tridiagonal matrix. The reconfigurable processor was an Annapolis Microsystems (AMS) Starfire, containing a single Xilinx Virtex 1000 series FPGA. For comparison purposes, we ran an equivalent C program on an 800MHz Pentium III.
seconds (800 MHz P3)
|
microseconds (AMS)
|
Frequency |
Logic Blocks
|
2.3x10-6
|
0.5x10-6
|
25.2 MHz |
35%
|