This paper discusses the implementation and evaluation of the reduction of a dense matrix to bidiagonal form on the Trident processor. The standard Golub and Kahan Householder bidiagonalization algorithm, which is rich in matrix-vector operations, and the LAPACK subroutine _GEBRD, which is rich in a mixture of vector, matrix-vector, and matrix operations, are simulated on the Trident processor. We show how to use the Trident parallel execution units, ring, and communication registers to effectively perform vector, matrix-vector, and matrix operations needed for bidiagonalizing a matrix. The number of clock cycles per FLOP is used as a metric to evaluate the performance of the Trident processor. Our results show that increasing the number of the Trident lanes proportionally decreases the number of cycles needed per FLOP. On a 32K?32K matrix and 128 Trident lanes, the speedup of using matrix-vector operations on the standard Golub and Kahan algorithm is around 1.5 times over using vector operations. However, using matrix operations on the _GEBRD subroutine gives speedup around 3 times over vector operations, and 2 times over using matrix-vector operations on the standard Golub and Kahan algorithm.
Index Terms:
Bidiagonal form, block Householder transformation, BLAS, scalable architecture
Citation:
Mostafa I. Soliman, Stanislav G. Sedukhin, "Matrix Bidiagonalization on the Trident Processor," ipdps, pp.257b, International Parallel and Distributed Processing Symposium (IPDPS'03), 2003