Synthesizing Efficient Out-of-Core Programs for Block Recursive Algorithms Using Block-Cyclic Data Distributions
|
Abstract—In this paper, we present a framework for synthesizing I/O efficient out-of-core programs for block recursive algorithms, such as the fast Fourier transform (FFT) and block matrix transposition algorithms. Our framework uses an algebraic representation which is based on tensor products and other matrix operations. The programs are optimized for the striped Vitter and Shriver's two-level memory model in which data can be distributed using various $cyclic(B)$ distributions in contrast to the normally used physical track distribution $cyclic(B_d)$, where $B_d$ is the physical disk block size. We first introduce tensor bases to capture the semantics of block-cyclic data distributions of out-of-core data and also data access patterns to out-of-core data. We then present program generation techniques for tensor products and matrix transposition. We accurately represent the number of parallel I/O operations required for the synthesized programs for tensor products and matrix transposition as a function of tensor bases and data distributions. We introduce an algorithm to determine the data distribution which optimizes the performance of the synthesized programs. Further, we formalize the procedure of synthesizing efficient out-of-core programs for tensor product formulas with various block-cyclic distributions as a dynamic programming problem. We demonstrate the effectiveness of our approach through several examples. We show that the choice of an appropriate data distribution can reduce the number of passes to access out-of-core data by as large as eight times for a tensor product and the dynamic programming approach can largely reduce the number of passes to access out-of-core data for the overall tensor product formulas.
[1] 297 G.E. Blelloch,Vector Models for Data-Parallel Computing. The MIT Press, 1990.
[2] A. Choudhary, I. Foster, G. Fox, K. Kennedy, C. Kesselman, C. Koelbel, J. Saltz,, and M. Snir, “Languages, Compilers, and Runtime Systems Support for Parallel Input-Output,” Technical Report CCSF-39, Scalable I/O Initiative, Caltech Concurrent Supercomputing Facilities, Caltech, 1994.
[3] T.H. Cormen, “Virtual Memory for Data-Parallel Computing,” PhD Thesis, Massachussetts Inst. of Technology, MIT/LCS/TR-559, 1992.
[4] T.H. Cormen and D. Kotz, “Integrating Theory and Practice in Parallel File Systems,” Technical Report PCS-TR93-188, Dept. of Math and Computer Science, Dartmouth College, Mar. 1993.
[5] D.L. Dai, S.K.S. Gupta, S.D. Kaushik, J.H. Lu, R.V. Singh, C.-H. Huang, P. Sadayappan,, and R.W. Johnson, “EXTENT: A Portable Programming Environment for Designing and Implementing High Performance Block Recursive Algorithms,” Proc. Supercomputing '94, pp. 49–58, 1994.
[6] J.O. Eklundh, “A Fast Computer Method for Matrix Transposing,” IEEE Trans. Computers, vol. 20, no. 7, pp. 801–803, July 1972.
[7] D.G. Feitelson, P.F. Corbett, Y. Hsu,, and J.-P. Prost, “Parallel I/O Systems and Interfaces for Parallel Computers,” Multiprocessor Systems—Design and Integration. C.-L. Wu, ed., World Scientific, 1995.
[8] J. Granta, M. Conner,, and R. Tolimieri, “Recursive Fast Algorithms and the Role of the Tensor Product,” IEEE Trans. Signal Processing, vol. 40, no. 12, pp. 2,921–2,930, Dec. 1992.
[9] S.K.S. Gupta, “Synthesizing Communication-Efficient Distributed-Memory Parallel Programs for Block Recursive Algorithms,” PhD thesis, The Ohio State Univ., Mar. 1995.
[10] S.K.S. Gupta, Z. Li,, and J.H. Reif, “Generating Efficient Programs for Two-Level Memories from Tensor Products,” Proc. Seventh IASTED/ISMM Int'l Conf. Parallel and Distributed Computing and Systems, pp. 510–513, Washington D.C., Oct. 1995.
[11] C.-H. Huang, J.R. Johnson,, and R.W. Johnson, “Generating Parallel Programs from Tensor Product Formulas: A Case Study of Strassen's Matrix Multiplication Algorithm,” Proc. Int'l Conf. Parallel Processing 1992, pp. 104–108, Aug. 1992.
[12] J.R. Johnson, R.W. Johnson, D. Rodriguez,, and R. Tolimieri, “A Methodology for Designing, Modifying and Implementing Fourier Transform Algorithms on Various Architectures,” Circuits Systems and Signal Processing, vol. 9, no. 4, pp. 450–500, 1990.
[13] R.W. Johnson, C.-H. Huang,, and J.R. Johnson, “Multilinear Algebra and Parallel Programming,” J. Supercomputing, vol. 5, pp. 189–218, 1991.
[14] S.D. Kaushik et al., "Efficient Transposition Algorithms for Large Matrices," Proc. Supercomputing '93, IEEE Computer Soc. Press, Los Alamitos, Calif., 1993, pp. 656-665.
[15] S.D. Kaushik, C.-H. Huang, R.W. Johnson,, and P. Sadayappan, “A Methodology for Generating Efficient Disk-Based Algorithms from Tensor Product Formulas,” Proc. Sixth Ann. Workshop Languages and Compilers for Parallel Computing, pp. 358–338, Aug. 1993.
[16] B. Kumar, C.-H. Huang, P. Sadayappan,, and R.W. Johnson, “An Algebraic Approach to Cache Memory Characterization for Block Recursive Algorithms,” Proc. 1994 Int'l Computer Symp., pp. 336–342, 1994.
[17] Z. Li, “Computational Model and Program Synthesis for Parallel Out-of-Core Computation,” PhD thesis, Duke Univ., 1996.
[18] C.V. Loan,Computational Frameworks for the Fast Fourier Transform. SIAM, 1992.
[19] R. Paige, J.H. Reif,, and R. Wachter,Parallel Algorithm Derivation and Program Transformation. Kluwer Academic, 1993.
[20] H.S. Stone, “Parallel Processing with the Perfect Shuffle,” IEEE Trans. Computers, vol. 20, no. 2, pp. 153–161, Feb. 1971.
[21] R. Thakur, R. Bordawekar, and A. Choudhary, “Compilation of Out-of-Core Data Parallel Programs for Distributed Memory Machines,” Proc. IPPS '94 Second Ann. Workshop Input/Output in Parallel Computer Systems, pp. 54-72, Apr. 1994. Also appears in ACM Computer Architecture News, vol. 22, no. 4, Sept. 1994.
[22] J.S. Vitter and E.A.M. Shriver, “Algorithms for Parallel Memory I: Two-Level Memories,” Algorithmica, vol. 12, nos. 2-3, pp. 110–147, 1994.
[23] M. Wolfe, High Performance Compilers for Parallel Computing. Addison-Wesley, 1996.
Index Terms:
Parallel I/O, program synthesis, data distribution, tensor product, block recursive algorithm, fast Fourier transform.
Citation:
Zhiyong Li, John H. Reif, Sandeep K.S. Gupta, "Synthesizing Efficient Out-of-Core Programs for Block Recursive Algorithms Using Block-Cyclic Data Distributions," IEEE Transactions on Parallel and Distributed Systems, vol. 10, no. 3, pp. 297-315, Mar. 1999, doi:10.1109/71.755830