loading...
Array Regrouping and Its Use in Compiling Data-Intensive Embedded Applications
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TC.2004.1255787January 2004 (vol. 53 no. 1) pp. 1-19
 This Article 
 
PDF
HTML
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   

Abstract—One of the key challenges facing computer architects and compiler writers is the increasing discrepancy between processor cycle times and main memory access times. To alleviate this problem in array-intensive embedded signal and video processing applications, compilers may employ either control-centric transformations that change data access patterns of nested loops or data-centric transformations that modify memory layouts of multidimensional arrays. Most of the memory layout optimizations proposed so far either modify the layout of each array independently or are based on explicit data reorganizations at runtime. This paper focuses on a compiler technique, called array regrouping, that automatically maps multiple arrays into a single data (array) space to improve data access pattern. We present a mathematical framework that enables us to systematically derive suitable mappings for a given array-intensive embedded application. The framework divides the arrays accessed in a given program into several groups and each group is independently layout-transformed to improve spatial locality and reduce the number of conflict misses. As compared to the previous approaches, the proposed technique makes two new contributions: 1) It presents a graph based formulation of the array regrouping problem and 2) it demonstrates potential benefits of this aggressive array-regrouping strategy in optimizing behavior of embedded systems. Extensive experimental results demonstrate significant improvements in cache miss rates and execution times. An important advantage of this approach over the previous techniques that target conflict misses is that it reduces conflict misses without increasing the data space requirements of the application being optimized. This is a very desirable property in many embedded/portable environments where data space requirements determine the minimum physical memory capacity. In addition to performance related issues, with the increased use of embedded/portable devices, improving energy efficiency of applications is becoming a critical issue. To develop a truly energy-efficient system, energy constraints should be taken into account early in the design process, i.e., at the source level in software compilation and behavioral level in hardware compilation. Source-level optimizations are particularly important in data-dominated media applications. In this paper, we also show how our array regrouping strategy increases energy savings from using multiple low-power operating modes provided in current memory modules. Using a set of array-intensive benchmarks, we observe significant savings in memory system energy.

[1] 1 I. Al-Furaih and S. Ranka, Memory Hierarchy Management for Iterative Graph Structures Proc. Int'l Parallel Processing Symp., Apr. 1998.
[2] F. Catthoor, S. Wuytack, E. D. Greef, F. Balasa, L. Nachtergaele, and A. Vandecappelle, Custom Memory Management Methodology Exploration of Memory Organization for Embedded Multimedia System Design. Kluwer Academic, June 1998.
[3] A. Chandrakasan, W.J. Bowhill, and F. Fox, Design of High-Performance Microprocessor Circuits. IEEE Press, 2001.
[4] T. Chilimbi, M. Hill, and J. Larus, Cache-Conscious Structure Layout Proc. SIGPLAN '99 Conf. Programming Language Design and Implementation, 1999.
[5] M. Cierniak and W. Li, Unifying Data and Control Transformations for Distributed Shared Memory Machines Proc. SIGPLAN '95 Conf. Programming Language Design and Implementation, 1995.
[6] S. Coleman and K. McKinley, Tile Size Selection Using Cache Organization and Data Layout Proc. ACM SIGPLAN '95 Conf. Programming Language Design and Implementation, 1995.
[7] V. Delaluz et al., Compiler-Directed Interleaving for Reducing Energy in Multi-Bank Memories Proc. Seventh Asia and South Pacific Design Automation Conf. and 15th Int'l Conf. VLSI Design, Jan. 2002.
[8] V. Delaluz et al., DRAM Energy Management Using Software and Hardware Direct Power Mode Control Proc. Seventh Int'l Conf. High Performance Computer Architecture. Jan. 2001.
[9] Dinero-IV Trace-Driven Uniprocessor Cache Simulator http://www.cs.wisc.edu/~markhillDineroIV /, 2000.
[10] C. Ding, Improving Effective Bandwidth through Compiler Enhancement of Global and Dynamic Cache Reuse PhD thesis, Dept. of Computer Science, Rice Univ., 2000.
[11] C. Ding and K. Kennedy, Improving Cache Performance in Dynamic Applications through Data and Computation Reorganization at Runtime Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, 1999.
[12] C. Ding and K. Kennedy, Inter-Array Data Regrouping Proc. 12th Workshop Languages and Compilers for Parallel Computing, 1999.
[13] C. Ding and K. Kennedy, Improving Effective Bandwidth through Compiler Enhancement of Global Cache Reuse Proc. Int'l Parallel and Distributed Processing Symp., 2001.
[14] Dspstone Benchmark suite http://www.ert. rwth-aachen.de/Projekte/ Tools/DSPSTONEdspstone.html, 1996.
[15] C. Eisenbeis, S. Lelait, and B. Marmol, The Meeting Graph: A New Model for Loop Cyclic Register Allocation Proc. IFIP WG 10.3 Working Conf. Parallel Architectures and Compilation Techniques, 1995.
[16] X. Fan, C.S. Ellis, and A.R. Lebeck, Memory Controller Policies for DRAM Power Management Proc. Int'l Symp. Low Power Electronics and Design, 2001.
[17] P. Feautrier, Dataflow Analysis of Array and Scalar References Int'l J. Parallel Programming, vol. 20, no. 1, pp. 23-51, 1991.
[18] A.H. Farrahi and M. Sarrafzadeh, System Partitioning to Maximize Sleep Time Proc. Int'l Conf. Computer Aided Design, 1995.
[19] B. Franke and M.F.P O'Boyle, Compiler Transformation of Pointers to Explicit Array Accesses in DSP Applications Proc. Int'l Conf. Compiler Construction '01, 2001.
[20] D. Genius and S. Lelait, A Case for Array Merging in Memory Hierarchies Proc. Ninth Int'l Workshop Compilers for Parallel Computers, pp. 375-384, 2001.
[21] D. Gannon, W. Jalby, and K. Gallivan, Strategies for Cache and Local Memory Management by Global Program Transformations J. Parallel and Distributed Computing, vol. 5, no. 5, pp. 587-616, 1988.
[22] H. Han and C.-W. Tseng, Improving Locality for Adaptive Irregular Scientific Codes Proc. 13th Int'l Workshop Languages and Compilers for Parallel Computing, 2000.
[23] W.-M.W. Hwu, Embedded Microprocessor Comparison http://www.crhc.uiuc.edu/IMPACT/ece412/public_html/ Notes/412_lec1ppframe.htm, 1997.
[24] F. Irigoin and R. Triolet, Super-Node Partitioning Proc. 15th Ann. ACM Symp. Principles of Programming Languages, pp. 319-329, Jan. 1988.
[25] Y. Ju and H. Dietz, Reduction of Cache Coherence Overhead by Compiler Data Layout and Loop Transformation Proc. Workshop Languages and Compiler for Parallel Computing, pp. 344-358, 1992.
[26] M. Kandemir, Array Unification: A Locality Optimization Technique Proc. Int'l Conf. Compiler Construction (ETAPS 2001) 2001.
[27] M. Kandemir, A. Choudhary, N. Shenoy, P. Banerjee, and J. Ramanujam, A Hyperplane Based Approach for Optimizing Spatial Locality in Loop Nests Proc. 12th ACM Int'l Conf. Supercomputing, 1998.
[28] M. Kandemir and I. Kolcu, Influence of Loop Optimizations on Energy Consumption of Multi-Bank Memory Systems Proc. Int'l Conf. Compiler Construction, 2002.
[29] M. Kandemir, J. Ramanujam, and A. Choudhary, A Compiler Algorithm for Optimizing Locality in Loop Nests Proc. 11th ACM Int'l Conf. Supercomputing, pp. 269-276, 1997.
[30] I. Kodukula, N. Ahmed, and K. Pingali, Data-Centric Multi-Level Blocking Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, 1997.
[31] M. Lam, E. Rothberg, and M. Wolf, The Cache Performance of Blocked Algorithms Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, 1991.
[32] R. Lebeck, X. Fan, H. Zeng, and C.S. Ellis, Power-Aware Page Allocation Proc. Ninth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, 2000.
[33] S.-T. Leung and J. Zahorjan, Optimizing Data Locality by Array Restructuring Technical Report TR 95-09-01, Dept. of Computer Science and Eng., Univ. of Washington, 1995.
[34] W. Li, Compiling for NUMA Parallel Machines PhD thesis, Dept. of Computer Science, Cornell Univ., 1993.
[35] S.Y. Liao, Code Generation and Optimization for Embedded Digital Signal Processors PhD thesis, Dept. of Electrical Eng. and Computer Science, Massachusetts Inst. of Tech nology, 1996.
[36] K. McKinley, S. Carr, and C.W. Tseng, Improving Data Locality with Loop Transformations ACM Trans. Programming Languages and Systems, 1996.
[37] J. Mellor-Crummey, D. Whalley, and K. Kennedy, Improving Memory Hierarchy Performance for Irregular Applications Proc. 13th ACM Int'l Conf. Supercomputing, 1999.
[38] N. Mitchell, L. Carter, and J. Ferrante, Localizing Nonaffine Array References Proc. Int'l Conf. Parallel Architectures and Compilation Techniques, 1999.
[39] S.S. Muchnick, Advanced Compiler Design and Implementation, first ed. Morgan Kaufmann, 1997.
[40] NWCHEM,http://www.emsl.pnl.gov:2080/docsnwchem/, 2001.
[41] M. O'Boyle and P. Knijnenburg, Non-Singular Data Transformations: Definition, Validity, Applications Proc. Sixth Workshop Compilers for Parallel Computers, pp. 287-297, 1996.
[42] M. O'Boyle and P. Knijnenburg, “Integrating Loop and Data Transformations for Global Optimisation,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques (PACT '98), Oct. 1998.
[43] W-T. Shiue and C. Chakrabarti, Memory Exploration for Low Power, Embedded Systems Technical Report CLPE-TR-9-1999-20, Center for Low Power Electronics, Arizona State Univ., 1999,
[44] O. Temam, E. Granston, and W. Jalby, To Copy or Not to Copy: A Compile-Time Technique for Assessing When Data Copying Should Be Used to Eliminate Cache Conflicts Proc. IEEE Supercomputing '93, 1993.
[45] Rambus Inc.,http:/www.rambus.com/, 2003.
[46] Rambus Inc., 128/144-mbit Direct RDRAM Data Sheet 1999.
[47] G. Rivera and C.-W. Tseng, Data Transformations for Eliminating Conflict Misses Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, 1998.
[48] K.O. Thabit, Cache Management by the Compiler PhD thesis, Dept. of Computer Science, Rice Univ., 1981.
[49] N. Vijaykrishnan, M. Kandemir, M. J. Irwin, H. Kim, and W. Ye, “Energy-Driven Integrated Hardware-Software Optimizations Using SimplePower,” Proc. Int'l Symp. Computer Architecture, 2000.
[50] R. Wilson et al., Suif: An Infrastructure for Research on Parallelizing and Optimizing Compilers ACM SIGPLAN Notices, vol. 29, no. 12, pp. 31-37, 1994.
[51] M. Wolf and M. Lam, A Data Locality Optimizing Algorithm Proc. ACM SIGPLAN '91 Conf. Programming Language Design and Implementation, pp. 30-44, 1991.
[52] M. Wolfe, High-Performance Compilers for Parallel Computing. Addison-Wesley, 1996.
[53] J. Xue, On Tiling as a Loop Transformation Parallel Processing Letters, vol. 7, no. 4, pp. 409-424, 1997.

Index Terms:
Array regrouping, layout optimizations, memory energy consumption, cache locality, embedded systems.
Citation:
Victor De La Luz, Mahmut Kandemir, "Array Regrouping and Its Use in Compiling Data-Intensive Embedded Applications," IEEE Transactions on Computers, vol. 53, no. 1, pp. 1-19, Jan. 2004, doi:10.1109/TC.2004.1255787
Usage of this product signifies your acceptance of the Terms of Use.