PSCR: A Coherence Protocol for Eliminating Passive Sharing in Shared-Bus Shared-Memory Multiprocessors
|
Abstract—In high-performance general-purpose workstations and servers, the workload can be typically constituted of both sequential and parallel applications. Shared-bus shared-memory multiprocessor can be used to speed-up the execution of such workload. In this environment, the scheduler takes care of the load balancing by allocating a ready process on the first available processor, thus producing process migration. Process migration and the persistence of private data into different caches produce an undesired sharing, named passive sharing. The copies due to passive sharing produce useless coherence traffic on the bus and coping with such a problem may represent a challenging design problem for these machines. Many protocols use smart solutions to limit the overhead to maintain coherence among shared copies. None of these studies treats passive-sharing directly, although some indirect effect is present while dealing with the other kinds of sharing. Affinity scheduling can alleviate this problem, but this technique does not adapt to all load conditions, especially when the effects of migration are massive. We present a simple coherence protocol that eliminates passive sharing using information from the compiler that is normally available in operating system kernels. We evaluate the performance of this protocol and compare it against other solutions proposed in the literature by means of enhanced trace-driven simulation. We evaluate the complexity in terms of the number of protocol states, additional bus lines, and required software support. Our protocol further limits the coherence-maintaining overhead by using information about access patterns to shared data exhibited in parallel applications.
[1] 742 S.V. Adve, M.D. Hill, and M. Vernon, “Comparison of Hardware and Software Cache Coherence Schemes,” Proc. 18th Int'l Symp. Computer Architecture, pp. 298-308, May 1991.
[2] S.V. Adve and K. Gharachorloo, "Shared Memory Consistency Models: A Tutorial," Computer, Dec. 1996, pp. 66-76.
[3] A.V. Aho, R. Sethi, and J.D. Ullman, Compilers, Principles, Techniques and Tools.New York: Addison-Wesley, 1985.
[4] C. Anderson and J.-L. Baer, “Design and Evaluation of a Sub-Block Cache Coherence Protocol for Bus-Based Multiprocessors,” Technical Report TR-94-05-02, Dept. of Computer Science and Eng., Univ. of Washington May 1994.
[5] C. Anderson and A.R. Karlin, “Two Adaptive Hybrid Cache Coherency Protocols,” Proc. Second Int'l Symp. High-Performance Computer Architecture, pp. 303-313, Feb. 1996.
[6] J. Archibald and J.L. Baer, "Cache Coherence Protocols: Evaluation Using a Multiprocessor Simulation Model," ACM Trans. Computer Systems, vol. 4, no. 4, Nov. 1986.
[7] J.K. Archibald, “The Cache Coherence Problem in Shared-Memory Multiprocessor,” PhD thesis, Univ. of Washington, Mar. 1987.
[8] J. K. Archibald,“A cache coherence approach for large multiprocessor system,”inProc. 2nd Int. Conf. Supercomput., 1988, pp. 337–345.
[9] M. Brorsson and P. Stenström, “Modelling Accesses to Migratory and Producer-Consumer Characterized Data in a Shared Memory Multiprocessor,” Proc. Sixth Symp. Parallel and Distributed Processing, pp. 612-619, Oct. 1994.
[10] J. Carter, J. Bennett, and W. Zwaenepoel, "Techniques for Reducing Consistency-Related Communication in Distributed Shared Memory Systems," ACM Trans. Computer Systems, Vol. 13, No. 3, Aug. 1995, pp. 205-244.
[11] L.M. Censier and P. Feautrier, “A New Solution to the Coherence Problems in Multicache Systems,” IEEE Trans. Computers, vol. 27, no. 12, pp. 1,112-1,118, Dec. 1978.
[12] H. Cheong and A.V. Veidenbaum, “A Cache Coherence Scheme with Fast Selective Invalidation,” Proc. 15th Int'l Symp. Computer Architecture, pp. 299-307, Honolulu, Hawaii, May-June 1988.
[13] A.L. Cox and R.J. Fowler, "Adaptive Cache Coherency for Detecting Migratory Shared Data," Proc. 20th Ann. Int'l Symp. Computer Architecture, IEEE Computer Soc. Press, Los Alamitos, Calif., 1993, pp. 98-108.
[14] D. Culler, J.P. Singh, and A. Gupta, Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufmann, San Francisco, 1998.
[15] DECChip 21064 - A RISC Microprocessor Preliminary Data Sheet, DEC - Digital Equipment Corp., Maynard, Mass., 1993.
[16] M. Dubois, C. Scheurich, and F.A. Briggs, “Synchronization, Coherence, and Event Ordering in Multiprocessors,” Computer, vol. 21, no. 2, pp. 9-21, Feb. 1998.
[17] M. Dubois and J.-C. Wang, “Shared Block Contention in a Cache Coherence Protocol,” IEEE Trans. Computers, vol. 40, no. 5, pp. 640-644, May 1991.
[18] M. Dubois, J.C. Wang, L.A. Barroso, K. Lee, and Y.-S. Chen, “Delayed Consistency and its Effect on the Miss Rate of Parallel Programs,” Proc. Fourth Conf. Supercomputing, pp. 197-207, Alburquerque, N.M., Nov. 1991.
[19] S.J. Eggers and R.H. Katz, "A Characterization of Sharing in Parallel Programs and Its Application to Coherency Protocol Evaluation," Proc. 15th Ann. Int'l Symp. Computer Architecture, IEEE Computer Society Press, Los Alamitos, Calif., 1988, pp. 373-382.
[20] S.J. Eggers, “Simulation Analysis of Data Sharing in Shared Memory Multiprocessors,” PhD thesis UCB/CSD 89/501, Univ. of California, Berkeley, Apr. 1989.
[21] S.J. Eggers and R.H. Katz, "Evaluating the Performance of Four Snooping Cache Coherence Protocols," Proc. 16th Ann. Int'l Symp. Computer Architecture, IEEE Computer Soc. Press, Los Alamitos, Calif., 1989, pp. 2-15.
[22] S.J. Eggers, D.R. Keppel, E.J. Koldinger, and H.M. Levy, “Techniques for Efficient In-Line Tracing on a Shared-Memory Multiprocessor,” Proc. ACM SIGMetrics Int'l Conf. Measurement and Modeling of Computer Systems, pp. 37–47, 1990.
[23] S.J. Eggers, “Simplicity versus Accuracy in a Model of Cache Coherency Overhead,” IEEE Trans. Computers, vol. 40, no. 8, pp. 893-906, Aug. 1991.
[24] M.J. Flynn, Computer Architecture Pipelined and Parallel Processor Design, Jones and Bartlett Publishers, Boston, 1995.
[25] S.J. Frank, “Tightly Coupled Multiprocessor System Speeds Memory Access Times,” Electronics, vol. 57, no. 1, pp. 164-169, Jan. 1984.
[26] J.G. Gee and A.J. Smith, “Absolute and Comparative Performance of Cache Consistency Algorithms,” Technical Report UCB//CSD-93-753, EECS Computer Science Division, Univ. of California, Berkeley, 1993.
[27] K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. Hennessy, “Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors,” Proc. 17th Ann. Int'l Symp. Computer Architecture, 1990.
[28] R. Giorgi, C. Prete, G. Prina, and L. Ricciardi, “A Hybrid Approach to Trace Generation for Performance Evaluation of Shared-Bus Multiprocessors,” Proc. 22nd EuroMicro Int'l. Conf., pp. 207-214, Prague, Sept. 1996.
[29] R. Giorgi, C. Prete, G. Prina, and L. Ricciardi, “Trace Factory: Generating Workloads for Trace-Driven Simulation of Shared-Bus Multiprocessors,” IEEE Concurrency, vol. 5, no. 4, pp. 54-68, Oct. 1997.
[30] S.R. Goldschmidt and J.L. Hennessy, “The Accuracy of Trace-Driven Simulations of Multiprocessors,” Proc. ACM Sigmetrics Conf. Measurement and Modeling of Computer Systems, pp. 146-157, May 1993.
[31] J.R. Goodman, "Using Cache Memory to Reduce Processor-Memory Traffic," Proc. 10th Ann. Symp. Computer Architecture, pp. 124-132, 1983.
[32] H. Grahn and P. Stenström, “Evaluation of a Competitive-Update Cache Coherence Protocol with Migratory Data Detection,” J. Parallel and Distributed Computing, vol. 39, no. 2, pp. 168-180, Dec. 1996.
[33] A. Gupta and W.-D. Weber, "Cache Invalidation Patterns in Shared-Memory Multiprocessors," IEEE Trans. Computers, vol. 41, no. 7, pp. 794-810, July 1992.
[34] J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach. Morgan Kaufmann, 1995.
[35] M.A. Hollyday and C.S. Ellis, "Accuracy of Memory Reference Traces of Parallel Computations in Trace-Driven Simulation," IEEE Trans. Parallel and Distributed Systems, Vol. 3, No. 1, Jan. 1992, pp. 97-109.
[36] K. Hwang, Advanced Computer Architecture: Parallelism, Scalability, Programmability. McGraw-Hill, 1993.
[37] T. Jeremiassen and S. Eggers, “Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations,” Proc. SIGPLAN Symp. Principles and Practices of Parallel Programming, pp. 179-188, July 1995.
[38] M. Kadiyala and L.N. Bhuyan, “A Dynamic Cache Sub-block Design to Reduce False Sharing,” Technical Report TR95-010, Texas A&M Univ., 1995.
[39] A. Karlin, M. Manasse, L. Rudolph, and D. Sleator, “Competitive Snoopy Caching,” Proc. 27th Symp. Foundations of Computer Science, pp. 244-254, Oct. 1986.
[40] R.H. Katz et al., "Implementing a Cache Consistency Protocol," Proc. 12th Ann. Int'l Symp. Computer Architecture, June 1985, pp. 158-166.
[41] G. Lee, B. Quattlebaum, and L. Kinney, “Protocol Mapping for a Bus-Based COMA Multiprocessor,” Technical Report DICE #4, Dept. of Electrical Eng., Univ. of Minnesota, Mar. 1994.
[42] E.P. Markatos and T.J. LeBlanc, “Load Balancing vs. Locality Management In Shared-Memory Multiprocessors,” Proc. 1992 Int'l Conf. Parallel Processing, vol. 1: Architecture, pp. 258-267, Ann Arbor, Mich., Aug. 1992.
[43] E.M. McCreight, “The Dragon Computer System: An Early Overview,” NATO Advanced Study Institute on Microarchitecture of VLSI Computer, July 1984.
[44] A.R. Lebeck and D.A. Wood, “Dynamic Self-Invalidation: Reducing Coherence Overhead in Shared-Memory Multiprocessors,” Proc. 22nd Int'l Symp. Computer Architecture, pp. 48-59, New York, 22-24 June 1995.
[45] T.C. Mowry, “Tolerating Latency in Multiprocessors Through Compiler-Inserted Prefetching,” ACM Trans. Computer Systems, vol. 16, no. 1, pp. 55-92 1998.
[46] H. Nilsson and P. Stenström, “An Adaptive Update-Based Cache Coherence Protocol for Reduction of Miss Rate and Traffic,” Parallel Architectures and Languages Europe, pp. 363-374, July 1994.
[47] M. Paramarcos and J. Patel,“A low-overhead coherence solution for multiprocessors with private cache memories,” Proc. 11th Int’l Symp. Computer Architecture, pp. 348-354, June 1984.
[48] C.A. Prete, “A New Solution of Coherence Protocol for Tightly Coupled Multiprocessor Systems,” Microprocessing and Microprogramming, vol. 30, nos. 1-5, pp. 207-214, 1990.
[49] C.A. Prete, “RST Cache Memory Design for a Tightly Coupled Multiprocessor System,” IEEE Micro, vol. 11, no. 2, pp. 16-19, 40-52, Apr. 1991.
[50] C.A. Prete, G. Prina, and L. Ricciardi, "A Trace-Driven Simulator for Performance Evaluation of Cache-Based Multiprocessor Systems," IEEE Trans. Parallel and Distributed Systems, Vol. 6, No. 9, Sept. 1995, pp. 915-929.
[51] C.A. Prete, G. Prina, and L. Ricciardi, “A Selective Invalidation Strategy for Cache Coherence,” IEICE Trans. Information and Systems, vol. E78-D, no. 10, pp. 1,316-1,320, Oct. 1995.
[52] C.A. Prete, G. Prina, R. Giorgi, and L. Ricciardi, “Some Considerations About Passive Sharing in Shared-Memory Multiprocessors,” IEEE TCCA Newsletter, pp. 34-40, Mar. 1997.
[53] L. Rudolph and Z. Segall,“Dynamic decentralized cache schemes for mimd parallel processors,” Proc. Int’l Symp. Computer Architecture, pp. 340-347, 1984.
[54] J.P. Singh, W.D. Weber, and A. Gupta, "SPLASH: Stanford Parallel Applications for Shared Memory," Proc. 19th Annual Int'l Symp. Computer Architecture, IEEE CS Press, Los Alamitos, Calif., May 1992, pp. 5-14.
[55] J. Skeppstedt and P. Stenstrom, "A Compiler Algorithm to Reduce Ownership Overhead in Cache Coherence Protocols," Proc. Int'l Conf. Architectural Support for Programming Languages and Operating Systems, ACM Press, New York, 1994, pp. 286-296.
[56] J. Skeppstedt and P. Stenstrom, "A Compiler Algorithm that Reduces Read Latency in Ownership-Based Cache Coherence Protocols," Proc. Int'l Conf. Parallel Architectures and Compilation Techniques, IEEE Computer Soc. Press, Los Alamitos, Calif., 1995, pp. 69-78.
[57] M.S. Squillante and E.D. Lazowska, "Using Processor-Cache Affinity Information in Shared-Memory Multiprocessor Scheduling," IEEE Trans. Parallel and Distributed Systems, Vol. 4, No. 2, Feb. 1993, pp. 131-143.
[58] S. Srbljic, Z.G. Vranesic, M. Stumm, and L. Budin, “Analytical Prediction of Performance for Cache Coherence Protocols,” IEEE Trans. Computers, vol. 46, no. 11, pp. 1,155-1,173, Nov. 1997.
[59] P. Stenström, "A Survey of Cache Coherence Scheme for Multiprocessors," Computer, vol. 23, no. 6, pp. 12-24, Jun.e 1990.
[60] P. Stenström, M. Brorsson, and L. Sandberg, "An Adaptive Cache Coherence Protocol Optimized for Migratory Sharing," Proc. 20th Ann. Int'l Symp. Computer Architecture, pp. 109-118, 1993.
[61] P. Stenström, E. Hagersten, D.J. Lilja, M. Martonosi, and M. Venugopal, “Trends in Shared Memory Multiprocessing,” Computer, vol. 30, no. 12, pp. 44-50, Dec. 1997.
[62] C.B. Stunkel,B. Janssens,, and W.K. Fuchs,“Address tracing for parallel machines,” Computer, pp. 31-38, 1991.
[63] P. Sweazey and A.J. Smith, “A Class of Compatible Cache Consistency Protocols and their Support by the IEEE Futurebus,” Proc. 13th Ann. Int'l Symp. Computer Architecture, pp. 414-423, June 1986.
[64] M. Takahashi, H. Takano, E. Kaneko, and S. Suzuki, “A Shared-bus Control Mechanism and a Cache Coherency Protocol for High-Performance On-Chip Multiprocessor,” Proc. Second IEEE Int'l Symp. High-Performance Computer Architecture, Feb. 1996.
[65] C. Thacker, L. Stewart, and E. Satterthwaite, “Firefly: A Multiprocessor Workstation,” IEEE Trans. Computers, vol. 37, no. 8, pp. 909-920, Aug. 1988.
[66] M. Tomasevic and V. Milutinovic, “A Simulation Study of Snoopy Cache Coherence Protocols,” Proc. 25thHawaii Int'l Conf. System Sciences (HICSS-25), vol. I, pp. 427-436, Jan. 1992.
[67] M. Tomasevic and V. Milutinovic, Tutorial on the Cache Coherence Problem in Shared-Memory Multiprocessors: Hardware Solutions, IEEE Computer Society Press, Los Alamitos, Calif., 1993.
[68] M. Tomasevic and V. Milutinovic,"Hardware Approaches to Cache Coherence in Shared-Memory Multiprocessors, Part 1," IEEE Micro, Oct. 1994, pp. 52-59.
[69] M. Tomasevic and V. Milutinovic, “The Word-Invalidate Cache Coherence Protocol,” Microprocessors and Microsystems, vol. 20, pp. 3-16, Mar. 1996.
[70] J. Torrellas, M.S. Lam, and J.L. Hennessy, “Share Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates,” Proc. 1990 Int'l Conf. Parallel Processing, vol. 2: Software, pp. 266-270, Urbana-Champaign, Ill., Aug. 1990.
[71] J. Torrellas, A. Tucker, and A. Gupta, “Evaluating the Performance of Cache-Affinity Scheduling in Shared-Memory Multiprocessors,” J. Parallel and Distributed Computing, vol. 24, no. 2, pp. 139-151, Feb. 1995.
[72] R.A. Uhlig and T.N. Mudge, "Trace-Driven Memory Simulation: A Survey," ACM Computing Surveys, Vol. 29, No. 2, June 1997, pp. 128-170.
[73] B. Vashaw, “Address Trace Collection an Trace Driven Simulation of Bus Based, Shared Memory Multiprocessors,” Technical Report CMUCDS-93-4, Carnegie Mellon Univ., Mar. 1993.
[74] R. Vaswani and J. Zahorjan,“The implications of cache affinity on processor scheduling for multiprogrammed, shared memory multiprocessors,”inACM Symp. Operat. Syst. Princip., Pacific Grove, 1991, pp. 26–40.
[75] J. Veenstra and R. Fowler, "A Performance Evaluation of Optimal Hybrid Cache Coherency Protocols," Proc. 1992 Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 149-157, Oct. 1992.
[76] J.E. Veenstra and R.J. Fowler, “The Prospects for On-Line Hybrid Coherency Protocols on Bus-Based Multiprocessors,” Technical Report TR 490, Computer Science Dept., Univ. of Rochester Mar. 1994.
[77] C. Xia and J. Torrellas, “Improving the Data Cache Performance of Multiprocessor Operating Systems,” Proc. Second Int'l Symp. High-Performance Computer Architecture, pp. 85-94, Feb. 1996.
[78] J.C. Mogul and A. Borg, “The Effect of Context Switches on Cache Performance,” Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 75-84, Santa Clara, Calif., Apr. 1991.
Index Terms:
Cache memory, coherence protocol, multiprocessor, performance evaluation.
Citation:
Roberto Giorgi, Cosimo Antonio Prete, "PSCR: A Coherence Protocol for Eliminating Passive Sharing in Shared-Bus Shared-Memory Multiprocessors," IEEE Transactions on Parallel and Distributed Systems, vol. 10, no. 7, pp. 742-763, July 1999, doi:10.1109/71.780868