710J.Comput.Sci.&Technol.,July2014,Vol.29, No.4HiPEAC, Jan.2011,Pp.147-157.[7] Shen X, Gao Y, Ding C et al. Lightweight reference affinity[48] Shen X, Shaw J, Meeker B, Ding C. Locality approximationanalysis.In Proc.the 19th ICS, June 2005,Pp.131-140.using time.In Proc.the34thPOPL, Jan.2007,pp.55-61.[72] Bao B, Ding C. Defensive loop tiling for shared cache.InProc. CGO, Feb. 2013, pp.1-11.[49]Shen X,ShawJ.Scalableimplementationof efficientlocal-ity approximation. In Proc. the 21st LCPC Workshop, July[73] Bao B. Peer-aware program optimization [Ph.D. Thesis].31-August 2, 2008, pp.202-216.Computer Science Dept., Univ.of Rochester, January 2013.[50] Jiang Y, Zhang E Z, Tian K, Shen X. Is reuse distance ap-[74] Yuan L, Ding C, Stefankovic D, Zhang Y. Modeling the local-plicable to data locality analysis on chip multiprocessors? Inity in graph traversals. In Proc. the 41st ICPP, Sept. 2012,Proc.the19thCC,Mar.2010,Pp.264-282.Pp.138-147[51] Shen X, Shaw J, Meeker B, Ding C. Locality approximation[75] Agarwal A, Hennessy J L, Horowitz M.Cache performanceusing time. Technical Report, TR 901, Department of Com-of operatingsystemandmultiprogrammingworkloads.ACMputer Science, University of Rochester,December 2006.Transactions on Computer Systems,1988, 6(4): 393-431[52] Jiang Y, Tian K, Shen X. Combining locality analysis with[76] Ding C, Chilimbi T. A composable model for analyzing local-online proactive job co-scheduling in chip multiprocessors. Inity of multi-threaded programs. Technical Report, MSR-TR-Proc.HiPEAC,Jan.2010,pp.201-2152009-107, Microsoft Research, August 2009.[53] West R, Zaroo P, Waldspurger C A, Zhang X. Online cache[77] Strohmaier E, Shan H. APEX-Map: A parameterized scal-modeling for commodity multicoreprocessors.Operating Sys-able memory access probe for high-performance computingtems Reuiew, 2010, 44(4): 19-29.systems. Concurrency and Computation: Practice and Erpe-[54] Fedorova A, Seltzer M, Smith M D. Improving performancerience, 2007,19(17): 2185-2205.isolation on chip multiprocessors via an operating system[78] Ibrahim K Z, Strohmaier E. Characterizing the relation be-scheduler. In Proc. the 16th PACT, Sept. 2007,Pp.25-38.tween Apex-Map synthetic probes and reuse distance distri-[55] Zhou S. An efficient simulation algorithm for cache of randombutions.In Proc.ICPP, Sept.2010,Pp.353-362.replacement policy.In Proc.the IFIP Int. Conf.Network[79] He L, Yu Z, Jin H. FractalMRC: Online cache miss rate curveand Parallel Computing, Sept.2010,pp.144-154.IPDPS,Mayprediction on commodity systems.In Proc.[56] Arnold M, Ryder B G. A framework for reducing the cost of2012, pp.1341-1351.instrumented code. In Proc.PLDI, June 2001, Pp.168-179.[80] Saltzer J H. A simple linear model of demand paging perfor-[57] Hirzel M, Chilimbi T M. Bursty tracing: A framework formance. Communications of the ACM, 1974, 17(4): 181-186.low-overhead temporal profiling-In Proc.ACM Workshop[81] Strecker W D. Transient behavior of cache memories. ACMon Feedback-Directed and Dynamic Optimization, Dec.2001.TransactionsonComputerSystems,1983,1(4):281-293[58] Cascaval C, Duesterwald E, Sweeney P F, Wisniewski R W.King W F.Analysis of demand paging algorithms.In Proc.[82]Multiple page size modeling and optimization.In Proc.theIFIPCongress,August1971,pp.485-490.14thPACT,Sept.2005,Pp.339-349.[83] Fagin R, Price T G. Efficient calculation of expected miss ra-[59] Zhong Y, Chang W. Sampling-based program locality approx-tios in the independent reference model.SIAM Journal ofimation.In Proc. the 7th ISMM, June 2008,pp.91-100.Computing,1978,7(3):288-297.[60] Tam D K, Azimi R, Soares L, Stumm M. RapidMRC: Ap-[84] Dan A, Towsley D F. An approximate analysis of the LRU andproximating L2 miss rate curves on commodity systems forFIFO buffer replacement schemes.In Proc. SIGMETRICS,online optimizations.InProc.the14th ASPLOS, Mar.2009,May 1990, pp.143-152.pp.121-132.[85] Gu X, Ding C. Reuse distance distribution in random access.[61] Niu Q, Dinan J, Lu Q, Sadayappan P. PARDA: A fast para-Technical Report, URCS #930, University of Rochester, Jan-Ilel reuse distance analysis algorithm.In Proc.IPDPS, Mayuary 2008.2012.[86] Denning P J, Slutz D R. Generalized working sets for segment[62] Cui H, Yi Q, Xue J, Wang L, Yang Y, Feng X. A highly para-reference strings. Communications of the ACM, 1978, 21(9):llel reuse distance analysis algorithm on GPUs.In Proc.the750-759.26thIPDPS,May2012,PP.1284-1294.[87] Easton M C, Fagin R. Cold-start vs. warm-start miss ratios.[63]Gupta S, Xiang P,Yang Y,Zhou H.Locality principlere-Communications of the ACM, 1978, 21(10): 866-872.visited: A probability-Based quantitative approach. In Proc.[88] Shedler G, Tung C. Locality in page reference strings. SIAMthe26thIPDPS,May2012,Pp.995-1009Journal on Computing,1972,1(3):218-241.[64] Moseley T, Shye A, Reddi V J, Grunwald D, Peri R. Shadow[89] Stone H S, Turek J, Wolf J L. Optimal partitioning of cacheprofiling: Hiding instrumentation costs with parallelism. InIEEETransactions on Computers,1992,41(9):memory.Proc. CGO, March 2007, Pp.198-208.1054-1068.[65] Wallace S, Hazelwood K. Superpin: Parallelizing dynamic[90] Thiebaut D, Stone H S, Wolf J L. Improving disk cache hit-instrumentation for real-time performance.In Proc.CGO,EEE Transactions onratios through cache partitioning.Mar.2007,pp.209-220.Computers, 1992, 41(6): 665-676.[66] Cascaval C, Padua D A. Estimating cache misses and local-[91] Falsafi B, Wood D A. Modeling cost/performance of a para-ity using stack distances.In Proc. the 17th ICS, June 2003,Ilel computer simulator.ACM Transactions on Modeling andpp.150-159.Computer Simulation, 1997,7(1):104-130.[67] Allen R, Kennedy K. Optimizing Compilers for Modern Archi-[92] Wu M J, Yeung D. Identifying optimal multicore cache hi-tectures: A Dependence-Based Approach. Morgan KaufmannPublishers, 2001.erarchies for loop-based parallel programs via reuse distanceanalysis.In Proc.the ACM SIGPLAN Workshop on Memory[68] Beyls K, D'Hollander E H. Generating cache hints for im-System Performance and Correctness, June 2012,pp.2-11.proved program efficiencyJournal of Systems Architecture,2005,51(4):223-250.[93] Fedorova A, Blagodurov S, Zhuravlev S. Managing contention[69] Pugh W,Wonnacott D. Eliminating false data dependencesfor shared resources on multicore processors.Communica-tions of the ACM, 2010, 53(2): 49-57.using the Omega test.In Proc.PLDI, June1992,pp.140-151.[70] Chauhan A, Shei C Y. Static reuse distances for locality-based[94] Zhuravlev S, Blagodurov S, Fedorova A. Addressing sharedoptimizations in MATLAB.In Proc.the 24thICS,June 2010,resource contention in multicore processors via scheduling.InPp.295-304.Proc.ASPLOS,March2010,pp.129-142
710 J. Comput. Sci. & Technol., July 2014, Vol.29, No.4 HiPEAC, Jan. 2011, pp.147-157. [48] Shen X, Shaw J, Meeker B, Ding C. Locality approximation using time. In Proc. the 34th POPL, Jan. 2007, pp.55-61. [49] Shen X, Shaw J. Scalable implementation of efficient locality approximation. In Proc. the 21st LCPC Workshop, July 31-August 2, 2008, pp.202-216. [50] Jiang Y, Zhang E Z, Tian K, Shen X. Is reuse distance applicable to data locality analysis on chip multiprocessors? In Proc. the 19th CC, Mar. 2010, pp.264-282. [51] Shen X, Shaw J, Meeker B, Ding C. Locality approximation using time. Technical Report, TR 901, Department of Computer Science, University of Rochester, December 2006. [52] Jiang Y, Tian K, Shen X. Combining locality analysis with online proactive job co-scheduling in chip multiprocessors. In Proc. HiPEAC, Jan. 2010, pp.201-215. [53] West R, Zaroo P, Waldspurger C A, Zhang X. Online cache modeling for commodity multicore processors. Operating Systems Review, 2010, 44(4): 19-29. [54] Fedorova A, Seltzer M, Smith M D. Improving performance isolation on chip multiprocessors via an operating system scheduler. In Proc. the 16th PACT, Sept. 2007, pp.25-38. [55] Zhou S. An efficient simulation algorithm for cache of random replacement policy. In Proc. the IFIP Int. Conf. Network and Parallel Computing, Sept. 2010, pp.144-154. [56] Arnold M, Ryder B G. A framework for reducing the cost of instrumented code. In Proc. PLDI, June 2001, pp.168-179. [57] Hirzel M, Chilimbi T M. Bursty tracing: A framework for low-overhead temporal profiling. In Proc. ACM Workshop on Feedback-Directed and Dynamic Optimization, Dec. 2001. [58] Cascaval C, Duesterwald E, Sweeney P F, Wisniewski R W. Multiple page size modeling and optimization. In Proc. the 14th PACT, Sept. 2005, pp.339-349. [59] Zhong Y, Chang W. Sampling-based program locality approximation. In Proc. the 7th ISMM, June 2008, pp.91-100. [60] Tam D K, Azimi R, Soares L, Stumm M. RapidMRC: Approximating L2 miss rate curves on commodity systems for online optimizations. In Proc. the 14th ASPLOS, Mar. 2009, pp.121-132. [61] Niu Q, Dinan J, Lu Q, Sadayappan P. PARDA: A fast parallel reuse distance analysis algorithm. In Proc. IPDPS, May 2012. [62] Cui H, Yi Q, Xue J, Wang L, Yang Y, Feng X. A highly parallel reuse distance analysis algorithm on GPUs. In Proc. the 26th IPDPS, May 2012, pp. 1284-1294. [63] Gupta S, Xiang P, Yang Y, Zhou H. Locality principle revisited: A probability-Based quantitative approach. In Proc. the 26th IPDPS, May 2012, pp.995-1009. [64] Moseley T, Shye A, Reddi V J, Grunwald D, Peri R. Shadow profiling: Hiding instrumentation costs with parallelism. In Proc. CGO, March 2007, pp.198-208. [65] Wallace S, Hazelwood K. Superpin: Parallelizing dynamic instrumentation for real-time performance. In Proc. CGO, Mar. 2007, pp.209-220. [66] Cascaval C, Padua D A. Estimating cache misses and locality using stack distances. In Proc. the 17th ICS, June 2003, pp.150-159. [67] Allen R, Kennedy K. Optimizing Compilers for Modern Architectures: A Dependence-Based Approach. Morgan Kaufmann Publishers, 2001. [68] Beyls K, D’Hollander E H. Generating cache hints for improved program efficiency. Journal of Systems Architecture, 2005, 51(4): 223-250. [69] Pugh W, Wonnacott D. Eliminating false data dependences using the Omega test. In Proc. PLDI, June 1992, pp.140-151. [70] Chauhan A, Shei C Y. Static reuse distances for locality-based optimizations in MATLAB. In Proc. the 24th ICS, June 2010, pp.295-304. [71] Shen X, Gao Y, Ding C et al. Lightweight reference affinity analysis. In Proc. the 19th ICS, June 2005, pp.131-140. [72] Bao B, Ding C. Defensive loop tiling for shared cache. In Proc. CGO, Feb. 2013, pp.1-11. [73] Bao B. Peer-aware program optimization [Ph.D. Thesis]. Computer Science Dept., Univ. of Rochester, January 2013. [74] Yuan L, Ding C, Stefankoviˇc D, Zhang Y. Modeling the local- ˇ ity in graph traversals. In Proc. the 41st ICPP, Sept. 2012, pp.138-147. [75] Agarwal A, Hennessy J L, Horowitz M. Cache performance of operating system and multiprogramming workloads. ACM Transactions on Computer Systems, 1988, 6(4): 393-431. [76] Ding C, Chilimbi T. A composable model for analyzing locality of multi-threaded programs. Technical Report, MSR-TR- 2009-107, Microsoft Research, August 2009. [77] Strohmaier E, Shan H. APEX-Map: A parameterized scalable memory access probe for high-performance computing systems. Concurrency and Computation: Practice and Experience, 2007, 19(17): 2185-2205. [78] Ibrahim K Z, Strohmaier E. Characterizing the relation between Apex-Map synthetic probes and reuse distance distributions. In Proc. ICPP, Sept. 2010, pp.353-362. [79] He L, Yu Z, Jin H. FractalMRC: Online cache miss rate curve prediction on commodity systems. In Proc. IPDPS, May 2012, pp.1341-1351. [80] Saltzer J H. A simple linear model of demand paging performance. Communications of the ACM, 1974, 17(4): 181-186. [81] Strecker W D. Transient behavior of cache memories. ACM Transactions on Computer Systems, 1983, 1(4): 281-293. [82] King W F. Analysis of demand paging algorithms. In Proc. IFIP Congress, August 1971, pp.485-490. [83] Fagin R, Price T G. Efficient calculation of expected miss ratios in the independent reference model. SIAM Journal of Computing, 1978, 7(3): 288-297. [84] Dan A, Towsley D F. An approximate analysis of the LRU and FIFO buffer replacement schemes. In Proc. SIGMETRICS, May 1990, pp.143-152. [85] Gu X, Ding C. Reuse distance distribution in random access. Technical Report, URCS #930, University of Rochester, January 2008. [86] Denning P J, Slutz D R. Generalized working sets for segment reference strings. Communications of the ACM, 1978, 21(9): 750-759. [87] Easton M C, Fagin R. Cold-start vs. warm-start miss ratios. Communications of the ACM, 1978, 21(10): 866-872. [88] Shedler G, Tung C. Locality in page reference strings. SIAM Journal on Computing, 1972, 1(3): 218-241. [89] Stone H S, Turek J, Wolf J L. Optimal partitioning of cache memory. IEEE Transactions on Computers, 1992, 41(9): 1054-1068. [90] Thi´ebaut D, Stone H S, Wolf J L. Improving disk cache hitratios through cache partitioning. IEEE Transactions on Computers, 1992, 41(6): 665-676. [91] Falsafi B, Wood D A. Modeling cost/performance of a parallel computer simulator. ACM Transactions on Modeling and Computer Simulation, 1997, 7(1): 104-130. [92] Wu M J, Yeung D. Identifying optimal multicore cache hierarchies for loop-based parallel programs via reuse distance analysis. In Proc. the ACM SIGPLAN Workshop on Memory System Performance and Correctness, June 2012, pp.2-11. [93] Fedorova A, Blagodurov S, Zhuravlev S. Managing contention for shared resources on multicore processors. Communications of the ACM, 2010, 53(2): 49-57. [94] Zhuravlev S, Blagodurov S, Fedorova A. Addressing shared resource contention in multicore processors via scheduling. In Proc. ASPLOS, March 2010, pp.129-142
711Chen Ding et al.:Performance Metrics and Models for Shared Cache[95] Blagodurov S, Zhuravlev S, Fedorova A. Contention-aware[117] Miller B P, Callaghan M D, Cargille J M et al. The Para-scheduling on multicore systems.ACM Transactions ondynparallelperformancemeasurementtool.IEEEComputer,Computer Systems, 2010, 28(4): Article No.8.1995,28(11):37-46.[96] Chen X E, Aamodt T M. A first-order fine-grained multi-[118] Kerbyson D J, Hoisie A, Wasserman H J. Modelling the per-threaded throughput model. In Proc.HPCA, Feb.2009,formance of large-scale systems.IEE Proceedings-Software,pp.329-340.2003,150(4):214-222[97] Xie Y, Loh G H. Dynamic classification of program memory[119] Wall D W. Predicting program behavior using real or esti-behaviors in CMPs.In Proc.CMP-MSI Workshop, Junemated profiles. In Proc. PLDI, June1991,Pp.59-70.2008.[120] Tian K, Jiang Y, Zhang E Z, Shen X. An input-centric[98] Hennessy JL, Patterson D A.Computer Architecture:Aparadigmforprogram dynamic optimizations.InProc.OOP-Quantitative Approach (4th edition).Morgan Kaufmann,SLA, Oct. 2010, Pp.125-139.2006.[121] Shen X, Zhong Y, Ding C. Regression-based multi-model pre-[99] Sun X H, Wang D.APC: A performance metric of memorydiction of data reuse signature. In Proc. the 4th Annual Sym-ACMSIGMETRICSPerformance Eualuation Re-systems.posium of the Los Alamos Computer Science Institute, Oct.iew,2012,40(2):125-130.2003.[100] Zhao J, Feng X, Cui H et al. An empirical model for predict-[122] Marin G, Mellor-Crummey J. Scalable cross-architecture pre-ing cross-core performance interference on multicore proces-dictions of memory hierarchy response for scientific applica-sors.In Proc.PACT, Sept.2013, pp.201-212.tions. In Proc. the Symposium of the Los Alamos Computer[101] Wang W, Dey T, Davidson J W et al. DraMon: PredictingScience Institute, Oct. 2005.memory bandwidth usage of multi-threaded programs with[123] Shen X, Ding C.Parallelization of utility programs based onhigh accuracy and low overhead.In Proc.HPCA,Feb.2014.behavior phase analysis.In Proc. the International Work-[102] Kim M, Kumar P, Kim H, Brett B. Predicting potentialshop on Languages and Compilers for Parallel Computing,Oct.2005,pp.425-432.speedup of serial code via lightweight profiling and emula-[124] Shen X, Zhong Y, Ding C. Locality phase prediction. In Proc.tions with memory performancemodel.InProc.IPDPS,May2012,pp.1318-1329.ASPLOS,Oct.2004,Pp.165-176.[103] Zhang X, ZhongR,Dwarkadas S, ShenK.Aflexibleframe-[125] ShenX,ZhongY,DingC.Predictinglocalityphasesfordy-work for throttling-enabled multicore management (TEMM).namicmemoryoptimization.Journal of Parallel and Dis-In Proc. ICPP, Sept.2012. Pp.389-398.tributed Computing. 2007, 67(7): 783-796.[104] Liu L, Cui Z, Xing M et al. A software memory partition[126] Mao F, Shen X. Cross-input learning and discriminative pre-approach for eliminating bank-level interference in multicorediction in evolvable virtual machines.In Proc. CGO, Mar.2009,pp.92-101.systems.InProc.PACT,Sept.2012,Pp.367-376.[105] Jiang Y, Tian K, Shen X, Zhang J, Chen J, Tripathi R. The[127] Jiang Y, Zhang E Z, Tian K et al. Exploiting statistical cor-complexity of optimal job co-scheduling on chip multiprocesrelations forproactiveprediction of program behaviors.Insors and heuristics-based solutions. IEEE Trans.ParallelProc. the 8th CGO, April 2010, pp.248-256.and Distributed Systems, 2011, 22(7): 1192-1205.[128] Cavazos J, Moss J E B. Inducing heuristics to decide whether[106] Jiang Y, Shen X, Chen J, Tripathi R. Analysis and approxi-to schedule. In Proc. PLDI, June 2004, pp.183-194.mation of optimal co-scheduling on chip multiprocessors.In[129] Wu B, Zhao Z, Shen X, Jiang Y, Gao Y, Silvera R. ExploitingProc.PACT,Oct.2008,Pp.220-229.inter-sequence correlations for program behavior prediction.[107] Snavely A, Tullsen D M. Symbiotic jobscheduling for a simul-InProc.OOPSLA,Oct.2012,Pp.851-866.taneous multithreading processor.In Proc.ASPLOs, Nov.[130] Arnold M, Welc A, Rajan V T. Improving virtual machine2000,pp.234-244.performance using a cross-run profile repository. In Proc.[108] Shen K. Request behavior variations.InProc..ASPLOS,OOPSLA,Oct.2005,Pp.297-311.Mar.2010, pp.103-116.[131] Tian K, Zhang E Z, Shen X. A step towards transparent in-[109] Knauerhase R, Brett P, Hohlt B, Li T, Hahn S. Using OStegration of input-consciousness into dynamic program opti-observations to improve performance in multicore systems.mizations.InProc.OOPSLA,Oct.2011,Pp.445-462.IEEEMicro,2008,38(3):54-66.[132] Chen Y, Huang Y, Eeckhout L et al. Evaluating iterative op-[110] Denning P J.Equipment configuration in balanced computertimization across 1000 datasets.In Proc.PLDI, June 2010,systems.IEEETransactions on Computers,1969, C-18(11):Pp.448-459.[133] Wu B, Zhou M, Shen X et al. Simple profile rectifications1008-1012.[111] Wulf W A. Performance monitors for multi-programming sys-go a long way Statistically exploring and alleviating thetems.In Proc. the ACM Symposium on Operating Systemeffects of sampling errors for program optimizations.In Proc.Principles, Oct.1969,pp.175-181.the European Conference on Object-Oriented Programming,[112] Mars J, Tang L, Skadron K, Soffa M L, Hundt R. Increas-July 2013, pp.654-678.[134] Srivastava A, Eustace A. ATOM: A system for building cus-ing utilization in modern warehouse-scale computers usingbubble-up. IEEE Micro, 2012, 32(3): 88-99.tomized program analysis tools.In Proc.PLDI, June 1994,[113] Delimitrou C, Kozyrakis C. Paragon: QoS-aware schedulingpp.196-205.[135] Luk C, Cohn R, Muth R, Patil H, Klauser A, Lowney G, Wal-for heterogeneous datacenters.In Proc.ASPLOS, March2013,pp.77-88.lace S, Reddi V J, Hazelwood K. Pin: Building customized[114] Ahn D H, Vetter J S. Scalable analysis techniques for micro-program analysis tools with dynamic instrumentation. Inprocessor performance counter metrics. In Proc. ACM/IEEEProc.PLDI, June 2005,pp.190-200.Conf. Supercomputing, Nov.2002.[136] Wagner Meira Jr., LeBlanc T, Poulos A. Waiting time analy-[115] Rodriguez G, Badia R M, Labarta J. Generation of simplesis and performancevisualization in Carnival.In Proc.ACManalytical models for message passing applications. In Proc.SIGMETRICS Symposium on Parallel and Distributed Tools,Euro-Par.,Aug.31-Sept.3,2004, pp.183-188.May 1996.[137] Reed D A, Elford C L, Madhyastha T M, Smirni E, Lamm S E.[116] Jacquet A, Janot V, Leung C et al. An executable analyt-ical performance evaluation approach for early performanceThe next frontier: Interactive and closed loop performanceprediction. In Proc. IPDPS, April 2003.steering.In Proc.ICPP Workshop,Aug.1996,pp.20-31
Chen Ding et al.: Performance Metrics and Models for Shared Cache 711 [95] Blagodurov S, Zhuravlev S, Fedorova A. Contention-aware scheduling on multicore systems. ACM Transactions on Computer Systems, 2010, 28(4): Article No.8. [96] Chen X E, Aamodt T M. A first-order fine-grained multithreaded throughput model. In Proc. HPCA, Feb. 2009, pp.329-340. [97] Xie Y, Loh G H. Dynamic classification of program memory behaviors in CMPs. In Proc. CMP-MSI Workshop, June 2008. [98] Hennessy J L, Patterson D A. Computer Architecture: A Quantitative Approach (4th edition). Morgan Kaufmann, 2006. [99] Sun X H, Wang D. APC: A performance metric of memory systems. ACM SIGMETRICS Performance Evaluation Review, 2012, 40(2): 125-130. [100] Zhao J, Feng X, Cui H et al. An empirical model for predicting cross-core performance interference on multicore processors. In Proc. PACT, Sept. 2013, pp.201-212. [101] Wang W, Dey T, Davidson J W et al. DraMon: Predicting memory bandwidth usage of multi-threaded programs with high accuracy and low overhead. In Proc. HPCA, Feb. 2014. [102] Kim M, Kumar P, Kim H, Brett B. Predicting potential speedup of serial code via lightweight profiling and emulations with memory performance model. In Proc. IPDPS, May 2012, pp.1318-1329. [103] Zhang X, Zhong R, Dwarkadas S, Shen K. A flexible framework for throttling-enabled multicore management (TEMM). In Proc. ICPP, Sept. 2012, pp.389-398. [104] Liu L, Cui Z, Xing M et al. A software memory partition approach for eliminating bank-level interference in multicore systems. In Proc. PACT, Sept. 2012, pp.367-376. [105] Jiang Y, Tian K, Shen X, Zhang J, Chen J, Tripathi R. The complexity of optimal job co-scheduling on chip multiprocessors and heuristics-based solutions. IEEE Trans. Parallel and Distributed Systems, 2011, 22(7): 1192-1205. [106] Jiang Y, Shen X, Chen J, Tripathi R. Analysis and approximation of optimal co-scheduling on chip multiprocessors. In Proc. PACT, Oct. 2008, pp.220-229. [107] Snavely A, Tullsen D M. Symbiotic jobscheduling for a simultaneous multithreading processor. In Proc. ASPLOS, Nov. 2000, pp.234-244. [108] Shen K. Request behavior variations. In Proc. ASPLOS, Mar. 2010, pp.103-116. [109] Knauerhase R, Brett P, Hohlt B, Li T, Hahn S. Using OS observations to improve performance in multicore systems. IEEE Micro, 2008, 38(3): 54-66. [110] Denning P J. Equipment configuration in balanced computer systems. IEEE Transactions on Computers, 1969, C-18(11): 1008-1012. [111] Wulf W A. Performance monitors for multi-programming systems. In Proc. the ACM Symposium on Operating System Principles, Oct. 1969, pp.175-181. [112] Mars J, Tang L, Skadron K, Soffa M L, Hundt R. Increasing utilization in modern warehouse-scale computers using bubble-up. IEEE Micro, 2012, 32(3): 88-99. [113] Delimitrou C, Kozyrakis C. Paragon: QoS-aware scheduling for heterogeneous datacenters. In Proc. ASPLOS, March 2013, pp.77-88. [114] Ahn D H, Vetter J S. Scalable analysis techniques for microprocessor performance counter metrics. In Proc. ACM/IEEE Conf. Supercomputing, Nov. 2002. [115] Rodr´ıguez G, Badia R M, Labarta J. Generation of simple analytical models for message passing applications. In Proc. Euro-Par., Aug. 31-Sept. 3, 2004, pp.183-188. [116] Jacquet A, Janot V, Leung C et al. An executable analytical performance evaluation approach for early performance prediction. In Proc. IPDPS, April 2003. [117] Miller B P, Callaghan M D, Cargille J M et al. The Paradyn parallel performance measurement tool. IEEE Computer, 1995, 28(11): 37-46. [118] Kerbyson D J, Hoisie A, Wasserman H J. Modelling the performance of large-scale systems. IEE Proceedings - Software, 2003, 150(4): 214-222. [119] Wall D W. Predicting program behavior using real or estimated profiles. In Proc. PLDI, June 1991, pp.59-70. [120] Tian K, Jiang Y, Zhang E Z, Shen X. An input-centric paradigm for program dynamic optimizations. In Proc. OOPSLA, Oct. 2010, pp.125-139. [121] Shen X, Zhong Y, Ding C. Regression-based multi-model prediction of data reuse signature. In Proc. the 4th Annual Symposium of the Los Alamos Computer Science Institute, Oct. 2003. [122] Marin G, Mellor-Crummey J. Scalable cross-architecture predictions of memory hierarchy response for scientific applications. In Proc. the Symposium of the Los Alamos Computer Science Institute, Oct. 2005. [123] Shen X, Ding C. Parallelization of utility programs based on behavior phase analysis. In Proc. the International Workshop on Languages and Compilers for Parallel Computing, Oct. 2005, pp.425-432. [124] Shen X, Zhong Y, Ding C. Locality phase prediction. In Proc. ASPLOS, Oct. 2004, pp.165-176. [125] Shen X, Zhong Y, Ding C. Predicting locality phases for dynamic memory optimization. Journal of Parallel and Distributed Computing, 2007, 67(7): 783-796. [126] Mao F, Shen X. Cross-input learning and discriminative prediction in evolvable virtual machines. In Proc. CGO, Mar. 2009, pp.92-101. [127] Jiang Y, Zhang E Z, Tian K et al. Exploiting statistical correlations for proactive prediction of program behaviors. In Proc. the 8th CGO, April 2010, pp.248-256. [128] Cavazos J, Moss J E B. Inducing heuristics to decide whether to schedule. In Proc. PLDI, June 2004, pp.183-194. [129] Wu B, Zhao Z, Shen X, Jiang Y, Gao Y, Silvera R. Exploiting inter-sequence correlations for program behavior prediction. In Proc. OOPSLA, Oct. 2012, pp.851-866. [130] Arnold M, Welc A, Rajan V T. Improving virtual machine performance using a cross-run profile repository. In Proc. OOPSLA, Oct. 2005, pp.297-311. [131] Tian K, Zhang E Z, Shen X. A step towards transparent integration of input-consciousness into dynamic program optimizations. In Proc. OOPSLA, Oct. 2011, pp.445-462. [132] Chen Y, Huang Y, Eeckhout L et al. Evaluating iterative optimization across 1000 datasets. In Proc. PLDI, June 2010, pp.448-459. [133] Wu B, Zhou M, Shen X et al. Simple profile rectifications go a long way — Statistically exploring and alleviating the effects of sampling errors for program optimizations. In Proc. the European Conference on Object-Oriented Programming, July 2013, pp.654-678. [134] Srivastava A, Eustace A. ATOM: A system for building customized program analysis tools. In Proc. PLDI, June 1994, pp.196-205. [135] Luk C, Cohn R, Muth R, Patil H, Klauser A, Lowney G, Wallace S, Reddi V J, Hazelwood K. Pin: Building customized program analysis tools with dynamic instrumentation. In Proc. PLDI, June 2005, pp.190-200. [136] Wagner Meira Jr., LeBlanc T, Poulos A. Waiting time analysis and performance visualization in Carnival. In Proc. ACM SIGMETRICS Symposium on Parallel and Distributed Tools, May 1996. [137] Reed D A, Elford C L, Madhyastha T M, Smirni E, Lamm S E. The next frontier: Interactive and closed loop performance steering. In Proc. ICPP Workshop, Aug. 1996, pp.20-31
712J.Comput.Sci.&Technol.,July2014,Vol.29,No.4Xiaoya Xiang graduated in 2005[138] Darema-Rogers F, Pfister G F, So K. Memory access patternsfrom Huazhong University of Scienceof parallel scientific programs. In Proc. SIGMETRICS, May1987,pp.46-58.and Technology with a B.S. degree in[139] Browne S, Dongarra J, Garner N, Ho G, Mucci P. A portablecomputer science and technology andprogramminginterfaceforperformanceevaluation onmodernat the same time from Wuhan Uni-processors. The International Journal of High Performanceversity with a B.S. degree in finance.ComputingApplications2000.14(3):189-204Shegot her M.S.degree in computer[140] Adhianto L, Banerjee S, Fagan M, Krentel M, Marin G,science from Institute of ComputingMellor-Crummey J, Tallent N R. HPCTOOLKIT: Tools forTechnology, Chinese Academy of Sci-performance analysis of optimized parallel programs.Con-currency and Computation: Practice and Erperience, 2010,ences, Beijing, in 2008. She earned22(6):685-701.her Ph.D. degree in computer science at the University of[14] Shende S, Malony A D. The TAU parallel performance sys-Rochester in 2013. She is now a software engineer at Twit-tem.International Journal of High Performance Computingter Inc., where her main focus is the runtime performanceApplications,2006,20(2):287-311of the Twitter services in a cloud environment.[142] Schulz M, Galarowicz J, Maghrak D, Hachfeld W, Montoya D,Cranford S.OpenSpeedShop:An open sourceinfrastructureBin Bao is a senior software engi-for parallel performance analysis.Scientific Programmingneer at Oualcomm Technologies,Inc.2008, 16(2/3):105-121.Prior tojoiningQualcommin2013.[143] Hauswirth M, Sweeney P F, Diwan A. Temporal vertical pro-Bin spent one year at Adobe Inc. asfling.Software:Practice and Eaperience, 2010, 40(8):627-He receiveda computer scientist.654.his Ph.D.degree in computer sci-[144] Childers B, Davidson J, Soffa M L. Continuous compilation:ence from University of RochesterA new approach to aggressive and adaptive code transforma-tion.In Proc.Symp. Parallel and Distributed Processing.in 2013, M.S.degree in computerApril 2003.science from Institute of Computing[145] Cascaval C,Duesterwald E,Sweeney PF,WisniewskiRWTechnology, Chinese Academy of Sci-Performance and environment monitoring for continuous pro-ences in 2007, and B.S.degree in software engineering fromgram optimization.IBM Journal of Research and Develop-the University of Scienceand Technologyof China in 2004.ment,2006,50(2/3):239-248His current research interests include program analysis and[146] McCurdy C, Vetter J S. Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms.Incompilationforgraphicsprocessors.Proc. ISPASS, March 2010, Pp.87-96.HaoLuo is a third year Ph.D.[147] Liu X, Mellor-Crummey J M. Pinpointing data locality prob-Departmentoflemsusingdata-centric analysis.InProc.the9th CGO,Aprilstudentinthe2011, pp.171-180.ComputerScience,Universityof[148] Liu X, Mellor-Crummey J. A tool to analyze the perfor-Rochester. His research interest liesmance of multithreaded programs on NUMA architectureson performance modeling of multi-In Proc.the 19th ACM SIGPLAN Symposium onPrinciplesthreadedapplications,locality-awareandPractice of Parallel Programming,Feb.2014,pp.259-272task management, and program be-[149] Zhuang X, Serrano M J,Cain H W,Choi J.Accurate, effi-havior analysis.cient,and adaptive calling context profiling.In Proc.PLDI,June 2006,pp.263-271[150] Ding C, Yuan L.Program interaction on multicore: TheoryYing-WeiLuoreceivedhisand applications. Computer Engineering and Science, 2014,Ph.D.degree in computer science36(1): 1-5. (In Chinese)fromPekingUniversityin 1999.HeChen Ding received his Ph.Disafull professor of computersciencedegree from Rice University, M.S.in the School of ElectronicsEngineer-degree from Michigan Technologi-ing and Computer Science(EECS)incal University, and B.S. degree fromPeking University. His research in-Beijing University, all in computerterests include operating system, sys-science before joining University oftem virtualization, and cloud com-Rochester in 2000.His researchputing.received young investigator awardsfrom NSF and DOE.He co-foundedXiao-LinWanggreceivedhisthe ACM SIGPLAN Workshop onPh.D. degree in computer scienceMemory System Performance and Correctness (MSPC) andfrom Peking University in 2001.Hewas a visiting researcher at Microsoft Research and a vis-is now.an associate professor of com-iting associate professor at MIT.He is an external facultyputer science in the School of EECSfellow at IBM Center for Advanced Studies.in Peking University. His research in-terests include operation system, sys-tem virtualization, and cloud com-puting
712 J. Comput. Sci. & Technol., July 2014, Vol.29, No.4 [138] Darema-Rogers F, Pfister G F, So K. Memory access patterns of parallel scientific programs. In Proc. SIGMETRICS, May 1987, pp.46-58. [139] Browne S, Dongarra J, Garner N, Ho G, Mucci P. A portable programming interface for performance evaluation on modern processors. The International Journal of High Performance Computing Applications, 2000, 14(3): 189-204. [140] Adhianto L, Banerjee S, Fagan M, Krentel M, Marin G, Mellor-Crummey J, Tallent N R. HPCTOOLKIT: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 2010, 22(6): 685-701. [141] Shende S, Malony A D. The TAU parallel performance system. International Journal of High Performance Computing Applications, 2006, 20(2): 287-311. [142] Schulz M, Galarowicz J, Maghrak D, Hachfeld W, Montoya D, Cranford S. Open|SpeedShop: An open source infrastructure for parallel performance analysis. Scientific Programming, 2008, 16(2/3): 105-121. [143] Hauswirth M, Sweeney P F, Diwan A. Temporal vertical pro- filing. Software: Practice and Experience, 2010, 40(8): 627- 654. [144] Childers B, Davidson J, Soffa M L. Continuous compilation: A new approach to aggressive and adaptive code transformation. In Proc. Symp. Parallel and Distributed Processing, April 2003. [145] Cascaval C, Duesterwald E, Sweeney P F, Wisniewski R W. Performance and environment monitoring for continuous program optimization. IBM Journal of Research and Development, 2006, 50(2/3): 239-248. [146] McCurdy C, Vetter J S. Memphis: Finding and fixing NUMArelated performance problems on multi-core platforms. In Proc. ISPASS, March 2010, pp.87-96. [147] Liu X, Mellor-Crummey J M. Pinpointing data locality problems using data-centric analysis. In Proc. the 9th CGO, April 2011, pp.171-180. [148] Liu X, Mellor-Crummey J. A tool to analyze the performance of multithreaded programs on NUMA architectures. In Proc. the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2014, pp.259-272. [149] Zhuang X, Serrano M J, Cain H W, Choi J. Accurate, effi- cient, and adaptive calling context profiling. In Proc. PLDI, June 2006, pp.263-271. [150] Ding C, Yuan L. Program interaction on multicore: Theory and applications. Computer Engineering and Science, 2014, 36(1): 1-5. (In Chinese) Chen Ding received his Ph.D. degree from Rice University, M.S. degree from Michigan Technological University, and B.S. degree from Beijing University, all in computer science before joining University of Rochester in 2000. His research received young investigator awards from NSF and DOE. He co-founded the ACM SIGPLAN Workshop on Memory System Performance and Correctness (MSPC) and was a visiting researcher at Microsoft Research and a visiting associate professor at MIT. He is an external faculty fellow at IBM Center for Advanced Studies. Xiaoya Xiang graduated in 2005 from Huazhong University of Science and Technology with a B.S. degree in computer science and technology and at the same time from Wuhan University with a B.S. degree in finance. She got her M.S. degree in computer science from Institute of Computing Technology, Chinese Academy of Sciences, Beijing, in 2008. She earned her Ph.D. degree in computer science at the University of Rochester in 2013. She is now a software engineer at Twitter Inc., where her main focus is the runtime performance of the Twitter services in a cloud environment. Bin Bao is a senior software engineer at Qualcomm Technologies, Inc. Prior to joining Qualcomm in 2013, Bin spent one year at Adobe Inc. as a computer scientist. He received his Ph.D. degree in computer science from University of Rochester in 2013, M.S. degree in computer science from Institute of Computing Technology, Chinese Academy of Sciences in 2007, and B.S. degree in software engineering from the University of Science and Technology of China in 2004. His current research interests include program analysis and compilation for graphics processors. Hao Luo is a third year Ph.D. student in the Department of Computer Science, University of Rochester. His research interest lies on performance modeling of multithreaded applications, locality-aware task management, and program behavior analysis. Ying-Wei Luo received his Ph.D. degree in computer science from Peking University in 1999. He is a full professor of computer science in the School of Electronics Engineering and Computer Science (EECS) in Peking University. His research interests include operating system, system virtualization, and cloud computing. Xiao-Lin Wang received his Ph.D. degree in computer science from Peking University in 2001. He is now an associate professor of computer science in the School of EECS in Peking University. His research interests include operation system, system virtualization, and cloud computing
AvailableatJournal ofwww.EisevierComputerScience.comParallel andDistributedPOWERED BYBCIENCEECTComputingELSEVIERJ. Parallel Distrib. Comput. 64 (2004) 108-134http://www.elsevier.com/locate/jpdeImproving effective bandwidth through compiler enhancementof global cache reuse *Chen Dinga,* and Ken Kennedyb"Department of Computer Science, University of Rochester, P.0. Box 270226, Rochester, NY14627, US4bCenterforHighPerformance SofwareResearch(HiPerSoft).Rice University.Houston,TX,USAReceived 20 November 2002; revised 11 September 2003AbstractThe performance of modern machines is increasingly limited by insufficient memory bandwidth. One way to alleviate thisbandwidth limitation for a given program is to minimize theaggregate data volume the program transfers from memory.In thisarticle wepresentcompiler strategiesfor accomplishing this minimization.Following a discussion of theunderlyingcauses ofbandwidth limitations, wepresentatwo-step strategyto exploitglobal cachereusethetemporal reuseacross the wholeprogramand the spatialreuseacrosstheentiredata setused in thatprogram.In thefirst step,wefuse computation onthesamedatausingatechnique called reuse-based loop fusion to integrate loops with different control structures.We prove that optimal fusion forbandwidth is NP-hard and we explore the limitations of computation fusion using perfect program information.In the second step,we group data used by the same computation through the technique of affinity-based dala regrouping,which intermixes the storageassignments of programdata elements at different granularities.We show that the method is compile-time optimal and can be usedon array and structure data.We prove that two extensionspartial and dynamic data regrouping—are NP-hard problems.Finally,wedescribeourcompiler implementation and experimentsdemonstrating that the newglobal strategy,on average,reduces memorytraffic by over 40% and improves execution speed by over60% ontwo high-end workstations. 2003 Elsevier Inc. All rights reserved.Keywords: Reference affinity: Data locality; Program analysis; Loop fusion; Data transformation; Global cache reuse1.Introductionhierarchy can service a program's demand for data. Thegoal of the work reported here is to improve effectiveOver the past two decades, the computing power ofbandwidth by reducingthe total volume of memorysingle-chip microprocessors has increased by a factor ofaccess in a program. In the remainder of this section, wediscussthememorybandwidthproblemandpresenttheover6400,insharpcontrastwiththemuchslowerrateofimprovementforoff-chipmemorybandwidth,whichbasic ideas underlying our approach.has increased by a factor of no more than 150 over thesameperiod.ITobridgethegrowinggapbetweenCPU1.1.The problem of limited memory bandwidthand memory,modern systems employ a hierarchy ofcache memory.For thepurposes of this paper,wedefineWemeasurethebalancebetweenthedatademandofeffective bandwidth as the bandwidth at which a memorya program and the memory supply of a machine asdefined by Callahan et al.[13].The demand of a*Parts of this work have been published in 2001 and 2000programorprogrambalanceisthenumberof bytesofInternationalParallel andDistributedProcessingSymposiummemory data the program needs to support a single(IPDPS'01andIPDPS'00)and1999InternationalWorkshop onCPU operationonaverage.The supplyof amachineorLanguages and Compilers for Parallel Computing.Correspondingauthomachine balance is the ratio of the maximal number ofE-mail addresses: cding@cs.rochester.edu (Chen Ding),bytes the machine can transfer on each cycle to theken@rice.edu (Ken Kennedy)maximal numberofoperationsthemachinecanperformIWe derived this estimate based on historical data about CPUon each cycle.If the machine balanceis significantlyspeed, pin count, and pin-bandwidth increases compiled by Burgeilower thanthe program balance,theCPUwill beet al. [11].0743-7315/S-see front matter 2003 Elsevier Inc. All rights reserved.doi:10.1016/j.jpdc.2003.09.005
J. Parallel Distrib. Comput. 64 (2004) 108–134 Improving effective bandwidth through compiler enhancement ofglobal cache reuse$ Chen Dinga, and Ken Kennedyb a Department of Computer Science, University of Rochester, P.O. Box 270226, Rochester, NY 14627, USA bCenter for High Performance Software Research (HiPerSoft), Rice University, Houston, TX, USA Received 20 November 2002; revised 11 September 2003 Abstract The performance of modern machines is increasingly limited by insufficient memory bandwidth. One way to alleviate this bandwidth limitation for a given program is to minimize the aggregate data volume the program transfers from memory. In this article we present compiler strategies for accomplishing this minimization. Following a discussion of the underlying causes of bandwidth limitations, we present a two-step strategy to exploit global cache reuse—the temporal reuse across the whole program and the spatial reuse across the entire data set used in that program. In the first step, we fuse computation on the same data using a technique called reuse-based loop fusion to integrate loops with different control structures. We prove that optimal fusion for bandwidth is NP-hard and we explore the limitations of computation fusion using perfect program information. In the second step, we group data used by the same computation through the technique of affinity-based data regrouping, which intermixes the storage assignments of program data elements at different granularities. We show that the method is compile-time optimal and can be used on array and structure data. We prove that two extensions—partial and dynamic data regrouping—are NP-hard problems. Finally, we describe our compiler implementation and experiments demonstrating that the new global strategy, on average, reduces memory traffic by over 40% and improves execution speed by over 60% on two high-end workstations. r 2003 Elsevier Inc. All rights reserved. Keywords: Reference affinity; Data locality; Program analysis; Loop fusion; Data transformation; Global cache reuse 1. Introduction Over the past two decades, the computing power of single-chip microprocessors has increased by a factor of over 6400, in sharp contrast with the much slower rate of improvement for off-chip memory bandwidth, which has increased by a factor of no more than 150 over the same period.1 To bridge the growing gap between CPU and memory, modern systems employ a hierarchy of cache memory. For the purposes ofthis paper, we define effective bandwidth as the bandwidth at which a memory hierarchy can service a program’s demand for data. The goal of the work reported here is to improve effective bandwidth by reducing the total volume ofmemory access in a program. In the remainder ofthis section, we discuss the memory bandwidth problem and present the basic ideas underlying our approach. 1.1. The problem of limited memory bandwidth We measure the balance between the data demand of a program and the memory supply ofa machine as defined by Callahan et al. [13]. The demand ofa program or program balance is the number ofbytes of memory data the program needs to support a single CPU operation on average. The supply ofa machine or machine balance is the ratio ofthe maximal number of bytes the machine can transfer on each cycle to the maximal number ofoperations the machine can perform on each cycle. Ifthe machine balance is significantly lower than the program balance, the CPU will be ARTICLE IN PRESS $Parts ofthis work have been published in 2001 and 2000 International Parallel and Distributed Processing Symposium (IPDPS’01 and IPDPS’00) and 1999 International Workshop on Languages and Compilers for Parallel Computing. Corresponding author. E-mail addresses: cding@cs.rochester.edu (Chen Ding), ken@rice.edu (Ken Kennedy). 1 We derived this estimate based on historical data about CPU speed, pin count, and pin-bandwidth increases compiled by Burger et al. [11]. 0743-7315/$ - see front matter r 2003 Elsevier Inc. All rights reserved. doi:10.1016/j.jpdc.2003.09.005
109Chen Ding,KenKemnedy/J.ParallelDistrib.Comput.64(2004)108-134partially idle because the memory system cannot deliverdelivers.This insufficient bandwidth severely limitsdata fast enoughtokeep it busy on the givenprogram.programperformance.Eveninthe best case(lowestAsapreliminaryto our compilerwork,weconductedratio),theprogramscannotonaverageexceed33%ofaperformancestudyinwhich wemeasuredthemachinethe peakCPUperformance.Theproblemis worseinbalanceofa195MHzMIPSR10KprocessoronSGIlarge applications:the average CPU utilization can beOrigin2000 and the program balance of six scientificnomorethan16%forSPand10%forSweep3DThe bandwidth constraint, as described here,isprograms,which include four kernels-convolution,different from the latency constraint.We define memorymatrix multiply,FFT,and dmxpy (matrix-vector multi-ply from Linpack)and two full applications-SP, alatency as the time needed for a datum to travel fromfluid dynamics simulation program from NAS, andmain memory to CPU without any resource contention.Sweep3D, a particle transport simulatorfromDoE.TheMemory latency can be tolerated by fetching data early.studyusedmachineparameters,micro-benchmarks,andHowever,prefetching cannot alleviatethe bandwidthhardware event counters to measure program andproblem because it does not reduce the aggregatemachine balance.All programs were compiled with fullvolume of data transfer from memory.In fact, it oftenoptimizationfromtheSGIMIPSprocompiler(exceptaggravates the bandwidth problem by generatingunnecessary prefetches.Bandwidth is a fundamentalformatrix multiply,for which weused a loweroptimization level-O2 because its performance wasconstraint on program performance.For example,ifanotmemoryboundafterloopblocking).Thisstudywasprogramneeds1oGBofmemorytransferon amachinewith1GB/smemorybandwidth,the execution wouldpresented in detail in an earlier paper [20], but we reviewit briefly here.take atleast 10s,even when themachinehas zeroTable 1 shows the ratio of program balance tomemory latency,infinite CPU speed, and arbitrarilymachine balance between four levels of memoryearly and accurate data prefetching.Previous compiler techniques have fallen far short ofhierarchy:registers,level-one cache,level-two cache,solving the bandwidthproblem.The SGIOrigin hadandmainmemory.Aratioofxmeansthattheprogramneeds x times the bandwidth at this memory level toone of the highest memory bandwidths for its time-achieve full CPU utilization.Each of these numbers,300MB/sona195MHzprocessor.Theproblemissave one,is significantlygreater than one,demonstratingworseonnewer systems because CPU speed isincreas-that the bandwidthis insufficient at every level.Theingmuchfasterthanmemorybandwidthisimproving.numbers in the last column are several times larger thanThe programs we used in this study were compiled withthenumbers in theother twocolumns,establishing thatan excellent commercial compiler that implementedbandwidthfrommainmemoryisthemost limited.Toextensiveloop and scalaroptimizations.Dataprefetch-run at thefull CPU speed, these programs would requireing wasperformed by both thehardware andthe3.4to10.5timesmorememorybandwidththantheSGIcompiler.Yet,most oftheseprograms could not utilizemore than a fraction of the CPU capacity. Thus, thebandwidth constraint has become a principle factorTable 1limitingtheperformanceofmodernprocessors.Ratios of program balance (bandwidth demand)to machinebalance(bandwidth supply) on SGI Origin20001.2.Atwo-stepsolutionstrategyBandwidth demand vs. supplyApplicationsLI-RegL2-L1Mem-L2Wepresenta software solution strategythatattemptsto minimize the total memory demand of a program.We1.61.36.5conpohutiondescribe the basic idea of this strategy through an2.1 2.110.5dmxpy2.17.46.0example.Fig.1(a)shows asequenceof seven accessestommjki (-02)2.10.83.4FFTthree data elements.Assume that the CPU can either2.7SP1.66.1accessmemorydirectlyfordataoraccesscachefora2.33.89.8Sweep3Dcopy.Given a single-element cache with the commonlyadcaacd(a)memoryaaaddcc..adc..layout(b)(c)Fig. 1. An example use of our two-step strategy: (a) a sequence of data accesses; (b) fuse computation on the same data; (c) group data used by thesame computation
partially idle because the memory system cannot deliver data fast enough to keep it busy on the given program. As a preliminary to our compiler work, we conducted a performance study in which we measured the machine balance ofa 195 MHz MIPS R10K processor on SGI Origin2000 and the program balance ofsix scientific programs, which include four kernels—convolution, matrix multiply, FFT, and dmxpy (matrix-vector multiply from Linpack)—and two full applications—SP, a fluid dynamics simulation program from NAS, and Sweep3D, a particle transport simulator from DoE. The study used machine parameters, micro-benchmarks, and hardware event counters to measure program and machine balance. All programs were compiled with full optimization from the SGI MIPSpro compiler (except for matrix multiply, for which we used a lower optimization level -O2 because its performance was not memory bound after loop blocking). This study was presented in detail in an earlier paper [20], but we review it briefly here. Table 1 shows the ratio ofprogram balance to machine balance between four levels of memory hierarchy: registers, level-one cache, level-two cache, and main memory. A ratio of x means that the program needs x times the bandwidth at this memory level to achieve full CPU utilization. Each of these numbers, save one, is significantly greater than one, demonstrating that the bandwidth is insufficient at every level. The numbers in the last column are several times larger than the numbers in the other two columns, establishing that bandwidth from main memory is the most limited. To run at the full CPU speed, these programs would require 3.4 to 10.5 times more memory bandwidth than the SGI delivers. This insufficient bandwidth severely limits program performance. Even in the best case (lowest ratio), the programs cannot on average exceed 33% of the peak CPU performance. The problem is worse in large applications: the average CPU utilization can be no more than 16% for SP and 10% for Sweep3D. The bandwidth constraint, as described here, is different from the latency constraint. We define memory latency as the time needed for a datum to travel from main memory to CPU without any resource contention. Memory latency can be tolerated by fetching data early. However, prefetching cannot alleviate the bandwidth problem because it does not reduce the aggregate volume of data transfer from memory. In fact, it often aggravates the bandwidth problem by generating unnecessary prefetches. Bandwidth is a fundamental constraint on program performance. For example, if a program needs 10 GB ofmemory transfer on a machine with 1 GB=s memory bandwidth, the execution would take at least 10 s; even when the machine has zero memory latency, infinite CPU speed, and arbitrarily early and accurate data prefetching. Previous compiler techniques have fallen far short of solving the bandwidth problem. The SGI Origin had one ofthe highest memory bandwidths for its time— 300 MB=s on a 195 MHz processor. The problem is worse on newer systems because CPU speed is increasing much faster than memory bandwidth is improving. The programs we used in this study were compiled with an excellent commercial compiler that implemented extensive loop and scalar optimizations. Data prefetching was performed by both the hardware and the compiler. Yet, most ofthese programs could not utilize more than a fraction of the CPU capacity. Thus, the bandwidth constraint has become a principle factor limiting the performance of modern processors. 1.2. A two-step solution strategy We present a software solution strategy that attempts to minimize the total memory demand ofa program. We describe the basic idea ofthis strategy through an example. Fig. 1(a) shows a sequence ofseven accesses to three data elements. Assume that the CPU can either access memory directly for data or access cache for a copy. Given a single-element cache with the commonly ARTICLE IN PRESS . . a c d memory layout (a) a d c a a c d a a a d d c c (b) (c) Fig. 1. An example use ofour two-step strategy: (a) a sequence ofdata accesses; (b) fuse computation on the same data; (c) group data used by the same computation. Table 1 Ratios ofprogram balance (bandwidth demand) to machine balance (bandwidth supply) on SGI Origin2000 Applications Bandwidth demand vs. supply L1-Reg L2-L1 Mem-L2 convolution 1.6 1.3 6.5 dmxpy 2.1 2.1 10.5 mmjki (-O2) 6.0 2.1 7.4 FFT 2.1 0.8 3.4 SP 2.7 1.6 6.1 Sweep3D 3.8 2.3 9.8 Chen Ding, Ken Kennedy / J. Parallel Distrib. Comput. 64 (2004) 108–134 109