705Chen Ding et al.:Performance Metrics and Models for Shared Cachemisses. For performance analysis, the next question isbit and a sheep both have a low miss rate. A rabbit isthe combined effect on the execution time, and the ul-sensitive and tends to be affected by co-run peers, buttimate question is the limit to which the performancea sheep does not.Both programs have small impactson others. The last class is a devil, which has a highcan be improved.miss rate and impairs the performance of others but is3.7.1FromCacheMisses toCPUCyclesnot affected by others.Other classifications include coloring of miss inten-The effect of cache on execution time is tradition-sity, dual metrics of cache partitioning, and utility ofally given by two metrics, AMP (average miss penalty)cache space to performance.These arereviewed by XieandAMAT(averagememoryaccesstime).whichistheand Loh[97],number of cycles necessary for respectively, a cache missJiang et al.[52] classified programs along two localityand a memory access on average[98].dimensions. The sensitivity is computed from the clas-On modern processors, the timing effect is increas-siccompositionmodel(Subsection3.6.4).Itshowshowingly complex.A recent analysis was conducted by Suna program is affected by others.The competitiveness isand Wang(99], who explained that AMAT is affected bydistinct data blocks per cycle (DPC), which is equiva-the processor techniques for improving instruction-levellent to the average footprint.If we divide each localityparallelism, including pipelining, multiple functionaldimension into two halves.wehavefour classes.whichunits, out-of-order execution, branch prediction, specu-we may call locality classes.Locality classes are not thelation, and by the techniques for improving memorysame as animal classes. For example, a program can beperformance, including pipelined, multi-port, multi-extremely competitive, i.e., devilish, but may also be ei-bank cache, non-blocking cache, and data prefetching.ther sensitive or insensitive. This phenomenon was ob-The increasing complexity motivates the developmentserved by Zhuravlev et al.[94], who showed that "devilsof new metrics such as APC (access per cycle)studiedwere some of the most sensitive applications".in their paper[99].Much of the timing delay is caused by events out-3.7.3Optimal Co-Schedulingside the CPU and the cache, in particular, the memoryGiven a set of programs, co-scheduling divides themcontroller, the memory bus and the DRAM modules.into co-run groups, where each group is run together.Zhao et al.[100] developed a model of pressure thatThe goal is to minimize the interference within theseincludes both cache and memorybandwidth sharinggroups, so to maximize resource utilization and co-using regression analysis to identify a piece-wise lin-run throughput.The interference depends mostly onear correlation between the memory latency and thethe memory hierarchy, and the effect is non-linear andmemorybandwidth utilization.Themodel is not peerasymmetric.specific.The sameutilizationmaybe caused byoneWhile a locality model may predict the cache inter-program or a group of programs. Wang et al.[101] gaveference, the impact on performance depends on manyaneventmodel calledDraMontocapturetheprobabiother factors including the CPU speed, the effect oflity of DRAM hits,misses and conflicts and the effectprefetching,the available memory bandwidth, and, ifof contention and concurrency at the level of aDRAMa program is I/O intensive, the speed of the disk orbank. The event model was shown to be more accuratethe network. Direct testing can most accurately mea-than linear and logarithmic regression[102].sure the performance interference.Complete testing,It is important to manage contention and shar-however, has an exponential cost, since the number ofing atthememorylayer,asshownbytwore-subsets in an n-element set is 2n. Note that solo exe-cent techniques, bus-cycle modulation for executioncutions areneeded to computethe slowdown in groupthrottling(103] and memory partitioning to reduce bank-executions.level interference[104], Next we turn the attention backFor pairwise co-runs, the interference can be repre-to cache and review the techniques for reducing thesented by a complete graph where nodes are programscache interference.and edges have weights equal to pair-run slowdowns.Jiang et al.[105],@ showed that the optimization is min-3.7.2Characterization of Interferenceweight perfect matching, and the problem is NP-hard.Xie and Lohl97] gave an animalistic classification ofThey gave an approximation algorithm that producesprogram interference.Based on the behavior in sharednear-optimal schedules.cache, a program belongs to one of the four animalThe throughput is often not the only goal. Otherclasses. A turtle has little use of shared cache. A rab-desirable properties include fairness, i.e., no program is@First published by Jiang et al,[106]
Chen Ding et al.: Performance Metrics and Models for Shared Cache 705 misses. For performance analysis, the next question is the combined effect on the execution time, and the ultimate question is the limit to which the performance can be improved. 3.7.1 From Cache Misses to CPU Cycles The effect of cache on execution time is traditionally given by two metrics, AMP (average miss penalty) and AMAT (average memory access time), which is the number of cycles necessary for respectively, a cache miss and a memory access on average[98] . On modern processors, the timing effect is increasingly complex. A recent analysis was conducted by Sun and Wang[99], who explained that AMAT is affected by the processor techniques for improving instruction-level parallelism, including pipelining, multiple functional units, out-of-order execution, branch prediction, speculation, and by the techniques for improving memory performance, including pipelined, multi-port, multibank cache, non-blocking cache, and data prefetching. The increasing complexity motivates the development of new metrics such as APC (access per cycle) studied in their paper[99] . Much of the timing delay is caused by events outside the CPU and the cache, in particular, the memory controller, the memory bus and the DRAM modules. Zhao et al. [100] developed a model of pressure that includes both cache and memory bandwidth sharing using regression analysis to identify a piece-wise linear correlation between the memory latency and the memory bandwidth utilization. The model is not peer specific. The same utilization may be caused by one program or a group of programs. Wang et al. [101] gave an event model called DraMon to capture the probability of DRAM hits, misses and conflicts and the effect of contention and concurrency at the level of a DRAM bank. The event model was shown to be more accurate than linear and logarithmic regression[102] . It is important to manage contention and sharing at the memory layer, as shown by two recent techniques, bus-cycle modulation for execution throttling[103] and memory partitioning to reduce banklevel interference[104]. Next we turn the attention back to cache and review the techniques for reducing the cache interference. 3.7.2 Characterization of Interference Xie and Loh[97] gave an animalistic classification of program interference. Based on the behavior in shared cache, a program belongs to one of the four animal classes. A turtle has little use of shared cache. A rabbit and a sheep both have a low miss rate. A rabbit is sensitive and tends to be affected by co-run peers, but a sheep does not. Both programs have small impacts on others. The last class is a devil, which has a high miss rate and impairs the performance of others but is not affected by others. Other classifications include coloring of miss intensity, dual metrics of cache partitioning, and utility of cache space to performance. These are reviewed by Xie and Loh[97] . Jiang et al. [52] classified programs along two locality dimensions. The sensitivity is computed from the classic composition model (Subsection 3.6.4). It shows how a program is affected by others. The competitiveness is distinct data blocks per cycle (DPC), which is equivalent to the average footprint. If we divide each locality dimension into two halves, we have four classes, which we may call locality classes. Locality classes are not the same as animal classes. For example, a program can be extremely competitive, i.e., devilish, but may also be either sensitive or insensitive. This phenomenon was observed by Zhuravlev et al. [94], who showed that “devils were some of the most sensitive applications”. 3.7.3 Optimal Co-Scheduling Given a set of programs, co-scheduling divides them into co-run groups, where each group is run together. The goal is to minimize the interference within these groups, so to maximize resource utilization and corun throughput. The interference depends mostly on the memory hierarchy, and the effect is non-linear and asymmetric. While a locality model may predict the cache interference, the impact on performance depends on many other factors including the CPU speed, the effect of prefetching, the available memory bandwidth, and, if a program is I/O intensive, the speed of the disk or the network. Direct testing can most accurately measure the performance interference. Complete testing, however, has an exponential cost, since the number of subsets in an n-element set is 2n. Note that solo executions are needed to compute the slowdown in group executions. For pairwise co-runs, the interference can be represented by a complete graph where nodes are programs and edges have weights equal to pair-run slowdowns. Jiang et al. [105],⑤ showed that the optimization is minweight perfect matching, and the problem is NP-hard. They gave an approximation algorithm that produces near-optimal schedules. The throughput is often not the only goal. Other desirable properties include fairness, i.e., no program is ⑤First published by Jiang et al. [106]
706J.Comput.Sci.&Technol.,July 2014,Vol.29,No.4penalized disproportionally due to unfair sharing, andprint.The choice is partly for efficiency.Other on-line techniques also use the last-level cache miss rate asqualityof service(QoS),i.e.,aprogrammustmaintaincache use intensity(108-109],a certain level of performance.As inputs, an optimal solution requires accurate pre-Pain is an offline solution.This idea is extendeddiction of co-run degradation. Prior solutions are ei-into an online solution called Distributed Intensity (DI),therlocalitybased(seeSubsection3.6)orperformancewhich uses only the miss rate. An application is clas-based (this subsection).It is difficult for themtosified as intensive if it is above the average miss rateand non-intensive otherwise. The scheduler then triesproduceaccuratepredictionwithoutexpensivetestingFor co-run miss rates, the footprint gives near real-to group high-resource-intensity program(s) with low-time prediction, with an accuracy similar to exhaustiveresource-intensity program(s) on a multicore to miti-gate the conflicts on shared resources[3,18,93-95,110-111].testingl10].Cache misses represent only a (small) subset of pro-3.7.4 Heuristics-Based Co-Schedulinggram accesses.In comparison, the footprint includesIn symbiotic scheduling (SOs),Snavely andthe effect of all cache accesses. Furthermore, the missTllsenli07] used a sampling phase to test a numberfrequency depends on co-run peers and has the effectof possible co-run schedules and select the best oneof circular feedback, since the peers are affected by thefrom these samples for the next (symbiosis) phase.self. The result of counter-based modeling is specific toThey showed that a small number of possible schedulesonegrouping situation and may not beusable in other(instead of exhaustive testing) is sufficient to producegroupings. In comparison, footprint analysis collectsgood improvements.The system was designed and“clean-room"statistics, unaffected by co-run peers pro-tested for simultaneous multi-threading.Symbioticgram instrumentation or the analyzer code and usablescheduling assumes that program co-run behavior doesfor interference with any peers (which may be unknownat the time of footprint analysis).With the new theorynot vary significantly over time, so the sampling phaseis representative of performance in the remaining exe-in this thesis, footprint can be obtained with near real-time efficiency.cution.Testing doesnot requireprograminstrumenta-In an offine solution, Jiang et al[105] defined thetion.Fedorova et al[54] addressed the problem of perfor-concept of politeness for a program as“the reciprocalmance isolation by suspending a program executionof the sum of the degradations of all co-run groups thatwhen needed. They gave a cache-fair algorithm to en-include the job." The politeness is measured by the ef-sure a program runs at least at the speed withfair cachefect on the execution time, not just the miss ratio, andallocation.The technique is based on the assumptionis used to approximate optimal job scheduling.that if two programs have the same frequency of cacheIn an onlinesolution,the high cost of co-runtesting is addressed in a strategy called Bubble-Upl112].missestheyhavethe sameamount of data incache.InThe strategy has two steps. First, a program is co-localitymodeling.the assumption meansuniform dis-tribution of the access in cache. While the assumptionrun against an expanding bubble to produce a sensi-is not always valid, the model is efficient for use in antivity curve.The bubble is a specially designed probeOS scheduler to manage cache sharing in real time.program. In the second step, the pressure of the pro-The two techniques are dynamic and do not needgram is reported by another probe and probing run.off-line profiling.However, on-line analysis may not beBubble-Up is a composable strategy, since each pro-accurate and cannot predict interference in other pro-gram is tested individually without testing all programgram combinations.Furthermore, non-symbiotic pair-combinations. Bubble-Up extracts the factors that de-ing (during sampling) and throttling (for fairness) dotermine the program execution time.In comparison.not maximize the throughput.the footprint theory has a narrower scope, which in-Blagodurov et al.[95].@ developed the Pain classifi-cludes just the factors that determine the program be-havior in cache.cation.Thedegree of pain that application A suffersTwo recent solutions use machine learning. Delim-while it co-runs with B is affected by A's cache sensi-itrou and Kozyrakisl113] built a data center schedulertivity, which is computed using the reuse distance pro-called Paragon. The design of Paragon identifies 10file (PRD), and B's cache intensity, which is measuredsources of interference. It uses offine training (throughby the number of last level cache accesses per millionprobe programs)to build parameterized models oninstructions. The Pain model is similarto the classictheir performance impact. During online use, Paragoncomposition model described in Subsection 3.6.4 exceptfeeds the history information to a learning algorithmthat Pain uses the miss frequency rather than the foot-Journal version of [93-94]
706 J. Comput. Sci. & Technol., July 2014, Vol.29, No.4 penalized disproportionally due to unfair sharing, and quality of service (QoS), i.e., a program must maintain a certain level of performance. As inputs, an optimal solution requires accurate prediction of co-run degradation. Prior solutions are either locality based (see Subsection 3.6) or performance based (this subsection). It is difficult for them to produce accurate prediction without expensive testing. For co-run miss rates, the footprint gives near realtime prediction, with an accuracy similar to exhaustive testing[10] . 3.7.4 Heuristics-Based Co-Scheduling In symbiotic scheduling (SOS), Snavely and Tullsen[107] used a sampling phase to test a number of possible co-run schedules and select the best one from these samples for the next (symbiosis) phase. They showed that a small number of possible schedules (instead of exhaustive testing) is sufficient to produce good improvements. The system was designed and tested for simultaneous multi-threading. Symbiotic scheduling assumes that program co-run behavior does not vary significantly over time, so the sampling phase is representative of performance in the remaining execution. Testing does not require program instrumentation. Fedorova et al. [54] addressed the problem of performance isolation by suspending a program execution when needed. They gave a cache-fair algorithm to ensure a program runs at least at the speed with fair cache allocation. The technique is based on the assumption that if two programs have the same frequency of cache misses, they have the same amount of data in cache. In locality modeling, the assumption means uniform distribution of the access in cache. While the assumption is not always valid, the model is efficient for use in an OS scheduler to manage cache sharing in real time. The two techniques are dynamic and do not need off-line profiling. However, on-line analysis may not be accurate and cannot predict interference in other program combinations. Furthermore, non-symbiotic pairing (during sampling) and throttling (for fairness) do not maximize the throughput. Blagodurov et al. [95],⑥ developed the Pain classifi- cation. The degree of pain that application A suffers while it co-runs with B is affected by A’s cache sensitivity, which is computed using the reuse distance pro- file (PRD), and B’s cache intensity, which is measured by the number of last level cache accesses per million instructions. The Pain model is similar to the classic composition model described in Subsection 3.6.4 except that Pain uses the miss frequency rather than the footprint. The choice is partly for efficiency. Other online techniques also use the last-level cache miss rate as cache use intensity[108-109] . Pain is an offline solution. This idea is extended into an online solution called Distributed Intensity (DI), which uses only the miss rate. An application is classified as intensive if it is above the average miss rate and non-intensive otherwise. The scheduler then tries to group high-resource-intensity program(s) with lowresource-intensity program(s) on a multicore to mitigate the conflicts on shared resources[3,18,93-95,110-111] . Cache misses represent only a (small) subset of program accesses. In comparison, the footprint includes the effect of all cache accesses. Furthermore, the miss frequency depends on co-run peers and has the effect of circular feedback, since the peers are affected by the self. The result of counter-based modeling is specific to one grouping situation and may not be usable in other groupings. In comparison, footprint analysis collects “clean-room” statistics, unaffected by co-run peers program instrumentation or the analyzer code and usable for interference with any peers (which may be unknown at the time of footprint analysis). With the new theory in this thesis, footprint can be obtained with near realtime efficiency. In an offline solution, Jiang et al. [105] defined the concept of politeness for a program as “the reciprocal of the sum of the degradations of all co-run groups that include the job.” The politeness is measured by the effect on the execution time, not just the miss ratio, and is used to approximate optimal job scheduling. In an online solution, the high cost of co-run testing is addressed in a strategy called Bubble-Up[112] . The strategy has two steps. First, a program is corun against an expanding bubble to produce a sensitivity curve. The bubble is a specially designed probe program. In the second step, the pressure of the program is reported by another probe and probing run. Bubble-Up is a composable strategy, since each program is tested individually without testing all program combinations. Bubble-Up extracts the factors that determine the program execution time. In comparison, the footprint theory has a narrower scope, which includes just the factors that determine the program behavior in cache. Two recent solutions use machine learning. Delimitrou and Kozyrakis[113] built a data center scheduler called Paragon. The design of Paragon identifies 10 sources of interference. It uses offline training (through probe programs) to build parameterized models on their performance impact. During online use, Paragon feeds the history information to a learning algorithm ⑥Journal version of [93-94]
707Chen Ding et al.: Performance Metrics and Models for Shared Cachecalled collaborative filtering.Collaborative filteringprint, dynamic analysis is too specific, because the re-supports sparse learningBased on a small amount ofsult is limited to what happens in one execution. Staticpast data, it can predict application-application inter-analysis is too general, since it assumes all code pathsference and application-machine match.are possible. Input-centric analysis provides a way toovercome these limitations.Statistical techniques have had many uses in per-Imperative to input-centric analysis is a metricformance analysis of parallel code, including clustering,factoring, and correlationl114), linear models (with non-whose results can be compared between different exe-linear components)[115], queuing models[116], directedcutions.Theaccess ofamemorylocation,for example.searches[117], and analytical models[118]is not comparable becausea programmay allocate thesame datum to different locations in different runs. Nei-Machine learning is general and can consider differ-ther is theinstructionmakingtheaccess, sincethesameent types of resources together.It is also scalableasaccess may be made from different codes in differentmore factors can be added by having additional learn-runs. Reuse distance is the first metric to enable input-ing. Paragon's learning technique observes the co-runcentric analysis, since it is not tied to specific memoryresults but has to be given the solo-run speed to com-allocation or control flow and can be compared betweenpute the co-run slowdown. The cache model comple-different runs.ments performance models, which can include the spe-The first group of work studied how the reuse dis-cialized model as a component. Locality metrics suchtance changes in different runs and developed sta-as the footprint can be used as an input to a learning altistical models of locality prediction (called whole-gorithm. While the strength of machine learning is theprogram locality)[5.36,121], miss-rate prediction[28],breadthandthegeneralframework.thestrength oftheperformance prediction (not just cross-input butlocality theory is the depth and the focused formula-also cross-architecture)[25,122], critical load instructiontion. As a benefit of the latter, we now can understandprediction[2], and locality phases/123-125],Zhong et al.the shared cache with mathematically tractable modelssurveyed these and other techniques and categorizedand derive precise co-run miss ratios.them as behavior (rather than code) based analysis.3.7.5Performance Scaling Modelsanalogous to observation and prediction in the physicaland biological sciences[5]Using the PRD/CRD model[13], Wu et a.[14] con-More recent work combined behavior and codeducted experiments on a wide range of symmetric mul-analysis, in particular, showed how to predict the looptithreaded benchmarks on modest problem size andbounds in different runs. To characterize program in-core counts and used their scaling framework to studyputs, Mao and Shen defined an extensible input char-the performance (average memory access time AMAT)acterization language (XICL)[126]. Jiang et al. definedover cache hierarchy scaling for large problem sizes onthenotion of seminal behavior,whichis thesmallest setlarge-scale (LCMPs),The studyfocuses ontheeffectof program values that collectively determine the itera-of hardwarecharacteristics such as core counts,cachetion count of all loops[127]. Learning techniques suchsizes,and cache organizations on different programsas classification trees were used to identify the seminaland program inputs, but not on hardware independentbehavior[126,128]..Wu et al. later expanded the loopprogram characterization.analysis to capture sequence patterns[i29].Input-centric analysis has been used to improve the3.8RelatedTechniquesfeedback-driven program optimization(FDO)in theIBM XL C compiler[127] and the just-in-time (JIT) com-3.8.1Input-CentricAnalysispiler in Java virtual machines[120,130-131], Profles fromThe early work in profiling examines multiple exe-different inputs are routinely used in feedback-drivencutionstoidentify what behavior is common.Forand iterative compiler optimization.The quality of op-example, Wall compared the hot variables and functionstimization depends on the quality of profiles. The de-found in dfferent executions of the same programl110]pendence has been examined using statisticgs[132-13]Recent work has gone one step further to identify thepatterns of change and predict how the behavior will3.8.2Profiling and Performance Monitoringdiffer from run to run.Shen called it input-centricanalysis(120],The term profiling broadly refers to techniques thatInput-centricanalysis covers the intermediateextract and analyze a subset of events in a programground between dynamic analysis, which is for a sin-execution. Locality profiling extracts and analyzes thegle execution, and static analysis, which is for all exe-sequence of memory accesses.It does so byprogramcutions. For problems such as reuse distance and foot-instrumentation.At each memory reference, it inserts
Chen Ding et al.: Performance Metrics and Models for Shared Cache 707 called collaborative filtering. Collaborative filtering supports sparse learning. Based on a small amount of past data, it can predict application-application interference and application-machine match. Statistical techniques have had many uses in performance analysis of parallel code, including clustering, factoring, and correlation[114], linear models (with nonlinear components)[115], queuing models[116], directed searches[117], and analytical models[118] . Machine learning is general and can consider different types of resources together. It is also scalable as more factors can be added by having additional learning. Paragon’s learning technique observes the co-run results but has to be given the solo-run speed to compute the co-run slowdown. The cache model complements performance models, which can include the specialized model as a component. Locality metrics such as the footprint can be used as an input to a learning algorithm. While the strength of machine learning is the breadth and the general framework, the strength of the locality theory is the depth and the focused formulation. As a benefit of the latter, we now can understand the shared cache with mathematically tractable models and derive precise co-run miss ratios. 3.7.5 Performance Scaling Models Using the PRD/CRD model[13], Wu et al. [14] conducted experiments on a wide range of symmetric multithreaded benchmarks on modest problem size and core counts and used their scaling framework to study the performance (average memory access time AMAT) over cache hierarchy scaling for large problem sizes on large-scale (LCMPs). The study focuses on the effect of hardware characteristics such as core counts, cache sizes, and cache organizations on different programs and program inputs, but not on hardware independent program characterization. 3.8 Related Techniques 3.8.1 Input-Centric Analysis The early work in profiling examines multiple executions to identify what behavior is common. For example, Wall compared the hot variables and functions found in different executions of the same program[119] . Recent work has gone one step further to identify the patterns of change and predict how the behavior will differ from run to run. Shen called it input-centric analysis[120] . Input-centric analysis covers the intermediate ground between dynamic analysis, which is for a single execution, and static analysis, which is for all executions. For problems such as reuse distance and footprint, dynamic analysis is too specific, because the result is limited to what happens in one execution. Static analysis is too general, since it assumes all code paths are possible. Input-centric analysis provides a way to overcome these limitations. Imperative to input-centric analysis is a metric whose results can be compared between different executions. The access of a memory location, for example, is not comparable because a program may allocate the same datum to different locations in different runs. Neither is the instruction making the access, since the same access may be made from different codes in different runs. Reuse distance is the first metric to enable inputcentric analysis, since it is not tied to specific memory allocation or control flow and can be compared between different runs. The first group of work studied how the reuse distance changes in different runs and developed statistical models of locality prediction (called wholeprogram locality)[5,36,121], miss-rate prediction[28] , performance prediction (not just cross-input but also cross-architecture)[25,122], critical load instruction prediction[29], and locality phases[123-125]. Zhong et al. surveyed these and other techniques and categorized them as behavior (rather than code) based analysis, analogous to observation and prediction in the physical and biological sciences[5] . More recent work combined behavior and code analysis, in particular, showed how to predict the loop bounds in different runs. To characterize program inputs, Mao and Shen defined an extensible input characterization language (XICL)[126]. Jiang et al. defined the notion of seminal behavior, which is the smallest set of program values that collectively determine the iteration count of all loops[127]. Learning techniques such as classification trees were used to identify the seminal behavior[126,128]. Wu et al. later expanded the loop analysis to capture sequence patterns[129] . Input-centric analysis has been used to improve the feedback-driven program optimization (FDO) in the IBM XL C compiler[127] and the just-in-time (JIT) compiler in Java virtual machines[120,130-131]. Profiles from different inputs are routinely used in feedback-driven and iterative compiler optimization. The quality of optimization depends on the quality of profiles. The dependence has been examined using statistics[132-133] . 3.8.2 Profiling and Performance Monitoring The term profiling broadly refers to techniques that extract and analyze a subset of events in a program execution. Locality profiling extracts and analyzes the sequence of memory accesses. It does so by program instrumentation. At each memory reference, it inserts
708J.Comput.Sci.&Technol.,July 2014,Vol.29,No.4namic bursts for calling context profilingl149]. Shadowa call to pass the memory address to an analyzer. Theinstrumentation can be done at source or binary level.profiling pauses a program at preset intervals and forksSource level instrumentation is made by a compilera separate process to profile in parallel with the baseprogram[64-65]. The reuse distance analysis is not asuch as GCC, Open64, and LLVM, usually at the levelof the intermediate code.Binary instrumentation is bygood target for these techniques because of the un-a binary rewriting tool. Both can be done statically,certain length of the reuse windows.However, thei.e., without running a program. Binary rewriting canfootprint can be easily sampled using shadow profil-ing. Reuse distance can then be computed using thealso be done dynamically when a program is running.The main problem of profiling isthe cost of theconversion theory.instrumentation.A compiler can optimize theinstruConclusions4mented code statically.Another advantage is thatthe instrumentation tool is portable if a compiler isIn this paper we have described the recent footprintportable. In comparison, binary rewriting is architec-theory of locality, including the definition and formalture specific.For example, ATOM instruments onlyproperties especially the footprint composition and theAlpha binary[134], and Pin x86 binary[135].On theconversion between window-based statistics, i.e., theother hand, Pin can instrument dynamically loadedfootprint, and reuse-based statistics, e.g., the miss ra-library, which a static tool cannot do.tio. We have surveyed a large literature, more than 140Profiling does not model the timing effect,for whichpublicationsoverthepastfourdecades,focusingonthewe need to either monitor an execution on actual hard-working set theory, which lays thefoundation of this re-ware or reproduce it in a simulator.search field, and recent performance models, which ad-Performance monitoring for parallel code has a longdress the complex challenges posed by the modern mul-history[136-138]. Modern processors provide hardwareticore memory hierarchy. Through the review, we havecounterstomonitorhardware eventswithlittleor noappraised their strengths and weaknesses and pointedrun-time cost. The events related to memory perfor-out the relation with the new footprint theory.mance include the frequency of cache misses, cacheNicholas Wirth titled his 1976 book"Algorithms +coherence misses, and various cycle counts, includingData Structures - Programs" to emphasize the corestalled cycles.When many events are being moni-subjects and their relations. We would modernize thetored in a large system over a long execution, the largefigurative equation for use on today's machines andvolumeof resultspresentstwoproblems.Thefirstissay"(Algorithms +Data Structures) × Locality = Ef-the time and space cost of collecting and storing theseficient Programs".In theory, locality is fundamentalresults. The second is analysis-how to identify high.in understanding the nature of computation. In prac-levelinformationfromlow-levelmeasurements.tice, memory optimization is necessary in the designThese problems are solved by monitoring and visua-and use of every computing system. Locality researchlization tools, including commercial ones such as Intelhas made tremendous progress and immense impacts.VTuneAmplifier,AMD CodeAnalysist,and CrayPatThis review focuses on the growing body of researchand open-source projects such as PAPI library[139],to uncover the essential aspects of program behavior inHPCToolkit[140], TAU[141], and Open|SpeedShop[142]shared cache and as a result enhance our ability to un-The aggregation of information is usually code cen-derstand and manage program interaction on multicoretric,which shows performance in programfunc-systems.tions and instructions.Vertical profiling identifiesperformance problems across a software stack[143].AcknowledgmentWethank PeterDenning andXipeng Shen for always patientlyand promptly answerContinuous program optimization (CPO)notonlyfindsing our questions about their work, for the many peopleperformance problems but also optimizes performanceautomatically[58,60,144-145],In recent work, data-centricwho worked with us at Rochester, and for the reviewersand the organizers of this special issue. Given the scopeaggregation is used to pin-point locality problems moreand depth of the past research, it is inevitable that theeffectively, for issues of not just cache misses butpresentation fails to be complete and completely pre-alsonon-uniform(NUMA)memoryaccesslatency[146-148]cise.We apologize for any omission and misrepresenta-Bursty Sampling and Shadow Profiling. Arnold andtion. Any error is entirely ours. We appreciate readerfeedback, which can be sent to the email address listedRyder pioneered a general framework to sample Javain the first page of the paper.code, i.e., the first few invocations of a function or theA Chinese version of the first two sections havebeginning iterations of a loop[56].It has been adoptedbeen co-authored with Yuan Liang and published infor hot-stream prefetching in C/C++ in bursty sam-the Journal of Computer Engineering and Sciencel150],pling [20] and extended to sample both static and dy-
708 J. Comput. Sci. & Technol., July 2014, Vol.29, No.4 a call to pass the memory address to an analyzer. The instrumentation can be done at source or binary level. Source level instrumentation is made by a compiler such as GCC, Open64, and LLVM, usually at the level of the intermediate code. Binary instrumentation is by a binary rewriting tool. Both can be done statically, i.e., without running a program. Binary rewriting can also be done dynamically when a program is running. The main problem of profiling is the cost of the instrumentation. A compiler can optimize the instrumented code statically. Another advantage is that the instrumentation tool is portable if a compiler is portable. In comparison, binary rewriting is architecture specific. For example, ATOM instruments only Alpha binary[134], and Pin x86 binary[135]. On the other hand, Pin can instrument dynamically loaded library, which a static tool cannot do. Profiling does not model the timing effect, for which we need to either monitor an execution on actual hardware or reproduce it in a simulator. Performance monitoring for parallel code has a long history[136-138]. Modern processors provide hardware counters to monitor hardware events with little or no run-time cost. The events related to memory performance include the frequency of cache misses, cache coherence misses, and various cycle counts, including stalled cycles. When many events are being monitored in a large system over a long execution, the large volume of results presents two problems. The first is the time and space cost of collecting and storing these results. The second is analysis — how to identify highlevel information from low-level measurements. These problems are solved by monitoring and visualization tools, including commercial ones such as Intel VTune Amplifier, AMD CodeAnalysist, and CrayPat, and open-source projects such as PAPI library[139] , HPCToolkit[140], TAU[141], and Open|SpeedShop[142] . The aggregation of information is usually code centric, which shows performance in program functions and instructions. Vertical profiling identifies performance problems across a software stack[143] . Continuous program optimization (CPO) not only finds performance problems but also optimizes performance automatically[58,60,144-145]. In recent work, data-centric aggregation is used to pin-point locality problems more effectively, for issues of not just cache misses but also non-uniform memory access (NUMA) latency[146-148] . Bursty Sampling and Shadow Profiling. Arnold and Ryder pioneered a general framework to sample Java code, i.e., the first few invocations of a function or the beginning iterations of a loop[56]. It has been adopted for hot-stream prefetching in C/C++ in bursty sampling [20] and extended to sample both static and dynamic bursts for calling context profiling[149]. Shadow profiling pauses a program at preset intervals and forks a separate process to profile in parallel with the base program[64-65]. The reuse distance analysis is not a good target for these techniques because of the uncertain length of the reuse windows. However, the footprint can be easily sampled using shadow profiling. Reuse distance can then be computed using the conversion theory. 4 Conclusions In this paper we have described the recent footprint theory of locality, including the definition and formal properties especially the footprint composition and the conversion between window-based statistics, i.e., the footprint, and reuse-based statistics, e.g., the miss ratio. We have surveyed a large literature, more than 140 publications over the past four decades, focusing on the working set theory, which lays the foundation of this research field, and recent performance models, which address the complex challenges posed by the modern multicore memory hierarchy. Through the review, we have appraised their strengths and weaknesses and pointed out the relation with the new footprint theory. Nicholas Wirth titled his 1976 book “Algorithms + Data Structures = Programs” to emphasize the core subjects and their relations. We would modernize the figurative equation for use on today’s machines and say “(Algorithms + Data Structures) × Locality = Ef- ficient Programs”. In theory, locality is fundamental in understanding the nature of computation. In practice, memory optimization is necessary in the design and use of every computing system. Locality research has made tremendous progress and immense impacts. This review focuses on the growing body of research to uncover the essential aspects of program behavior in shared cache and as a result enhance our ability to understand and manage program interaction on multicore systems. Acknowledgment We thank Peter Denning and Xipeng Shen for always patiently and promptly answering our questions about their work, for the many people who worked with us at Rochester, and for the reviewers and the organizers of this special issue. Given the scope and depth of the past research, it is inevitable that the presentation fails to be complete and completely precise. We apologize for any omission and misrepresentation. Any error is entirely ours. We appreciate reader feedback, which can be sent to the email address listed in the first page of the paper. A Chinese version of the first two sections have been co-authored with Yuan Liang and published in the Journal of Computer Engineering and Science[150]
709Chen Ding et al.:Performance Metrics and Models for Shared CacheReferences[25] Marin G, Mellor-Crummey J. Cross architecture performancepredictions for scientific applications using parameterized[1] Zhang X, Dwarkadas S, Shen K. Towards practical pagemodels.In Proc.SIGMETRICS, June 2004,pp.2-13.coloring-based multicore cache management.InProc.the[26] Snir M, Yu J.On thetheory of spatial and temporal local-EuroSys Conference, April 2009, Pp.89-102.ity. Technical Report, DCS-R-2005-2564, Computer Science[2] Denning P J. Working sets past and present. IEEE Transac-Dept., Univ. of Illinois at Urbana-Champaign, 2005.tions on Software Engineering, 1980, 6(1): 64-84.[27] Fang C, Carr S, Onder S, Wang Z. Path-based reuse distance[3] Denning P J. The working set model for program behaviour.analysis.In Proc.the 15th CC, Mar.2006,Pp.32-46.Communications of the ACM, 1968, 11(5): 323-333.[28] Zhong Y, Dropsho S G, Shen X, Studer A, Ding C. Miss rate[4] Brock J, Luo H, Ding C. Locality analysis: A nonillion timeprediction across program inputs and cache configurations.windowproblem.In Proc.Big Data Analytics Workshop,IEEE TransactionsonComputers,2007,56(3):328-343June 2013.[29] Fang C, Carr S, Onder S, Wang Z. Instruction based memory[5] Zhong Y, Shen X, Ding C. Program locality analysis usingdistance analysis and its application to optimization. In Proc.reuse distance.ACM TOPLAS, 2009, 31(6):1-39.PACT, Sept. 2005, Pp.27-37.[6] Zhong Y, Orlovich M, Shen X, Ding C. Array regrouping and[30] Beyls K, D'Hollander E H. Discovery of locality-improvingstructure splitting using whole-program reference affinity. Inrefactorings by reuse path analysis.In Proc.the 2nd Int.Proc.PLDI,June 2004,pp.255-266.Conf. High Performance Computing and Communications,[7] Ding C, Chilimbi T. All-window profiling of concurrent exe-Sept.2006,pp.220-229.cutions. In Proc. the 13th PPoPP(Poster Paper),Feb.2008,[31] Beyls K, D'Hollander E H. Intermediately executed code is thepp.265-266.key to find refactorings that improve temporal data locality.[8] Xiang X, Bao B, Bai T, Ding C, Chilimbi T M. All-windowIn Proc.the Srd ACM Conference on Computing Frontiers,profiling and composable models of cache sharing.In ProcMay 2006, pp.373-382.PPoPP,Feb.2011, Pp.91-102.[32] Kelly T, Cohen I, Goldszmidt M, Keeton K. Inducing models[9] Xiang X, Bao B, Ding C, Gao Y. Linear-time modeling ofof black-box storage arrays.Technical Report, HPL-2004-108,program working set in shared cache. In Proc. PACT, Oct.HP Laboratories Palo Alto, 2004.2011,pp.350-360.[33] Almeida V, Bestavros A, Crovella M, de Oliveira A. Char-[10] Xiang X, Ding C, Luo H, Bao B. HOTL: A higher order theoryacterizingreference locality in the WWW.In Proc.the4thof locality.In Proc.ASPLOS,March 2013,Pp.343-356.International Conference on Parallel and Distributed Infor-[11] Xiang X, Bao B, Ding C, Shen K. Cache conscious task re-mation Systems (PDIS), December 1996, pp.92-103.grouping on multicore processors. In Proc.the 12th CCGrid,[34] Bennett B T, Kruskal V J. LRU stack processing. IBM Jour-May 2012, pp.603-611.nal of Research and Development, 1975, 19(4): 353-357.[12] Xiang X. A higher order theory of locality and its application[35] Olken F. Efficient methods for calculating the success func-in multicore cache management [Ph.D. Thesis]. Computertion of fixed space replacement policies. Technical Report,Science Dept., Univ. of Rochester, 2014.LBL-12370, Lawrence Berkeley Laboratory, 1981.[13] Wu M, Yeung D. Coherent profiles: Enabling efficient reuse[36] Ding C, Zhong Y. Predicting whole-program locality throughdistance analysis of multicore scaling for loop-based parallelreuse distance analysis.In Proc.PLDI, June 2003, Pp.245-programs.InProc.PACT,Oct.2011,Pp.264-275.257.[14] Wu M, Zhao M, Yeung D. Studying multicore processor scal-[37] Zhong Y, Ding C, Kennedy K. Reuse distance analysis foring via reuse distance analysis.In Proc.the 4Oth ISCA, Junescientific programs. In Proc.Workshop on Languages, Com-2013,pp.499-510.pilers, and Run-time Systems for Scalable Computers, March[15] Thiebaut D, Stone H S.Footprints in the cache.ACM Trans-2002.actions on Computer Systems, 1987,5(4): 305-329.[38] Schuff D L, Kulkarni M, Pai V S. Accelerating multicore reuse[16] Suh G E, Devadas S, Rudolph L. Analytical cache modelsdistance analysis with sampling and parallelization. In Proc.with applications to cache partitioning.In Proc. the 15ththe19thPACT,Sept.2010,Pp.53-64.ICS, June 2001, pp.1-12.[39] Kim Y H, Hill M D, Wood D A. Implementing stack simu-[17] Chandra D, Guo F, Kim S, Solihin Y. Predicting inter-threadlation for highly-associative memories.In Proc.SIGMET-cache contention on a chip multi-processor architecture. InRICS,May1991,Pp.212-213Proc.the 11th HPCA,Feb.2005, pp.340-351.[40] Sugumar R A, Abraham S G. Multi-configuration simulation[18] Belady L A, A study of replacement algorithms for a virtual-algorithms for the evaluation of computer architecture de-storage computer. IBM Systems Journal, 1966, 5(2): 78-101.signs. Technical Report, University of Michigan, August 1993.[19] Denning P J. Thrashing: Its causes and prevention. In Proc.[41] Burger D, Austin T. The SimpleScalar tool set, version 2.0.AFIPS Fall Joint Computer Conference, Part1, Dec.1968,Technical Report, CS-TR-97-1342, Department of Computerpp.915-922.Science, University of Wisconsin, June 1997.[42] Almasi G, Cascaval C, Padua D A. Calculating stack dis-[20] Chilimbi T M, Hirzel M. Dynamic hot data stream prefetch-ing for general-purpose programs. In Proc. PLDI, June 2002,tances efficiently. In Proc. the ACM SIGPLAN Workshoppp.199-209on Memory System Performance, June 2002, pp.37-43.[21] Mattson R L, Gecsei J, Slutz D, Traiger IL. Evaluation tech-[43] Denning P J, Schwartz S C. Properties of the working setniques for storage hierarchies.IBM System Journal, 1970,model.CommunicationsoftheACM,1972.15(3):191-1989(2): 78-117.[44] Berg E, Hagersten E. StatCache: A probabilistic approach[22] Jiang S, Zhang X. LIRS: An efficient low inter-reference re-to efficient and accurate data locality analysis. In Proc. IS-cency set replacement to improve buffer cache performance.PASS, March 2004, Pp.20-27.InProc.SIGMETRICS,June2002,Pp.31-42.[45] Berg E, Hagersten E. Fast data-locality profling of native[23] Smith A J. On the effectiveness of set associative page map-execution.InProc.SIGMETRICS,June2005,Pp.169-180.[46] Eklov D, Hagersten E. StatStack: Efficient modeling of LRUping and its applications in main memory management.InProc.the 2ndICSE,Oct.1976,pp.286-292.caches.In Proc.ISPASS, March 2010, Pp.55-65.[24] Hill M D, Smith A J. Evaluating associativity in CPU caches.[47] Eklov D, Black-Schaffer D, Hagersten E. Fast modeling ofIEEE Transactions on Computers,1989, 38(12):1612-1630.shared caches in multicore systems.In Proc.the6th
Chen Ding et al.: Performance Metrics and Models for Shared Cache 709 References [1] Zhang X, Dwarkadas S, Shen K. Towards practical page coloring-based multicore cache management. In Proc. the EuroSys Conference, April 2009, pp.89-102. [2] Denning P J. Working sets past and present. IEEE Transactions on Software Engineering, 1980, 6(1): 64-84. [3] Denning P J. The working set model for program behaviour. Communications of the ACM, 1968, 11(5): 323-333. [4] Brock J, Luo H, Ding C. Locality analysis: A nonillion time window problem. In Proc. Big Data Analytics Workshop, June 2013. [5] Zhong Y, Shen X, Ding C. Program locality analysis using reuse distance. ACM TOPLAS, 2009, 31(6): 1-39. [6] Zhong Y, Orlovich M, Shen X, Ding C. Array regrouping and structure splitting using whole-program reference affinity. In Proc. PLDI, June 2004, pp.255-266. [7] Ding C, Chilimbi T. All-window profiling of concurrent executions. In Proc. the 13th PPoPP (Poster Paper), Feb. 2008, pp.265-266. [8] Xiang X, Bao B, Bai T, Ding C, Chilimbi T M. All-window profiling and composable models of cache sharing. In Proc. PPoPP, Feb. 2011, pp.91-102. [9] Xiang X, Bao B, Ding C, Gao Y. Linear-time modeling of program working set in shared cache. In Proc. PACT, Oct. 2011, pp.350-360. [10] Xiang X, Ding C, Luo H, Bao B. HOTL: A higher order theory of locality. In Proc. ASPLOS, March 2013, pp.343-356. [11] Xiang X, Bao B, Ding C, Shen K. Cache conscious task regrouping on multicore processors. In Proc. the 12th CCGrid, May 2012, pp.603-611. [12] Xiang X. A higher order theory of locality and its application in multicore cache management [Ph.D. Thesis]. Computer Science Dept., Univ. of Rochester, 2014. [13] Wu M, Yeung D. Coherent profiles: Enabling efficient reuse distance analysis of multicore scaling for loop-based parallel programs. In Proc. PACT, Oct. 2011, pp.264-275. [14] Wu M, Zhao M, Yeung D. Studying multicore processor scaling via reuse distance analysis. In Proc. the 40th ISCA, June 2013, pp.499-510. [15] Thi´ebaut D, Stone H S. Footprints in the cache. ACM Transactions on Computer Systems, 1987, 5(4): 305-329. [16] Suh G E, Devadas S, Rudolph L. Analytical cache models with applications to cache partitioning. In Proc. the 15th ICS, June 2001, pp.1-12. [17] Chandra D, Guo F, Kim S, Solihin Y. Predicting inter-thread cache contention on a chip multi-processor architecture. In Proc. the 11th HPCA, Feb. 2005, pp.340-351. [18] Belady L A. A study of replacement algorithms for a virtualstorage computer. IBM Systems Journal, 1966, 5(2): 78-101. [19] Denning P J. Thrashing: Its causes and prevention. In Proc. AFIPS Fall Joint Computer Conference, Part 1, Dec. 1968, pp.915-922. [20] Chilimbi T M, Hirzel M. Dynamic hot data stream prefetching for general-purpose programs. In Proc. PLDI, June 2002, pp.199-209. [21] Mattson R L, Gecsei J, Slutz D, Traiger I L. Evaluation techniques for storage hierarchies. IBM System Journal, 1970, 9(2): 78-117. [22] Jiang S, Zhang X. LIRS: An efficient low inter-reference recency set replacement to improve buffer cache performance. In Proc. SIGMETRICS, June 2002, pp.31-42. [23] Smith A J. On the effectiveness of set associative page mapping and its applications in main memory management. In Proc. the 2nd ICSE, Oct. 1976, pp.286-292. [24] Hill M D, Smith A J. Evaluating associativity in CPU caches. IEEE Transactions on Computers, 1989, 38(12): 1612-1630. [25] Marin G, Mellor-Crummey J. Cross architecture performance predictions for scientific applications using parameterized models. In Proc. SIGMETRICS, June 2004, pp.2-13. [26] Snir M, Yu J. On the theory of spatial and temporal locality. Technical Report, DCS-R-2005-2564, Computer Science Dept., Univ. of Illinois at Urbana-Champaign, 2005. [27] Fang C, Carr S, Onder S, Wang Z. Path-based reuse distance ¨ analysis. In Proc. the 15th CC, Mar. 2006, pp.32-46. [28] Zhong Y, Dropsho S G, Shen X, Studer A, Ding C. Miss rate prediction across program inputs and cache configurations. IEEE Transactions on Computers, 2007, 56(3): 328-343. [29] Fang C, Carr S, Onder S, Wang Z. Instruction based memory ¨ distance analysis and its application to optimization. In Proc. PACT, Sept. 2005, pp.27-37. [30] Beyls K, D’Hollander E H. Discovery of locality-improving refactorings by reuse path analysis. In Proc. the 2nd Int. Conf. High Performance Computing and Communications, Sept. 2006, pp.220-229. [31] Beyls K, D’Hollander E H. Intermediately executed code is the key to find refactorings that improve temporal data locality. In Proc. the 3rd ACM Conference on Computing Frontiers, May 2006, pp.373-382. [32] Kelly T, Cohen I, Goldszmidt M, Keeton K. Inducing models of black-box storage arrays. Technical Report, HPL-2004-108, HP Laboratories Palo Alto, 2004. [33] Almeida V, Bestavros A, Crovella M, de Oliveira A. Characterizing reference locality in the WWW. In Proc. the 4th International Conference on Parallel and Distributed Information Systems (PDIS), December 1996, pp.92-103. [34] Bennett B T, Kruskal V J. LRU stack processing. IBM Journal of Research and Development, 1975, 19(4): 353-357. [35] Olken F. Efficient methods for calculating the success function of fixed space replacement policies. Technical Report, LBL-12370, Lawrence Berkeley Laboratory, 1981. [36] Ding C, Zhong Y. Predicting whole-program locality through reuse distance analysis. In Proc. PLDI, June 2003, pp.245- 257. [37] Zhong Y, Ding C, Kennedy K. Reuse distance analysis for scientific programs. In Proc. Workshop on Languages, Compilers, and Run-time Systems for Scalable Computers, March 2002. [38] Schuff D L, Kulkarni M, Pai V S. Accelerating multicore reuse distance analysis with sampling and parallelization. In Proc. the 19th PACT, Sept. 2010, pp.53-64. [39] Kim Y H, Hill M D, Wood D A. Implementing stack simulation for highly-associative memories. In Proc. SIGMETRICS, May 1991, pp.212-213. [40] Sugumar R A, Abraham S G. Multi-configuration simulation algorithms for the evaluation of computer architecture designs. Technical Report, University of Michigan, August 1993. [41] Burger D, Austin T. The SimpleScalar tool set, version 2.0. Technical Report, CS-TR-97-1342, Department of Computer Science, University of Wisconsin, June 1997. [42] Almasi G, Cascaval C, Padua D A. Calculating stack distances efficiently. In Proc. the ACM SIGPLAN Workshop on Memory System Performance, June 2002, pp.37-43. [43] Denning P J, Schwartz S C. Properties of the working set model. Communications of the ACM, 1972, 15(3): 191-198. [44] Berg E, Hagersten E. StatCache: A probabilistic approach to efficient and accurate data locality analysis. In Proc. ISPASS, March 2004, pp.20-27. [45] Berg E, Hagersten E. Fast data-locality profiling of native execution. In Proc. SIGMETRICS, June 2005, pp.169-180. [46] Eklov D, Hagersten E. StatStack: Efficient modeling of LRU caches. In Proc. ISPASS, March 2010, pp.55-65. [47] Eklov D, Black-Schaffer D, Hagersten E. Fast modeling of shared caches in multicore systems. In Proc. the 6th