695Chen Ding et al.:Performance Metrics and Models for Shared Cacheaffected by circular feedback. As a result, we cannotFootprintCompositiondirectly add the solo-run miss ratio to compute the co-IndividualCombined--....-FootprintFootprintrun miss ratio, as we can with the footprint.Another locality metric is reuse distance. Reuse dis.Mertricstance does not depend on cache parameters, but as weConversionwillexplainin Subsection3.2,neitherisit composableConcurrentSolo-RunCo-Run MissPrivateReuseAs mentioned earlier, we can measure the averageReuseDistancMiss RatioDistance (PRD)Ratio(CRD)as well as the distribution of footprints. The averagefootprint is immediately composable.The distribution,Fig.3. Joint use of two theoretical properties: composition (dot-although composable,requires aconvolutionwhichisted line)and conversion (solid lines),expensiveto computeand difficulttovisualize.Inthempose the latter indirectly through the former. First.following,the term "footprint" means the averagefoot-we compute the individual footprintfrom the individualprint.miss ratios (of all cache sizes). Then we add the individ-The next question is whether thefootprint composa-ual footprints and finally compute the co-run miss ratiobility can help in analyzing the miss ratio and otherin the shared cache (of all sizes). Fig.3 shows this typelocality metrics in shared cache. This is solved in theof deduction and others that are made possible by com-third part of the new theory.position and conversion.In particular,the figure showsLocality Metrics Conversion. Locality has differenthow to compose another locality metric, the reuse dis-measurements, just like temperature can be measuredtance. We use the terms private reuse distance (PRD)in different scales, Celsius or Fahrenheit. For locality,and concurrent reuse distance (CRD), as introduced bythe two most common metrics are missratio forhard-[13-14]ware design and reuse distance for program optimiza-The solution of composition raises the problem of de-tion.composition. The co-run miss ratio does not tell us theCentral to a locality theory is the conversion betweencontribution from each program. To see the individualdifferent metrics. The footprint theory shows that theeffects, we need more elaborate models.footprint is convertible with a number of other met-Composable Locality Models.We say a model is com-rics: Let mr(c) be the miss ratio for cache size c. Itposable if the co-run result can be computed from solo-can be computed from thefootprint using the followingrun results, not just for the co-run group as a whole,formulal10]:but also for the co-run effect on each individual pro-m(c) = f(c + Az) - fp(a)gram. In other words, a composable model must beAbothcomposableanddecomposable.As a composable metric, the footprint has the fol-where c = fp(). If these are continuous functions, welowing useful traits:would say that the miss ratio is the derivative of the.Machine Independent.The analysis is based onfootprint.data accesses, not cache misses. It takes a single passThe higher order mathematics implies mathematicalto analyze a trace for all cache sizes, and the result isnot affected by program instrumentation. In compa-properties. Since the derived metric,the miss ratio,isnon-decreasing, the source metric, the footprint, mustrison, it is inescapable for direct measurement to bebe not just non-decreasing, but also concave. Indeed,affected by instrumentation.the monotonicity and concavity were proved in two suc-Clean-Room Statistics.The footprint of onepro-cessive papers(9-10].gram can be measured in a co-run environment, unper-The conversion is reversible. If we have the miss ra-turbed by other programs. The clean-room effect solvestios of all cache sizes, we can reverse the formula andthe chicken-or-egg problem of direct measurement: thecompute the average footprint. The reverse process isbehavior of one program depends on its peer, but thetheanalog of integrationfor a discretefunctionpeer behavior in turn depends on itself.Combining footprint composition and metrics con-Peer Independent.The footprint of a programversion, we can see immediately that if the co-run missis independent of co-run peers. The analysis of cacheratio (miss ratio seen by the shared cache) can be com-sharing does not require actual cache sharing. The co-puted from the aggregate footprint. Fig.3 shows therun effect is computed rather than measured.derivation by adding the individual footprints and then·Statically Composable.There are 2P co-run com-converting the sum into the co-run miss ratio.binations for P programs.The footprint model canSince the conversion formula is reversible, we canpredict the interference in these 2P runs by testing Pswitch between the footprint and the miss ratio and co-single-program runs. For the P sequential runs, we can
Chen Ding et al.: Performance Metrics and Models for Shared Cache 695 affected by circular feedback. As a result, we cannot directly add the solo-run miss ratio to compute the corun miss ratio, as we can with the footprint. Another locality metric is reuse distance. Reuse distance does not depend on cache parameters, but as we will explain in Subsection 3.2, neither is it composable. As mentioned earlier, we can measure the average as well as the distribution of footprints. The average footprint is immediately composable. The distribution, although composable, requires a convolution which is expensive to compute and difficult to visualize. In the following, the term “footprint” means the average footprint. The next question is whether the footprint composability can help in analyzing the miss ratio and other locality metrics in shared cache. This is solved in the third part of the new theory. Locality Metrics Conversion. Locality has different measurements, just like temperature can be measured in different scales, Celsius or Fahrenheit. For locality, the two most common metrics are miss ratio for hardware design and reuse distance for program optimization. Central to a locality theory is the conversion between different metrics. The footprint theory shows that the footprint is convertible with a number of other metrics. Let mr (c) be the miss ratio for cache size c. It can be computed from the footprint using the following formula[10]: mr(c) = fp(x + ∆x) − f p(x) ∆x , where c = fp(x). If these are continuous functions, we would say that the miss ratio is the derivative of the footprint. The higher order mathematics implies mathematical properties. Since the derived metric, the miss ratio, is non-decreasing, the source metric, the footprint, must be not just non-decreasing, but also concave. Indeed, the monotonicity and concavity were proved in two successive papers[9-10] . The conversion is reversible. If we have the miss ratios of all cache sizes, we can reverse the formula and compute the average footprint. The reverse process is the analog of integration for a discrete function. Combining footprint composition and metrics conversion, we can see immediately that if the co-run miss ratio (miss ratio seen by the shared cache) can be computed from the aggregate footprint. Fig.3 shows the derivation by adding the individual footprints and then converting the sum into the co-run miss ratio. Since the conversion formula is reversible, we can switch between the footprint and the miss ratio and coFig.3. Joint use of two theoretical properties: composition (dotted line) and conversion (solid lines). mpose the latter indirectly through the former. First, we compute the individual footprint from the individual miss ratios (of all cache sizes). Then we add the individual footprints and finally compute the co-run miss ratio in the shared cache (of all sizes). Fig.3 shows this type of deduction and others that are made possible by composition and conversion. In particular, the figure shows how to compose another locality metric, the reuse distance. We use the terms private reuse distance (PRD) and concurrent reuse distance (CRD), as introduced by [13-14]. The solution of composition raises the problem of decomposition. The co-run miss ratio does not tell us the contribution from each program. To see the individual effects, we need more elaborate models. Composable Locality Models. We say a model is composable if the co-run result can be computed from solorun results, not just for the co-run group as a whole, but also for the co-run effect on each individual program. In other words, a composable model must be both composable and decomposable. As a composable metric, the footprint has the following useful traits: • Machine Independent. The analysis is based on data accesses, not cache misses. It takes a single pass to analyze a trace for all cache sizes, and the result is not affected by program instrumentation. In comparison, it is inescapable for direct measurement to be affected by instrumentation. • Clean-Room Statistics. The footprint of one program can be measured in a co-run environment, unperturbed by other programs. The clean-room effect solves the chicken-or-egg problem of direct measurement: the behavior of one program depends on its peer, but the peer behavior in turn depends on itself. • Peer Independent. The footprint of a program is independent of co-run peers. The analysis of cache sharing does not require actual cache sharing. The corun effect is computed rather than measured. • Statically Composable. There are 2P co-run combinations for P programs. The footprint model can predict the interference in these 2P runs by testing P single-program runs. For the P sequential runs, we can
696J.Comput.Sci.&Technol.,July2014,Vol.29, No.4models to consider time dilation(8,12), cache associati-choose to run them one by one or some of them in para-vity, and program phases[10].llel to increase speed.The composition is static if thereis no actual co-run; otherwise we say the composition3Locality Theory from 1968is dynamic.Here dynamic composition means paralleltesting, while static composition does not need parallelLocality was started as an observation that programstesting at all.do not use all their data at all times[2]. After decades ofTocomputetheco-runeffectoneachindividual proresearch, it has been developed into an important scien-gram, this dissertation describes three models.Thetific field. At its foundation are locality metrics, so themodels solve thedecomposition problem as a composi-conceptanditseffectcanbemeasured.Amongtheba-tion problem: how one program is affected by its peers.sic problems are the measurement speed and accuracyComposition by Reuse Distance and Footprintof these metrics.Variations of this model were invented by Thiebautand Stone[15] and Suh et al.[16] for time-sharing systems3.1Miss Ratio and Execution Time(time-switched cache sharing) and Chandra et al.[17] forThe metric of miss ratio was first used by Belady[18]multicore (continuous cache sharing). These studies es-to find out how often individual policies caused pagetimated the footprint since there was no feasible waysfaults.It was challenging at that time to measureto measure it. After the invention of the fast measure-page traces and simulate the various policies on them.ment, the cost of the model became limited by the timerequired for reuse distance measurement/9-10j.Today, the hardware performance counters on modernmachines enable a tool to measure program speed and.Composition byFootprint Only.The second modelcount cache misses in real time with little cost. The per-converts the footprint into reuse distance,so it noformance of a singleprogram or a group of programslonger needs to measure the reuse distance and can behundreds of times faster(10].can be observed directly. However, direct observationhas difficulties in characterizing the locality cleanly due.Composition by Program Pressure and Sensitivity.to dependences on the observation environment. TheseThe last model is as fast as the second model but moredependences include:intuitive and easier to use. It characterizes the behavior.Machine Dependence.Different machines have dif-of a program by two factors, pressure and sensitivity.ferent memory hierarchies and processors, so we can-The two can be visualized as two curves. Performancenot compare the locality in different programs entirelycomposition is as simple as looking up related values onthe two curves[12]]based on their performance.. Instrumentation Dependence. The analysis codeThe composable modelsprovide answers to a numitself consumes processor and cache resources. It mayber of long-standing questions about shared cache, in-not be possible to completely separate the effect of thecluding a machine independent way to compare pro-instrumentation.grams by their shared cache behavior, the correlation.Peer Dependence.It is unknown how theperfor-between a program's cache interference and its miss ra-mance has changed due to cache sharing. It would havetio, and the performance of cache sharing comparedwith cache partitioning(12].required another test on an unloaded system. It is alsounknown how the performance will change if the peerThese models are theoretical, and they are appealingprograms change.partly due to the generality.The footprint is definedThe effect of cache on performance is often dis-on a program trace without knowing co-run peers orThis phenomenon was first discussed byruptive..machine parameters (other than having shared cache).Denning(19] and stated as the thrashing, which happensThere are many sources of error dueto the fact thatwhen the sum of the working sets exceeds the availablethe basic models do not consider theeffect of cache as-memory.Chilimbi once compared the phenomenon tosociativity,program phase behavior,the time dilationstrolling leisurely until suddenly falling over a cliff?due to interference, the filtering effect in a multi-levelcache hierarchy, and the impact of the prefetcher. A th-The danger is greater in a shared environment. As moreeorymust be validated to be practically relevant.Theprograms are added, the combined working set grows.past studies have used experiments on real systems toWhen it exceeds the size of the shared cache, sharpevaluate the theoretical models and compare their pre-performance drops would ensue.Being peer and ma-dictions withactual miss countsmeasured byhardwarechine dependent, direct testing cannotforesee a pendingcounters[8-10,12].They also showed extensions of thecalamity. What is worse, it cannot even tell whether a@Trishul Chilimbi made this analogy in a presentation in 2002[20]
696 J. Comput. Sci. & Technol., July 2014, Vol.29, No.4 choose to run them one by one or some of them in parallel to increase speed. The composition is static if there is no actual co-run; otherwise we say the composition is dynamic. Here dynamic composition means parallel testing, while static composition does not need parallel testing at all. To compute the co-run effect on each individual program, this dissertation describes three models. The models solve the decomposition problem as a composition problem: how one program is affected by its peers. • Composition by Reuse Distance and Footprint. Variations of this model were invented by Thiebaut and Stone[15] and Suh et al. [16] for time-sharing systems (time-switched cache sharing) and Chandra et al. [17] for multicore (continuous cache sharing). These studies estimated the footprint since there was no feasible ways to measure it. After the invention of the fast measurement, the cost of the model became limited by the time required for reuse distance measurement[9-10] . • Composition by Footprint Only. The second model converts the footprint into reuse distance, so it no longer needs to measure the reuse distance and can be hundreds of times faster[10] . • Composition by Program Pressure and Sensitivity. The last model is as fast as the second model but more intuitive and easier to use. It characterizes the behavior of a program by two factors, pressure and sensitivity. The two can be visualized as two curves. Performance composition is as simple as looking up related values on the two curves[12] . The composable models provide answers to a number of long-standing questions about shared cache, including a machine independent way to compare programs by their shared cache behavior, the correlation between a program’s cache interference and its miss ratio, and the performance of cache sharing compared with cache partitioning[12] . These models are theoretical, and they are appealing partly due to the generality. The footprint is defined on a program trace without knowing co-run peers or machine parameters (other than having shared cache). There are many sources of error due to the fact that the basic models do not consider the effect of cache associativity, program phase behavior, the time dilation due to interference, the filtering effect in a multi-level cache hierarchy, and the impact of the prefetcher. A theory must be validated to be practically relevant. The past studies have used experiments on real systems to evaluate the theoretical models and compare their predictions with actual miss counts measured by hardware counters[8-10,12]. They also showed extensions of the models to consider time dilation[8,12], cache associativity, and program phases[10] . 3 Locality Theory from 1968 Locality was started as an observation that programs do not use all their data at all times[2]. After decades of research, it has been developed into an important scientific field. At its foundation are locality metrics, so the concept and its effect can be measured. Among the basic problems are the measurement speed and accuracy of these metrics. 3.1 Miss Ratio and Execution Time The metric of miss ratio was first used by Belady[18] to find out how often individual policies caused page faults. It was challenging at that time to measure page traces and simulate the various policies on them. Today, the hardware performance counters on modern machines enable a tool to measure program speed and count cache misses in real time with little cost. The performance of a single program or a group of programs can be observed directly. However, direct observation has difficulties in characterizing the locality cleanly due to dependences on the observation environment. These dependences include: • Machine Dependence. Different machines have different memory hierarchies and processors, so we cannot compare the locality in different programs entirely based on their performance. • Instrumentation Dependence. The analysis code itself consumes processor and cache resources. It may not be possible to completely separate the effect of the instrumentation. • Peer Dependence. It is unknown how the performance has changed due to cache sharing. It would have required another test on an unloaded system. It is also unknown how the performance will change if the peer programs change. The effect of cache on performance is often disruptive. This phenomenon was first discussed by Denning[19] and stated as the thrashing, which happens when the sum of the working sets exceeds the available memory. Chilimbi once compared the phenomenon to strolling leisurely until suddenly falling over a cliff②. The danger is greater in a shared environment. As more programs are added, the combined working set grows. When it exceeds the size of the shared cache, sharp performance drops would ensue. Being peer and machine dependent, direct testing cannot foresee a pending calamity. What is worse, it cannot even tell whether a ②Trishul Chilimbi made this analogy in a presentation in 2002[20]
697Chen Ding et al.: Performance Metrics and Models for Shared Cachegiven parallel mix is efficient or not without testingas a discrete probability density function, showing thethem individuallyfirst.probability of a memory access having a certain de-gree of locality. The miss ratio is then the probability3.2ReuseDistancefunction, showing the probability of the access beinga miss for a given cache size.A probabilistic adjust-The most common metric in program characteriza-mentinventedbySmithcanestimatetheeffectofcachetion is the reuse distance. For each memory access inconflicts in set-associative cache[23-25]. Combining thea trace.thereuse distance isthe number of distinctreuse distance and the Smith formula, we can computedata elements accessed between this and the previousthe miss ratio in the cache of all sizes.access to the same data. Mattson et al. first defined theMiss Ratio Curve (MRC).The miss ratio curveconcept (to model the performance of an LRU stack)(MRC) shows the miss ratio of all its cache sizes asand called it the LRU stack distance[21], LRU is aa discrete function.It is easy to visualize and showcache management method that favors recently useddirectly the trade-off between performance and cachedata.Recognizing it as a measure of recency, Jiang andsize. For fully associative LRU cache, the miss ratioZhang[22] called the metric the inter-reference recencycurve is equivalent to the reuse distance distribution, as(IRR).the preceding formula shows.The problem is equivalentFor example,the reuse distance shows the localityin theory to the argument whether it is measuring theof stack access and streaming access traces in Fig.4miss ratio curve or the reuse distance. In practice, theWhen a block is first accessed, the reuse distance ismiss ratio curve is defined for only practical cache sizes.infinite.When theblock is reused,the reuse distancei.e., powers of two between some range, e.g., 32KB andis the number of distinct blocks accessed between the8MB. The reuse distance has the full range between 1previous access and the reuse. The reuse would miss inand the size of program data.(fully associative LRU) cache if and only if its reuse dis-The full range of reuse distance represents the com-tance is greater than the cache size. The figure showsplete temporal locality. The miss ratio curve is a pro-that the stack trace can reuse the data in cache whenjection of the full information on a subset of cachethe cache size is less than 3 but the streaming tracesizes.The two would be equivalent if the miss ratiocannot.is defined for all cache sizes between 1 and infinityTheunbounded size of therepresentation is necessaryReuse.?323as shown by the theoretical result of Snir and Yu[26]Distanceaha.力that temporal locality cannot be fully encoded using aFig.4. The locality of two traces, stack accesses on the left andbounded number of bits.In thefollowing,wereviewstreaming accesses on the right, measured by the reuse distancethe prior work on both the reuse distance and the missof each memory reference. An access is a miss in fully associativeratio curve.LRU cache if and only if its reuse distance is greater than thecache size.3.2.2 Locality Analysis and OptimizationThe reuse distance quantifies the locality of everyReuse distance has found many uses. The localitymemory access.Thelocality of a program, or a loopsignature shows how the cache behavior changes withor a function inside it,is the collection of all its reusethe program input, and the changes can be predicted bydistances. The collective result can be represented aswhole-program locality analysis[5,25,27], which was useda distribution. It is called a locality signaturel5] andto predict the miss ratio of all inputs and cache sizes[28],locality profile[14].Fang et al. modeled locality signature for each memoryreferenceand used ittofind critical memoryloads and3.2.1 Relationwith CachePerformanceimportant program paths[27,29]. Marin et al.[25] mode-led the locality signature at reference, loop, and func-In the absence of cache sharing, the capacity miss ra-tionlevelstopredictperformanceacrossdifferent com-tio can be written as the fraction of the reuse distanceputer architectures. Beyls and D'Hollander[30-31] builtthat exceeds the cache size[21]. Let the test program bea program tuning tool SLO, which identifies the causeA and cachesize be C:wehaveof long distance reuses and gives improvement sugges-P(capacity miss byA) = P(A's reuse distance > C)tions for restructuring the code. In addition to cache(1)misses, reuse distance has been used to analyze the re-sponse time in server systems[32] and the usage patternThe reuse distance is machine independent but canin web reference streams[33], Zhong et al.[5] classifiedgivethe capacitymiss ratio for cache of all sizes, as theformula shows. The locality signature can be viewedthese and other uses of reuse distance as “Five Dimen-
Chen Ding et al.: Performance Metrics and Models for Shared Cache 697 given parallel mix is efficient or not without testing them individually first. 3.2 Reuse Distance The most common metric in program characterization is the reuse distance. For each memory access in a trace, the reuse distance is the number of distinct data elements accessed between this and the previous access to the same data. Mattson et al. first defined the concept (to model the performance of an LRU stack) and called it the LRU stack distance[21]. LRU is a cache management method that favors recently used data. Recognizing it as a measure of recency, Jiang and Zhang[22] called the metric the inter-reference recency (IRR). For example, the reuse distance shows the locality of stack access and streaming access traces in Fig.4. When a block is first accessed, the reuse distance is infinite. When the block is reused, the reuse distance is the number of distinct blocks accessed between the previous access and the reuse. The reuse would miss in (fully associative LRU) cache if and only if its reuse distance is greater than the cache size. The figure shows that the stack trace can reuse the data in cache when the cache size is less than 3 but the streaming trace cannot. Fig.4. The locality of two traces, stack accesses on the left and streaming accesses on the right, measured by the reuse distance of each memory reference. An access is a miss in fully associative LRU cache if and only if its reuse distance is greater than the cache size. The reuse distance quantifies the locality of every memory access. The locality of a program, or a loop or a function inside it, is the collection of all its reuse distances. The collective result can be represented as a distribution. It is called a locality signature[5] and locality profile[14] . 3.2.1 Relation with Cache Performance In the absence of cache sharing, the capacity miss ratio can be written as the fraction of the reuse distance that exceeds the cache size[21]. Let the test program be A and cache size be C; we have P(capacity miss byA) = P(A’s reuse distance > C). (1) The reuse distance is machine independent but can give the capacity miss ratio for cache of all sizes, as the formula shows. The locality signature can be viewed as a discrete probability density function, showing the probability of a memory access having a certain degree of locality. The miss ratio is then the probability function, showing the probability of the access being a miss for a given cache size. A probabilistic adjustment invented by Smith can estimate the effect of cache conflicts in set-associative cache[23-25]. Combining the reuse distance and the Smith formula, we can compute the miss ratio in the cache of all sizes. Miss Ratio Curve (MRC). The miss ratio curve (MRC) shows the miss ratio of all its cache sizes as a discrete function. It is easy to visualize and show directly the trade-off between performance and cache size. For fully associative LRU cache, the miss ratio curve is equivalent to the reuse distance distribution, as the preceding formula shows. The problem is equivalent in theory to the argument whether it is measuring the miss ratio curve or the reuse distance. In practice, the miss ratio curve is defined for only practical cache sizes, i.e., powers of two between some range, e.g., 32 KB and 8 MB. The reuse distance has the full range between 1 and the size of program data. The full range of reuse distance represents the complete temporal locality. The miss ratio curve is a projection of the full information on a subset of cache sizes. The two would be equivalent if the miss ratio is defined for all cache sizes between 1 and infinity. The unbounded size of the representation is necessary, as shown by the theoretical result of Snir and Yu[26] that temporal locality cannot be fully encoded using a bounded number of bits. In the following, we review the prior work on both the reuse distance and the miss ratio curve. 3.2.2 Locality Analysis and Optimization Reuse distance has found many uses. The locality signature shows how the cache behavior changes with the program input, and the changes can be predicted by whole-program locality analysis[5,25,27], which was used to predict the miss ratio of all inputs and cache sizes[28] . Fang et al. modeled locality signature for each memory reference and used it to find critical memory loads and important program paths[27,29]. Marin et al. [25] modeled the locality signature at reference, loop, and function levels to predict performance across different computer architectures. Beyls and D’Hollander[30-31] built a program tuning tool SLO, which identifies the cause of long distance reuses and gives improvement suggestions for restructuring the code. In addition to cache misses, reuse distance has been used to analyze the response time in server systems[32] and the usage pattern in web reference streams[33]. Zhong et al. [5] classified these and other uses of reuse distance as “Five Dimen-
698J.Comput.Sci.&Technol.,July2014,Vol.29, No.4sions of Locality"and reviewed the analysis techniques3.2.4Approximation by Reuse Timefor program input,code, data,execution phase, andWhilethe reuse distancecounts thenumberof disprogram interactiontinct memory accesses,the reuse time counts all ac-Reuse distance provides a common foundation tocesses. It is simply the difference in logical time be-model program behavior, predict machine performance,tween the previous access and the current reuse andand guide program optimization. Locality analysis andcan be measured quickly in O(n) time. The workingprofiling are to infer, measure, and decompose reuseset theory uses the reuse time (inter-reference gap) todistances, and locality optimization is to shorten longcompute the time-window miss ratel43]. If we take time-reuse distances.The analysis and the optimizationwindowmiss rate as an approximation ofthe LRU missare free of machine, instrumentation, and peer depen-rate, we may say that the working set theory is the firstdences.The downside, however, is the complexity ofapproximation technique.measuring reuse distance.Two series of more recent studies have used thereuse time to compute the reuse distance.The first3.2.3DirectMeasurementis StatCache and StatStack by Hagersten and hisstudents[44-47],@, and the second is time-based localityReuse distance is one of the stack distances definedapproximation[48-50]. For brevity, we name the latterin the seminal paper in 1970 by Mattson et al.[21] Thetechnique after its lead author and call it the Shen con-stack algorithm in the paper needs O(nm) time to pro-version.file a trace with n accesses to m distinct data.TheBerg and Hagersten solved the following equation forefficiencyhas been steadily improved over thepastfourthe miss ratio Rl44], Let N be the length of the trace,decades. In 1975, Bennett and Kruskal[34] organizedh(t)bethenumberofaccesseswhosereusetimeist,andthe trace as a tree and reduced the cost to O(n log n).In1980.Olken[35] madethetree compact and reducedf(k) be the probability that a cache block is replacedafter k misses. The cache is assumed to have randomthe cost further to O(nlogm).replacement, so f()=1- (1 -), for cache with CThe Olken algorithm has been the most efficientblocks. The total number of misses can be computedasymptotic solution (for full reuse distance measure-in two ways, and they should be equal:ment)until 2003,when Dingand Zhonggave anapproximation algorithm[5.36]The approximationNR=Nh(t)ft(R)guarantees a relative precision, e.g., 99%, and takest=1O(nloglogm)timewhichiseffectivelylineartonforany practical data size m. Zhong et al. also gave anStatCachesolvestheimplicitequationforthemissratioalgorithm that guarantees a constant error bound andR using numerical methods.does not reduce the asymptotic cost[37]. In an indepen-In the Shen conversion[48,5i], the key measure is thedent implementation, Schuff et al.[38] reported that theinterval access probability p(A), which is the probabi-average cost of the O(nloglogm)method is as high aslity of a randomly chosen datum being accessed dur- several thousand times slowdown.ing a time interval △.For areuseat time distanceKim et al.[39] gave a linear-time algorithm to mea-below is the probability that its reuse distance is k:sure the miss ratio for a fixed number of cache sizes,which may be used to approximate reuse distance.p(A)*(1 - p(A))N-kp(k,△) =There are practical improvements to the Olken algo-rithm. Cheetahimplemented the Olken algorithm usingThe formula computes the probability for k distincta splay-treel40], It became part of the widely used Sim-dataitems toappearina-long interval. ItassumesapleScalar simulator[41]. Almasi et al.[42] used a differentbinomial distribution given the interval access probabi-tree representation to further improve the efficiency.Ality p(), which is computed asgreater efficiency can be obtained through sampling andparallelization (Subsections 3.2.6 and 3.2.7)AT1NZZhong et al.gave a lower bound result showing(2)p(△) = N-1Pr(0),thatthespacecost of precisemeasurement is atleastt=18=t+O(n log n), indicating that reuse distance is fundamen- is the probability that an access has thetally a harder problem than streaming, i.e., countingwherept =the number of 1's in a sliding window, which can betime distance t.The derivation for p() can be foundin a technical report[51].done using O(n) space[5],@Berg and Hagersten used the term reuse distance for what we mean by reuse timel44]
698 J. Comput. Sci. & Technol., July 2014, Vol.29, No.4 sions of Locality” and reviewed the analysis techniques for program input, code, data, execution phase, and program interaction. Reuse distance provides a common foundation to model program behavior, predict machine performance, and guide program optimization. Locality analysis and profiling are to infer, measure, and decompose reuse distances, and locality optimization is to shorten long reuse distances. The analysis and the optimization are free of machine, instrumentation, and peer dependences. The downside, however, is the complexity of measuring reuse distance. 3.2.3 Direct Measurement Reuse distance is one of the stack distances defined in the seminal paper in 1970 by Mattson et al. [21] The stack algorithm in the paper needs O(nm) time to pro- file a trace with n accesses to m distinct data. The efficiency has been steadily improved over the past four decades. In 1975, Bennett and Kruskal[34] organized the trace as a tree and reduced the cost to O(n log n). In 1980, Olken[35] made the tree compact and reduced the cost further to O(n log m). The Olken algorithm has been the most efficient asymptotic solution (for full reuse distance measurement) until 2003, when Ding and Zhong gave an approximation algorithm[5,36]. The approximation guarantees a relative precision, e.g., 99%, and takes O(n log log m) time, which is effectively linear to n for any practical data size m. Zhong et al. also gave an algorithm that guarantees a constant error bound and does not reduce the asymptotic cost[37]. In an independent implementation, Schuff et al. [38] reported that the average cost of the O(n log log m) method is as high as several thousand times slowdown. Kim et al. [39] gave a linear-time algorithm to measure the miss ratio for a fixed number of cache sizes, which may be used to approximate reuse distance. There are practical improvements to the Olken algorithm. Cheetah implemented the Olken algorithm using a splay-tree[40]. It became part of the widely used SimpleScalar simulator[41]. Almasi et al. [42] used a different tree representation to further improve the efficiency. A greater efficiency can be obtained through sampling and parallelization (Subsections 3.2.6 and 3.2.7). Zhong et al. gave a lower bound result showing that the space cost of precise measurement is at least O(n log n), indicating that reuse distance is fundamentally a harder problem than streaming, i.e., counting the number of 1’s in a sliding window, which can be done using O(n) space[5] . 3.2.4 Approximation by Reuse Time While the reuse distance counts the number of distinct memory accesses, the reuse time counts all accesses. It is simply the difference in logical time between the previous access and the current reuse and can be measured quickly in O(n) time. The working set theory uses the reuse time (inter-reference gap) to compute the time-window miss rate[43]. If we take timewindow miss rate as an approximation of the LRU miss rate, we may say that the working set theory is the first approximation technique. Two series of more recent studies have used the reuse time to compute the reuse distance. The first is StatCache and StatStack by Hagersten and his students[44-47],③, and the second is time-based locality approximation[48-50]. For brevity, we name the latter technique after its lead author and call it the Shen conversion. Berg and Hagersten solved the following equation for the miss ratio R[44]. Let N be the length of the trace, h(t) be the number of accesses whose reuse time is t, and f(k) be the probability that a cache block is replaced after k misses. The cache is assumed to have random replacement, so f(k) = 1 − (1 − 1 C ) k , for cache with C blocks. The total number of misses can be computed in two ways, and they should be equal: NR = X t=1 Nh(t)f t(R). StatCache solves the implicit equation for the miss ratio R using numerical methods. In the Shen conversion[48,51], the key measure is the interval access probability p(∆), which is the probability of a randomly chosen datum v being accessed during a time interval ∆. For a reuse at time distance ∆, below is the probability that its reuse distance is k: p(k, ∆) = µ N k ¶ p(∆)k (1 − p(∆))N−k . The formula computes the probability for k distinct data items to appear in a ∆-long interval. It assumes a binomial distribution given the interval access probability p(∆), which is computed as p(∆) = X ∆ t=1 X T δ=t+1 1 N − 1 pT (δ), (2) where pt = h(t) N is the probability that an access has the time distance t. The derivation for p(∆) can be found in a technical report[51] . ③Berg and Hagersten used the term reuse distance for what we mean by reuse time[44]
699Chen Ding et al.:Performance Metrics and Models for Shared CacheThe two statistical techniques are successful in pre-block is uniformly randomly accessed. By computingdicting performance. StatCachewas used to model firstthe reuse distance, the Shen conversion models LRUprivate cache[44-45] and then shared cache[46-47]. Therather than random cache replacement.Shen conversion was used first for sequential code[48-49]Cache models can be divided by the replacement pol-and then multi-threaded code[50,52].icy: LRU or random. There is a second dimension toAlthough both using statistical analysis, StatCachecompare them: the metrics used to measure window-and the Shen conversion are fundamentally different:based locality.Forrandom replacement,wewant toone models the random cache, and the other the LRUknow the number of misses in a time window.Therecache.Next we explore the difference between randomis a (trivial) linear relation between the miss count andand LRU modeling in greater depth.the window length.For LRU, we want to know thefootprint in a window. The relation is non-linear, and3.2.5RandomvsLRUit is the main source of complexity in the Shen conver-Any statistical analysis of locality invariably makession in particular the derivation of the interval accesssome assumptions about randomness.Weexamineprobability.threesuchassumptions.Cache models use two types of window-based local-The first is random access to a cache set, whichity: the miss count and the footprint. The miss countmeans that a data access can happen at any cache setis linear but cache size dependent. In comparison, thewith equal probability.The Smithformulauses theas-footprint is non-linearbut cache size independent.Forsumption to calculate the contention in a cache set andexample, StatCache has to solve its equation for everythe effect of cache associativity[23].cache size, while the Shen conversion produces the reuseThe second is random cache replacement,whichdistance and themiss ratiofor all cache sizes.Thepastmeans that a miss may evict any cache block with equalsolutions represent different trade-offs between mode-probability.Under the assumption of random replace-ling simplicity and power. With the footprint theory,ment, the lifetime of a block in cache is binomially dis-we have a new option, which is to compute the reusetributed overthenumber ofcachemisses.Notknowingdistanceusing thefootprint,which we can measure asthe miss rate, StatCache uses the relation to computeaccurately as we can with the miss count.the miss rate from the reuse time[44]. Knowing the missThe three modeling methods, StatCache, the Shenrate, West et al. computed the cache occupancy[53],Fe-conversion, and the footprint conversion, are not gua-dorova et al. devised a fair scheduling policy based onranteed to always give the correct reuse distance.In-the assumption that a set of applications divide thedeed,aprecise linear-time solution is unlikelygiventhecache equally if they had the same miss rate[54].lower bound result in Zhong et al.[5] Among the three,the footprint theory is unique in formulating the condi-Sincereal cachedoesnot userandomreplacementthe accuracy of the assumption needs to be examined.tion for correctness, which is the reuse hypothesis(10].For cache occupancy, West et al.compared the pre-3.2.6SamplingAnalysisdiction with the actual measurement (through cacheSampling is usually effective to reduce the cost ofsimulation) and found that the prediction is accuratefor caches using random replacement but less so forprofiling.Choosing a low sampling rate may reduce thecaches using LRU[53]amount of profiling work by factors of hundreds or thou-Random has two other differences fromLRU.Onesands.In program analysis, bursty tracing is widelyiswell known,whichis that therandom replacementused. where the execution alternates between shortsampling periods and long hibernation periods[20,56-57].cache is fully associative by definition.The other islessrecognized,whichisthatthecacheperformanceisDuringhibernation,the execution happens in the origi-not deterministic as the replacement decisions changenal code and has no analysis overhead.randomly every time a program is run. Fortunately,Locality sampling, however, is tricker. Locality isabout the time of data reuse, but the time is unknowntheproblemisrecently solved.Zhougavean ingeniousuntil the access actually happens. The uncertainty hasalgorithm to computethe averagemiss ratio in asin-gle pass, without having to simulate multiple times totwo consequences.First, the length of the sampling pe-compute the average[55]riod cannot be bounded if it is to cover a sampled dataThe way to model LRU is using reuse distance.reuse pair. Second, the analyzer has to keep examiningKnowingthereuse distance,theSmithformula usesevery data access. Complete hibernation is effectivelyit to model the LRU replacement within a cache set[23].impossible.Not knowing the reuse distance, the Shen conversionThe problem of locality sampling is addressed by aneeds a way to compute it[48,51].It assumes a thirdseries of studies, including the publicly available SLOtoo1[30], continuous program optimization[58], burstytype of randomness - in a time window, each data
Chen Ding et al.: Performance Metrics and Models for Shared Cache 699 The two statistical techniques are successful in predicting performance. StatCache was used to model first private cache[44-45] and then shared cache[46-47]. The Shen conversion was used first for sequential code[48-49] and then multi-threaded code[50,52] . Although both using statistical analysis, StatCache and the Shen conversion are fundamentally different: one models the random cache, and the other the LRU cache. Next we explore the difference between random and LRU modeling in greater depth. 3.2.5 Random vs LRU Any statistical analysis of locality invariably makes some assumptions about randomness. We examine three such assumptions. The first is random access to a cache set, which means that a data access can happen at any cache set with equal probability. The Smith formula uses the assumption to calculate the contention in a cache set and the effect of cache associativity[23] . The second is random cache replacement, which means that a miss may evict any cache block with equal probability. Under the assumption of random replacement, the lifetime of a block in cache is binomially distributed over the number of cache misses. Not knowing the miss rate, StatCache uses the relation to compute the miss rate from the reuse time[44]. Knowing the miss rate, West et al. computed the cache occupancy[53]. Fedorova et al. devised a fair scheduling policy based on the assumption that a set of applications divide the cache equally if they had the same miss rate[54] . Since real cache does not use random replacement, the accuracy of the assumption needs to be examined. For cache occupancy, West et al. compared the prediction with the actual measurement (through cache simulation) and found that the prediction is accurate for caches using random replacement but less so for caches using LRU[53] . Random has two other differences from LRU. One is well known, which is that the random replacement cache is fully associative by definition. The other is less recognized, which is that the cache performance is not deterministic as the replacement decisions change randomly every time a program is run. Fortunately, the problem is recently solved. Zhou gave an ingenious algorithm to compute the average miss ratio in a single pass, without having to simulate multiple times to compute the average[55] . The way to model LRU is using reuse distance. Knowing the reuse distance, the Smith formula uses it to model the LRU replacement within a cache set[23] . Not knowing the reuse distance, the Shen conversion needs a way to compute it[48,51]. It assumes a third type of randomness — in a time window, each data block is uniformly randomly accessed. By computing the reuse distance, the Shen conversion models LRU rather than random cache replacement. Cache models can be divided by the replacement policy: LRU or random. There is a second dimension to compare them: the metrics used to measure windowbased locality. For random replacement, we want to know the number of misses in a time window. There is a (trivial) linear relation between the miss count and the window length. For LRU, we want to know the footprint in a window. The relation is non-linear, and it is the main source of complexity in the Shen conversion in particular the derivation of the interval access probability. Cache models use two types of window-based locality: the miss count and the footprint. The miss count is linear but cache size dependent. In comparison, the footprint is non-linear but cache size independent. For example, StatCache has to solve its equation for every cache size, while the Shen conversion produces the reuse distance and the miss ratio for all cache sizes. The past solutions represent different trade-offs between modeling simplicity and power. With the footprint theory, we have a new option, which is to compute the reuse distance using the footprint, which we can measure as accurately as we can with the miss count. The three modeling methods, StatCache, the Shen conversion, and the footprint conversion, are not guaranteed to always give the correct reuse distance. Indeed, a precise linear-time solution is unlikely given the lower bound result in Zhong et al. [5] Among the three, the footprint theory is unique in formulating the condition for correctness, which is the reuse hypothesis[10] . 3.2.6 Sampling Analysis Sampling is usually effective to reduce the cost of profiling. Choosing a low sampling rate may reduce the amount of profiling work by factors of hundreds or thousands. In program analysis, bursty tracing is widely used, where the execution alternates between short sampling periods and long hibernation periods[20,56-57] . During hibernation, the execution happens in the original code and has no analysis overhead. Locality sampling, however, is tricker. Locality is about the time of data reuse, but the time is unknown until the access actually happens. The uncertainty has two consequences. First, the length of the sampling period cannot be bounded if it is to cover a sampled data reuse pair. Second, the analyzer has to keep examining every data access. Complete hibernation is effectively impossible. The problem of locality sampling is addressed by a series of studies, including the publicly available SLO tool[30], continuous program optimization[58], bursty