700J.Comput.Sci.&Technol.,July2014,Vol.29,No.4reuse distance sampling[59], and multicore reuse dis-grams many times faster with parallel algorithms.Niutance analysis[38]. Sampling can drastically reduce theet al[6i] parallelized the analysis to run on a computercluster, while Cui et al.[62] and Gupta et al.[63] paral-costifsampled windows accuratelyreflectthebehaviorof other windows[45-47].lelized itforGPU.SLO has been developed by Beyls and D'Holla-Unlike the reuse distance, the footprint can be eas-nder(30].It instruments a program to skip every kily sampled and analyzed in parallel using shadowprofiling/64-65]. By measuring the footprint and con-accesses and take the next address as a sample.Abounded number of samples are kept in a sample reser-vertingittoreuse distance,wehaveshown theequivavoirhence the name reservoir sampling.To capturelent of parallel sampling analysis for reuse distance,the reuse, SLO checks each access to see if it reuseswhich can be done in near real-time,with just 0.5% vis-ible cost on averageli0]. We note that the accuracy ofsome sample data in the reservoir. The instrumenta-footprint conversion is conditional[1o], but direct (para-tion code is carefully engineered in GCC to have justtwo conditional statementsfor eachmemoryaccess (onellel)measurements are always accurate.for address and the other for counter checking). Reser-3.2.8CompilerAnalysisvoir sampling reduces the time overhead from 1000-foldslow-down to only afactor of 5 and the space overheadReuse distance can be analyzed statically for sci-entific code. Cascaval and Padua[66] used the depen-to within 250MB extra memory. The sampling accu-racy is 90% with 95% confidence.The accuracy ismea-dence analysis[67], and Beyls and D'Hollander[68] de-sured in reuse time, not reuse distance or miss rate.fined reuse distance equations and used the Omegasolver[69].While they analyzed conventional loops,To accurately measure reuse distance, a record mustChauhan and Shei[70] analyzed MATLAB scripts us-be kept to count the number of distinct data that ap-peared in a reuse window. Zhong and Chang[59] de-ingdependenceanalysis.Unlikeprofilingwhose reveloped the bursty reuse distance sampling, which di-sults are usually input specific.static analysis canvides a program execution into sampling and hiber-identify and model the effect of program parameters.nation periods. In the sampling period, the countingBeyls and D'Hollander[68] used the reuse distance equa-uses a tree structure and costs O(log log M) per access.tions for cache hint insertion, in particular, conditionalIf a reuse window extends beyond a sampling periodhints, where the caching decision is based on programinto the subsequent hibernation period, counting usesrun-time parameters. Shen et al.[71] used static anda hash-table, which reduces the cost to O(1) per access.lightweight reuse analysis in the IBM compiler for ar-Multicore reuse distance analysis by Schuff et al.[38] usesrayregroupingand structure splitting.a similar schemefor analyzing multi-threaded code.ItsUsing the static reuse distance analysis and thefast modeimprovesoverhibernationbyomittingthefootprint theory, Bao and Ding demonstrated a com-hash-table access at times when no samples are beingpiler technique for analyzing the program footprinttracked. Both methods track reuse distance accurately.and discussed the potential use in peer-aware programStatCache by Berg and Hagersten[45] is based on un-optimization[72-73], In [72], they used the tiled matrixbiased uniform sampling.After a data sample is se-multiply (Fig.5) as an example to show the reuse dis-lected, StatCache puts the page under the OS protec-tance computed at the source level (Table 1).They alsotion (at page granularity) to capture the next access tothe same datum. It uses the hardware counters to mea-for(jj=O;jj<N;jj=jj+B,)sure the time distance until the reuse. OS protectionfor (kk = 0; kk < N; kk = kk+ Bk)is limited by the page granularity. Two other systems,for(i=0;i<N;i=i+1)developed by Cascaval et a.[58] and Tam et al.[60], usefor (j=jj:j<min(jj+Bj,N);j=j+1)the special support on IBM processors to trap accessesfor(k=kk:k<min(kk+Bk.N):k=k+1)to specific data addresses. To reduce the cost, theseC[][] =beta × C[[] + apha × A[][] × B[][];methods use a small number of samples. Cascaval etFig.5. Loop nest of tiled matrix multiply.al.[58] used the Hellinger Affinity Kernel to infer the ac-curacy of sampling. Tam et al.[60] predicted the missTable 1. Reuse Distance as a Function of the Loop Boundsrate curve in real time.LoopArrayReuse Distance (Bytes)3.2.7Parallel AnalysiskC]8×3Schuff et az.[38] combined sampling and parallel72A[][]8×1+8×B+8×BB[][]8×B+8×B+8×B×B,analysis for parallel code on multicore.At theIPDPSkkC[][]8×N×Bi+8×N×Bk+8×Bk×B,conference in 2012, three groups of researchers reportedA[][]8×N×Bi+8×N×N+8×N×Bthat they made the analysis of even sequential pro-
700 J. Comput. Sci. & Technol., July 2014, Vol.29, No.4 reuse distance sampling[59], and multicore reuse distance analysis[38]. Sampling can drastically reduce the cost if sampled windows accurately reflect the behavior of other windows[45-47] . SLO has been developed by Beyls and D’Hollander[30]. It instruments a program to skip every k accesses and take the next address as a sample. A bounded number of samples are kept in a sample reservoir — hence the name reservoir sampling. To capture the reuse, SLO checks each access to see if it reuses some sample data in the reservoir. The instrumentation code is carefully engineered in GCC to have just two conditional statements for each memory access (one for address and the other for counter checking). Reservoir sampling reduces the time overhead from 1000-fold slow-down to only a factor of 5 and the space overhead to within 250 MB extra memory. The sampling accuracy is 90% with 95% confidence. The accuracy is measured in reuse time, not reuse distance or miss rate. To accurately measure reuse distance, a record must be kept to count the number of distinct data that appeared in a reuse window. Zhong and Chang[59] developed the bursty reuse distance sampling, which divides a program execution into sampling and hibernation periods. In the sampling period, the counting uses a tree structure and costs O(log log M) per access. If a reuse window extends beyond a sampling period into the subsequent hibernation period, counting uses a hash-table, which reduces the cost to O(1) per access. Multicore reuse distance analysis by Schuff et al. [38] uses a similar scheme for analyzing multi-threaded code. Its fast mode improves over hibernation by omitting the hash-table access at times when no samples are being tracked. Both methods track reuse distance accurately. StatCache by Berg and Hagersten[45] is based on unbiased uniform sampling. After a data sample is selected, StatCache puts the page under the OS protection (at page granularity) to capture the next access to the same datum. It uses the hardware counters to measure the time distance until the reuse. OS protection is limited by the page granularity. Two other systems, developed by Cascaval et al. [58] and Tam et al. [60], use the special support on IBM processors to trap accesses to specific data addresses. To reduce the cost, these methods use a small number of samples. Cascaval et al. [58] used the Hellinger Affinity Kernel to infer the accuracy of sampling. Tam et al. [60] predicted the miss rate curve in real time. 3.2.7 Parallel Analysis Schuff et al. [38] combined sampling and parallel analysis for parallel code on multicore. At the IPDPS conference in 2012, three groups of researchers reported that they made the analysis of even sequential programs many times faster with parallel algorithms. Niu et al. [61] parallelized the analysis to run on a computer cluster, while Cui et al. [62] and Gupta et al. [63] parallelized it for GPU. Unlike the reuse distance, the footprint can be easily sampled and analyzed in parallel using shadow profiling[64-65]. By measuring the footprint and converting it to reuse distance, we have shown the equivalent of parallel sampling analysis for reuse distance, which can be done in near real-time, with just 0.5% visible cost on average[10]. We note that the accuracy of footprint conversion is conditional[10], but direct (parallel) measurements are always accurate. 3.2.8 Compiler Analysis Reuse distance can be analyzed statically for scientific code. Cascaval and Padua[66] used the dependence analysis[67], and Beyls and D’Hollander[68] de- fined reuse distance equations and used the Omega solver[69]. While they analyzed conventional loops, Chauhan and Shei[70] analyzed MATLAB scripts using dependence analysis. Unlike profiling whose results are usually input specific, static analysis can identify and model the effect of program parameters. Beyls and D’Hollander[68] used the reuse distance equations for cache hint insertion, in particular, conditional hints, where the caching decision is based on program run-time parameters. Shen et al. [71] used static and lightweight reuse analysis in the IBM compiler for array regrouping and structure splitting. Using the static reuse distance analysis and the footprint theory, Bao and Ding demonstrated a compiler technique for analyzing the program footprint and discussed the potential use in peer-aware program optimization[72-73]. In [72], they used the tiled matrix multiply (Fig.5) as an example to show the reuse distance computed at the source level (Table 1). They also for (jj = 0; jj < N; jj = jj + Bj ) for (kk = 0; kk < N; kk = kk + Bk) for (i = 0; i < N; i = i + 1) for (j = jj; j < min(jj + Bj , N); j = j + 1) for (k = kk; k < min(kk + Bk, N); k = k + 1) C[i][j] = beta × C[i][j] + alpha × A[i][k] × B[k][j]; Fig.5. Loop nest of tiled matrix multiply. Table 1. Reuse Distance as a Function of the Loop Bounds Loop Array Reuse Distance (Bytes) k C[i][j] 8 × 3 j A[i][k] 8 × 1 + 8 × Bk + 8 × Bk i B[k][j] 8 × Bj + 8 × Bk + 8 × Bk × Bj kk C[i][j] 8 × N × Bj + 8 × N × Bk + 8 × Bk × Bj jj A[i][k] 8 × N × Bj + 8 × N × N + 8 × N × Bj
701Chen Ding et al.:Performance Metrics and Models for Shared Cacheshowed the use of the conversion theory (Subsec-ate cache performance degradation caused by operationtion2.2)to computethemissratiocurveand ameasuresystem and multiprogramming activity.of shared-cache friendliness called the fill time.The footprint in single-length execution windowscan be computed inlinear time,On time-shared svs-3.2.9Domain-SpecificModelingtems, the window of concern is the scheduling quan-tum. On these systems, the cached data of one processTo model graph algorithms, Yuan et al.[74] definedmay be evicted by data brought in by the next pro-the notion verter distance and used statistical analysiscess. Thiebaut and Stone computed what is essentiallyto derive the reuse distance. The study examines ran-the single-window footprint by dividing a trace by thedom graphs and scale-free graphs. It shows the dualfixed interval of CPU scheduling quantum and takingbenefits of domain-specific analysis.Ontheone hand.the average amount of data access of each quantum[15]the structure ofagraphfacilitateslocalityanalysis.OnDing and Chilimbil7] gave a sampling solution. Atthe other hand, locality analysis reveals the relation be-each access, it measures the footprint of a window end-tween the properties of a graph, e.g., edge density, anding at the current access. The length of the measuredthe efficiency of its computation.window is chosen at random.For an execution of length n, direct counting mea-3.2.10 Discussionsures the footprint in O(n) windows. If we use directcounting to estimate all-window footprint, we have aReuse distance is a powerful tool for program analy-sampling rate O(). The sampling rate may be too lowsis. It quantifies the locality of every program instruc-to be statisticallymeaningful. or it may be sufficient intion.For a single sequential execution, the metric ispractice. Without a solution for all-window analysis.composable.For example,the composition can happenwewouldnothavea way to evaluatetheaccuracy ofstructurally to show the locality of larger program unitsdirect counting.such as loops, functions, and the whole program, or itFootprint Equations. Suh et al[16] and Chandra etcan happen temporally to show program executions asal[17] used a recursive equation to estimate the foot-(integervalued)signals.print.As a window of sizew is increased to w+l.theThere are at least two limitations. First, reuse dis-change in the footprint depends on whether the newtance is insufficient to analyze program interaction.access is a miss. The equation is as follows: consider aWhile programs interact at all times in the sharedrandom window wt of size t being played out on somecache, reuse distance provides locality information forcache of infinite size. As we increase t, the footprintonlyreusewindows,not all windows.Second,preciseincreases with every cache miss. Let E[wt] be the ex-reuse distance is still costly to measure. Despite allpected footprint of wt,and M(EwtD)be theprobabi-of the advances in sampling and parallelization, thelity of a miss at the end of wt. For window size t +1,asymptotic cost is still more than linear.These prob-thefootprinteitherincreasesbyanincrement of oneorlems will be addressed indirectly through the study ofstays the same depending on whether t + 1 access is aanother locality metric, the footprint.cache miss.3.3EarlyFootprintE[wt+i] = E[wt](1-M(E[wt))+(E[wt]+1)M(E[wt]).Measuring footprint requires counting distinct dataelements, and the result depends on observation win-The term M(E[w) requires simulating sub-tracesof all size t windows, which is impractical. Suh et al.[16]dows. The problem has long been studied in measur-ing various types of reuse distances as discussed be-solved it as a differential equation and made the as-fore. However, footprint measurement is a more diffi-sumption of linear windowgrowth when therange ofcult problem thanreuse distancemeasurement.Givenwindow sizes under consideration is small.On the otherhand, Chandra et al.[17] computed the recursive relationa trace of length n,there is onlyO(n)reuse windowsbut in total O(n2) footprint windows. This subsectionbottom up. Neither method can guarantee a bound onfocuses on themeasurement problem,whichpriorworkthe accuracy, i.e., how the estimate may deviate fromsolved by either selecting a window subset to measurethe actual footprint.or constructing a model to approximate.In addition, these approaches produce the averageDirect Counting for Subset Windows. Agarwal etfootprint, not the distribution. The distribution can beal.[75] counted the number of cold-start misses for allimportant.Consider two sets of footprints, A and B.windows starting from the beginning of a trace (cumu-One tenth of A has size 10N and the rest has size 0.All of B has size N. A and B have the same averagelativecoldmisses).Cumulativecoldmisses,togetherwith warm-start region misses, were used to evalu-footprint N, but their different distribution can lead
Chen Ding et al.: Performance Metrics and Models for Shared Cache 701 showed the use of the conversion theory (Subsection 2.2) to compute the miss ratio curve and a measure of shared-cache friendliness called the fill time. 3.2.9 Domain-Specific Modeling To model graph algorithms, Yuan et al. [74] defined the notion vertex distance and used statistical analysis to derive the reuse distance. The study examines random graphs and scale-free graphs. It shows the dual benefits of domain-specific analysis. On the one hand, the structure of a graph facilitates locality analysis. On the other hand, locality analysis reveals the relation between the properties of a graph, e.g., edge density, and the efficiency of its computation. 3.2.10 Discussion Reuse distance is a powerful tool for program analysis. It quantifies the locality of every program instruction. For a single sequential execution, the metric is composable. For example, the composition can happen structurally to show the locality of larger program units such as loops, functions, and the whole program, or it can happen temporally to show program executions as (integer valued) signals. There are at least two limitations. First, reuse distance is insufficient to analyze program interaction. While programs interact at all times in the shared cache, reuse distance provides locality information for only reuse windows, not all windows. Second, precise reuse distance is still costly to measure. Despite all of the advances in sampling and parallelization, the asymptotic cost is still more than linear. These problems will be addressed indirectly through the study of another locality metric, the footprint. 3.3 Early Footprint Measuring footprint requires counting distinct data elements, and the result depends on observation windows. The problem has long been studied in measuring various types of reuse distances as discussed before. However, footprint measurement is a more diffi- cult problem than reuse distance measurement. Given a trace of length n, there is only O(n) reuse windows but in total O(n 2 ) footprint windows. This subsection focuses on the measurement problem, which prior work solved by either selecting a window subset to measure or constructing a model to approximate. Direct Counting for Subset Windows. Agarwal et al. [75] counted the number of cold-start misses for all windows starting from the beginning of a trace (cumulative cold misses). Cumulative cold misses, together with warm-start region misses, were used to evaluate cache performance degradation caused by operation system and multiprogramming activity. The footprint in single-length execution windows can be computed in linear time. On time-shared systems, the window of concern is the scheduling quantum. On these systems, the cached data of one process may be evicted by data brought in by the next process. Thiebaut and Stone computed what is essentially the single-window footprint by dividing a trace by the fixed interval of CPU scheduling quantum and taking the average amount of data access of each quantum[15] . Ding and Chilimbi[7] gave a sampling solution. At each access, it measures the footprint of a window ending at the current access. The length of the measured window is chosen at random. For an execution of length n, direct counting measures the footprint in O(n) windows. If we use direct counting to estimate all-window footprint, we have a sampling rate O( 1 n ). The sampling rate may be too low to be statistically meaningful, or it may be sufficient in practice. Without a solution for all-window analysis, we would not have a way to evaluate the accuracy of direct counting. Footprint Equations. Suh et al. [16] and Chandra et al. [17] used a recursive equation to estimate the footprint. As a window of size w is increased to w + 1, the change in the footprint depends on whether the new access is a miss. The equation is as follows: consider a random window wt of size t being played out on some cache of infinite size. As we increase t, the footprint increases with every cache miss. Let E[wt] be the expected footprint of wt, and M(E[wt]) be the probability of a miss at the end of wt. For window size t + 1, the footprint either increases by an increment of one or stays the same depending on whether t + 1 access is a cache miss. E[wt+1] = E[wt](1−M(E[wt]))+(E[wt]+1)M(E[wt]). The term M(E[wt]) requires simulating sub-traces of all size t windows, which is impractical. Suh et al. [16] solved it as a differential equation and made the assumption of linear window growth when the range of window sizes under consideration is small. On the other hand, Chandra et al. [17] computed the recursive relation bottom up. Neither method can guarantee a bound on the accuracy, i.e., how the estimate may deviate from the actual footprint. In addition, these approaches produce the average footprint, not the distribution. The distribution can be important. Consider two sets of footprints, A and B. One tenth of A has size 10N and the rest has size 0. All of B has size N. A and B have the same average footprint N, but their different distribution can lead
702J.Comput.Sci.&Technol.,July2014,Vol.29, No.4to very different types of cache interference. With themanagement as it is for cache management. Denningfootprint distribution analysis[8], we now have a waydefined the working set precisely as "the set of distinctto evaluate whether the average footprint produces thepages referred to in a backward window of fixed sizeT"@ The average footprint for window length T is thesame composition results as the footprint distribution.The past solutions on reuse distance often make simi-averageworking setsizefor all sizeTwindows.lar estimatesbecausethereusedistanceisthefootprintA breakthrough in this area is a simple formula dis-in a reuse window. These techniques[45.48-50,76] werecovered by Denning and first published in 1972[43]. Itmentioned in Subsection 3.2.They do not guaranteeshows the relation between the working set size, whichthe precision of the estimation.is difficult to measure, and the frequency and intervalof data reuses, which are easy to measure.The formula3.4AnalyticalModelsconverts between two locality metrics. Metrics conver-sion is at the heart of the science of locality, becauseInstead of measuring the reuse distance or footprint,it showsthat memorybehavior andperformanceareamathematical model maybeused tocharacterizethedifferent displays of the same underlying propertycache performance.Apex-Map uses a parameterizedWhile the proof of Denning and Schwartz[43] de-model and a probe program to quickly find the modelpends on idealized conditions in infinitely long exe-parameter for a program and a machinel77).Ibrahimcutions, subsequent research has shown that the work-and Strohmaier[78] compared the result of syntheticing set theory is accurate and effective in managingprobing and that of reuse distance profiling,while Hephysical memory for real applications.et al.[79] used a fractal model to estimate the miss rateThere are three ways to quantify the working set:curve through efficient online analysis.as a limit value in Denning's original paper[3], as theThere was much work earlier on analytical modelstime-space product defined by Denning and Slutz[86]for memory paging performance. An extensive surveyand as the all-window footprint just defined in Subsec-can be found in [2]. Saltzer[80], a designer of the Multicstion 3.3 (initially in [7]). The equation Denning dis-system, gaveone simple formula (Subsection 3.6.1).Hecovered holds in all three cases. In our 2013 paper[10],explained that "Although it is only occasionally thatwe stated it as a law of locality and named it after itsa mathematically tractable model happens to exactlydiscoverer:represent the real-world situation, often an approxi-Denning's Law of Locality 1. The working set ismate model is good enough for many engineering calcu-the second-order sum of the reuse frequency, and con-lations. The challenge ... is to maintain mathematicalversely,the reusefrequency is the second-order differtractability in the face of obvious flaws and limitationsence of the working set.in the range of applicability and yet produce a usefulThe footprint theory subsumes the infinitely longresult." Saltzer's formula has been used by Strecker(81]case in theoriginal working settheory and proves Den-in cache modeling.ning s law for all executions.It gives a theoretical ex-Another type of analytical models is the independentplanation to the long observed effectiveness of the work-reference model. Given a program with n pages, eaching set theory in practice.has an independent access probability p that adds toEaston and Fagin[s7] gave another important for-1, King(s2] showed that a steady miss rate exists formula forthe conversionbetween the cold-start andfully associative caches managed by LFU, LRU, andwarm-start miss ratios. The authors called it theirFIFO replacement policies. Later studies gave efficient"recipe". The recipe reveals that the (cold-start) life-approximation methods for LRU and FIFO(83-84]. Gutime in cache size C is the sum of the inter-miss times ofand Ding(85] proved a simple relation between randomthe (warm) cache for sizes smaller than C. They foundaccess and the reuse distance distribution (which is uni-that their "estimate was almost always within 10~15form). The method of Dan and Towsley(84] can be usedpercent of the directly observed average cold-start missto analyze a more general case where data is dividedratio."They further quoted the analysis of [88] as cor-into multiple groups and has different (random access)roborating evidence. In these studies, as in the work ofprobabilities.Itisatypeof composablemodel.Denning and Schwartz(43)], a program is assumed to bea stationary stochastic process. In the footprint theory,3.5Metrics Conversion andDenning'sLawthe Easton-Fagin formula can be derived directly, andFootprint is a form of working set. The workingthetheory shows the correctness conditionwhenit isset theory is the scientific basis as much for memoryused for finite-length program executions.enal cmmunicatoneb,2
702 J. Comput. Sci. & Technol., July 2014, Vol.29, No.4 to very different types of cache interference. With the footprint distribution analysis[8], we now have a way to evaluate whether the average footprint produces the same composition results as the footprint distribution. The past solutions on reuse distance often make similar estimates because the reuse distance is the footprint in a reuse window. These techniques[45,48-50,76] were mentioned in Subsection 3.2. They do not guarantee the precision of the estimation. 3.4 Analytical Models Instead of measuring the reuse distance or footprint, a mathematical model may be used to characterize the cache performance. Apex-Map uses a parameterized model and a probe program to quickly find the model parameter for a program and a machine[77]. Ibrahim and Strohmaier[78] compared the result of synthetic probing and that of reuse distance profiling, while He et al. [79] used a fractal model to estimate the miss rate curve through efficient online analysis. There was much work earlier on analytical models for memory paging performance. An extensive survey can be found in [2]. Saltzer[80], a designer of the Multics system, gave one simple formula (Subsection 3.6.1). He explained that “Although it is only occasionally that a mathematically tractable model happens to exactly represent the real-world situation, often an approximate model is good enough for many engineering calculations. The challenge . is to maintain mathematical tractability in the face of obvious flaws and limitations in the range of applicability and yet produce a useful result.” Saltzer’s formula has been used by Strecker[81] in cache modeling. Another type of analytical models is the independent reference model. Given a program with n pages, each has an independent access probability p that adds to 1, King[82] showed that a steady miss rate exists for fully associative caches managed by LFU, LRU, and FIFO replacement policies. Later studies gave efficient approximation methods for LRU and FIFO[83-84]. Gu and Ding[85] proved a simple relation between random access and the reuse distance distribution (which is uniform). The method of Dan and Towsley[84] can be used to analyze a more general case where data is divided into multiple groups and has different (random access) probabilities. It is a type of composable model. 3.5 Metrics Conversion and Denning’s Law Footprint is a form of working set. The working set theory is the scientific basis as much for memory management as it is for cache management. Denning defined the working set precisely as “the set of distinct pages referred to in a backward window of fixed size T.”④ The average footprint for window length T is the average working set size for all size T windows. A breakthrough in this area is a simple formula discovered by Denning④ and first published in 1972[43]. It shows the relation between the working set size, which is difficult to measure, and the frequency and interval of data reuses, which are easy to measure. The formula converts between two locality metrics. Metrics conversion is at the heart of the science of locality, because it shows that memory behavior and performance are different displays of the same underlying property. While the proof of Denning and Schwartz[43] depends on idealized conditions in infinitely long executions, subsequent research has shown that the working set theory is accurate and effective in managing physical memory for real applications. There are three ways to quantify the working set: as a limit value in Denning’s original paper[3], as the time-space product defined by Denning and Slutz[86] , and as the all-window footprint just defined in Subsection 3.3 (initially in [7]). The equation Denning discovered holds in all three cases. In our 2013 paper[10] , we stated it as a law of locality and named it after its discoverer: Denning’s Law of Locality 1. The working set is the second-order sum of the reuse frequency, and conversely, the reuse frequency is the second-order difference of the working set. The footprint theory subsumes the infinitely long case in the original working set theory and proves Denning’s law for all executions. It gives a theoretical explanation to the long observed effectiveness of the working set theory in practice. Easton and Fagin[87] gave another important formula for the conversion between the cold-start and warm-start miss ratios. The authors called it their “recipe”. The recipe reveals that the (cold-start) lifetime in cache size C is the sum of the inter-miss times of the (warm) cache for sizes smaller than C. They found that their “estimate was almost always within 10∼15 percent of the directly observed average cold-start miss ratio.” They further quoted the analysis of [88] as corroborating evidence. In these studies, as in the work of Denning and Schwartz[43], a program is assumed to be a stationary stochastic process. In the footprint theory, the Easton-Fagin formula can be derived directly, and the theory shows the correctness condition when it is used for finite-length program executions. ④Personal communication, December 17, 2013
703Chen Ding et al.:Performance Metrics and Models for Shared Cache3.6Locality Models of Shared Cacheformance model for cache.They used theformulation ofThiebaut and Stone[15] but computed the overlap using3.6.1Early Modelsa queuing model.In implementation, they measuredthe cold-start miss rate and used a reverse mappingThere are two types of cache sharing:the shar-to estimate the footprint. Since WWT ran the con-ing between multiple time-switched programs, and thecurrent processes of a parallel program, the instructionsharing between the instruction and data of the samecode was shared between processes.The sharing wasprogram. Easton and Fagin[87] studied the former,modeled as the shared footprint in the overall processcomparing the difference between cold-start and warm-footprint.start miss ratios and computing the effect of task in-Falsafi and Wood[91] revised the terminology ofterruptions as a weighted average of expected coldThiebaut and Stone[15] and redefined the footprint asstart miss ratios. Thiebaut and Stonel15] defined a pre-the set of unique data blocks a program accesses. Thecise measure called the reload transient. For a depart-projection of the footprint is the set of data blocks thating process, the reload transient is the amount of itsthe program leaves in cache. Viewed in another way,cached data lost when it returns after another processthe footprint is the program data in an infinite cache,is run. To compute the reload transient, Thiebaut andand the projection is the data in a finite cache. TheStonel15] defined cache footprint, which is the numberfootprint theory uses their definition of the word foot-of data blocks a program has in cache. Given two pro-print.grams A,B, the reload transient of Aafter B is theoverlap between their cache footprints.3.6.2ReuseDistance in Parallel CodeTo compute footprints and their overlap, Thiebautand Stone[15] assumed that a program has an equalReuse distance measures the locality of a programprobability of accessing any cache block.The probabi-directly and does not rely on the assumptions that arelity is independent and identically distributed.Thenecessary for analytical models.In a parallel program,overlap is then computed from expectations of bino-we have two types of reuse distance.One considersmial distributions.only the accesses of a single task, and the other consid-Instead of discrete probabilistic models, Strecker[(81]ers the interleaved accesses of all tasks. Using the ter-put forward an intuitive notion that a program is aminology of Wu and Yeung(13] and Jiang et a.[50], wecontinuous flow and fills the cache attheratethatiscall them private reuse distance (PRD) and concurrentthe product of two probabilities: the chance of a missreuse distance (CRD).The new problem in analyzingand the chance that the miss results in a new loca-the parallel locality is the relation between PRD andtion in the cache being filled. A differential equationCRD.was constructed since the fill rate is the derivative ofRecent work has studied several solutions. Ding andthe footprint over time.To compute the miss ratioChilimbil76] built models of data sharing and threadStrecker[8]] used an analytical formula by Saltzer[80].interleaving to compose CRD. Jiang et al.[50] tackledSaltzer[80] computed the inter-miss time in which hethe composition problem using probabilistic analysis,called the headway as the number of hits between suc-in particular, the interval access probability based oncessive misses.[48], discussed in Subsection 3.2.The second type of cache sharing happens betweenMulticore reuse distance by Schuff et al.[38] mea-the instruction and the data of a program.Stone etsures CRD directly using improved algorithms madeal.[80] investigated whether LRU produces the optimalefficient by sampling and parallelization.For loop-allocation. Assuming that the miss rate functions forbased code, Wu and Yeung gave a scaling model toinstruction and data are continuous and differentiablepredict how the reuse distance, both PRD and CRD,the optimal allocation happens at the points“"whenchanges when the work is divided by a different numbermiss-rate derivatives are equal [90]. The miss rate func-of threads[13]. These modeling techniques have foundtions, one for instruction and one for data, were mode-uses in co-scheduling(52] and multicore cache hierarchyled instead of measured.The authors showed that LRUdesign[13-14,92],is not optimal, but left open a question as to whetherthere is a bound on how close LRU allocation is to op-3.6.3Non-Composabilityof ReuseDistancetimalallocation.Thefootprint theorycan beusedtccompute the effective cache allocation (LRU allocation)A model is composable if the locality of a parallelexecution can be computed from the locality of indi-amonganygroupofprograms.As a component of the Wisconsin Wind Tunnelvidual tasks. However, the reuse distance is insufficient(WWT) project, Falsafi and Wood[91] developed a per-to build composablemodels
Chen Ding et al.: Performance Metrics and Models for Shared Cache 703 3.6 Locality Models of Shared Cache 3.6.1 Early Models There are two types of cache sharing: the sharing between multiple time-switched programs, and the sharing between the instruction and data of the same program. Easton and Fagin[87] studied the former, comparing the difference between cold-start and warmstart miss ratios and computing the effect of task interruptions as a weighted average of expected coldstart miss ratios. Thiebaut and Stone[15] defined a precise measure called the reload transient. For a departing process, the reload transient is the amount of its cached data lost when it returns after another process is run. To compute the reload transient, Thiebaut and Stone[15] defined cache footprint, which is the number of data blocks a program has in cache. Given two programs A, B, the reload transient of A after B is the overlap between their cache footprints. To compute footprints and their overlap, Thiebaut and Stone[15] assumed that a program has an equal probability of accessing any cache block. The probability is independent and identically distributed. The overlap is then computed from expectations of binomial distributions. Instead of discrete probabilistic models, Strecker[81] put forward an intuitive notion that a program is a continuous flow and fills the cache at the rate that is the product of two probabilities: the chance of a miss and the chance that the miss results in a new location in the cache being filled. A differential equation was constructed since the fill rate is the derivative of the footprint over time. To compute the miss ratio, Strecker[81] used an analytical formula by Saltzer[80] . Saltzer[80] computed the inter-miss time in which he called the headway as the number of hits between successive misses. The second type of cache sharing happens between the instruction and the data of a program. Stone et al. [89] investigated whether LRU produces the optimal allocation. Assuming that the miss rate functions for instruction and data are continuous and differentiable, the optimal allocation happens at the points “when miss-rate derivatives are equal”[90]. The miss rate functions, one for instruction and one for data, were modeled instead of measured. The authors showed that LRU is not optimal, but left open a question as to whether there is a bound on how close LRU allocation is to optimal allocation. The footprint theory can be used to compute the effective cache allocation (LRU allocation) among any group of programs. As a component of the Wisconsin Wind Tunnel (WWT) project, Falsafi and Wood[91] developed a performance model for cache. They used the formulation of Thiebaut and Stone[15] but computed the overlap using a queuing model. In implementation, they measured the cold-start miss rate and used a reverse mapping to estimate the footprint. Since WWT ran the concurrent processes of a parallel program, the instruction code was shared between processes. The sharing was modeled as the shared footprint in the overall process footprint. Falsafi and Wood[91] revised the terminology of Thiebaut and Stone[15] and redefined the footprint as the set of unique data blocks a program accesses. The projection of the footprint is the set of data blocks that the program leaves in cache. Viewed in another way, the footprint is the program data in an infinite cache, and the projection is the data in a finite cache. The footprint theory uses their definition of the word footprint. 3.6.2 Reuse Distance in Parallel Code Reuse distance measures the locality of a program directly and does not rely on the assumptions that are necessary for analytical models. In a parallel program, we have two types of reuse distance. One considers only the accesses of a single task, and the other considers the interleaved accesses of all tasks. Using the terminology of Wu and Yeung[13] and Jiang et al. [50], we call them private reuse distance (PRD) and concurrent reuse distance (CRD). The new problem in analyzing the parallel locality is the relation between PRD and CRD. Recent work has studied several solutions. Ding and Chilimbi[76] built models of data sharing and thread interleaving to compose CRD. Jiang et al. [50] tackled the composition problem using probabilistic analysis, in particular, the interval access probability based on [48], discussed in Subsection 3.2. Multicore reuse distance by Schuff et al. [38] measures CRD directly using improved algorithms made efficient by sampling and parallelization. For loopbased code, Wu and Yeung gave a scaling model to predict how the reuse distance, both PRD and CRD, changes when the work is divided by a different number of threads[13]. These modeling techniques have found uses in co-scheduling[52] and multicore cache hierarchy design[13-14,92] . 3.6.3 Non-Composability of Reuse Distance A model is composable if the locality of a parallel execution can be computed from the locality of individual tasks. However, the reuse distance is insufficient to build composable models
704J.Comput.Sci.&Technol.,July2014,Vol.29, No.4We illustrate this limitation by a counter example,In this model, the cache interference (i.e., CRD)isfirstpublished in [8].Fig.6 shows three short programcomputed by combining the footprint (i.e., the interfer-Programs A,B have the same set of privateence), and the reuse distance, i.e, the per-task locality.traces.reuse distances (PRD).However, when running with aSpecialized versions of this model were first developedby Suh et al[16] for time-sharing systems and Chandrathird program C, the pair A,C produces a different setet al.[17] for multicore cache sharing. While Chandral17]ofconcurrentreusedistances (CRD)thanthepairB,C.Assuming that the cache size is 4, the pair A, C has nodescribed and evaluated the composition for two pro-grams, Chen and Aamodt([96] improved the accuracycapacity miss, but B,C has. The example also showsthe same limitation for miss ratio. With identical reusewhen analyzing more programs with a greater numberdistances, A, B have the same number of misses in theof cache conflicts. A later study by Jiang et al.[52] givesprivatecache.Butintheshared cacheco-runningwiththe general form of the classic model not tied to cachethe same program C, they incur a different number ofparameters such as associativity.In the work of Suh et al[16] and Chandra et al[17],cache misses.Fig.6 is a disproof by counterexample.It shows conthe footprint equation is iterative (see Subsection 3.3)while in the work of Jiang et al.[52], the footprint equa-clusivelythat PRD is not enough to compute CRD,tion is statistical (see Subsection 3.2).Another foot-and thesolo-runmissratioisnot enoughto computeprint equation is the conversion formula by Denningthe co-run miss ratio.and Schwartz[43]. These equations are not completelyIn the example, the reason for the different co-runconstrained, so the solution is not unique and dependslocality is the different interaction based on the timeon modeling assumptions.spanofareuse.Considerthedataaccessestoa.inA.BThe classic model is not simple as presented inThey have the same private reuse distance, 2, but verythe previous publications.In the work of Chandradifferent (logical) reuse times, 3 in A and 7 in B. Whenet ai[17], hardware and program factors were consid-co-running with C, the reuse distance is lengthened be-ered together. Xie and Loh[97] noted that the modelcause of the data accesses by C. Since the reuse in Bby Chandra et al. "is fairly involved; the large num-spans over a longer time, it is affected more by cacheber of complex statistical computations would be verysharing. As a result, the concurrent reuse distance fordifficult to directly implement in hardware."In addi-a is 4 in the A,C but 5 in the B, C co-run.tion, the model has a high cost.,It was not used inChandra et al.[17] described three models of cachethe comparison study of Zhuravlev et al.[94],because itsharing.A simple one is the composition of reuse dis.was not “computationally fast enough to be used in thetance, called (LRU)stackdistance competition (SDC)robust scheduling algorithm."Since the model uses the reuse distance as the only in-There is another weakness in usability. The two in-put, it would have given the same prediction in ourputs, reuse distance and footprint, do not have a simpleexample for A,C and B,C.Therefore, it is a flawedeffect on the composed output.The complexity hin-model.A number of earlier studies have reached thesame conclusion through experiments[93-95]dersthe use of composable model in practice.As in-troduced in Subsection 2.2, the footprint theory shows3.6.4Classic Composition Modelmany equivalent methods of composition.The subsec-tion lists two other methods that are faster and easierLet A, B be two programs that share the same cacheto use.but do not share data.The effect of B on the localityof A is:PerformanceandOptimization3.7P(capacity miss by A when running with B)Cache is one of the factors in machine performance.=P(A'sreusedistance+B'sfootprint≥ cachesize)Locality models show the frequency of cache hits andA,BCo-Execution---24--22-222-22rd:--2-1111awbwaxcxcycyczczProgramAabacccccNo Capacity Miss on 4-Element FullyAssociative LURCachert:--3-2222wwxxyyzzProgramBB,CCo-Executionrd:--11112---22-222-225--2ProgramCacccccabawcwcxcxcycyazbzrt:--22227-1 Capacity Miss on 4-Element FullyAssociative LUR CacheFig.6. Non-composability of reuse distance. Programs A, B have the same set of reuse distances ("_" means infty), but A,C co-runhas a different set of reuse distances than B,C co-run does
704 J. Comput. Sci. & Technol., July 2014, Vol.29, No.4 We illustrate this limitation by a counter example, first published in [8]. Fig.6 shows three short program traces. Programs A, B have the same set of private reuse distances (PRD). However, when running with a third program C, the pair A, C produces a different set of concurrent reuse distances (CRD) than the pair B, C. Assuming that the cache size is 4, the pair A, C has no capacity miss, but B, C has. The example also shows the same limitation for miss ratio. With identical reuse distances, A, B have the same number of misses in the private cache. But in the shared cache co-running with the same program C, they incur a different number of cache misses. Fig.6 is a disproof by counterexample. It shows conclusively that PRD is not enough to compute CRD, and the solo-run miss ratio is not enough to compute the co-run miss ratio. In the example, the reason for the different co-run locality is the different interaction based on the time span of a reuse. Consider the data accesses to a in A, B. They have the same private reuse distance, 2, but very different (logical) reuse times, 3 in A and 7 in B. When co-running with C, the reuse distance is lengthened because of the data accesses by C. Since the reuse in B spans over a longer time, it is affected more by cache sharing. As a result, the concurrent reuse distance for a is 4 in the A, C but 5 in the B, C co-run. Chandra et al. [17] described three models of cache sharing. A simple one is the composition of reuse distance, called (LRU ) stack distance competition (SDC). Since the model uses the reuse distance as the only input, it would have given the same prediction in our example for A, C and B, C. Therefore, it is a flawed model. A number of earlier studies have reached the same conclusion through experiments[93-95] . 3.6.4 Classic Composition Model Let A, B be two programs that share the same cache but do not share data. The effect of B on the locality of A is: P (capacity miss by A when running with B) =P(A’s reuse distance + B’s footprint > cache size). In this model, the cache interference (i.e., CRD) is computed by combining the footprint (i.e., the interference), and the reuse distance, i.e., the per-task locality. Specialized versions of this model were first developed by Suh et al. [16] for time-sharing systems and Chandra et al. [17] for multicore cache sharing. While Chandra[17] described and evaluated the composition for two programs, Chen and Aamodt[96] improved the accuracy when analyzing more programs with a greater number of cache conflicts. A later study by Jiang et al. [52] gives the general form of the classic model not tied to cache parameters such as associativity. In the work of Suh et al. [16] and Chandra et al. [17] , the footprint equation is iterative (see Subsection 3.3), while in the work of Jiang et al. [52], the footprint equation is statistical (see Subsection 3.2). Another footprint equation is the conversion formula by Denning and Schwartz[43]. These equations are not completely constrained, so the solution is not unique and depends on modeling assumptions. The classic model is not simple as presented in the previous publications. In the work of Chandra et al. [17], hardware and program factors were considered together. Xie and Loh[97] noted that the model by Chandra et al. “is fairly involved; the large number of complex statistical computations would be very difficult to directly implement in hardware.” In addition, the model has a high cost. It was not used in the comparison study of Zhuravlev et al. [94], because it was not “computationally fast enough to be used in the robust scheduling algorithm.” There is another weakness in usability. The two inputs, reuse distance and footprint, do not have a simple effect on the composed output. The complexity hinders the use of composable model in practice. As introduced in Subsection 2.2, the footprint theory shows many equivalent methods of composition. The subsection lists two other methods that are faster and easier to use. 3.7 Performance and Optimization Cache is one of the factors in machine performance. Locality models show the frequency of cache hits and Fig.6. Non-composability of reuse distance. Programs A, B have the same set of reuse distances (“−” means infty), but A, C co-run has a different set of reuse distances than B, C co-run does