110ChenDing.KenKennedy/J.ParallelDistrib.Comput.64(2004)108-134used LRU replacementpolicy,we have one cachereuse,others will be as well.We then present affinity-basedwhich is the third access of element a. The best cachingdataregrouping,whichensures that program datapolicy stores thedata with the closest reuse in the future,elements with reference affinity are allocated togetheras shownbyBestin thecontextofthefirst Fortranin memory,and we show that the regrouping methodcompiler [7] and by Belady for virtual memory [10]. Forpresented is compile-time optimal.Then we describe twothis example,it would keep a in cachefrom theextensions,partial and dynamic dataregrouping,andbeginning and get two cache reuses.This is the bestprove that they are NP-hard problems.We show the usehardware can do,given complete information of dataofdataregrouping on structuredata inadditiontoarraydata.accesses.Bytransformingaprogramintwodistinctsteps,weThenewglobal strategiespresented here complementcandramaticallyimprovecacheperformance.Thefirstrather than replace existing techniques that work on astepis computation fusion,which brings togethersingle loop nest or a single array.Examples of localdifferent computations on the same data. For thetechniques include unroll-and-jam, loop blocking, loopprevious example,it would cluster accesses to the sameinterchange, loop skewing,and single-array data transdata element and,assuming no other constraints onformations. Our global techniques target data reusesacross loop and array boundaries.They combine loopsaccessorder.producethesequenceinFig.I(b).Thenewsequence has four cache reuses, twice that of the bestand arrays but they do not change the relative order ofhardwaremethod.accessesinaloop(exceptinsmallways),nordotheyThe second step of our strategy is data regrouping,alter the relativeplacement of data elements in an arraywhich organizes data elementsused by the sameForexample,ifaloopnestbenefitsfromsomelocalcomputation into contiguous locations in memory.technique,the fused loop nest benefits as well.In fact,Because cache is organized in non-unit size cachethe local technique would be more beneficial since theblocks, grouping data in this manner improves band-amountofdata and data reuseoften increase after loopwidth utilization by ensuring that,when a cache block isfusion.The global techniques in this article are designedloaded,morethanoneofitsdataelementsisusedbeforeas an independent stepbeforeapplyinglocal techniquesit is evicted.In theprevious example,thedataJointoptimizationmayproducebetterresultsbutitisregrouping step would place the three data elementsoutside the scope of this article.into contiguous memory locations so that they willThenexttwosectionsofthepaperpresentcomputa-tion fusion and data regrouping.The combined strategyoccupythesamecacheblock, as shown in Fig.I(c).The two-steptransformation achieves optimal cach-is evaluated in Section 4.Section 5 discusses relateding:useful data are loaded once and only once in awork and Section 6 provides a summary of the paper'sminimal number of cache blocks.Thus thetransformedcontributions.sequence has the lowest possible amount of aggregatememorytransfer.To be effective in large programs, computation fusion2. Reuse-based computation fusionmustrecombineallfunctions,anddataregroupingmustre-shuffle the entire data layout. Current compilerIn this section we cover three main topics: reuse-basedtechniquesarenotadeguatetocarryoutthesetasksloopfusion,aformof computationfusionthatfocusesMost loop transformations target a singleloop nest andon loops:an analysis of the complexity of optimaldo not exploit data reuse among disjoint loops. Mostfusion;and an experimental exploration of the practicaldata transformations changea singlearray orobjectandlimits offusion given completeprogram information.do not recombine useful data across array and objectboundaries2.1.Reuse-based loopfusionInwhatfollows,webeginbypresentingcomputationfusion,which attempts to group all accessesto the sameIn most programs,data reuses occur primarily indatum in a program.We first describe reuse-based looploops,hencecomputationfusionequatestoloopfusionfusion.whichfusesloopsofdifferentcontrolstructuresThis section describes the three components of reuse-inlarge applications.Thenweprovethatoptimalbased loop fusion:pair-wise fusion, sequential greedyfusion for bandwidthis NP-hard.Finally,we examinefusion, and multi-level fusion.the limit of computation fusion using perfect programinformation.2.1.1. Program modelNext, we present data regrouping, which attempts toWe begin with a program model that abstracts awaygroup all data involved in the same computation. A setdetails that are not relevant to improving data reuseof data elements is said to exhibit reference affinity ifA program consists of a list of loops and non-loopthese elements are always used together in a program-statements.The body of each loop consists of athat is, if one of the elements is used in a program, the
used LRU replacement policy, we have one cache reuse, which is the third access ofelement a: The best caching policy stores the data with the closest reuse in the future, as shown by Best in the context ofthe first Fortran compiler [7] and by Belady for virtual memory [10]. For this example, it would keep a in cache from the beginning and get two cache reuses. This is the best hardware can do, given complete information of data accesses. By transforming a program in two distinct steps, we can dramatically improve cache performance. The first step is computation fusion, which brings together different computations on the same data. For the previous example, it would cluster accesses to the same data element and, assuming no other constraints on access order, produce the sequence in Fig. 1(b). The new sequence has four cache reuses, twice that of the best hardware method. The second step ofour strategy is data regrouping, which organizes data elements used by the same computation into contiguous locations in memory. Because cache is organized in non-unit size cache blocks, grouping data in this manner improves bandwidth utilization by ensuring that, when a cache block is loaded, more than one ofits data elements is used before it is evicted. In the previous example, the data regrouping step would place the three data elements into contiguous memory locations so that they will occupy the same cache block, as shown in Fig. 1(c). The two-step transformation achieves optimal caching: useful data are loaded once and only once in a minimal number ofcache blocks. Thus the transformed sequence has the lowest possible amount ofaggregate memory transfer. To be effective in large programs, computation fusion must recombine all functions, and data regrouping must re-shuffle the entire data layout. Current compiler techniques are not adequate to carry out these tasks. Most loop transformations target a single loop nest and do not exploit data reuse among disjoint loops. Most data transformations change a single array or object and do not recombine useful data across array and object boundaries. In what follows, we begin by presenting computation fusion, which attempts to group all accesses to the same datum in a program. We first describe reuse-based loop fusion, which fuses loops of different control structures in large applications. Then we prove that optimal fusion for bandwidth is NP-hard. Finally, we examine the limit of computation fusion using perfect program information. Next, we present data regrouping, which attempts to group all data involved in the same computation. A set ofdata elements is said to exhibit reference affinity if these elements are always used together in a program— that is, ifone ofthe elements is used in a program, the others will be as well. We then present affinity-based data regrouping, which ensures that program data elements with reference affinity are allocated together in memory, and we show that the regrouping method presented is compile-time optimal. Then we describe two extensions, partial and dynamic data regrouping, and prove that they are NP-hard problems. We show the use ofdata regrouping on structure data in addition to array data. The new global strategies presented here complement rather than replace existing techniques that work on a single loop nest or a single array. Examples oflocal techniques include unroll-and-jam, loop blocking, loop interchange, loop skewing, and single-array data transformations. Our global techniques target data reuses across loop and array boundaries. They combine loops and arrays but they do not change the relative order of accesses in a loop (except in small ways), nor do they alter the relative placement ofdata elements in an array. For example, ifa loop nest benefits from some local technique, the fused loop nest benefits as well. In fact, the local technique would be more beneficial since the amount of data and data reuse often increase after loop fusion. The global techniques in this article are designed as an independent step before applying local techniques. Joint optimization may produce better results but it is outside the scope ofthis article. The next two sections ofthe paper present computation fusion and data regrouping. The combined strategy is evaluated in Section 4. Section 5 discusses related work and Section 6 provides a summary ofthe paper’s contributions. 2. Reuse-based computation fusion In this section we cover three main topics: reuse-based loop fusion, a form of computation fusion that focuses on loops; an analysis ofthe complexity ofoptimal fusion; and an experimental exploration of the practical limits of fusion given complete program information. 2.1. Reuse-based loop fusion In most programs, data reuses occur primarily in loops, hence computation fusion equates to loop fusion. This section describes the three components ofreusebased loop fusion: pair-wise fusion, sequential greedy fusion, and multi-level fusion. 2.1.1. Program model We begin with a program model that abstracts away details that are not relevant to improving data reuse. * A program consists ofa list ofloops and non-loop statements. The body ofeach loop consists ofa ARTICLE IN PRESS 110 Chen Ding, Ken Kennedy / J. Parallel Distrib. Comput. 64 (2004) 108–134
111Chen Ding,KenKemnedy/J.ParallelDistrib.Comput.64(2004)108-134similar list. A structured branch is treated as a singleAcompiler determines the data sharing between twometa-statement.A functioncalliseither in-lined orloopsbyarraysectionanalysis[25l.Itanalyzesthedatatreated as a meta-statement.Programs with unstruc-access of each iteration of each loop.The iterationaccess (its data footprint) is parameterized by the loopturedgotostatementswillnotbeconsidered.asthesecanbe eliminated bysystematictransformation(seeindex of this and all outer loops.It also summarizes thefor example Chapter7 of Allen and Kennedy [5D.data access patterns of all inner loops.For each.Each subscript position of an array reference is in onedimension of an array, a loop accesses either the wholeof thetwo forms:ali+t and al,wherea is thearraydimension, a number of elements on the border, or aname,i is a loop index, and t represents a loop-loop-variant section(a rangeparameterized bytheloopinvariant value. Otherwise we assume the subscriptindexvariable).Data dependence is tested bytheranges over all involved data dimensionsintersection of data footprints.The alignment factor isalso determined by comparingdata footprints.Weusethis simplemodel because it is sufficient for usAlgorithm 1 shows the steps of pair-wise fusion.Itto test loop fusion on a set of commonlyused bench-first determines whether to apply loop fusion, embed-mark programs.It simplifies the description of ourding,or splitting.It then finds the alignment factor forfusionalgorithms.In principle,we can extend thethreefusion and embedding andfinally returns the fused loop.new techniques described in this section to optimizemore complexloopsbyusingmorepowerful modelsAlgorithm1Pair-wise fusionsuch as affine subscript expressions and integer-setmappings, although at the expense of a slower fusionprocedure PairwiseFusion(s, p)algorithm.Require:s and p are either a loop or a non-loop2.1.2.Pair-wise fusionstatementif s and p do not share data thenThe loops in real programs often have differentreturncontrol structures, such as single statements,loop nestsend ifwithdifferent numbers of subloops and perfect or(determinewhether to useloopfusion,embedding,orimperfect nesting.Previous fusion techniques have beensplitting)limited to fusing loops of the samegeneral type and(find the alignmentfactorforfusion and embedding)control structure.Pair-wisefusion as specified in thisfor each array accessed in both s and p dosection strives to bemore general than earlier strategiesfind the smallest alignment factor thatbyfusingtwoloopsatatimebasedontheirdatareuse()satisfies data dependence,andrather than their control structure.(2) has the closest reuseGiven two loops,pair-wisefusion analyzes the dataifa constantalignmentfactorisnotpossiblethenaccess patterns in the outer-most loop and thentrysplitting off boundaryiterationscombines the iterations that share the same data. Theelsetwo loops can have any control structure.Asinglethe alignment factor is infinitestatement is considered to be a degenerate loop. Pair-end ifwise fusion classifies data sharing into the followingend forthree cases and uses different transformations for eachfind the largest of all alignment factors,fcase. Example of pair-wise fusion will be given later inif f is not a constant thenFig.2 and Fig.3.fusionfailed· Loop fusion and alignment, when data are sharedreturnbetween iterations of the two loops.It interleaves theelsedata-sharing iterations of the two loops. It may alignfuse or embed loops with the alignment fthe loops in two directions.It may shift the secondreturn the merged loop and pieces after splitting ifloopdowntopreservedatadependences,oritmayanyshift the second loop up to bring together data reuse.end if Loop embedding,when data are shared between oneendPairwiseFusionloop and one ormore iterations of the other loop.ItWhen two loops share more than one array,embeds the former loop into the latter at the earliestdata-sharing iteration.PairwiseFusionmay have multiple fusion choices.One Iteration reordering, when two loops cannot be fusedexample is shown in part (a)of Fig.2.At the top level,entirely.It breaks up the loops and fuses iterationswe may fuse the two loops at i-level to reuse array A,orthat can be fused.Special cases of iteration reorderingwemay embed thefirst loop as one i-iteration to reusearrayB.In general, we need to choose between fusionincludeloopsplitting andloopreversal.Inthispaper,we consider only splitting at loop boundaries.and embeddingandbetweenembeddingoneloopand
similar list. A structured branch is treated as a single meta-statement. A function call is either in-lined or treated as a meta-statement. Programs with unstructured goto statements will not be considered, as these can be eliminated by systematic transformation (see for example Chapter 7 of Allen and Kennedy [5]). * Each subscript position ofan array reference is in one ofthe two forms: a½i þ t and a½t; where a is the array name, i is a loop index, and t represents a loopinvariant value. Otherwise we assume the subscript ranges over all involved data dimensions. We use this simple model because it is sufficient for us to test loop fusion on a set of commonly used benchmark programs. It simplifies the description ofour fusion algorithms. In principle, we can extend the three new techniques described in this section to optimize more complex loops by using more powerful models such as affine subscript expressions and integer-set mappings, although at the expense ofa slower fusion algorithm. 2.1.2. Pair-wise fusion The loops in real programs often have different control structures, such as single statements, loop nests with different numbers of subloops and perfect or imperfect nesting. Previous fusion techniques have been limited to fusing loops of the same general type and control structure. Pair-wise fusion as specified in this section strives to be more general than earlier strategies by fusing two loops at a time based on their data reuse rather than their control structure. Given two loops, pair-wise fusion analyzes the data access patterns in the outer-most loop and then combines the iterations that share the same data. The two loops can have any control structure. A single statement is considered to be a degenerate loop. Pairwise fusion classifies data sharing into the following three cases and uses different transformations for each case. Example ofpair-wise fusion will be given later in Fig. 2 and Fig. 3. * Loop fusion and alignment, when data are shared between iterations of the two loops. It interleaves the data-sharing iterations ofthe two loops. It may align the loops in two directions. It may shift the second loop down to preserve data dependences, or it may shift the second loop up to bring together data reuse. * Loop embedding, when data are shared between one loop and one or more iterations of the other loop. It embeds the former loop into the latter at the earliest data-sharing iteration. * Iteration reordering, when two loops cannot be fused entirely. It breaks up the loops and fuses iterations that can be fused. Special cases of iteration reordering include loop splitting and loop reversal. In this paper, we consider only splitting at loop boundaries. A compiler determines the data sharing between two loops by array section analysis [25]. It analyzes the data access ofeach iteration ofeach loop. The iteration access (its data footprint) is parameterized by the loop index ofthis and all outer loops. It also summarizes the data access patterns ofall inner loops. For each dimension ofan array, a loop accesses either the whole dimension, a number ofelements on the border, or a loop-variant section (a range parameterized by the loop index variable). Data dependence is tested by the intersection of data footprints. The alignment factor is also determined by comparing data footprints. Algorithm 1 shows the steps ofpair-wise fusion. It first determines whether to apply loop fusion, embedding, or splitting. It then finds the alignment factor for fusion and embedding and finally returns the fused loop. Algorithm 1 Pair-wise fusion procedure PairwiseFusion(s, p) Require: s and p are either a loop or a non-loop statement if s and p do not share data then return end if {determine whether to use loop fusion, embedding, or splitting} {find the alignment factor for fusion and embedding} for each array accessed in both s and p do find the smallest alignment factor that (1) satisfies data dependence, and (2) has the closest reuse if a constant alignment factor is not possible then try splitting off boundary iterations else the alignment factor is infinite end if end for find the largest ofall alignment factors, f if f is not a constant then fusion failed return else fuse or embed loops with the alignment f return the merged loop and pieces after splitting if any end if end PairwiseFusion When two loops share more than one array, PairwiseFusion may have multiple fusion choices. One example is shown in part (a) of Fig. 2. At the top level, we may fuse the two loops at i-level to reuse array A; or we may embed the first loop as one i-iteration to reuse array B: In general, we need to choose between fusion and embedding and between embedding one loop and ARTICLE IN PRESS Chen Ding, Ken Kennedy / J. Parallel Distrib. Comput. 64 (2004) 108–134 111
112Chen Ding,Ken Kemnedy/ J.Parallel Distrib.Comput.64 (2004)108-134for i=l,Nforj=l,ND[i,j]=A[i,j]+B[i,]end forfor i=2,Nend forA[i] -= A[i-1]end forfor i=l,Nfor j=l,NA[1] = A[N]for k=l,NC[i,j]+=A[i,k]+B[k,j]for i=2,Nend forA[i] -= A[i-1]end forend forend for(b)(a)Fig. 2. Two examples of pair-wise fusion: (a) An example showing multiple choices of reused-based fusion; (b) an example showing that fusiblerelation is not transitiveembeddingtheother.Weresolvetheconflictby choosingapplies this heuristic at the source level. Section 2.3the one that reuses the largest arrays or the largeststudies itspotentialby applying it to the execution trace.number of arrays.Otherwise,wemake a random choice,Algorithm 2gives the basic steps of single-level fusion.which would be the case for the example in Fig.2(a).For each statement s, the algorithm finds the closestOne advantageof pair-wise fusion is theefficient testpredecessor pthat shares data with s.Then it invokesof fusibility.The fusible relation is not transitive.Forpair-wise fusion described in the previous section. Ifexamplefor two loops and a statement in Fig.2(b),anyfusion succeeds, the process is repeated for the fusedtwo of the threeare fusible but all three together are not,loopbecauseitnowaccessesalargersetof dataandbecause the alignment required to safely fuse all threemay share data with its predecessors.would leave no iterations in common between the firstand third loop.Since the fusibility of a group of loopsAlgorithm2Single-level sequentialgreedyfusioncannot be inferred from the fusibility of its subsets,finding all fusion choices would incur a costexponentialprocedure SingleLevelFusion(p)to program size.Pair-wise fusion avoids this exponentialRequire: Program p is a list of statements and loopscost byincremental fusion.However,because it does notfor each statement or loop sJij doexamine all possible choices,pair-wise fusion does notGreedyFusion(s[i])alwaysproduce the best result.The next sectionend fordescribes a heuristic.Sections 2.2and 2.3 discuss theend SingleLevelFusionproblem of optimal fusion and thepractical limit of theheuristic-based fusion.procedure GreedyFusion(s)Pair-wise fusion works in the same way for program-Require:sis a loop or a statementmer-written loops as for partiallyfused loops.The twosearch backward from sto find the last data-sharingalgorithms we present shortly apply pair-wise fusionpredecessorploopbyloopandlevelbyleveluntilnomoreloopscanif p does not exist or (s,p) has been marked as notbe fused.fusiblethenreturnend if2.1.3.Single-level sequential greedyfusionPairwiseFusion(s,p)We use a heuristic we call sequential greedy fusion:forif fusionfailed thenevery statement or loop from the beginning of amark (s,p)asnot fusible,returnprogram to the end, we fuse it forward as much aselsepossibletoward thepreviousdata-sharing statement or(let q be the fused loop)loop. The heuristic is sequential because it considersGreedyFusion(q)loops inprogram order.Itisgreedybecauseittriestofor each remaining piece t after splitting domerge all the uses of the same data into the place of itsGreedyFusion(t)first definition.The heuristic is symmetrical to the policyend forof Best and Beladywhiletheir scheme evicts data thatend ifhas thefurthestreuse, sequential greedyfusion executesendGreedyFusionthe instruction that has the closest reuse. This section
embedding the other. We resolve the conflict by choosing the one that reuses the largest arrays or the largest number ofarrays. Otherwise, we make a random choice, which would be the case for the example in Fig. 2(a). One advantage of pair-wise fusion is the efficient test of fusibility. The fusible relation is not transitive. For example for two loops and a statement in Fig. 2(b), any two ofthe three are fusible but all three together are not, because the alignment required to safely fuse all three would leave no iterations in common between the first and third loop. Since the fusibility of a group of loops cannot be inferred from the fusibility of its subsets, finding all fusion choices would incur a cost exponential to program size. Pair-wise fusion avoids this exponential cost by incremental fusion. However, because it does not examine all possible choices, pair-wise fusion does not always produce the best result. The next section describes a heuristic. Sections 2.2 and 2.3 discuss the problem ofoptimal fusion and the practical limit ofthe heuristic-based fusion. Pair-wise fusion works in the same way for programmer-written loops as for partially fused loops. The two algorithms we present shortly apply pair-wise fusion loop by loop and level by level until no more loops can be fused. 2.1.3. Single-level sequential greedy fusion We use a heuristic we call sequential greedy fusion: f or every statement or loop from the beginning of a program to the end, we fuse it forward as much as possible toward the previous data-sharing statement or loop. The heuristic is sequential because it considers loops in program order. It is greedy because it tries to merge all the uses ofthe same data into the place ofits first definition. The heuristic is symmetrical to the policy ofBest and Belady—while their scheme evicts data that has the furthest reuse, sequential greedy fusion executes the instruction that has the closest reuse. This section applies this heuristic at the source level. Section 2.3 studies its potential by applying it to the execution trace. Algorithm 2 gives the basic steps ofsingle-level fusion. For each statement s; the algorithm finds the closest predecessor p that shares data with s: Then it invokes pair-wise fusion described in the previous section. If fusion succeeds, the process is repeated for the fused loop because it now accesses a larger set ofdata and may share data with its predecessors. Algorithm 2 Single-level sequential greedy fusion procedure SingleLevelFusion(p) Require: Program p is a list ofstatements and loops for each statement or loop s[i] do GreedyFusion(s[i]) end for end SingleLevelFusion procedure GreedyFusion(s) Require: s is a loop or a statement search backward from s to find the last data-sharing predecessor p if p does not exist or (s,p) has been marked as not fusible then return end if PairwiseFusion(s, p) if fusion failed then mark (s,p) as not fusible, return else {let q be the fused loop} GreedyFusion(q) for each remaining piece t after splitting do GreedyFusion(t) end for end if end GreedyFusion ARTICLE IN PRESS for i=1, N for j=1, N D[i,j]=A[i,j]+B[i,j] end for end for for i=1, N for j=1, N for k=1, N C[i,j]+=A[i,k]+B[k,j] end for end for end for for i=2, N A[i] -= A[i-1] end for A[1] = A[N] for i=2, N A[i] -= A[i-1] end for (a) (b) Fig. 2. Two examples of pair-wise fusion: (a) An example showing multiple choices of reused-based fusion; (b) an example showing that fusible relation is not transitive. 112 Chen Ding, Ken Kennedy / J. Parallel Distrib. Comput. 64 (2004) 108–134
113Chen Ding,KenKemnedy/J.ParallelDistrib.Comput.64(2004)108-134The example in Fig.3(a) illustrates the sequentialelements are accessed by array references in theform ofgreedy heuristic and pair-wise fusion.The program hasA(i+c). Let A(i +c) and A(i2+c2) be two arraytwo loops sharing access to array A.Theloops cannotreferences in two initial loops with indexvariablesinandbe fused directly because two intervening statementsi2,respectively.If two loops are fused with noalso access parts of A.GreedyFusion examines each loopalignment, the reuses of an A array element areand statement in program order. It embeds the twoseparated by ci- c2| iterations. Now we consider thestatementsintothefirstloop.Thetworemainingloopseffectofthealignment.Ateachstepofpair-wisefusion.are not fusible because A[il is assigned by the lastthe alignmentfactormust bea constant.Suppose thatiterationofthefirstloopbutusedbythefirstiterationofNioops are fused, the alignment is then O(Nioops). Thethesecondloop.PairwiseFusionusesiterationsplittingreuses of an A element are separated by O(Nioops + Jci -to peel off the first iteration of the second loop so thatc2l) or O(Nioops) iterations in the fused loop.all later iterations can be fused with the first loop.We assume that each initial loop contributes aconstantnumberof referencesof theformB(i+c)forFinally,PairwiseFusionusesloop alignment to shift upthe iterations of the second loop so that they directlyeach array B.Hence, the fused loop has at mostreuse A[i-1].The fused program is shown in Fig.3(b)O(Nioops) references of that form for each array,The cache locality is greatly improved.Before fusion,covering a section of O(Nioops) elements in eachmost elements of array A cannot be cached if the size ofiteration. The size of the section would be O(Nioops +the array is larger than cache.After fusion,most ofk)fork consecutive iterations.Replacingk with Nioops.array A is cached regardless of the size of the array.we see that the reuses of an A element are separated byReuse-based loop fusion may add significant instruc-at most O(Nioops +Nioops) or O(Nioops) elements fromtion overhead because of the inserted branches.Brancheach array.The maximal reuse distance is bounded bythe amount of accessed data from all arrays, which isstatements can beavoided using a code-generationscheme from Allen and Kennedy (Section 8.6.3 of [5]),O(NarraysNioops)but at the cost of replicated loop bodies.FutureThe upper bound is tight because a worst-caseprocessorswill be better equipped to handlebranchesexamplecanbeconstructed asfollows:thebodyof thewith features such as predicated execution. Even onfirst loop is Bi(i)=A(i-n),next arenloops with acurrent machines, our study in Section 1 shows thatbodyBi(i)=Bi(i+1)+B2(i+1)+..+Bm(i+1),fi-nally is a loop with the body A(i) = Bi(i). Let all loopsprogramsspendmost timewaitingformemory,sohigherinstructionoverheadmaybetolerated.Section4run from 1to N except for the first loop,which runswill evaluate reuse-based loop fusion on currentfrom n+1 toN.Accordingto our informal proof,themachines.reusedistancebetween thereuses ofanAelementwill beFor single-level loops, the algorithm ensures boundedbounded by nm in thefused loop.This asymptoticreusedistancefor mostdata reuses in thefused loop.boundisthelowestpossiblebecausethevalueofeachAwhichcanthen be cached by a constant-size cachearrayelementhastoflowthroughnelementsofB,arrayregardless of the volume of the input data.We nowand combinethevalueofN-I elementsof arraysB2toestablishthis bound.Ourprogram model allowstwoBm before returning to itself.Therefore,thefusionalgorithm achieves thetightest asymptotic upper boundforms of array references, A(c)and A(i+c),where i istheloopindexandisaloop invariant.Whentheon the length of reuse distances in afused loop.numberof loopiterationsissufficientlylarge,mostdataBeing a heuristic, sequential-greedy fusion has twomajor weaknesses. First, it is not an optimal solution. Itfusesloopsupwardasmuchaspossibleandoftenresultsin verylarge loops at the beginning.The second andfor i=2,Nrelated problem is that the heuristic is unconstrained, sofor i=2,NA[i] = f(A[i-1]]A[i] = f(A[i-1])a fused loop may access too much data and conseif(i==3)end forquently overflow the limited register and cache re-A[2]=0.0sources.We have recently developed a method forelse if (i==N)A[1] = A[N]A[I] = A[N]resource-constrained fusion [29], which could be used toA[2] = 0.0end ifameliorate this problem, but it is still under evaluationfor i=3,Nand not used in the results reported here.The evaluationif (i>2 and i<N)B[i] = g(A[i-2])section will measure the effect of the current, uncon-B[i+1] = g(A[i-1])end forend ifstrainedfusionmethod.end forB[3] = g(A[1])(b)(a)2.1.4.Multi-levelfusionMulti-level fusion first decides the orderof looplevelsFig. 3. Examples of pair-wise and sequential greedy fusion: (a) Anand then applies single-level fusion level by level.It triesexample program; (b) transformed program
The example in Fig. 3(a) illustrates the sequential greedy heuristic and pair-wise fusion. The program has two loops sharing access to array A: The loops cannot be fused directly because two intervening statements also access parts of A: GreedyFusion examines each loop and statement in program order. It embeds the two statements into the first loop. The two remaining loops are not fusible because A½1 is assigned by the last iteration ofthe first loop but used by the first iteration of the second loop. PairwiseFusion uses iteration splitting to peel off the first iteration of the second loop so that all later iterations can be fused with the first loop. Finally, PairwiseFusion uses loop alignment to shift up the iterations ofthe second loop so that they directly reuse A½i 1: The fused program is shown in Fig. 3(b). The cache locality is greatly improved. Before fusion, most elements ofarray A cannot be cached ifthe size of the array is larger than cache. After fusion, most of array A is cached regardless ofthe size ofthe array. Reuse-based loop fusion may add significant instruction overhead because ofthe inserted branches. Branch statements can be avoided using a code-generation scheme from Allen and Kennedy (Section 8.6.3 of [5]), but at the cost ofreplicated loop bodies. Future processors will be better equipped to handle branches with features such as predicated execution. Even on current machines, our study in Section 1 shows that programs spend most time waiting for memory, so higher instruction overhead may be tolerated. Section 4 will evaluate reuse-based loop fusion on current machines. For single-level loops, the algorithm ensures bounded reuse distance for most data reuses in the fused loop, which can then be cached by a constant-size cache regardless ofthe volume ofthe input data. We now establish this bound. Our program model allows two forms of array references, AðcÞ and Aði þ cÞ; where i is the loop index and c is a loop invariant. When the number ofloop iterations is sufficiently large, most data elements are accessed by array references in the form of Aði þ cÞ: Let Aði1 þ c1Þ and Aði2 þ c2Þ be two array references in two initial loops with index variables i1 and i2; respectively. Iftwo loops are fused with no alignment, the reuses ofan A array element are separated by jc1 c2j iterations. Now we consider the effect of the alignment. At each step of pair-wise fusion, the alignment factor must be a constant. Suppose that Nloops are fused, the alignment is then OðNloopsÞ: The reuses ofan A element are separated by OðNloops þ jc1 c2jÞ or OðNloopsÞ iterations in the fused loop. We assume that each initial loop contributes a constant number ofreferences ofthe form Bði þ cÞ for each array B: Hence, the fused loop has at most OðNloopsÞ references of that form for each array, covering a section of OðNloopsÞ elements in each iteration. The size ofthe section would be OðNloops þ kÞ for k consecutive iterations. Replacing k with Nloops; we see that the reuses ofan A element are separated by at most OðNloops þ NloopsÞ or OðNloopsÞ elements from each array. The maximal reuse distance is bounded by the amount ofaccessed data from all arrays, which is OðNarraysNloopsÞ: The upper bound is tight because a worst-case example can be constructed as follows: the body of the first loop is B1ðiÞ ¼ Aði nÞ; next are n loops with a body B1ðiÞ ¼ B1ði þ 1Þ þ B2ði þ 1Þ þ ? þ Bmði þ 1Þ; fi- nally is a loop with the body AðiÞ ¼ B1ðiÞ: Let all loops run from 1 to N except for the first loop, which runs from n þ 1 to N: According to our informal proof, the reuse distance between the reuses ofan A element will be bounded by nm in the fused loop. This asymptotic bound is the lowest possible because the value ofeach A array element has to flow through n elements of B1 array and combine the value of N 1 elements ofarrays B2 to Bm before returning to itself. Therefore, the fusion algorithm achieves the tightest asymptotic upper bound on the length ofreuse distances in a fused loop. Being a heuristic, sequential-greedy fusion has two major weaknesses. First, it is not an optimal solution. It fuses loops upward as much as possible and often results in very large loops at the beginning. The second and related problem is that the heuristic is unconstrained, so a fused loop may access too much data and consequently overflow the limited register and cache resources. We have recently developed a method for resource-constrained fusion [29], which could be used to ameliorate this problem, but it is still under evaluation and not used in the results reported here. The evaluation section will measure the effect of the current, unconstrained fusion method. 2.1.4. Multi-level fusion Multi-level fusion first decides the order of loop levels and then applies single-level fusion level by level. It tries ARTICLE IN PRESS (a) for i=2, N A[i] = f(A[i-1]) end for A[1] = A[N] A[2] = 0.0 for i=3, N B[i] = g(A[i-2]) end for for i=2, N A[i] = f(A[i-1]) if (i==3) A[2] = 0.0 else if (i==N) A[1] = A[N] end if if (i>2 and i<N) B[i+1] = g(A[i-1]) end if end for B[3] = g(A[1]) (b) Fig. 3. Examples ofpair-wise and sequential greedy fusion: (a) An example program; (b) transformed program. Chen Ding, Ken Kennedy / J. Parallel Distrib. Comput. 64 (2004) 108–134 113
114Chen Ding,KenKemnedy/J.ParallelDistrib.Comput.64(2004)108-134to minimize the total distance of data reuses byprocedure LoopInterchange(s, L)Require:s is a data dimension; Lis the current loop levelminimizingthenumberoffusedloopsatouterlevels.Likepair-wise fusion, multi-level fusion is based onforeachloopnestdothe data access patterns rather the control structures ofif loop level t(t>L)iterates data dimension s thenloops.Whilealldataareconsideredforthecorrectnessinterchange level ttoLifpossible,otherwise doof fusion, only large arrays are used to determine thenothingend ifprofitability.Different choices lead to different loopend forfusions. Our current heuristic chooses arrays that havethe largest size. It requires that if a loop accesses moreend LoopInterchangethanone ofthechosen arrays,itmusttraversethe sameThesecond stepofAlgorithm3picksthedimensiondata dimension of each of these arrays. Otherwise, itchooses the largest array subset that meets the require-thatyields the smallest number of loops after fusion atlevel L. The third step recursively applies MultiLevelFu-ment.In the worst case, it chooses only one array.As aresult,each loop level accesses either no data dimensionsion at the next loop level. Note that after fusion, not allor a unique data dimension of thechosen array(s).Twolevel-Lloops iterateoverthe samedatadimension.Sinceloop levels may accessthe samedata dimension.Intheloop interchangemaynotalwayssucceed,somelevel-Lfollowing description, data dimension refers to a dimen-loops mayaccess a different data dimension than otherssion ofa chosenarrayInthealgorithm, thedimensionsinthethird step is notAlgorithm 3shows the steps of multi-level fusion.Foralways the dimension found in the second step.Loopeach loop level startingfromthe outermost,Multi-fusionmaytakeplaceeveniftheloopsaccessaLevelFusion determines nesting orderat each level L indimension other thans.three steps. First, it examines all data dimensions thatWe now analyze the time complexity of multi-levelare iterated byloops at L or deeper levels.For each dataloopfusion.Assumingloopsplittingatboundarydimension s, it performs a hypothetical analysis in whichiterations is the only form of iteration reordering,pair-itfirstmovesalls-traversingloopstolevelLifpossible.wise fusion takes O(Narrays), and single-level fusion takesthen performs sequential greedy fusion,and finallyO(Ninit. loops * Nrused loops * Narrays), where Ninit. loops ismeasures the number of fused loops at level L.the number of loops or statements before fusionThe analysis in this step is based on the techniqueNrused loops is the number of loops after fusion, anddevelopedbyMcKinleyetal.,whichpermutes the loopsNarays is the number of data arrays in the program.that access the contiguous data dimension to theAlthoughthis is quadratic in the number of loopsinnermost level [43]. We use a variation of this methodin the program, we believe that the running timetopermuteloops thataccessa chosen data dimension towill be manageable because the number of loops tolevel L.which fusionwill be applied will not usuallybe largeand,theapplication is developed in a modular style, thisnumber should not grow linearly with the size of theAlgorithm 3 Multi-level loop fusionprogram.Other types of iteration reordering may increase timeprocedureMultiLevelFusion(S, L)complexity of the algorithm.However,boundary split-Require: S is the set of data dimensions; L is the currentting is sufficient fortheprograms used in ourevaluationlooplevelSince multi-level fusion examines each dimension at(Step 1. find the best data dimension for level L)each loop level, the total cost is O(Ndimensions) times thefor each dimension s in S, test hypothetically docost of single-level fusion, where Ndimensions is theLoopInterchange(s, L)numberofdata dimensions in dataarrays.applySingleLevelFusion at levelLcountthenumber offused loopsend for2.2.Optimalcomputationfusionchosedimension thatyieldsthefewest fused loops(Step2.fuseloopsforlevel Londimension)This sectionformulates the problem of optimal fusionLooplnterchange($,L)and proves that the problem is NP-hard.apply SingleLevelFusion at level LWeuse thefollowingprogrammodel in this section.A(Step 3. continue fusion at level L + 1)program is a sequenceof computation units.WeassumeforthebodyofeachloopatlevelLdono cache reuse between different units but perfect cacheMultiLevelFusion(S-s,L+l),s is the data dimen-reuse within the same unit. In other words, when a unitsion iterated at level Lis executed,itloadsitsdatafrommemoryonceand onlyend foronce.Thus.aunitcanbethoughtofasa singleloopnestendMultiLevelFusionorprogram section inwhich all reuse is out of cache
to minimize the total distance ofdata reuses by minimizing the number offused loops at outer levels. Like pair-wise fusion, multi-level fusion is based on the data access patterns rather the control structures of loops. While all data are considered for the correctness offusion, only large arrays are used to determine the profitability. Different choices lead to different loop fusions. Our current heuristic chooses arrays that have the largest size. It requires that ifa loop accesses more than one ofthe chosen arrays, it must traverse the same data dimension ofeach ofthese arrays. Otherwise, it chooses the largest array subset that meets the requirement. In the worst case, it chooses only one array. As a result, each loop level accesses either no data dimension or a unique data dimension ofthe chosen array(s). Two loop levels may access the same data dimension. In the following description, data dimension refers to a dimension ofa chosen array. Algorithm 3 shows the steps ofmulti-level fusion. For each loop level starting from the outermost, MultiLevelFusion determines nesting order at each level L in three steps. First, it examines all data dimensions that are iterated by loops at L or deeper levels. For each data dimension s; it performs a hypothetical analysis in which it first moves all s-traversing loops to level L ifpossible, then performs sequential greedy fusion, and finally measures the number offused loops at level L: The analysis in this step is based on the technique developed by McKinley et al., which permutes the loops that access the contiguous data dimension to the innermost level [43]. We use a variation ofthis method to permute loops that access a chosen data dimension to level L: Algorithm 3 Multi-level loop fusion procedure MultiLevelFusion(S, L) Require: S is the set ofdata dimensions; L is the current loop level {Step 1. find the best data dimension for level L} for each dimension s in S; test hypothetically do LoopInterchange(s, L) apply SingleLevelFusion at level L count the number offused loops end for chose dimension s0 that yields the fewest fused loops {Step 2. fuse loops for level L on dimension s0 } LoopInterchangeðs0 ; LÞ apply SingleLevelFusion at level L {Step 3. continue fusion at level L þ 1} for the body ofeach loop at level L do MultiLevelFusion(S-s,L+1), s is the data dimension iterated at level L end for end MultiLevelFusion procedure LoopInterchange(s, L) Require: s is a data dimension; L is the current loop level for each loop nest do if loop level t ðt4LÞ iterates data dimension s then interchange level t to L ifpossible, otherwise do nothing end if end for end LoopInterchange The second step ofAlgorithm 3 picks the dimension that yields the smallest number of loops after fusion at level L: The third step recursively applies MultiLevelFusion at the next loop level. Note that after fusion, not all level-L loops iterate over the same data dimension. Since loop interchange may not always succeed, some level-L loops may access a different data dimension than others. In the algorithm, the dimension s in the third step is not always the dimension s0 found in the second step. Loop fusion may take place even if the loops access a dimension other than s0 : We now analyze the time complexity ofmulti-level loop fusion. Assuming loop splitting at boundary iterations is the only form of iteration reordering, pairwise fusion takes OðNarraysÞ; and single-level fusion takes OðNinit: loops Nfused loops NarraysÞ; where Ninit: loops is the number of loops or statements before fusion, Nfused loops is the number of loops after fusion, and Narrays is the number ofdata arrays in the program. Although this is quadratic in the number ofloops in the program, we believe that the running time will be manageable because the number ofloops to which fusion will be applied will not usually be large and, the application is developed in a modular style, this number should not grow linearly with the size ofthe program. Other types ofiteration reordering may increase time complexity ofthe algorithm. However, boundary splitting is sufficient for the programs used in our evaluation. Since multi-level fusion examines each dimension at each loop level, the total cost is OðNdimensionsÞ times the cost ofsingle-level fusion, where Ndimensions is the number ofdata dimensions in data arrays. 2.2. Optimal computation fusion This section formulates the problem of optimal fusion and proves that the problem is NP-hard. We use the following program model in this section. A program is a sequence ofcomputation units. We assume no cache reuse between different units but perfect cache reuse within the same unit. In other words, when a unit is executed, it loads its data from memory once and only once. Thus, a unit can be thought ofas a single loop nest or program section in which all reuse is out ofcache. ARTICLE IN PRESS 114 Chen Ding, Ken Kennedy / J. Parallel Distrib. Comput. 64 (2004) 108–134