temworkloads across different environments,Roselliet.al.five.Ithasmanykeys that arerequestedmillions oftimesfound that even small caches can produce a high hit rate, buta day, and yet its average hit rate is only 81.4% becauselarger cacheswouldhavediminishingreturns28,similartohalf of its keys are accessed infrequently, and because a fewour conclusions on theETC workload (Sec.4.3)large values take up a disproportionate amount of the stor-In the work describing Facebook's photo storage system [8],We have APP, which represents a single applicationage.theauthorspresentedstatisticsofI/Orequestsforthepho-and consequently has more uniform objects:90%of themtos,which exhibit clear diurnalpatterns.consistentwithourhave roughly the same size. It also mirrors the interest ofobservations in this paper (Sec. 3.3).Facebook's users in specific popular objects, as evidencedWeb caches are widely deployed as a caching infrastruc-in load spikes that are accompanied by improved localityture for speeding up Internet access. Their workloads havemetrics. We have VAR, a transient store for non-persistentbeen collected and analyzed in Web servers [6, 25], proxies [5,performance data.Ithasthreetimesas manywrites asreads and 70% of its keys occur only once.:But its 94%10,16,and clients7,12In astudyofrequestsreceivedby Web servers, Arlittand Williamson found that 80% ofhit rate provides noticeable improvementsto the user ex-requested documents are smaller than ~ 10K B.However,perience.We have USR, which is more like a RAM-basedrequeststothesedocumentsgenerateonly26%of databytesstore for immutable two-byte values than a cache.It mayretrieved from the server [6]. This finding is consistent withnot be the best fit for Memcached, but its overall data sizethe distribution we describe in Sec. 3.2is small enough that even a few Memcached servers can de-In an analysis of traces of client-side requests, Cunha et.liver ahit rate of 98.2%.And finally, we have SYS, anotheral. show that many characteristics of Web use can be mod-RAM-based storage that exhibits unique behavior, becauseeled using power-law distributions, including the distribu-its clients already cache its data.They only access SYS whention of document sizes, the popularity of documents, thenew data or new clients show up, resulting in a low requestdistribution of user requests for documents, and the numberrate and a nearly bimodal distribution of temporal locality:of referencestodocumentsasapowerlawoftheiroveralleither a key is accessed many times in a short period, orpopularity rank (Zipf's law)[12].virtually not at all.Ourmodelingworkonthe ETC trace (Sec. 5) also shows power-law distributionsThis study has already answered pertinent questions toimprove Facebook's Memcached usage.For example, Fig.7in most request properties.In light of the increasing popularity and deployment ofshows the relatively marginal benefit of significantly increas-KV-stores, several schemes were proposed to improve theiringthe cachesizefor theETCpool.As another example.performance, energy efficiency, and cost effectiveness [4, 9,the analysis in Sec. 6 demonstrated both the importance of15, 26, 29]-Absent well-publicized workload traces, in par-increasing Memcached's hit rate, especially on larger data,ticular large-scale production traces, many works used hy-as well as the upper bound on the potential increase.potheticalorsyntheticworkloads29.Forexample,toeval-Thedatapresentedherecanalsobeusedasabasisfornewuate SILT, a KV-store design that constructs a three-levelstudies on key-value stores. We have also provided a simplestore hierarchy for storage on flash memory with a memoryanalytical model of ETC's performancemetrics to enablebased index, the authors assumed a workload of 10%PUTsynthetic generation of more representative workloads.Theand90%GETrequests using20Bkeysand 100Bvalues,treatment of workload modeling and synthetic load genera-as well as aworkload of 50%PUT and 50% GET requeststion in this paper only scratches the surface of possibilityfor 64B KV pairs [23].Andersen et. al.used queries ofand deserves its own focus in a following publication.Weconstant size (256B keys and 1KB values)in the evalu-plan to focus on this area and model the remaining work-ation of FAWN,aKV-store designed for nodes consistingload parameters for ETC (such as key reuse), and otherof low-power embedded CPUs and small amounts of flashworkloads as well. With these models, we would like to cre-storage [4]. In the evaluation of CLAM, a KV-store designate representative synthetic load generators, and share thosethat places both hash table and data items on flash, thewiththecommunity.We would also liketo see improvements inthe memoryauthors used syntheticworkloads thatgeneratekeysfroma random distribution and a number of artificial workloadallocation model so that more room is saved for items inmixes [3]. There are also some studies that used real work-high demand. Areas of investigation include an adaptiveloads in KV-store evaluations: In two works on flash-basedslab allocation,using no slabs at all, or using prediction ofKV store-design, Debnath et. al. adopted workloads fromitem locality based on the analysis in this study.online multi-playergaming and a storagede-duplication toolFinally,we arelooking into replacing Memcached's refrom Microsoft [13, 14]. Amazon's production workload wasplacement policy.LRU is not optimal for all workloads, andused to evaluate its Dynamo KV store, Dynamo [15].How-can be quite slow. We have already started prototyping al-ever, these papers did not specifically disclose the workloadternative replacement schemes, and the initial results arecharacteristics.encouraging.Finally,there are multiple studies offering analytical mod-els of observed, large-scale workloads. Of those, a good sur-vey of the methods is presented in Lublin's and Feitelson'sanalysis of supercomputer workloads [17, 24]Acknowledgements8.CONCLUSIONANDFUTUREWORKWe would like to thank Marc Kwiatkowski for spearheadingThis paper presented a dizzying number of views into athis project and Mohan Srinivasan for helping with the ker-very large data set. Together, these views tell a coherentnel module.Weare alsograteful forthe valuablefeedbackstory of five different Memcached workloads at Facebook.provided by the following: Goranka Bjedov, Rajiv Krishna-We have ETC, the largest and most heterogeneous of themurthy, Rajesh Nishtala, Jay Parikh, and Balaji Prabhakar
tem workloads across different environments, Roselli et. al. found that even small caches can produce a high hit rate, but larger caches would have diminishing returns [28], similar to our conclusions on the ETC workload (Sec. 4.3). In the work describing Facebook’s photo storage system [8], the authors presented statistics of I/O requests for the photos, which exhibit clear diurnal patterns, consistent with our observations in this paper (Sec. 3.3). Web caches are widely deployed as a caching infrastructure for speeding up Internet access. Their workloads have been collected and analyzed in Web servers [6, 25], proxies [5, 10, 16], and clients [7, 12]. In a study of requests received by Web servers, Arlitt and Williamson found that 80% of requested documents are smaller than ≈ 10KB. However, requests to these documents generate only 26% of data bytes retrieved from the server [6]. This finding is consistent with the distribution we describe in Sec. 3.2 In an analysis of traces of client-side requests, Cunha et. al. show that many characteristics of Web use can be modeled using power-law distributions, including the distribution of document sizes, the popularity of documents, the distribution of user requests for documents, and the number of references to documents as a power law of their overall popularity rank (Zipf’s law) [12]. Our modeling work on the ETC trace (Sec. 5) also shows power-law distributions in most request properties. In light of the increasing popularity and deployment of KV-stores, several schemes were proposed to improve their performance, energy efficiency, and cost effectiveness [4, 9, 15, 26, 29]. Absent well-publicized workload traces, in particular large-scale production traces, many works used hypothetical or synthetic workloads [29]. For example, to evaluate SILT, a KV-store design that constructs a three-level store hierarchy for storage on flash memory with a memory based index, the authors assumed a workload of 10% PUT and 90% GET requests using 20B keys and 100B values, as well as a workload of 50% PUT and 50% GET requests for 64B KV pairs [23]. Andersen et. al. used queries of constant size (256B keys and 1KB values) in the evaluation of FAWN, a KV-store designed for nodes consisting of low-power embedded CPUs and small amounts of flash storage [4]. In the evaluation of CLAM, a KV-store design that places both hash table and data items on flash, the authors used synthetic workloads that generate keys from a random distribution and a number of artificial workload mixes [3]. There are also some studies that used real workloads in KV-store evaluations. In two works on flash-based KV store-design, Debnath et. al. adopted workloads from online multi-player gaming and a storage de-duplication tool from Microsoft [13, 14]. Amazon’s production workload was used to evaluate its Dynamo KV store, Dynamo [15]. However, these papers did not specifically disclose the workload characteristics. Finally, there are multiple studies offering analytical models of observed, large-scale workloads. Of those, a good survey of the methods is presented in Lublin’s and Feitelson’s analysis of supercomputer workloads [17, 24]. 8. CONCLUSION AND FUTURE WORK This paper presented a dizzying number of views into a very large data set. Together, these views tell a coherent story of five different Memcached workloads at Facebook. We have ETC, the largest and most heterogeneous of the five. It has many keys that are requested millions of times a day, and yet its average hit rate is only 81.4% because half of its keys are accessed infrequently, and because a few large values take up a disproportionate amount of the storage. We have APP, which represents a single application and consequently has more uniform objects: 90% of them have roughly the same size. It also mirrors the interest of Facebook’s users in specific popular objects, as evidenced in load spikes that are accompanied by improved locality metrics. We have VAR, a transient store for non-persistent performance data. It has three times as many writes as reads and 70% of its keys occur only once. But its 94% hit rate provides noticeable improvements to the user experience. We have USR, which is more like a RAM-based store for immutable two-byte values than a cache. It may not be the best fit for Memcached, but its overall data size is small enough that even a few Memcached servers can deliver a hit rate of 98.2%. And finally, we have SYS, another RAM-based storage that exhibits unique behavior, because its clients already cache its data. They only access SYS when new data or new clients show up, resulting in a low request rate and a nearly bimodal distribution of temporal locality: either a key is accessed many times in a short period, or virtually not at all. This study has already answered pertinent questions to improve Facebook’s Memcached usage. For example, Fig. 7 shows the relatively marginal benefit of significantly increasing the cache size for the ETC pool. As another example, the analysis in Sec. 6 demonstrated both the importance of increasing Memcached’s hit rate, especially on larger data, as well as the upper bound on the potential increase. The data presented here can also be used as a basis for new studies on key-value stores. We have also provided a simple analytical model of ETC’s performance metrics to enable synthetic generation of more representative workloads. The treatment of workload modeling and synthetic load generation in this paper only scratches the surface of possibility, and deserves its own focus in a following publication. We plan to focus on this area and model the remaining workload parameters for ETC (such as key reuse), and other workloads as well. With these models, we would like to create representative synthetic load generators, and share those with the community. We would also like to see improvements in the memory allocation model so that more room is saved for items in high demand. Areas of investigation include an adaptive slab allocation, using no slabs at all, or using prediction of item locality based on the analysis in this study. Finally, we are looking into replacing Memcached’s replacement policy. LRU is not optimal for all workloads, and can be quite slow. We have already started prototyping alternative replacement schemes, and the initial results are encouraging. Acknowledgements We would like to thank Marc Kwiatkowski for spearheading this project and Mohan Srinivasan for helping with the kernel module. We are also grateful for the valuable feedback provided by the following: Goranka Bjedov, Rajiv Krishnamurthy, Rajesh Nishtala, Jay Parikh, and Balaji Prabhakar
9.REFERENCESSymposiun of Internet Technologies and Systems(Dec. 1997).[17] FEITELSON, D. G. Workload modeling for1http://voldemort-project.com.performance evaluation. In Performance Evaluation ofAHMAD,I.Easyand efficientdiskI/Oworkload2Compler Systems: Techniques and Tools, M. C.characterization in VMware ESX server.InCalzarossa and S. Tucci, Eds., vol. 2459 of LectureProceedings of IEEE International Symposium onNotes in Computer Science. Springer-Verlag, Sept.Workload Characterization (Sept.2007).2002,pp.114-141.[3] ANAND,A.,MUTHUKRISHNAN,C.,KAPPES,S.www.cs.huji.ac.il/"feit/papers/WorkloadModel02chap.ps.gz.AKELLA, A., AND NATH, S. Cheap and large CAMs[18]FITZPATRICK,B.Distributed caching withfor high performance data-intensive networkedmemcached. Linur Journal, 124 (Aug.2004),72-78.systems.In Proceedings of the 7th USENIX conferencewww.linuxjournal.com/article/7451?page=0,0.on Networked Systems Design and Implementation[19]JIANG,S.,AND ZHANG,X.LIRS:an efficientlow(Apr. 2010).inter-reference recency set replacement policy to[4] ANDERSEN, D. G., FRANKLIN, J., KAMINSKY, M.,improvebuffer cacheperformance.In Proceedings ofPHANISHAYEE, A., TAN, L., AND VASUDEVAN, V.the 2002 ACM SIGMETRICS international conferenceFAWN:a fast array of wimpy nodes. In Proceedings ofon Measurement and modeling of computer systemsthe 22nd ACM SIGOPS Symposium onOperating(2002), SIGMETRICS'02, ACM, PP. 31-42.SystemsPrinciples (SOSP) (Big Sky,Montana, 2009),[20] KAVALANEKAR, S., WORTHINGTON, B., ZHANG, Q,ACM, pp. 1-14.AND SHARDA,V.Characterization of storage workload[5] ARLITT, M., FRIEDRICH, R., AND JIN, T. Workloadtraces from production windows servers. Incharacterization of a web proxy in a cablemodemProceedings of IEEE International Symposium onenvironment.ACM SIGMETRICS-PerformanceWorkload Characterization (Sept. 2008).Eualuation Reew 27 (1999),25-36.[2] KEETON,K.,ALISTAIR VEITCH,D.O.,ANDWILKES,[6] ARLITT, M. F.,AND WILLIAMSON,C.L.InternetJ.I/O characterization of commercial workloads.Inweb servers: Workload characterization andProceedings of the 3rd Workshop on Computerperformance implications.IEEE/ACMTransactionsArchitecture Evaluation using Commercial Workloadson Networking 5 (October1997),631-645.(Jan.2000).[7] BARFORD,P.,BESTAVROS,A.,BRADLEY,A.,AND[22] KIM, Y., GUNASEKARAN, R., SHIPMAN, G. M.,CROVELLA, M.Changes in web client access patterns.DILLOW,D.A.,ZHANG,Z.,AND SETTLEMYER,In World Wide Web Journal, Special Issue onB.W.Workload characterization of a leadership classCharacterization and Performance Evaluation (1999).storage cluster.In Proceedings of Petascale Data[8] BEAVER, D., KUMAR, S., LI, H. C., SOBEL, J., ANDStorage Workshop (Nov. 2010).VAJGEL, P.Finding a needle in haystack: Facebook's[23] LIM,H., FAN,B., ANDERSEN,D.G., ANDphoto storage.In Proceedings of the 10th USENIXKAMINsKY, M. Silt: A memory-eficient,Symposium on Operating Systems Design andhigh-performance key-value store. In Proceedings ofImplementation (Oct.2010)the23rd ACM Symposiumon Operating Systems[9] BEREZECKI,M.,FRACHTENBERG,E.,PALECZNY,M..Principles (Oct. 20i1).AND STEELE, K. Many-core key-value store. In[24] LUBLIN,U.,ANDFEITELSON,D.G.TheworkloadonProceedings of the Second International Greenparallel supercomputers:Modeling the characteristicsComputing Conference (Orlando, FL, Aug.2011).of rigid jobs.Journal of Parallel and Distributed[10] BRESLAU, L., CAO, P., FAN, L., PHILLIPS, G., ANDComputing 63,11 (Nov.2003),1105-1122.SHENKER, S. Web caching and zipf-like distributions:www.cs.huji.ac.ii/"feit/papers/Rigid01TR.ps.gz.Evidenceand implications.In Proceedings of the18th[25] MANLEY, S., AND SELTZER, M.Web facts andAnnualIEEEInternationalConferenceonComputerfantasy. In Proceedings of USENIX Symposium onCommunications (1999).Internet Technologies and Systems (Dec. 1997).[1] CARNS, P., LATHAM, R., ROSS, R., KAMIL ISKRA,[26] PETROVIC,J.Using Memcached for data distributionS. L., AND RILEY,K. 24/7 characterization ofin industrial environment. In Proceedings of the ThirdpetascaleI/O workloads.In Proceedings of the 4thInternational Conference on Systems (Washington,Workshop on Interfaces and Architectures forDC, 2008),IEEE Computer Society,Pp.368-372.Scientific Data Storage (Nov.2009).[27]REDDI,V.J.,LEE,B.C.,CHILIMBI,T.,ANDVAID,[12]CUNHA,C.R., BESTAVROS, A., AND CROVELLA,K.Web search using mobile cores: Quantifying andM.E.Characteristics of WWW client-based traces.Inmitigating the price of efficiency.In Proceedings of theTechnical Report TR-95-010,Boston University37th International Symposium on ComputerDepartment of Computer Science, (July 1995).Architecture(ISCA)(June 2010),ACM.[13] DEBNATH, B.K., SENGUPTA, S., AND LI, J.portal.acm.org/citation.cfm?id=1815961.1816002.Flashstore: High throughput persistent key-value[28] ROSELLI, D., LORCH, J. R., AND ANDERSON, T. E. Astore.Proceedings of 36th International Conference oncomparison of file system workloads.In Proceedings ofVery Large Data Bases (VLDB) 3, 2 (2010).the2000 USENIX Annual Technical Conference (June[14DEBNATH,B.K.,SENGUPTA,S..ANDLI,J.2000).SkimpyStash: RAM space skimpy key-value store on[29] VASUDEVAN, V.R.Energy-Efficient Data-intensiveflash-based storage.In Proceedings of the AnnualComputing with a Fast Array of Wimpy Nodes.PhDACM SIGMOD Conference (June 2010), Pp.25-36.thesis, Carnegie Mellon University, Oct. 2011.[15] DECANDIA, G.,HASTORUN, D., JAMPANI, M.,[30] WANG, F., XIN, Q., HONG, B., MILLER, E. L.KAKULAPATI, G., PILCHIN, A., SIVASUBRAMANIAN,LONG, D. D. E., BRANDT, S. A., AND MCLARTY,S.,VosSHALL, P., AND VoGELs,W.Dynamo:T. T.File system workload analysis for large scientificAmazon'shighlyavailablekey-valuestore.Incomputingapplications.In Proceedings of 21stIEEEProceedings of the 21st ACM SIGOPS Symposium on12thNASAGoddard ConferenceonMass StorageOperating Systems Principles (SOSP)(Stevenson,Systems and Technologies (Apr.2004).WA,2007),Pp.205-220.[3] ZHAO, H.HipHop for PHP: Move fast.[16] DUSKA, B. M., MARWOOD, D., AND FEELEY, M. J.https://developers.facebook.com/blog/post/358/The measured access characteristics of world-wide webFeb. 2010.client proxy caches.In Proceedings of USENIX
9. REFERENCES [1] http://voldemort-project.com. [2] Ahmad, I. Easy and efficient disk I/O workload characterization in VMware ESX server. In Proceedings of IEEE International Symposium on Workload Characterization (Sept. 2007). [3] Anand, A., Muthukrishnan, C., Kappes, S., Akella, A., and Nath, S. Cheap and large CAMs for high performance data-intensive networked systems. In Proceedings of the 7th USENIX conference on Networked Systems Design and Implementation (Apr. 2010). [4] Andersen, D. G., Franklin, J., Kaminsky, M., Phanishayee, A., Tan, L., and Vasudevan, V. FAWN: a fast array of wimpy nodes. In Proceedings of the 22nd ACM SIGOPS Symposium on Operating Systems Principles (SOSP) (Big Sky, Montana, 2009), ACM, pp. 1–14. [5] Arlitt, M., Friedrich, R., and Jin, T. Workload characterization of a web proxy in a cable modem environment. ACM SIGMETRICS - Performance Evaluation Review 27 (1999), 25–36. [6] Arlitt, M. F., and Williamson, C. L. Internet web servers: Workload characterization and performance implications. IEEE/ACM Transactions on Networking 5 (October 1997), 631–645. [7] Barford, P., Bestavros, A., Bradley, A., and Crovella, M. Changes in web client access patterns. In World Wide Web Journal, Special Issue on Characterization and Performance Evaluation (1999). [8] Beaver, D., Kumar, S., Li, H. C., Sobel, J., and Vajgel, P. Finding a needle in haystack: Facebook’s photo storage. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (Oct. 2010). [9] Berezecki, M., Frachtenberg, E., Paleczny, M., and Steele, K. Many-core key-value store. In Proceedings of the Second International Green Computing Conference (Orlando, FL, Aug. 2011). [10] Breslau, L., Cao, P., Fan, L., Phillips, G., and Shenker, S. Web caching and zipf-like distributions: Evidence and implications. In Proceedings of the 18th Annual IEEE International Conference on Computer Communications (1999). [11] Carns, P., Latham, R., Ross, R., Kamil Iskra, S. L., and Riley, K. 24/7 characterization of petascale I/O workloads. In Proceedings of the 4th Workshop on Interfaces and Architectures for Scientific Data Storage (Nov. 2009). [12] Cunha, C. R., Bestavros, A., and Crovella, M. E. Characteristics of WWW client-based traces. In Technical Report TR-95-010, Boston University Department of Computer Science, (July 1995). [13] Debnath, B. K., Sengupta, S., and Li, J. Flashstore: High throughput persistent key-value store. Proceedings of 36th International Conference on Very Large Data Bases (VLDB) 3, 2 (2010). [14] Debnath, B. K., Sengupta, S., and Li, J. SkimpyStash: RAM space skimpy key-value store on flash-based storage. In Proceedings of the Annual ACM SIGMOD Conference (June 2010), pp. 25–36. [15] DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Pilchin, A., Sivasubramanian, S., Vosshall, P., and Vogels, W. Dynamo: Amazon’s highly available key-value store. In Proceedings of the 21st ACM SIGOPS Symposium on Operating Systems Principles (SOSP) (Stevenson, WA, 2007), pp. 205–220. [16] Duska, B. M., Marwood, D., and Feeley, M. J. The measured access characteristics of world-wide web client proxy caches. In Proceedings of USENIX Symposium of Internet Technologies and Systems (Dec. 1997). [17] Feitelson, D. G. Workload modeling for performance evaluation. In Performance Evaluation of Complex Systems: Techniques and Tools, M. C. Calzarossa and S. Tucci, Eds., vol. 2459 of Lecture Notes in Computer Science. Springer-Verlag, Sept. 2002, pp. 114–141. www.cs.huji.ac.il/~feit/papers/WorkloadModel02chap.ps.gz. [18] Fitzpatrick, B. Distributed caching with memcached. Linux Journal, 124 (Aug. 2004), 72–78. www.linuxjournal.com/article/7451?page=0,0. [19] Jiang, S., and Zhang, X. LIRS: an efficient low inter-reference recency set replacement policy to improve buffer cache performance. In Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems (2002), SIGMETRICS’02, ACM, pp. 31–42. [20] Kavalanekar, S., Worthington, B., Zhang, Q., and Sharda, V. Characterization of storage workload traces from production windows servers. In Proceedings of IEEE International Symposium on Workload Characterization (Sept. 2008). [21] Keeton, K., Alistair Veitch, D. O., and Wilkes, J. I/O characterization of commercial workloads. In Proceedings of the 3rd Workshop on Computer Architecture Evaluation using Commercial Workloads (Jan. 2000). [22] Kim, Y., Gunasekaran, R., Shipman, G. M., Dillow, D. A., Zhang, Z., and Settlemyer, B. W. Workload characterization of a leadership class storage cluster. In Proceedings of Petascale Data Storage Workshop (Nov. 2010). [23] Lim, H., Fan, B., Andersen, D. G., and Kaminsky, M. Silt: A memory-eficient, high-performance key-value store. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (Oct. 2011). [24] Lublin, U., and Feitelson, D. G. The workload on parallel supercomputers: Modeling the characteristics of rigid jobs. Journal of Parallel and Distributed Computing 63, 11 (Nov. 2003), 1105–1122. www.cs.huji.ac.il/~feit/papers/Rigid01TR.ps.gz. [25] Manley, S., and Seltzer, M. Web facts and fantasy. In Proceedings of USENIX Symposium on Internet Technologies and Systems (Dec. 1997). [26] Petrovic, J. Using Memcached for data distribution in industrial environment. In Proceedings of the Third International Conference on Systems (Washington, DC, 2008), IEEE Computer Society, pp. 368–372. [27] Reddi, V. J., Lee, B. C., Chilimbi, T., and Vaid, K. Web search using mobile cores: Quantifying and mitigating the price of efficiency. In Proceedings of the 37th International Symposium on Computer Architecture (ISCA) (June 2010), ACM. portal.acm.org/citation.cfm?id=1815961.1816002. [28] Roselli, D., Lorch, J. R., and Anderson, T. E. A comparison of file system workloads. In Proceedings of the 2000 USENIX Annual Technical Conference (June 2000). [29] Vasudevan, V. R. Energy-Efficient Data-intensive Computing with a Fast Array of Wimpy Nodes. PhD thesis, Carnegie Mellon University, Oct. 2011. [30] Wang, F., Xin, Q., Hong, B., Miller, E. L., Long, D. D. E., Brandt, S. A., and McLarty, T. T. File system workload analysis for large scientific computing applications. In Proceedings of 21st IEEE / 12th NASA Goddard Conference on Mass Storage Systems and Technologies (Apr. 2004). [31] Zhao, H. HipHop for PHP: Move fast. https://developers.facebook.com/blog/post/358/, Feb. 2010
Ding C, XiangX, BaoB et al.Performancemetrics and models for shared cache.JOURNAL OF COMPUTER SCIENCEANDTECHNOLOGY29(4):692-712July2014.DOI10.1007/s11390-014-1460-7Performance Metrics and Models for Shared CacheChenDing(丁晨),XiaoyaXiang(向晓娅),BinBaol(包斌),HaoLuol(罗昊),Ying-WeiLuo2(罗英伟)andXiao-LinWang?(汪小林)1Department of Computer Science, University of Rochester,Rochester,NY14627-0226, U.S.A.2School of Electronics Engineering and Computer Science,Peking University,Beijing 100871, ChinaE-mail: cding@cs.rochester.edu; [sappleing, bin.bao)@gmail.com; hluo@cs.rochester.edu; [lyw, wxl)@pku.edu.cnReceived March 1, 2014; revised May 14, 2014.AbstractPerformance metrics and models are prerequisites for scientific understanding and optimization.This paperintroduces a new footprint-based theory and reviews the research in the past four decades leading to the new theory.Thereview groups the past work into metrics and their models in particular those of the reuse distance, metrics conversion,models of shared cache, performance and optimization, and other related techniques.Keywordsmemory performance metric,cache sharing,reuse distancePartitioned cache solves the interference problem1Introductionvia program isolation. However, cache partitioning isComputing is ubiquitous in science, engineering,wasteful when only one program is running and ineffi-cient when co-run programs share data. Current multi-business, and everyday life.Most of today's applica-tions, whether for cloud, desktop, or handheld, run oncore processors use a mix of private and shared cache.For example, Intel Nehalem has 256 K L2 cache per coremulticore processors.As a result, they interact withand 4MB to 8 MB L3 cache shared by all cores. IBMpeer programs. It is beneficial to minimize the nega-Power7has 8 cores, with 256KBL2cacheper core andtive interaction. The benefit is important not just for32 MB L3 shared by all cores.good performance but also forstable performance,notDepending on which CPU they are using,programsjust for parallel codebut alsofor sequential applicationsinteract in different ways.Physical cores have privaterunning in parallel.caches at the first and second levels but share the lastThis paper surveys the theories and techniques tolevel cache. Logical cores share the caches at all levels.measureand improveprograminteraction on multicoreDifferent processors do not share the caches. However.processors.Aprogramis eithera sequential applicationthey share the memory bandwidth, and the demandor a parallel application being treated as a single partyof memory bandwidth depends entirely on the perfor-in interaction. Here we assume that programs do notmance of the cache. In addition, some caching policies,share data or computation,but they share the hard-e.g., inclusive cache on Intel machines, may induce in-ware host.We call it a solo-run if a program runs byitself on a machine and a co-run if multiple programsdirectinteraction,where a program may lose data inits private cache due to the data access by another pro-run in parallel.gram in the shared cache.Cache sharing is a primary cause of co-run interferThe advent of cache sharing the 2000s is reminiscentence.Modern applicationstakemost of their time toof the middle 1960s when time sharing was invented.accessmemory,andmostmemoryaccessesover99%Since then, the problem of memory management hastypicallyhappen in cache.A commodity system to-been well studied and solved, and modern operatingday has 2 to 8 processors (sockets), 2 to 6 physical coressystems routinely manage memory for a large numberper processor, and 2 to 4 hyperthreaded logical coresof programs.However, the problem of cache sharing isperphysical core.Nearlyahundred programs canrunmore complex.togetherinparallel.SurveyThe work is partially supported by the National Natural Science Foundation of China (NSFC) under Grant No. 61232008, theNSFC Joint Research Fund for Overseas Chinese Scholars and Scholars in Hong Kong and Macao under Grant No.61328201, theNational Science Foundation of USA under Contract Nos.CNS-1319617,CCF-1116104, CCF-0963759, an IBM CAS Faculty Fellowshipand a research grant from Huawei.Any opinions, findings,and conclusions or recommendations expressed in this paper are those ofthe authorrsand donotnecessarilyreflecttheviewofthefundingorganizatioXiang has graduated and is now working at Twitter Inc. Bao has graduated and is now working at Qualcomm Inc.2014 Springer Science+Business Media,LLC&SciencePress,China
Ding C, Xiang X, Bao B et al. Performance metrics and models for shared cache. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 29(4): 692–712 July 2014. DOI 10.1007/s11390-014-1460-7 Performance Metrics and Models for Shared Cache Chen Ding1 (丁 晨), Xiaoya Xiang1 (向晓娅), Bin Bao1 (包 斌), Hao Luo1 (罗 昊), Ying-Wei Luo2 (罗英伟) and Xiao-Lin Wang2 (汪小林) 1Department of Computer Science, University of Rochester, Rochester, NY 14627-0226, U.S.A. 2School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China E-mail: cding@cs.rochester.edu; {sappleing, bin.bao}@gmail.com; hluo@cs.rochester.edu; {lyw, wxl}@pku.edu.cn Received March 1, 2014; revised May 14, 2014. Abstract Performance metrics and models are prerequisites for scientific understanding and optimization. This paper introduces a new footprint-based theory and reviews the research in the past four decades leading to the new theory. The review groups the past work into metrics and their models in particular those of the reuse distance, metrics conversion, models of shared cache, performance and optimization, and other related techniques. Keywords memory performance metric, cache sharing, reuse distance 1 Introduction Computing is ubiquitous in science, engineering, business, and everyday life. Most of today’s applications, whether for cloud, desktop, or handheld, run on multicore processors. As a result, they interact with peer programs. It is beneficial to minimize the negative interaction. The benefit is important not just for good performance but also for stable performance, not just for parallel code but also for sequential applications running in parallel. This paper surveys the theories and techniques to measure and improve program interaction on multicore processors. A program is either a sequential application or a parallel application being treated as a single party in interaction. Here we assume that programs do not share data or computation, but they share the hardware host. We call it a solo-run if a program runs by itself on a machine and a co-run if multiple programs run in parallel. Cache sharing is a primary cause of co-run interference. Modern applications take most of their time to access memory, and most memory accesses — over 99% typically — happen in cache. A commodity system today has 2 to 8 processors (sockets), 2 to 6 physical cores per processor, and 2 to 4 hyperthreaded logical cores per physical core. Nearly a hundred programs can run together in parallel. Partitioned cache solves the interference problem via program isolation. However, cache partitioning is wasteful when only one program is running and ineffi- cient when co-run programs share data. Current multicore processors use a mix of private and shared cache. For example, Intel Nehalem has 256 K L2 cache per core and 4 MB to 8 MB L3 cache shared by all cores. IBM Power 7 has 8 cores, with 256 KB L2 cache per core and 32 MB L3 shared by all cores. Depending on which CPU they are using, programs interact in different ways. Physical cores have private caches at the first and second levels but share the last level cache. Logical cores share the caches at all levels. Different processors do not share the caches. However, they share the memory bandwidth, and the demand of memory bandwidth depends entirely on the performance of the cache. In addition, some caching policies, e.g., inclusive cache on Intel machines, may induce indirect interaction, where a program may lose data in its private cache due to the data access by another program in the shared cache. The advent of cache sharing the 2000s is reminiscent of the middle 1960s when time sharing was invented. Since then, the problem of memory management has been well studied and solved, and modern operating systems routinely manage memory for a large number of programs. However, the problem of cache sharing is more complex. Survey The work is partially supported by the National Natural Science Foundation of China (NSFC) under Grant No. 61232008, the NSFC Joint Research Fund for Overseas Chinese Scholars and Scholars in Hong Kong and Macao under Grant No. 61328201, the National Science Foundation of USA under Contract Nos. CNS-1319617, CCF-1116104, CCF-0963759, an IBM CAS Faculty Fellowship and a research grant from Huawei. Any opinions, findings, and conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the funding organizations. Xiang has graduated and is now working at Twitter Inc. Bao has graduated and is now working at Qualcomm Inc. ©2014 Springer Science + Business Media, LLC & Science Press, China
693Chen Ding et al.:Performance Metrics and Models for Shared CacheCache is managed by hardware, not the operatingGiven a window, the footprint is the amount of dataaccessed in the window, i.e., the size of the "active"system. Cache has multiplelevels and varying mixes ofexclusivity and sharing.Eventsof cacheaccesses anddata.Foran execution,thefootprintis defined for eachreplacements are orders of magnitude more frequentwindow length as the auerage footprint of all windowsthan memory access and paging.A single program mayof that length. In a dynamic execution, the data usageaccess cache a billion times a second and can wipe outmay change in different length windows and in differ-the entire content of the cache in less than a millisec-ent windows of the same length. The footprint showsond. The intensity multiplies when more programs arethe change over all window lengths. For each length, itrun in parallel. Furthermore, the size of cache is fixedshows the average footprint, which is a single, uniqueon a given machine. One cannot get online and buyvalue.more cache as one can with memory.For example, consider three data blocks a,b, c.Fig.1Cache interference is asymmetrical, non-linear, andshows two patterns of data accesses. One has a stack ac-circular.Theasymmetrywas shown experimentallybycess pattern, where the data block last accessed is firstZhang et al.[i] at Rochester and confirmed by laterreused. The other has a streaming pattern, where theblocks are traversed in the same order. The footprintsstudies. In a pair-run experiment we conducted usingZhang's setup.One program becomes 85%slower,whileare shown for all length-3 windows, four in each trace.its partner is only 15% slower.The interference changesThe footprint of a trace is the average. For length-3from program to program. The effect depends not aswindows, the footprint, fp(3),is 2.5 in the stack tracemuch on how many programs are running as on whichand 3 in the streaming trace. Therefore, the stream-programs are running. Finally, the effect is circular. Asing access has a greater data activity for that windowa program affects its peers, it is also affected by them.length.The completefootprint is defined for all windowThesolutiontotheseproblemsrequiresaspecialthlengths and would count in theamount of data accesseory called the theory of locality.Locality is a basicin all windows of all lengths.property of a computing system. Denning/2] defined lo-cality as"a concept that a program favors a subset of itsfp=3fp=fp=segments during extended intervals (phases)."Therefp=fp=fo=3is a difference between the data that a program has(a)(b)and the data that the program is actively using. The"active" data is a subset, which Denning[3] called theFig.1. Amount of data accessed in length-3 windows in two ac-working set.cess traces: (a) stack accesses and (b) streaming accesses. ThePerformance depends on how fast a computer systemfootprint of a trace is the average amount. It is defined for eachprovides access to the active data subset. The accesswindow length. When the length is 3, the footprint, fp(3), is 2.5time of the other data is irrelevant. Locality analysisin the stack trace and 3 in the streaming trace.is therefore a prerequisite to memory design, for theIn practice, thefootprint is too numerous to enume-oft quoted reason "we cannot improve what we can-rate. The number of time windows is quadratic to thenot measure." In this article, we review the metrics formeasuring and techniques for improving performancelength of the trace.Assuming a program running forin shared cache.10 seconds ona3GHzprocessor,we have3E10CPUcycles in the execution and 4.5E20 distinct windows.2Footprint Theory of LocalityBrock et al.[] described program analysis as a BigDataproblem, and showed the scale of theproblem by2.1Footprintthe number of time windows in an execution.Fig.2As a locality metric, the footprint measures theshows that as the length of execution increases from1 second to 1 month, the number of CPU cycles (n)amount of active data usage. Given a program exe-cution, we extract the data accesses as a linear sequenceranges from 3E9 to 2E15, and the number of distinctof memory addresses or object IDs.The sequence isexecution windows (2) from 4.5E18 to 5.8E29, that is,called an access trace or an address string. A windowfrom 4 sextillion to over a half nonillion.is a sub-sequence of consecutive accesses. The lengthAs a dynamic analysis problem, the scale quicklyof a window is measured by time,either logically basedreaches the size of any static problem.As a compa-on the number of accesses in thewindow or physicallyrison, the figure shows the radius of the Milky Way incentimeters, 48 sextillion, and the radius of the observ-based on thetimewhen thefirst and thelast accesseswere made.able universe, 44 octillion.Oif h tracelngth isn,th numberof windows (and hence fotprints) s() +n=xg) or (n2) asymptoticlly
Chen Ding et al.: Performance Metrics and Models for Shared Cache 693 Cache is managed by hardware, not the operating system. Cache has multiple levels and varying mixes of exclusivity and sharing. Events of cache accesses and replacements are orders of magnitude more frequent than memory access and paging. A single program may access cache a billion times a second and can wipe out the entire content of the cache in less than a millisecond. The intensity multiplies when more programs are run in parallel. Furthermore, the size of cache is fixed on a given machine. One cannot get online and buy more cache as one can with memory. Cache interference is asymmetrical, non-linear, and circular. The asymmetry was shown experimentally by Zhang et al. [1] at Rochester and confirmed by later studies. In a pair-run experiment we conducted using Zhang’s setup. One program becomes 85% slower, while its partner is only 15% slower. The interference changes from program to program. The effect depends not as much on how many programs are running as on which programs are running. Finally, the effect is circular. As a program affects its peers, it is also affected by them. The solution to these problems requires a special theory called the theory of locality. Locality is a basic property of a computing system. Denning[2] defined locality as “a concept that a program favors a subset of its segments during extended intervals (phases).” There is a difference between the data that a program has and the data that the program is actively using. The “active” data is a subset, which Denning[3] called the working set. Performance depends on how fast a computer system provides access to the active data subset. The access time of the other data is irrelevant. Locality analysis is therefore a prerequisite to memory design, for the oft quoted reason “we cannot improve what we cannot measure.” In this article, we review the metrics for measuring and techniques for improving performance in shared cache. 2 Footprint Theory of Locality 2.1 Footprint As a locality metric, the footprint measures the amount of active data usage. Given a program execution, we extract the data accesses as a linear sequence of memory addresses or object IDs. The sequence is called an access trace or an address string. A window is a sub-sequence of consecutive accesses. The length of a window is measured by time, either logically based on the number of accesses in the window or physically based on the time when the first and the last accesses were made. Given a window, the footprint is the amount of data accessed in the window, i.e., the size of the “active” data. For an execution, the footprint is defined for each window length as the average footprint of all windows of that length. In a dynamic execution, the data usage may change in different length windows and in different windows of the same length. The footprint shows the change over all window lengths. For each length, it shows the average footprint, which is a single, unique value. For example, consider three data blocks a, b, c. Fig.1 shows two patterns of data accesses. One has a stack access pattern, where the data block last accessed is first reused. The other has a streaming pattern, where the blocks are traversed in the same order. The footprints are shown for all length-3 windows, four in each trace. The footprint of a trace is the average. For length-3 windows, the footprint, f p(3), is 2.5 in the stack trace and 3 in the streaming trace. Therefore, the streaming access has a greater data activity for that window length. The complete footprint is defined for all window lengths and would count in the amount of data access in all windows of all lengths. Fig.1. Amount of data accessed in length-3 windows in two access traces: (a) stack accesses and (b) streaming accesses. The footprint of a trace is the average amount. It is defined for each window length. When the length is 3, the footprint, f p(3), is 2.5 in the stack trace and 3 in the streaming trace. In practice, the footprint is too numerous to enumerate. The number of time windows is quadratic to the length of the trace①. Assuming a program running for 10 seconds on a 3 GHz processor, we have 3E10 CPU cycles in the execution and 4.5E20 distinct windows. Brock et al. [4] described program analysis as a Big Data problem, and showed the scale of the problem by the number of time windows in an execution. Fig.2 shows that as the length of execution increases from 1 second to 1 month, the number of CPU cycles (n) ranges from 3E9 to 2E15, and the number of distinct execution windows ¡n 2 ¢ from 4.5E18 to 5.8E29, that is, from 4 sextillion to over a half nonillion. As a dynamic analysis problem, the scale quickly reaches the size of any static problem. As a comparison, the figure shows the radius of the Milky Way in centimeters, 48 sextillion, and the radius of the observable universe, 44 octillion. ①If the trace length is n, the number of windows (and hence footprints) is ¡n 2 ¢ + n = n×(n+1) 2 or O(n 2 ) asymptotically
694J.Comput.Sci.&Technol.,July2014,Vol.29,No.4umion oim oto linear time O(n) by computing the average withoutRadius of Observable0Eenumerating all footprints.Universe:4.4E28 cm·Footprint sampling, which samples limited-sizeRadius of Milkywindows and further reduces the cost.Way:4.8E22cmThe distribution analysis is the first algorithm tomeasure the all-windowfootprint.As it actually enu-31merates all footprints, it finds the largest, smallest, me-dian,average,and anypercentilefootprintforeachwin-S十国dow length. However, the cost is sometimes thousandsof times slowdown compared to the speed of the originalprogram.The second algorithm computes just the averages(IE10)min(1E12)HourDayfootprint, and the cost is reduced from a thousand timesExecution Time (CPU Cycles on a 3 GHz Core)slowdown to about 20 times.Being a linear time algo-Fig.2.Scale of the problem shown bythe number of footprintrithm, it is scalable in that the cost increases propor-windows in aprogram execution,compared to thesizeofagalaxytionally to the length of the program execution.and the universe. Reproduced from [4].The cache on a real machine has a finite size, so ananalysis does not have to consider windows whose foot-For system design, it may not be very useful toprint is greater than the cache size. In addition, theconsider very large windows, since caching decisionsbehavior of a long running program tends to repeat it-are usually based on information on the recent exe-self.Furthermore,on modern processors,theanalysiscution.For programming,however,it is necessarycan be carried out on a separate core in parallel withto analyze the full execution to find opportunities oftheanalyzed execution.Footprint sampling specializesglobal optimization. This is shown by Zhong et al. inand parallelizes the analysis for a specific machine andwhole-program locality analysis, which analyzes the fullprogram. The average cost is reduced to 0.5% of thelengthofreusedistancestoseehowitchangeswiththerunning time of the unmodified execution.input[5], and in affinity-based data layout, which groupsThe algorithmic development attains immense gainsstructure fields based on the distribution of long reusein both computational complexity and implementationdistances[6].efficiency.As the baseline, the distribution analysis isThe purpose of a footprint theory is to overcome thethe first viable solution for precise all-window analysis.enormity of the analysis problem, characterize the ac-The second and thethird algorithm eachimproves effi-tive data usage in all windows. and make it useful forciencybyanotherorder of magnitude,eventuallymak-system analysis and optimization.ing it fast enough for real-time analysis.This has abeneficial impact elsewhere, because the footprint can2.2Footprint Theorybe used to compute other locality metrics, as we willFor locality analysis, the basic unit of informationsee in the third part of the footprint theory.is a data access, and the basic relation is a data reuse.Composability.Alocalitymetric is composableif theThe theory of locality is concerned with the fundamen-metric of a co-run can be computed from the metric oftal properties of data accesses and reuses, just as thesolo-runs. If co-run programs do not share data, thegraph theory is with nodes and their links.footprint is composable.Let the average footprint ofThe footprint theory consists of a set of formal definiaprogram be prog.fp()forwindowlength .If wetions, algorithms, and properties based on the concepthavek programs progi,prog2,...,progk actively shar-of the footprint. This subsection introduces the fouring the cache, the aggregate footprint is the sum of thecomponents of the theory and their supporting tech-individual footprints.niques, based on the material published in a series ofpapers[7-12],corun.fp(a) =progi.fp()Footprint Measurement.The enormous scale of all-i=1window analysis is tackled by a series of three algo-rithms. Each is two orders of magnitude more efficientIn comparison, the miss ratio is not composable.Wethan the previous one.will prove it later in Subsection 3.6.3. Intuitively, the.Footprint distribution analysis, which enumeratesco-run miss ratio will change compared to the solo-runall O(n2) footprints in O(nlog m) time, where n is themiss ratio, since each program has now a fraction in-lengthofthetraceandmthemaximalfootprint.stead of the whole cache. The change in miss ratio.: Average footprint analysis, which reduces the costas mentioned earlier, is asymmetrical, non-linear, and
694 J. Comput. Sci. & Technol., July 2014, Vol.29, No.4 Fig.2. Scale of the problem shown by the number of footprint windows in a program execution, compared to the size of a galaxy and the universe. Reproduced from [4]. For system design, it may not be very useful to consider very large windows, since caching decisions are usually based on information on the recent execution. For programming, however, it is necessary to analyze the full execution to find opportunities of global optimization. This is shown by Zhong et al. in whole-program locality analysis, which analyzes the full length of reuse distances to see how it changes with the input[5], and in affinity-based data layout, which groups structure fields based on the distribution of long reuse distances[6] . The purpose of a footprint theory is to overcome the enormity of the analysis problem, characterize the active data usage in all windows, and make it useful for system analysis and optimization. 2.2 Footprint Theory For locality analysis, the basic unit of information is a data access, and the basic relation is a data reuse. The theory of locality is concerned with the fundamental properties of data accesses and reuses, just as the graph theory is with nodes and their links. The footprint theory consists of a set of formal definitions, algorithms, and properties based on the concept of the footprint. This subsection introduces the four components of the theory and their supporting techniques, based on the material published in a series of papers[7-12] . Footprint Measurement. The enormous scale of allwindow analysis is tackled by a series of three algorithms. Each is two orders of magnitude more efficient than the previous one. • Footprint distribution analysis, which enumerates all O(n 2 ) footprints in O(n log m) time, where n is the length of the trace and m the maximal footprint. • Average footprint analysis, which reduces the cost to linear time O(n) by computing the average without enumerating all footprints. • Footprint sampling, which samples limited-size windows and further reduces the cost. The distribution analysis is the first algorithm to measure the all-window footprint. As it actually enumerates all footprints, it finds the largest, smallest, median, average, and any percentile footprint for each window length. However, the cost is sometimes thousands of times slowdown compared to the speed of the original program. The second algorithm computes just the average footprint, and the cost is reduced from a thousand times slowdown to about 20 times. Being a linear time algorithm, it is scalable in that the cost increases proportionally to the length of the program execution. The cache on a real machine has a finite size, so an analysis does not have to consider windows whose footprint is greater than the cache size. In addition, the behavior of a long running program tends to repeat itself. Furthermore, on modern processors, the analysis can be carried out on a separate core in parallel with the analyzed execution. Footprint sampling specializes and parallelizes the analysis for a specific machine and program. The average cost is reduced to 0.5% of the running time of the unmodified execution. The algorithmic development attains immense gains in both computational complexity and implementation efficiency. As the baseline, the distribution analysis is the first viable solution for precise all-window analysis. The second and the third algorithm each improves effi- ciency by another order of magnitude, eventually making it fast enough for real-time analysis. This has a beneficial impact elsewhere, because the footprint can be used to compute other locality metrics, as we will see in the third part of the footprint theory. Composability. A locality metric is composable if the metric of a co-run can be computed from the metric of solo-runs. If co-run programs do not share data, the footprint is composable. Let the average footprint of a program be prog.fp(x) for window length x. If we have k programs prog1 , prog2 , . . . , progk actively sharing the cache, the aggregate footprint is the sum of the individual footprints. corun.fp(x) = X k i=1 progi .fp(x). In comparison, the miss ratio is not composable. We will prove it later in Subsection 3.6.3. Intuitively, the co-run miss ratio will change compared to the solo-run miss ratio, since each program has now a fraction instead of the whole cache. The change in miss ratio, as mentioned earlier, is asymmetrical, non-linear, and