We Must Find the New Distribution for Big Data AccessesInternet stores all kinds of huge big data sets_ The rapid growth and wide distribution of Internet mediacontent is a representative case study of big data The media content is carried by scalable distributed systemsWehope distribution model developed is- General purpose for other applications of big data- Scalability nature of both data and systems1
We Must Find the New Distribution for Big Data Accesses • Internet stores all kinds of huge big data sets – The rapid growth and wide distribution of Internet media content is a representative case study of big data – The media content is carried by scalable distributed systems • We hope distribution model developed is – General purpose for other applications of big data – Scalability nature of both data and systems 7
Zipf distributionis believed the generalmodel of data access patternsZipfdistribution(powerlaw)Characterizesthepropertyofscaleinvariance-Heawytailed,scalefree80-20ruleheavy tailIncomedistribution:80%ofsocialwealth-owned by20% people (Pareto law)Webtraffic:80%Webrequestsaccess20% pages (Breslau,INFOCOM'99)y, oαci-αα.0.6~0.8Systemimplicationsi:rank of objects-Objectivelycachingtheworkingsetinyi : number of referencesproxy-Significantlyreducenetworktraffic8
8 Zipf distribution is believed the general model of data access patterns • Zipf distribution (power law) – Characterizes the property of scale invariance – Heavy tailed, scale free • 80-20 rule – Income distribution: 80% of social wealth owned by 20% people (Pareto law) – Web traffic: 80% Web requests access 20% pages (Breslau, INFOCOM’99) • System implications – Objectively caching the working set in proxy – Significantly reduce network traffic log i log y slope: -a i y i−a i : rank of objects yi : number of references a: 0.6~0.8 i y heavy tail
Does Internet media trafficfollow Zipf's law?Webmedia systemsVoDmedia systemsaudlo/videoChesire,USITS'O1:Zipf-likeAcharya,MMcN'oo:non-ZipfCherkasova,NOSSDAVo2:non-ZipfYu,EUROSYS'O6:Zipf-likeP2PmediasystemsLivestreamingandIPTVsystemsVeloso,IMW'02:Zipf-likeGummadi,SOSPo3:non-Zipf9Sripanidkulchai,IMC'04:non-Zipflamnitchi,INFOCOM'O4:Zipf-like
9 Does Internet media traffic follow Zipf’s law? Chesire, USITS’01: Zipf-like Cherkasova, NOSSDAV’02: non-Zipf Acharya, MMCN’00: non-Zipf Yu, EUROSYS’06: Zipf-like Web media systems VoD media systems Live streaming and IPTV systems Veloso, IMW’02: Zipf-like Sripanidkulchai, IMC’04: non-Zipf P2P media systems Gummadi, SOSP’03: non-Zipf Iamnitchi, INFOCOM’04: Zipf-like
Inconsistent media access pattern modelsStill basedontheZipfmodel-Zipfwithexponential cutoff-Zipf-Mandelbrotdistribution- Generalized Zipf-like distributionheuristicassumptions-Two-modeZipfdistribution-Fetch-at-most-onceeffect-ParabolicfractaldistributionAllcasestudies-Basedononeortwoworkloads- Different from or even conflict with each otherAninsightfulunderstandingisessentialto-Contentdelivery systemdesign-Internetresourceprovisioning- Performance optimization10
10 Inconsistent media access pattern models • Still based on the Zipf model – Zipf with exponential cutoff – Zipf-Mandelbrot distribution – Generalized Zipf-like distribution – Two-mode Zipf distribution – Fetch-at-most-once effect – Parabolic fractal distribution – . • All case studies – Based on one or two workloads – Different from or even conflict with each other • An insightful understanding is essential to – Content delivery system design – Internet resource provisioning – Performance optimization heuristic assumptions
ResearchObjectives: Find a general distribution model of Internet mediaaccess patterns as a case for big data- Comprehensive measurements and experiments- Rigorous mathematical analysis and modeling- Insights into media system designs11
11 Research Objectives • Find a general distribution model of Internet media access patterns as a case for big data – Comprehensive measurements and experiments – Rigorous mathematical analysis and modeling – Insights into media system designs