MapReduce Operations on Hadoop Calculate the average salary of every department(name: (org., salary)[org.: avg. salary)HDFS业业MapMapMap业业Shuffle the data using org. as Partition Key (PK)Records of “org-1"Records of“org-2"11
HDFS MapReduce Operations on Hadoop • Calculate the average salary of every department Map Map Map {name: (org., salary)} {org.: avg. salary} 11 Shuffle the data using org. as Partition Key (PK) Records of “org-1” Records of “org-2
MapReduce Operations on Hadoop: Calculate the average salary of every department(name: (org., salary))(org.: avg.salary)HDFS业MapMapMap业业Calculate theCalculatethe口average salaryaverage salaryJJLfor “org-1"for “org-2"Reduce(Avg.)Reduce(Avg.)贝只HDFS
HDFS MapReduce Operations on Hadoop • Calculate the average salary of every department Map Map Map Reduce (Avg.) Reduce (Avg.) HDFS {name: (org., salary)} {org.: avg. salary} 12 Calculate the average salary for “org-1” Calculate the average salary for “org-2
KeyNalue Pairs in MapReduce. A simple but effective programming model designed toprocess huge volumes of data concurrently on a cluster· Map: (k1, v1) → (k2, v2),- e.g. (name, org & salary) →> (org, salary)·Reduce: (k2, v2) →> (k3, v3),- e.g. (org, salary) →> (org, avg. salary): Shuffle: Partition Key (lt could be the same as k2, or not)- Partition Key: to determine how a key/lvalue pair in the mapoutput be transferredtoareducetask- e.g. org. name is used to partition the map output fileaccordingly13
Key/Value Pairs in MapReduce • A simple but effective programming model designed to process huge volumes of data concurrently on a cluster • Map: (k1, v1) → (k2, v2), – e.g. (name, org & salary) → (org, salary) • Reduce: (k2, v2) → (k3, v3), – e.g. (org, salary) → (org, avg. salary) • Shuffle: Partition Key (It could be the same as k2, or not) – Partition Key: to determine how a key/value pair in the map output be transferred to a reduce task – e.g. org. name is used to partition the map output file accordingly 13
MR(Hadoop) Job Execution PatternsMR program (job)The execution ofMap TasksaMR job involvesReduce TasksControl level work, e.g.job scheduling and taskData is stored in al:Job submissionassignmentDistributedFileSystem(e.g.HadoopDistributedMaster nodeFileSystem)WorkernodesWorkernodes2: Assign TasksDo data processingwork specified by Map14orReduceFunction
14 MR(Hadoop) Job Execution Patterns MR program (job) Master node 1: Job submission Worker nodes Worker nodes 2: Assign Tasks Map Tasks Reduce Tasks Data is stored in a Distributed File System (e.g. Hadoop Distributed File System) Control level work, e.g. job scheduling and task assignment Do data processing work specified by Map or Reduce Function The execution of a MR job involves 6 steps
MR(Hadoop) Job Execution PatternsMRprogramTheexecutionofMap TasksaMRjobinvolvesReduce Tasks6 steps1:Job submissionMap outputMaster nodeWorkernodesWorkernodeMap output will be shuffled todifferentreducetasksbasedon3: Map phase4:Shuffle phasePartitionKeys(PKs)(usuallyConcurrenttasksMap output keys)15
15 MR(Hadoop) Job Execution Patterns MR program Master node 1: Job submission Worker nodes Worker nodes 3: Map phase Concurrent tasks Map Tasks Reduce Tasks 4: Shuffle phase Map output Map output will be shuffled to different reduce tasks based on Partition Keys (PKs) (usually Map output keys) The execution of a MR job involves 6 steps