Data Communication in Computer SystemsSourceDestinationTransferBandwidthTimeLatency TimeDestination-perceived latency reductionisstill limited dueto imbalanced improvement ofbandwidth and latency
Data Communication in Computer Systems Transfer Bandwidth Time Latency Time Destination-perceived latency reduction is still limited due to imbalanced improvement of bandwidth and latency Source Destination
Latency Lags Bandwidth (CACM, Patterson).In the last 20 years.10000100-2000X improvementinbandwidth5-20x improvement in latencyMicroprocessorBetweenCPUandon-chipcache:1000bandwidth:2250XincreaseNetworklatency:20xreductionDiskBetweenoff-cacheandDRAMMemory100bandwidth:125XincreaseLatency: 4X reductionRelativeBandwidthImprovementBetweenDRAManddisk(Latencyimprovementbandwidth:150Xincrease10Badwidthrinlatency:8X reductionBetweentwo nodesvia aLAN:bandwidth:100Xincrease10100latency:15X reductionRelativeLatency ImprovementNotethatlatencyimprovedabout10Xwhilebandwidthimprovedabout100Xto1000X
Latency Lags Bandwidth (CACM, Patterson) • In the last 20 years, 100–2000X improvement in bandwidth 5-20X improvement in latency Between CPU and on-chip cache: bandwidth: 2250X increase latency: 20X reduction Between off-cache and DRAM: bandwidth: 125X increase Latency: 4X reduction Between DRAM and disk: bandwidth: 150X increase latency: 8X reduction Between two nodes via a LAN: bandwidth: 100X increase latency: 15X reduction
ThreeBalanceTheoriesofEconomicsAdamSmith(WealthoftheNation,1776)- Commodity price is determined by an invisible hand in the market= Merits:highincentive,fastand marketbasedeconomicprogress- Limits: society fairness, gap between rich and poor, ...KarlMarx(DesCapital,1867)- Commodityprice is determined by the necessarylabor time- Merits:fairness is addressedLimits:lowefficiencyandsloweconomicprogressMaynardKeynes(GeneralTheoryofE,I.andM.,1933)-“Effectivedemand"includebothmarketneedsandinvestmentsMerits:Governmentinvestmentscanbalancedifferencesinsocietyandnarrowthegapofrich/poor.- Limits: a big government caneasily build a “relaxed and lazy" society- (E: employment, I: investment, and M: money)13
Three Balance Theories of Economics • Adam Smith (Wealth of the Nation, 1776) – Commodity price is determined by an invisible hand in the market – Merits: high incentive, fast and market based economic progress – Limits: society fairness, gap between rich and poor, . • Karl Marx (Des Capital, 1867) – Commodity price is determined by the necessary labor time – Merits : fairness is addressed – Limits : low efficiency and slow economic progress • Maynard Keynes (General Theory of E, I. and M., 1933) – “Effective demand” include both market needs and investments – Merits: Government investments can balance differences in society and narrow the gap of rich/poor. – Limits: a big government can easily build a “relaxed and lazy” society – (E: employment, I: investment, and M: money) 13
How is Resource Supply/Demand Balanced?Slowdown CPU Speed (Smith: reduction of oversupplied production)- Earth Simulator: NEC AP, 500 MHz (4-way SU, a VU)BlueGene/L:IBMPowerPC440,700MHz- Columbia: SGl Altix 3700 (Intel Itanium 2), 1.5 GHz. (commodityprocessors,nochoicefor itshighspeed) Very low latency data accesses: (Keynes: creating effective demand)EarthSimulator:128KL1cacheand128largeregisters- Blue Gene/L: on-chip L3 cache (2 MB).- Columbia: on-chip L3 cache (6 MB).Fast accessesto huge/shared memory (Marx/Keynes:Increaseinvestmentinpublicinfrastructure)Earth Simulator: cross bar switches between AP and memory.-BlueGene/L:cachedDRAMmemory,and3-Dtorusconnection.- Columbia:SGlNUMALink's data block transfertime:5o ns.Further latency reductions: prefetching and caching
How is Resource Supply/Demand Balanced? Slowdown CPU Speed (Smith: reduction of oversupplied production) – Earth Simulator: NEC AP, 500 MHz (4-way SU, a VU). – Blue Gene/L: IBM Power PC 440, 700 MHz. – Columbia: SGI Altix 3700 (Intel Itanium 2), 1.5 GHz. (commodity processors, no choice for its high speed) • Very low latency data accesses: (Keynes: creating effective demand) – Earth Simulator: 128K L1 cache and 128 large registers. – Blue Gene/L: on-chip L3 cache (2 MB). – Columbia: on-chip L3 cache (6 MB). • Fast accesses to huge/shared memory (Marx/Keynes: Increase investment in public infrastructure) • Earth Simulator: cross bar switches between AP and memory. – Blue Gene/L: cached DRAM memory, and 3-D torus connection. – Columbia: SGI NUMALink’s data block transfer time: 50 ns. Further latency reductions: prefetching and caching
ComputingOperationsVersus Data MovementComputation is much cheaper than data movement- In a 0.13 um CMOS, a 64-bit FPU < 1 mm2, 16 FPUs can be easilyplaced in a 14mm * 14mm 1 GHz chip ($200).Processingdata from 16 registers (256GB/s)<$12.5/GFlop (60mW/GFlop)Processingdatafromon-chipcaches(100GB/s)$32/Gflop (1 W/GFlops) Processing data from off-chip memory (16 GB/s)$200/Gflops(manyWs/GFlops)Processingdatafromfurtherlocationincreasescostdramatically
Computing Operations Versus Data Movement Computation is much cheaper than data movement – In a 0.13 um CMOS, a 64-bit FPU < 1 mm2 , 16 FPUs can be easily placed in a 14mm * 14mm 1 GHz chip ($200). – Processing data from 16 registers (256 GB/s) < $12.5/GFlop (60 mW/GFlop) – Processing data from on-chip caches (100 GB/s) $32/Gflop (1 W/GFlops) – Processing data from off-chip memory (16 GB/s) • $200/Gflops (many Ws/GFlops) – Processing data from further location increases cost dramatically