



## First planar integrated circuit (1961)

### 90 nm Intel's processor Montecito (2004) Itanium Processor Family



Transistors: 1.72 Billion Frequency: >1.7GHz Power: ~100W

Source: Intel Developer Forum,

September, 2004

#### Adrian Ionescu, October 2005

### Cost limit



#### Fundamental limits

From: thermodinamics, quantum-mechanics, electromagnetics

- Limit on energy transfer during a binary switching:
   E(min) = (In2) kT (=kTlog<sub>e</sub>N, N=2) (J. Neumann)
- Heisenberg's uncertainty principle:
   △E > h/△t → forbidden region for power-delay
- electromagnetics  $\rightarrow \tau > L/c_0$  (limited time of electromagnetic wave travelling across interconnects)

Adrian Jonescu, October 2005

13

### Why $E(min) = kT \times ln2$ ?

Binary signal discrimination: the slope of the static transfer curve of a (CMOS) binary logic gate must be greater than unity in absolute value at the transition point where input and output voltage levels are equal  $\rightarrow$  **CMOS** inverter

$$Vdd(min) = 2[kT/q] \left[ 1 + \frac{C_{fs}}{C_{ox} + C_{d}} \right] ln(2 + \frac{C_{d}}{C_{ox}})$$

$$Vdd(min) \approx 2(ln 2) \frac{kT}{q} = 1.38 \frac{kT}{q} = 0.036 V @ T = 300 K$$

Min signal energy stored on gate:

$$Es(min) = (1/2)Q_gVdd = (1/2)q \times 2(\ln 2)\frac{kT}{q} = kT \times \ln 2 == 0.693kT$$

$$with: C_g = \frac{\epsilon_{ox}L_{min}^2}{t_{ox}} \Longrightarrow \quad L_{min} = \left[\frac{t_{ox}}{\epsilon_{ox}}\right] q^2 / \left[2(\ln 2)kT\right]^{1/2} = 9.3nm \ @\ t_{ox} = 1nm$$

Source: J.D. Meindl, J. A. Davis, IEEE JSSC, Vol. 35, October 2000, pp. 1515-1516.

#### **Fundamental limits**

Average power transfer during a binary transition, *P*, versus transition time, *t*d. The red, orange, and green zones are forbidden by fundamental, silicon material, and 50-nm channel length transistor device level limits,respectively.



Source:

J. D. Meindl, Q. Chen, J. A. Davis, Science, Vol. 293, pp. 2044-2049, September 2001



ibm power 5

### **CMOS Computer Performance**







Oxide thickness is approaching a few atomic layers



faster w/o scaling



Heat Density

IBM Cell

Hetero- Many- core





Moore's Law: based on CMOS scaling 2005-2012 90nm 2003 65nm 45nm 2007 32nm 2009 2013 -2017 22nm 16nm 2013 11nm 2015 8nm 2017 Research Source: Intel, march 2008 Ringvorlesung ETH Springsemester 2010 – Ronald Luijten, IBM research – Zurich, "IT energy challenges – a new approach" @ 2010 IBM Corporati







### An example for a planned 2012 machine: Blue Waters

- 10 PFlop (10\*\*16) sustained operations
- 300'000 compute cores = 37'500 CPU chips = 9375 QCM = 1172 drawers = 98 racks
- 800W / QCM → 7.5 MW in CPUs
- New building being finished
- 24 transformers@2 MW
- Blue waters PUE = 1.1
- http://www.ncsa.illinois.edu/BlueWaters/



Rindvorlesung ETH Springsemester 2010 - Ronald Luitlen. IBM research - Zurich. "IT energy challenges - a new approach"

@ 2010 IBM Corporation



(49.2TB)

 Redundant Cooling CRAH Eliminated



### Connecting 50000 servers challenging

High bandwidth at low costs

### Hierarchy of network

Rack switch, array switch, L3 switch, border routers

### BlueGene/L - Holistic Design in Practice



- Using the industry-standard LINPACK benchmark, the IBM Blue Gene/L system attained a sustained performance of <u>70.72 Teraflops</u>, eclipsing the three year old top mark of <u>35.86 Teraflops</u> for the Japanese Earth Simulator and the recent mark of <u>42.7 Teraflops</u> at the NASA's Ames research center.
- The BlueGene/L system is 1/100th the physical size (320 vs 32,500 square feet) and consumes 1/28th the power (216KW vs 6,000KW) as compared to the Earth Simulator.
- Dr. Bernard S. Meyerson

© 2010 BM Corporation

By 2011, the world will store 10X the Data stored in 2006, BUT; Internet connected devices will grow by 2000X, from 500M to 1 Trillion, and each will demand that someone "listen".



Dr. Bernard S. Meyerson

© 2010 IBM Corporation









FIGURE 5.7: Human energy usage vs. activity levels (adult male) [52].



Figure 2. Server power usage and energy efficiency at varying utilization levels, from idle to peak performance. Even an energy-efficient server still consumes about half its full power when doing virtually no work.



Figure 4. Power usage and energy efficiency in a more energy-proportional server. This server has a power efficiency of more than 80 percent of its peak value for utilizations of 30 percent and above, with efficiency remaining above 50 percent for utilization levels as low as 10 percent.



FIGURE 5.3: An example benchmark result for SPECpower\_ssj2008; energy efficiency is indicated by bars, whereas power consumption is indicated by the line. Both are plotted for a range of utilization levels, with the average metric corresponding to the vertical dark line. The system has a single-chip 2.83 GHz quad-core Intel Xeon processor, 4 GB of DRAM, and one 7.2 k RPM 3.5" SATA disk drive.





Multi-Level Holistic Modeling of Computing Algorithms and Hardware

#### IRM

#### Multi-Level Model Equation Examples



**Notation**:  $C_{xx}$ : computation throughput (FLOPS)  $P_{xx}$ : Power; f: clock frequency;  $K_{xx}$ : constant;  $V_{xx}$ : voltage;  $R_{xx}$ : resistance:

 $\eta_{xx}$ : power supply efficiency;  $\rho$ : application active/idle ratio, etc.

### Power Distribution (J. Hamilton)





### **Cooling: Cold/Hot Aisles**





- CRAC = computer room air conditioning
  - Cold airs goes through servers and exits in hot aisle
  - Cold aisles ~18-22C, hot aisles ~35C
  - CRAC units consume significant amount of energy!

### **Energy Use in a DC**







FIGURE 5.1: LBNL survey of the power usage efficiency of 24 datacenters, 2007 (Greenberg et al.)

- Cooling infrastructure is a major contributor
  - Picture from a PUE=3 data center
  - Current datacenters: PUE: 1.2 to 2



### WSC design considerations: Request-level parallelism



- Instruction-level parallelism (ILP)
  - Pipelining, speculation, OOO, ...
- Data-level parallelism (DLP)
  - Vectors, GPUs, MMX, ...
- Thread-level parallelism (TLP)
  - Multithreading, multi-cores, ...
- Request-level parallelism (RLP)
  - Parallelism among multiple decoupled tasks
  - Web servers, "map-reduce", search, email, ...
  - Large-scale distributed systems (clusters, NOW, Grids)

## WSC design considerations: The datacenter is the computer







### Cost model: systems capex



#### Servers:

- 45,978 servers x \$1450 per server = \$66.7M CAPEX
- Depreciation: 3 years; cost of money = 5%
- Monthly OPEX: \$2000K

### Networking

- Rack switches: 1150 x \$4800; Array switches: 22 x \$300K; Layer3 switch: 2 x \$500K; Border routers: 2 x \$144.8K = \$13.41M CAPEX
- Depreciation: 4 years; cost of money = 5%
- Monthly OPEX: \$309K

### Cost model: opex costs



#### Power

- [=MegaWattsCriticalLoad\*AveragePowerUsage/1000\*PUE\*PowerCost\*24\*365/12]
- 0.07c/KWhr; PUE = 1.45; average power use: 80%
- \$475K OPEX (monthly)

### People

- Security guards: 3 x 24x365x\$20 + Facilities: 1x24x365x\$30; Benefits multiplier: 1.3
- \$85K OPEX (monthly)
- Network bandwidth costs to internet
  - Varies by application and usage
- Vendor maintenance fees + sysadmins
  - Varies by equipment and negotiations

**Monthly Costs** 



3yr server, 4yr network, 10 yr infrastructure amortization

# Enterprise Vs WSC: a Cost Perspective



#### Enterprise computing approach

- Largest cost is people -- scales roughly with servers (~100:1 common)
- Enterprise interests focus on consolidation & utilization
  - Consolidate workload onto fewer, larger systems
  - Large SANs for storage & large routers for networking

#### Internet-scale services approach

- Largest costs is server H/W
  - Typically followed by cooling, power distribution, power
  - Networking varies from very low to dominant depending upon service
  - People costs under 10% & often under 5% (>1000+:1 server:admin)
- Services interests centered around work-done-per-\$ (or watt)

#### Observations

- People costs shift from top to nearly irrelevant.
- Focus instead on work done /\$ & work done/watt



### ■ Datacenter at The Dalles, Oregon

- Moderate climate, cheap hydroelectric power, near internet backbone fiber
- 75000 square feet

Congle's data center at The Dalles, OP



### MS Quincy Datacenter

- 470k sq feet (10 football fields)
- Next to a hydro-electric generation plant
  - At up to 40 MegaWatts, \$0.02/kWh is better than \$0.15/kWh ©
  - That's equal to the power consumption of 30,000 homes





- Rack switch = 48-port ethernet 1Gig switch
  - Commodity switch >= \$30 per port
    - Infiniband ~= \$500/port
  - One Switch per two racks
  - 40 server ports; 2-8 uplink ports
    - Oversubscription ratio
    - Programmer burden
  - Bandwidth within rack is same irrespective of sender/receiver
- Array switch



- More expensive: 10X more BW = 100X more \$
- High-end switches feature-rich (mgmt, inspectic CAMs, FPGAs
- 480 1Gbit links, few 10Gbit ports to datacenter routers
- Manage oversubscription carefully

## WSC Storage Hierarchy: A Programmer's Perspective





### Interesting observations

- Remote memory is often faster than local disk
- Bandwidth bottlenecks

|                             | Local  | Rack    | Array     |
|-----------------------------|--------|---------|-----------|
| DRAM Latency (microseconds) | 0.1    | 100     | 300       |
| Disk Latency (microseconds) | 10,000 | 11,000  | 12,000    |
| DRAM Bandwidth (MB/sec)     | 20,000 | 100     | 10        |
| Disk Bandwidth (MB/sec)     | 200    | 100     | 10        |
| DRAM Capacity (GB)          | 16     | 1,040   | 31,200    |
| Disk Capacity (GB)          | 2,000  | 160,000 | 4,800,000 |

|                             | Local  | Rack    | Array     |
|-----------------------------|--------|---------|-----------|
| DRAM Latency (microseconds) | 0.1    | 100     | 300       |
| Disk Latency (microseconds) | 10,000 | 11,000  | 12,000    |
| DRAM Bandwidth (MB/sec)     | 20,000 | 100     | 10        |
| Disk Bandwidth (MB/sec)     | 200    | 100     | 10        |
| DRAM Capacity (GB)          | 16     | 1,040   | 31,200    |
| Disk Capacity (GB)          | 2,000  | 160,000 | 4,800,000 |



#### **Useful Numbers**

#### Courtesy of Jeff Dean, Google



| L1 cache reference                                 | 0.5 ns         |
|----------------------------------------------------|----------------|
| <ul><li>Branch mispredict</li></ul>                | 5 ns           |
| L2 cache reference                                 | 7 ns           |
| <ul><li>Mutex lock/unlock</li></ul>                | 25 ns          |
| Main memory reference                              | 100 ns         |
| <ul><li>Compress 1K bytes with Zippy</li></ul>     | 3,000 ns       |
| Send 2K bytes over 1 Gbps network                  | 20,000 ns      |
| Read 1 MB sequentially from memory                 | 250,000 ns     |
| Round trip within same datacenter                  | 500,000 ns     |
| <ul><li>Disk seek</li></ul>                        | 10,000,000 ns  |
| <ul><li>Read 1 MB sequentially from disk</li></ul> | 20,000,000 ns  |
| Send packet CA->Europe->CA                         | 150,000,000 ns |

# Useful Back of the Envelope Math





- How long to generate image results page (30 thumbnails)?
- Design 1: Read serially, thumbnail 256K images on the fly
  - 30 seeks \* 10 ms/seek + 30 \* 256K / 30 MB/s = 560 ms
- Design 2: Issue reads in parallel
  - 10 ms/seek + 256K read / 30 MB/s = 18 ms
  - (Ignores variance, so really more like 30-60 ms, probably)
- Lots of other options
  - Caching (single images? whole sets of thumbnails?)
  - Pre-computing thumbnails
  - ... Back of the envelope helps identify most promising...

| Server Delay<br>(ms) | Increased time to<br>next click (ms) | Queries/<br>user | Any clicks/<br>user | User satisfac-<br>tion | Revenue/<br>User |
|----------------------|--------------------------------------|------------------|---------------------|------------------------|------------------|
| 50                   |                                      |                  |                     |                        |                  |
| 200                  | 500                                  |                  | -0.3%               | -0.4%                  |                  |
| 500                  | 1200                                 |                  | -1.0%               | -0.9%                  | -1.2%            |
| 1000                 | 1900                                 | -0.7%            | -1.9%               | -1.6%                  | -2.8%            |
| 2000                 | 3100                                 | -1.8%            | -4.4%               | -3.8%                  | 4.3%             |

Figure 6.12 Negative impact of delays at Bing search server on user behavior [Brutlag and Schurman 2009].

- 10000 processors with 4GB per server => following rates of unrecoverable errors in 3 years of operation [IBM study]
  - Parity only: about 90,000; 1 unrecoverable failure every 17 minutes
  - ECC only: about 3500; one unrecoverable or undetected failure every 7.5 hours
  - Chipkill: about 6; one unrecoverable/undetected failure every 2 months
  - 10,000 server chipkill = same error rate as a a 17-server ECC system
- Schroeder 2009: Google WSC error rates
  - Average DIMM had 4000 correctable errors and 0.2 uncorrectable errors per vear
  - With chipkill, for one third of the servers, one memory error is corrected every 2.5 hours
  - With just parity error, one third of the machiens would spend 20% of time rebooting (5 minutes reboot time)
  - Google 2000 consistency checking in software for DRAM errors, but with cost-effective DRAM error checking, move to hardware

# Example 3-tier App: WebMail



- May include thousands of machines,
   PetaBytes of data, and billions of users
- 1<sup>st</sup> tier: protocol processing
  - Typically stateless
  - Use a load balancer
- 2<sup>nd</sup> tier: application logic
  - Often caches state from 3<sup>rd</sup> tier
- 3<sup>rd</sup> tier: data storage
  - Heavily stateful
  - Often includes bulk of machines



55

# **Example: Google Cluster Environment**





Machine 1



Machine N



GFS master

Chubby lock service

- 1000s of machines, typically in few configurations
- File system (GFS) + Cluster scheduling system are core services
- Typically 100s to 1000s of active jobs
  - Some w/1 task, some w/1000s
  - Mix of batch and low-latency, user-facing production jobs



#### **Example: Google File System**





- Distributed file system using server disks
  - Master provides a naming service
  - Clients access data directly
- Replication support for availability & throughput
  - E.g. replicate across racks to survive node/switch failures

## **Cascade System Architecture**



- Globally addressable memory with unified addressing architecture
- Configurable network, memory, processing and I/O
- Heterogeneous processing across node types, and within MVP nodes
- Can adapt at configuration time, compile time, run time



**Increasingly Complex Application Requirements** 

**Earth Sciences Example** 





International Intergovernmental Panel on Climate Change, 2004, as updated by Washington, NCAR, 2005

Increased complexity and number of components lends itself well to a variety of processing technologies

- Similar trends in astrophysics, nuclear engineering, CAE, etc.
  - Higher resolution, multi-scale, multi-science

#### 3D Node: Processor + Orthogonal Memory Chips



- Capacity
  - 8-32 memory chips @ 1 GB each = 8-32 GB per node
- Bandwidth
  - 5  $\mu$ m pitch wires (10  $\mu$ m per diff signal), 15mm edge  $\Rightarrow$  1500 signals per memory chip
  - Need to keep signaling rates to < 10 Gbps with memory periphery transistors</li>
  - Assume 512 bits/dir @ 8.25 Gbps, packetized protocol, 80% read efficiency
    - ⇒ 320 GB/s read bandwidth per memory chip (1.28W at 0.5 pJ/bit)
    - ⇒ 2.5-10 TB/s read bandwidth per node with 8-32 memory chips
  - Could nicely feed a 5-10 TF node
  - Probably still too much power in memory chips to support this...

# **Example Board Architecture**



- Can treat as 16 nodes for highest aggregate memory bandwidth
- Could combine into 2, 4, 8 or 16-node "super-nodes"
  - Flat addressing, latency and bandwidth
  - Hashed to avoid bank conflicts
  - Would still want compiler to exploit locality within a single node
    - Either via explicit local segments or via caching (possibly in main memory)
- Inter-node signaling shown using conservative technology extrapolations
  - Could also consider high-bandwidth on-board technologies (quilting, capacitive coupling, optics?, etc.) to boost super-node bandwidth even further



#### One Last Exascale Challenge (4)

- Need to build systems for tomorrow's applications
  - Irregular, dynamic, sparse, heterogeneous....
  - Codes that don't exhibit locality, or that have limited per-thread concurrency
  - Need to start, stop, move and synchronize computation efficiently
  - Let's not solve the scaling problem for the easy apps and declare success
  - "Leave no application behind"



### Key Challenges to Get to the Zettascale

- I accept that CMOS won't get there due to power and other reasons.
- New computing technologies will likely require new architectures, new execution models and new programming models
  - Exploitation of locality will be key
  - Very likely to involve massive threading and lightweight thread migration
- Architects need to understand the technological sandbox within the next dozen years or so...
- Absolutely must have better programming models where humans don't have to coordinate all the data distribution and communication
  - Would be nice if those were the same programming models used at Exascale
- Need to have much more sophisticated and automated tools for performance and correctness analysis
  - Presumably involving pervasive introspection
- I am an optimist. I think we will get to zettaflop computing using some interesting post-CMOS technology by ~2030. It will look different than any of us imagine today. Good occasion to retire.

FEC 2007