

- Fast memory technology is more expensive per bit than slower memory
- Solution: organize memory system into a hierarchy
  - Entire addressable memory space available in largest, slowest memory
  - Incrementally smaller and faster memories, each containing a subset of the memory below it, proceed in steps up toward the processor
- Temporal and spatial locality insures that nearly all references can be found in smaller memories
  - Gives the allusion of a large, fast memory being presented to the processor



(a) Memory hierarchy for server

- Aggregate peak bandwidth grows with # cores:
  - Intel Core i7 can generate two references per core per clock
  - Four cores and 3.2 GHz clock
    - 25.6 billion 64-bit data references/second +
    - 12.8 billion 128-bit instruction references
    - = 409.6 GB/s!
  - DRAM bandwidth is only 6% of this (25 GB/s)
  - Requires:
    - Multi-port, pipelined caches
    - Two levels of cache per core
    - Shared third-level cache on chip



Metrics: Any Access time = (hit rate)(hit time) + (miss rate)(miss time)

Any Power = \*(active devices)(any dynamic power)

- When a word is not found in the cache, a miss occurs:
  - Fetch word from lower level in hierarchy, requiring a higher latency reference
  - Lower level may be another cache or the main memory
  - Also fetch the other words contained within the block
    - Takes advantage of spatial locality
  - Place block into cache in any location within its set, determined by address
    - block address MOD number of sets



- Miss rate
  - Fraction of cache access that result in a miss
- Causes of misses
  - Compulsory
    - First reference to a block
  - Capacity
    - Blocks discarded and later retrieved
  - Conflict
    - Program makes repeated references to multiple addresses from different blocks that map to the same location in the cache
      - Note that speculative and multithreaded processors may execute other instructions during a miss
        - Reduces performance impact of misses

Cache handles

miss while accessing
other data/instructions

+ Coherency + context switching

- 1. Do more work than we do Reads and Writes
- 2. Re-use things by keeping them handy



#### Memory Reference Patterns



## Exploiting the Memory Hierarchy



5.004 - Spring 2009 4/2/09 L15 - Memory Hierarchy 12

RAM

"MAIN MEMORY

Static CACHE

DISK

"SWAP SPACE"

#### Typical Memory Hierarchy: Everything is a Cache for Something Else





time to R/W data + cache hit time

- Need to define an average access time
  - Since some will be fast and some slow

Access time = hit time + miss rate x miss penalty

- The hope is that the hit time will be low and the miss rate low since the miss penalty is so much larger than the hit time
- Average Memory Access Time (AMAT)
  - Formula can be applied to any level of the hierarchy
    - · Access time for that level
  - Can be generalized for the entire hierarchy
    - Average access time that the processor sees for a reference

#### How Processor Handles a Miss

- (hit rate) (hit time) + (miss rate) (miss time) = (1-MR) Thit + MR(Taccess + Thit)
  - Thit ((1-MR) + MR) + MR Taccess = Thit + MR ( Taccess)
    - miss Penalty

what's important? Overall access time, averaged over all levels and all instructions.

- Assume that cache access occurs in 1 cycle
  - (Hit)s great, and basic pipeline is fine CPI penalty = miss rate x miss penalty

- A miss stalls the pipeline (for a instruction or data miss)
  - Stall the pipeline (you don't have the data it needs)
  - Send the address that missed to the memory
  - Instruct main memory to perform a read and wait
  - When access completes, return the data to the processor
  - Restart the instruction COMPLETE

How To Build A Cache?

Generally Turing Tape, move L/R, Copy region, But w/ distance cost

- Big question: locating data in the cache
  - I need to map a large address space into a small memory
- How do I do that?
  - Can build full associative lookup in hardware, but complex
  - Need to find a simple but effective solution
  - Two common techniques:

- Direct Mapped
- Set Associative
- Further questions
  - Block size (crucial for spatial locality) how hig?
     Replacement policy (crucial for temporal locality) when what
  - Write policy (writes are always more complex)

#### Associativity: Parallel Lookup



#### Direct Mapped Cache



-(Valid bit:)1 = present, 0 = not present

Initially 0

E.G.

8-blocks, 1 word/block, direct mapped

• (Initial state)

| Index | V | Tag | Data |
|-------|---|-----|------|
| 000   | N | ?   | ?    |
| 001   | N | ?   | ?    |
| 010   | N | ?   | ?    |
| 011   | N | ?   | ?    |
| 100   | N | ?   | ?    |
| 101   | N | ?   | ?    |
| 110   | N | .7  | ?    |
| 111   | N | ?   | · .  |

START up?

-boot, how?

- what's in SRAM?

Process swap?

- system actions?

| 3-bits for index (ignore bits for byte offset into word) |           |             |          |             |           |
|----------------------------------------------------------|-----------|-------------|----------|-------------|-----------|
|                                                          | Word addr | Binary addr | Hit/miss | Cache block | 10110 \$2 |
|                                                          | 22        | 10 110      | Miss     | 110         | , R2      |

Compulsory/Cold Miss NOW

|   | Index | V | Tag  | Data / Instr      |   |
|---|-------|---|------|-------------------|---|
|   | 000   | N | T    |                   |   |
|   | 001   | N |      | .+0               |   |
|   | 010   | N |      | I TAG WING        |   |
|   | 011   | N |      | icite hata        |   |
|   | 100   | N | /"   | Com Me            | W |
| 1 | 101   | N | J. K |                   |   |
|   | 110   | Y | 10   | Mem[10110] = 1234 | < |
|   | 111   | N |      |                   |   |

- mixed cache? data cache? instruction cache?

Can we avoid cold misses?

PVALID → Miss mem fetch, stall, load, restart

| Word addr | Binary addr | Hit/miss | Cache block |
|-----------|-------------|----------|-------------|
| 26        | 11 010      | Miss     | 010         |

LW \$3, \$ (\$4)

11010 \$4

| ompul | sory Cold | Miss     |         |                   |
|-------|-----------|----------|---------|-------------------|
|       |           |          | V       |                   |
|       | Index     | <b>V</b> | Tag     | Data              |
|       | 000       | N        | 1       | daha              |
|       | 001       | N        | <u></u> |                   |
|       | 010       | Y        | 11)     | Mem[11010] = ABCD |
|       | 011       | N        |         |                   |
|       | 100       | N        |         |                   |
|       | 101       | N        |         |                   |
|       | 110       | Υ        | 10      | Mem[10110] = 1234 |
|       | 111       | N        |         |                   |

IVALID

fetch stall

load restart



#### Direct Mapped Problems: Conflict misses

- Two blocks that are used concurrently and map to same index
  - Only one can fit in the cache, regardless of cache size
  - No flexibility in placing 2<sup>nd</sup> block elsewhere
- · Frashing Thrashing
  - If accesses alternate, one block will replace the other before reuse
  - (No benefit from caching) worse than no cache

```
Same Tags

LW (1100110)

LW (0101110)

SW (1100110)

SW (0101110)
```

a[]

6[]

• Consider the following example code:

$$1k \times 8B$$
 cache 
$$= 2^{13}B = 8kB$$

- Arrays a, b, and c will tend to conflict in small caches
- Code will get cache misses with every array access (3 per loop)
- Spatial locality savings from blocks will b€ eliminated
- How can the severity of the conflicts be reduced?

Any stride in increments of 2^10 X 2^3 Bytes causes same problem: indices are identical, but tags differ.



How can we fix this?

Programmer's mistake?

How to make system crawl, worst case?

Z Let's have a contest!

8/e ×8B

# Larger Block Size → n words / Block



**Block Size** 

assume { Bandwidth to memory.

fixed { Total cuche size: \*\* Blocks = Total size

Block size

**Block Size** 

Block Size

# Fully-assoc vs. Direct-mapped

#### Fully-associative N-line cache:

- N tag comparators, registers used for tag/data storage (\$\$\$)
- Location A might be cached in any one of the N cache lines no restrictions!
- Replacement strategy (e.g., LRU) used to pick which line to use when loading new word(s) into cache

PROBLEM: Cost!

implement,

#### Direct-mapped N-line cache:

- 1 tag comparator SRAM used for tag/data storage (\$)
- Location A is cached in a specific line
   of the cache determined by its
   address; address "collisions" possible
- Replacement strategy not needed each word can only be cached in one specific cache line

PROBLEM Contention!

cheap! Fast data access on hits

#### Cost vs Contention

two observations...

- 1. Probability of collision diminishes with cache size...
  - ... so lets build HUGE direct-mapped caches, using cheap SRAM!
- 2 Contention mostly occurs between independent "hot spots"
  - Instruction fetches vs stack frame vs data structures, etc
  - Ability to simultaneously cache a few (27 47 87) hot spots eliminates most collisions
  - ... so lete build caches that allow each location to be stored in some restricted set of cache lines.

    (allow each location to be rather than in exactly one (direct mapped) or every line (fully associative).





Insight: an N-way set-associative cache affords modest parallelism

- · parallel lookup (associativity): restricted to small set of N lines
- modest parallelism deals with most contention at modest cost
- · can implement using N direct-mapped caches running in parallel

#### N-way Set Associative

- Compromise between direct-mapped and fully associative
  - Each memory block can go to one of N entries in cache
    - Each "set" can store N blocks; a cache contains some number of sets
  - For fast access all blocks in a set are search in parallel
- How to think of a N-way associative cache with X sets
  - 1st view: N direct mapped caches each with X entries
    - Caches search in parallel
    - Need to coordinate on data output and signaling hit/miss
  - 2<sup>nd</sup> view: X fully associative caches each with N entries each





#### E.G.

Compare 4-block caches
 — Direct mapped, 2-way set associative fully associative
 — Block access sequence 0, 8, 0, 6, 8
 3 different block addresses





Fully associative

NO INDEX

|      | ADDRESS |    | Hit/miss | Cache content after access |        |        |  |  |  |
|------|---------|----|----------|----------------------------|--------|--------|--|--|--|
| 1    | 0 00    |    | miss     | Mem[0]                     |        |        |  |  |  |
| Time | 8 00    |    | miss     | Mem[0]                     | Mem[8] |        |  |  |  |
|      | 0 00    |    | hit      | Mem[0]                     | Mem[8] |        |  |  |  |
|      | 6 10    |    | miss     | Mem[0]                     | Mem[8] | Mem[6] |  |  |  |
|      | 8 00    |    | hit      | Mem[0]                     | Mem[8] | Mem[6] |  |  |  |
|      | 100 4   | 11 |          |                            |        |        |  |  |  |

Tag index





- · 8-way is (almost) as effective as fully-associative
- · rule of thumb: N-line direct-mapped == N/2-line 2-way set assoc.

#### Another Job mix

- Simulation of a system with 64KB D-cache, 16-word blocks, SPEC2000
  - 1-way: 10.3%
  - 2-way: 8.6%
  - 4-way: 8.3%
  - 8-way: 8.1%

I diminishing returns?

#### Replacement Methods

- Which line do you replace on a miss?
- Direct Mapped
  - Easy, you have only one choice
  - Replace the line at the index you need
- N-way Set Associative
  - Need to choose which way to replace
  - Random (choose one at random)
  - Least Recently Used (LRU) (the one used least recently)
    - Often difficult to calculate, so people use approximations. Often they are really not recently used



C. Kozyrakis

FF 108h Lecture 12

Handling of WRITES workload { How many reads?
How many writes?
How many read-after-write?

Observation: Most (90+%) of memory accesses are READs. How should we handle writes? Issues:

(Write-through: CPU writes are cached, but also written to main memory (stalling the CPU until write is completed). Memory always holds "the truth".

Write-behind: CPU writes are cached; writes to main memory may be buffered, perhaps pipelined. CPU keeps executing while writes are. completed (in order) in the background.

Write-back: CPU writes are cached, but not immediately written to main memory. Memory contents can be "stale".

477000

Our cache thus far uses write-through.

Can we improve write performance?



Mem

cach



no stall

6004 - Spring 2009

· Interesting observation

Processor does not need to "wait" until the store completes



- Memory controller slowly "drains" buffer to memory
- Write Buffer: a first-in-first-out buffer (FIFO)
  - Typically holds a small number of writes
  - Can absorb small bursts as long as the long term rate of writing to the buffer does not exceed the maximum rate of writing to DRAM



write-back w/ dirty replacement

# Write Miss \_\_\_\_\_ Typical Choices



EE 108b Lecture 12

Selected by OS on at some coarse granularity (e.g. 4KB)

#### Be Careful, Even with Write Hits

Reading from a cache
 Read tags and data in parallel

C. Kozyrakis

- If it hits, return the data, else go to lower level

- Writing a cache can take more time
  - First read tag to determine hit/miss (access 1)
- 1. Read (Tag) (stall or no stall) 2. Write data
- Then overwrite data on a hit (access 2)
  - Otherwise, you may overwrite dirty data or write the wrong cache way
- Can you ever access tag and write data in parallel?

#### **Splitting Caches**

IMEM

DMEM



#### Multilevel Caches

- Primary(L1) caches attached to CPU IMEM DMEM
  - (Small, but fast)
  - Focusing on (hit time) rather than hit rate
- Level-2 cache services misses from primary cache L2
  - Larger, (slower) but still faster than main memory
  - Unified instruction and data (why?)
  - Focusing on hit rate rather than hit time (why?)
- Main memory services L-2 cache misses
  - Some high-end systems include L-3 cache

### E. G. W/O LZ

- Given
  - CPU base CPI = 1, clock rate 4GHz -
- 1 cycle = = 0.25 ns

- (Miss rate instruction ± 2%)
- Main memory access time = 100ns
- miss penalty = 100 ns ( 1 cycle) = 400 cycles

- With just a primary (L1) cache
  - Miss penalty = 100ns/0.25ns = 400 cycles
  - Effective CPI = 1 + 0.02 × 400 = 9

$$CPT = (98\%)(1 \text{ cycle for hit}) + (2\%)(400 \text{ cycle stall} + 1 \text{ cycle})$$

$$= 0.98 + 0.02(400) + 0.02(1) = 1 + 0.02(400) = 9$$

#### E.G. w/ L2

- Now add L-2 cache
  - Access time = 5ns
  - Global miss rate to main memory = 0.5%
- Primary miss with L-2 hit
  - Penalty = 5ns/0.25ns = 20 cycles
- Primary miss with L-2 miss
  - Extra penalty = 400 cycles
- CPI =  $1 + 0.02 \times 20 + 0.005 \times 400 = 3.4$
- Performance ratio = 9/3.4 = 2.6

$$CPI = \frac{\text{% cycles}}{N \text{ instructions}}$$
= (\( \lambda \right) \left[ (N\_1 + N\_2 + N\_3)(1) + (N\_2 + N\_3)(20) + N\_3(400) \right]



$$N_{1} = 98\% N \qquad N_{3} = \frac{1}{2}\% N \qquad N_{2} = N - (N_{1} + N_{3}) \implies (N_{2} + N_{3}) = N - N_{1} = 2\% N$$

$$= \left[N(1) + 2\% N(20) + \frac{1}{2}\% N(400)\right] / N = 0.98 + 0.02(20) + 0.005(400) = 3.4$$



|   |            | Intel Nehalem P6 Quad               | AMD Opteron X4                       | المارا     |
|---|------------|-------------------------------------|--------------------------------------|------------|
| d | L1 caches  | L1 I-cache: 32KB) 64-byte blocks 4- | L1 I-cache: 32KB 64-byte blocks 2-   | hit ,      |
|   | (per core) | way, approx LRU replacement, hit    | way LRU replacement, hit time 3      | 3 cycles   |
|   |            | time n/a                            | cycles                               |            |
|   | \$ D       | L1 D-cache 32KB, 64-byte blocks 8-) | L1 D-cache: 32KB, 64-byte blocks, 2- |            |
|   |            | way, approx LRU replacement, write- | way, LRU replacement, write-         |            |
|   |            | back/allocate hit time n/a          | back/allocate hit time 3 cycles      | hit        |
|   |            |                                     |                                      | l ' 1      |
| 7 | L2 unified | 256KB 64-byte blocks, 8-way approx  | (512KB,)64-byte blocks(16-way,)      | - 9 cycles |
|   | cache      | LRU replacement, write-             | approx LRU replacement, write-       |            |
|   | (per core) | back/allocate, hit time n/a         | back/allocate hit time 9 cycles      | hit        |
| 7 | L3 unified | 8MB, 64-byte blocks, 16-way,        | 2MB 64-byte blocks 32-way replace    | 38 cycle   |
|   | cache      | replacement n/a, write-             | block shared by fewest cores, write- | - 30 cycle |
|   | (shared)   | back/allocate, hit time n/a         | back/allocate, hit time 38 cycles    |            |
|   |            |                                     |                                      |            |

n/a: data not available

64B BLocks = 16 32-bit words

8 64-bit words

#### Interface Signals



See (LC3-based cache projects):

http://pages.cs.wisc.edu/~karu/courses/cs552/spring2009/wiki/index.php/Main/CacheModule

http://www.ece.ncsu.edu/muse/courses/ece406spr09/labs/proj2/proj2\_spr09.pdf

#### Cache Controller FSM



Kozyrakis

#### SRAM

- Requires low power to retain bit
- Requires 6 transistors/bit

#### DRAM

- Must be re-written after being read
- Must also be periodically refeshed
  - Every ~ 8 ms
  - Each row can be refreshed simultaneously
- One transistor/bit
- Address lines are multiplexed:
  - Upper half of address: row access strobe (RAS)
  - Lower half of address: column access strobe (CAS)

#### Some optimizations:

- Multiple accesses to same row
- Synchronous DRAM
  - Added clock to DRAM interface
  - Burst mode with critical word first
- Wider interfaces
- Double data rate (DDR)
- Multiple banks on each DRAM device

- Bits in a DRAM are organized as a rectangular array
  - DRAM accesses an entire row
  - Burst mode; supply successive words from a row with reduced latency
- Double data rate (DDR) DRAM
  - Transfer on rising and falling clock edges
- Quad data rate QDR DRAM
  - Four transfers per cycle





row address

- DIMMs: small boards with multiple DRAM chips connected in parallel
  - Functions as a higher capacity, wider interface DRAM chip
  - Easier to manipulate, replace, ...



#### Row access strobe (RAS)

| Production year | Chip size | DRAM Type | Slowest<br>DRAM (ns) | Fastest<br>DRAM (ns) | Column access strobe (CAS)<br>data transfer time (ns) | / Cycle<br>time (ns) |
|-----------------|-----------|-----------|----------------------|----------------------|-------------------------------------------------------|----------------------|
| 1980            | 64K bit   | DRAM      | 180                  | 150                  | 75                                                    | 250                  |
| 1983            | 256K bit  | DRAM      | 150                  | 120                  | 50                                                    | 220                  |
| 1986            | 1M bit    | DRAM      | 120                  | 100                  | 25                                                    | 190                  |
| 1989            | 4M bit    | DRAM      | 100                  | 80                   | 20                                                    | 165                  |
| 1992            | 16M bit   | DRAM      | 80                   | 60                   | 15                                                    | 120                  |
| 1996            | 64M bit   | SDRAM     | 70                   | 50                   | 12                                                    | 110                  |
| 1998            | 128M bit  | SDRAM     | 70                   | 50                   | 10                                                    | 100                  |
| 2000            | 256M bit  | DDR1      | 65                   | 45                   | 7                                                     | 90                   |
| 2002            | 512M bit  | DDR1      | 60                   | 40                   | 5                                                     | 80                   |
| 2004            | 1G bit    | DDR2      | 55                   | 35                   | 5                                                     | 70                   |
| 2006            | 2G bit    | DDR2      | 50                   | 30                   | 2.5                                                   | 60                   |
| 2010            | 4G bit    | DDR3      | 36                   | 28                   | 1                                                     | 37                   |
| 2012            | 8G bit    | DDR3      | 30                   | 24                   | 0.5                                                   | 31                   |

#### **DRAM Generations & Trends**

| Year | Capacity | \$/GB     |
|------|----------|-----------|
| 1980 | 64Kbit   | \$1500000 |
| 1983 | 256Kbit  | \$500000  |
| 1985 | 1Mbit    | \$200000  |
| 1989 | 4Mbit    | \$50000   |
| 1992 | 16Mbit   | \$15000   |
| 1996 | 64Mbit   | \$10000   |
| 1998 | 128Mbit  | \$4000    |
| 2000 | 256Mbit  | \$1000    |
| 2004 | 512Mbit  | \$250     |
| 2007 | 1Gbit    | \$50      |



| Standard | Clock rate (MHz) | M transfers per second | DRAM name | MB/sec/DIMM   | DIMM name |
|----------|------------------|------------------------|-----------|---------------|-----------|
| DDR      | 133              | 266                    | DDR266    | 2128          | PC2100    |
| DDR      | 150              | 300                    | DDR300    | 2400          | PC2400    |
| DDR      | 200              | 400                    | DDR400    | 3200          | PC3200    |
| DDR2     | 266              | 533                    | DDR2-533  | 4264          | PC4300    |
| DDR2     | 333              | 667                    | DDR2-667  | 5336          | PC5300    |
| DDR2     | 400              | 800                    | DDR2-800  | 6400          | PC6400    |
| DDR3     | 533              | 1066                   | DDR3-1066 | 8528          | PC8500    |
| DDR3     | 666              | 1333                   | DDR3-1333 | 10,664        | PC10700   |
| DDR3     | 800              | 1600                   | DDR3-1600 | 12,800        | PC12800   |
| DDR4     | 1066–1600        | 2133-3200              | DDR4-3200 | 17,056–25,600 | PC25600   |

#### DDR:

- DDR2
  - Lower power (2.5 V -> 1.8 V)
  - Higher clock rates (266 MHz, 333 MHz, 400 MHz)
- DDR3
  - 1.5 V
  - 800 MHz
- DDR4
  - 1-1.2 V
  - 1600 MHz

#### Graphics memory:

- Achieve 2-5 X bandwidth per DRAM vs. DDR3
  - Wider interfaces (32 vs. 16 bit)
  - Higher clock rate
    - Possible because they are attached via soldering instead of socketted DIMM modules



- Memory is susceptible to cosmic rays
- Soft errors: dynamic errors
  - Detected and fixed by error correcting codes (ECC)
- Hard errors: permanent errors
  - Use sparse rows to replace defective rows
- Chipkill: a RAID-like error recovery technique

# Increasing Memory Bandwidth



## Six basic cache optimizations:

Larger block size

- -> spatial locality
- Reduces compulsory misses
- Increases capacity and conflict misses, increases miss penalty
- Larger total cache capacity to reduce miss rate
  - Increases hit time, increases power consumption
- Higher associativity
  - Reduces conflict misses
  - Increases hit time, increases power consumption



- Higher number of cache levels
  - Reduces overall memory access time
- Giving priority to read misses over writes
  - Reduces miss penalty

more reads than writes

- Avoiding address translation in cache indexing
  - Reduces hit time