# Cache Optimizations

- Small and simple first level caches
  - Critical timing path:
    - addressing tag memory, then
    - comparing tags, then
    - selecting correct set
  - Direct-mapped caches can overlap tag compare and transmission of data
  - Lower associativity reduces power because fewer cache lines are accessed





- To improve hit time, predict the way to pre-set mux
  - Mis-prediction gives longer hit time
  - Prediction accuracy
    - > 90% for two-way
    - > 80% for four-way
    - I-cache has better accuracy than D-cache
  - First used on MIPS R10000 in mid-90s
  - Used on ARM Cortex-A8
- Extend to predict block as well
  - "Way selection"
  - Increases mis-prediction penalty



## • Pipeline cache access to improve bandwidth

- Examples:
  - Pentium: 1 cycle
  - Pentium Pro Pentium III: 2 cycles
  - Pentium 4 Core i7: 4 cycles
- Increases branch mis-prediction penalty
- Makes it easier to increase associativity



1 word data to processor per cycle pipeline fill/drain overhead pipeline hazard overhead

- Organize cache as independent banks to support simultaneous access
  - ARM Cortex-A8 supports 1-4 banks for L2
  - Intel i7 supports 4 banks for L1 and 8 banks for L2

## Interleave banks according to block address

|        | Block   |                       | Block                        |                                                                                                                        |
|--------|---------|-----------------------|------------------------------|------------------------------------------------------------------------------------------------------------------------|
| Bank 1 | address | Bank 2                | address                      | Bank 3                                                                                                                 |
|        | 2       |                       | 3                            |                                                                                                                        |
|        | 6       |                       | 7                            |                                                                                                                        |
|        | 10      |                       | 11                           |                                                                                                                        |
|        | 14      |                       | 15                           |                                                                                                                        |
|        | Bank 1  | Bank 1 address 2 6 10 | Bank 1 address Bank 2 2 6 10 | Bank 1         address         Bank 2         address           2         3         7           6         7         11 |

**Figure 2.6** Four-way interleaved cache banks using block addressing. Assuming 64 bytes per blocks, each of these addresses would be multiplied by 64 to get byte addressing.



#### General Principle: Keep working to hide latency

- --- Can we find other work to do?
- --- Do we have a mechanism for that?
- --- Multiple threads (switching), out-or-order execution, loop unrolling

## **Nonblocking Caches**

- Allow hits before previous misses complete
  - "Hit under miss"
  - "Hit under multiple miss"
- L2 must support this
- In general, processors can hide L1 miss penalty but not L2 miss penalty





t



Mem bandwidth used: 64B/36ns ~ 2 GB/sec





How many, at most, concurrent requests can we handle? Do we need to handle if we want to use all our bandwidth?

#### Two concurrent misses supported

~1/5

Rø exits pipe

Maximum Concurrent misses we can support? at steady state, maximum request rate we can fulfill: required bandwidth = (1 reg/4ns)(64B/reg) = 16 GB/sec

pipelined Memory

As long as queves are steady, n > 0!

If requests take exactly 36 ns latency?

(36 ns latency / req) = n (steps down pipeline) (4ns / step)

|    |             | _                      |            |
|----|-------------|------------------------|------------|
| Rn | <br>$ R_1 $ | $\rightarrow R_{\phi}$ | $\eta = 9$ |

## Collision (same line)



--- We want latency = min = 36ns

We have to be able to handle 18 concurrent misses.

--- We want to utilize full memory bandwidth

We need a mechanism that can still find other work to do, even though 18 instructions are queued waiting for data.

- Critical word first
  - Request missed word from memory first
  - Send it to the processor as soon as it arrives
- Early restart
  - Request words in normal order
  - Send missed work to the processor as soon as it arrives



- When storing to a block that is already pending in the write buffer, update write buffer
- Reduces stalls due to full write buffer
- Do not apply to I/O addresses



#### Loop Interchange

 Swap nested loops to access memory in sequential order



#### Blocking

- Instead of accessing entire rows or columns, subdivide matrices into blocks
- Requires more memory accesses but improves locality of accesses



 Fetch two blocks on miss (include next sequential block)



Pentium 4 Pre-fetching

Hardware
Prefetch

> no exceptions (faults)!

- Insert <u>prefetch</u> instructions before data is needed
- Non-faulting: <u>prefetch</u> doesn't cause exceptions
- Register prefetch
  - Loads data into register
- Cache prefetch
  - Loads data into cache
- Combine with loop unrolling and software pipelining



**Figure H.1** A software-pipelined loop chooses instructions from different loop iterations, thus separating the dependent instructions within one iteration of the original loop. The start-up and finish-up code will correspond to the portions above and below the software-pipelined iteration.

