# Performance

# How fast can we clock?

- 1. 1 cycle MIPS performance
- 3. general pipelining
- 8. MIPS pipe, LW
- 12. MIPS performance, piped vs non-piped
- 14. arrays, piped
- 15. hazards



l clock changes PC output.
How long before we can clock PC, Regfile? critical Path?

#### Single Cycle Processor Performance

- Functional unit delay
  - Memory: 200ps
  - ALU and adders: 200ps
  - Register file: 100 ps

| Instruction<br>Class | Instruction memory | Register<br>read | ALU<br>operation | Data<br>memory | Register<br>write | Total |
|----------------------|--------------------|------------------|------------------|----------------|-------------------|-------|
| R-type               | 200                | 100              | 200              |                | 100               | 600   |
| load                 | 200                | 100              | 200              | 200            | 100               | 800   |
| store                | 200                | 100              | 200              | 200            |                   | 700   |
| branch               | 200                | 100              | 200              |                |                   | 500   |
| jump                 | 200                |                  |                  |                |                   | 200   |

Max delay =  $T_{clock}$   $\frac{1}{T_{clock}} = \frac{1}{0.8 \, ns} = 1.25 \, GH_Z$ 

• CPU clock cycle = 800 ps = 0.8ns (1.25GHz)

C. Kozyrakis

EE108h Winter 2010 Lecture 8

34

What if we make control more complex? Set clock by opcode?

| /iX Instruction<br>Class | Instruction<br>memory | Register<br>read | ALU<br>operation | Data<br>memory | Register<br>write | Total |    |       |
|--------------------------|-----------------------|------------------|------------------|----------------|-------------------|-------|----|-------|
| res R-type               | 200                   | 100              | 200              |                | 100               | 600   | ρs | → 0.6 |
| load                     | 200                   | 100              | 200              | 200            | 100               | 800   | 1  | 0.8   |
| store                    | 200                   | 100              | 200              | 200            |                   | 700   |    | 0.7   |
| branch                   | 200                   | 100              | 200              |                |                   | 500   |    | 0.5   |
| jump                     | 200                   |                  |                  |                |                   | 200   | 1  | 0.2   |

what is speedup? 
$$\int_{\text{new-old}} = \frac{n \, \text{CPI}(\frac{1}{CR})}{n \, \text{CPI}(\frac{1}{CR})} = \frac{CR_{\text{new}}}{CR_{\text{old}}} = \frac{1.25}{1.6} \text{ worth it?}$$

Signa

What's the

use d

Problem:

- Each functional unit used once per cycle
- Most of the time it is sitting waiting for its turn.
  - Well it is calculating all the time, but it is waiting for valid data
- There is no parallelism in this arrangement
- Making instructions take more cycles can make machine faster!
  - Each instruction takes roughly the same time
    - While the CPI is much worse, the clock freq is much higher

Overlap execution of multiple instructions at the same time

- Different instructions will be active at the same time
- This is called "Pipelining"
- We will look at a 5 stage pipeline
  - Modern machines (Core 2) have order 15 cycles/instruction

functional unit

CPI CR ?

wait

$$P_{exf} = \frac{n}{+} = \frac{n}{n \, cpt \, (\%r)}$$
$$= \frac{1}{n} \, crt \, 1$$

Can we win?

Perf 
$$\rightarrow \frac{n}{n (k CPI)(1/k CR)}$$

no change?

# Pipelining Load Lw

- Load instruction takes 5 stages
  - Five independent functional units work on each stage
    - Each functional unit used only once for a single inst.
  - Another load can start as soon as 1st finishes IF stage
  - Each load still takes 5 cycles to complete
  - The throughput, however, is much higher

all stages busy 1 instr. exits
per cycle





SUB R4, R5, R1

Required delay before using written data.

**ADD R1, R2, R3** 

# fix delay? -> negedge Ff for Regfile

Positive edge-triggered FF: output changes on rising clock



Negative edge-triggered FF: output changes on rising clock





Written data available for read in same cycle.



Instruction Fetch

Read Read Result

Result

Read Result

Read Result

Result

Read Result

Result

Result

Read Result

Resul

MIPS: LW \$2, 15( \$1 ) [ op | rs | rt | off ]

[op | io | io | oii ]

LC4: LDR R2, R1, #15

[ op | SR1 | DR | off ]

#### Instruction fetch:

PC++ ==> PC

PC++ ==> PC
Instruction ==> Instr





#### **Execute:**

ID/EX EX/MEM

PC ==> ADD ==> PC

SR1out ==> ALU
offset ==> ALU
==> Res

DR ==> DR

CTL ==> CTL





- Use a Main Control unit to generate signals during RF/ID Stage
  - Control signals for EX
    - (ExtOp, ALUSrc, ...) used 1 cycle later
  - Control signals for Mem
    - (MemWr, Branch) used 2 cycles later
  - Control signals for WB
    - (MemtoReg, MemWr) used 3 cycles later

use pipe regs for control signals; could also pass along OP field, decode as needed



- Assume time for stages is
  - 100ps for register read or write
  - 200ps for other stages

### Pipeline Performance



- MIPS SA designed for pipelining
  - All instructions are 32-bits
    - Easier to fetch and decode in one cycle
    - c.f. x86: 1- to 17-byte instructions
  - Few and regular instruction formats
    - Can decode and read registers in one step

Load/store addressing

- Can calculate address in 3<sup>rd</sup> stage, access memory in 4<sup>th</sup> stage
- Alignment of memory operands
   Memory access takes only one cycle

VS. 2 for misaligned => stall

orthogonality

#### But Something Is Fishy Here

• If dividing it into 5 parts made the clock faster  $800 \text{ ps} \rightarrow 200 \text{ ps}$  — And the effective CPI is still one (i/i?)

- Then dividing it into 10 parts would make the clock even faster  $200 \rho s \rightarrow 100 \rho s$ ?
  - And wouldn't the CPI still be one?
- Then why not go to twenty cycles?
- Really two issues
  - Some things really have to complete in a cycle
    - Find next PC from current PC
  - CPI is not really one.
    - Sometimes you need the results a previous instruction that is not done

An instruction does not complete in 1 cycle, reed result before latency is done?

cannot divide every operation

### Can Pipelining Lead to an Arbitrary Short Clock Cycle?

- Min clock cycle = longest combinatorial delay + FF setup + clock skew
- · Pipelining reduces the combinatorial delay
  - Less work per pipeline stage
  - Ideally, N stages reduce delay to 1/N
  - Best you can achieve is Clock cycle
     FF setup + clock skew
    - Diminishing returns from ever longer pipelines...
- · Imbalance between stages also reduces benefits from subdividing
- Even if you could continuously improve clock frequency

^ – Power consumption ∞ Frequncy

# Dependencies and Hazards

Hazards: situations that prevent starting the next instruction in the next cycle Wasted cycles, CPI>T Hazards are due to dependencies between instructions Two instructions share resources or data - Pipelining may lead to overlapping their execution 2 refs to data memory Types of hazards Structural Hazard (resource conflict) • Two instructions need to use the same piece of hardware Data Hazard at same time • Instruction depends on result of instruction still in the pipeline Control Hazard Instruction fetch depends on the result of instruction in pipeline STRUCTURAL HAZARD • Simple example MIPS pipeline with a single unified memory No separate instruction & data memories Load/store requires data access Memory Instruction fetch would have to stall for that cycle

Would cause a pipeline "bubble"



- Dependencies are a property of your program (always there)
- Dependencies may lead to hazards on a specific pipeline







## STALLS + Performance

Suppose 40% of instructions cause 3-bubble stells

$$T_{\text{w/stalls}} = (\text{n instructions}) \left[ (60\%)(1 \text{ cycle}) + (40\%)(1 \text{ cycle} + 3 \text{ bubbles}) \right] (\%\text{R})$$

$$T_{\text{w/stalls}} = (n \text{ instructions}) [(100\%)(1 \text{ cycle})] (1/cR)$$

$$S_{\omega/-\omega/0} = [1 \cdot 1]/[0.6 + 0.4/4)] = 1/2.2 < 1/2$$

added

How to Stall the Pipeline
OR How to Insert a NOP or Bubble

- You discover the need to stall when 2<sup>nd</sup> instruction is in ID stage
  - Idea: repeat its ID stage until hazard resolved; let all instructions ahead of it move forward; stall all instructions behind it
- 1. Force control values in ID/EX register a NOP instruction
  - As if you fetched or \$0, \$0, \$0
  - When it propagates to EX, MEM and WB on following cycles, nothing will happen (nop = no-operation)
- 2. Prevent update of PC and IF/ID register
  - Using instruction is decoded again
  - Following instruction is fetched again

0 -> WE





NOP

OR, set WE=0 1 and 1 = 1 0 and 0 = 0



#### THAN STALLING

send data directly to ALU input? Do write later? as-if R1 read

(new feedback path)



RAW reg-mode?

Data available next tick.

Forwarding (feedback) works.





LW?

WHY NOT forward Dmem.out?

Forward from WB instead, insert NOP

SLOW down clock to 400ps?!?



insert bubble nop WB

## LW stall





Forwarding Control: ID/EX.(rs, rt) =? (EX/MEM.rd or MEM/WB.rd)
===> Set MUXes

# multiple feedback at once?



Feedback paths to ALU go to both inputs. Hazard detection sets MUXes: Opcode needed in pipe stage registers for detection.

