

Data signal Jsed

tunctional unit 1

wait



- Problem
  - Each functional unit used once per cycle
  - Most of the time it is sitting waiting for its turn
    - Well it is calculating all the time, but it is waiting for valid data
  - There is no parallelism in this arrangement
- Making instructions take more cycles can make machine faster!!
  - Each instruction takes roughly the same time
    - While the CPI is much worse, the clock freq is much higher

Overlap execution of multiple instructions at the same time

- Different instructions will be active at the same time
- This is called "Pipelining"

C. Kozyrakis

- We will look at a 5 stage pipeline
  - Modern machines (Core 2) have order 15 cycles/instruction

$$\begin{array}{ccc}
\operatorname{CPI} & \operatorname{CR} \\
\operatorname{Perf} & = \frac{n}{T} & = \frac{n}{n \operatorname{CPI}(V_{CR})} \\
& \longrightarrow & \operatorname{1CR}_{CPI} \\
hmm. & ?
\end{array}$$

## Sequential Laundry





Pipelined laundry takes 3.5 hours for 4 loads

nzyrakis

What does Andahl's Law say?

Pipelining Lessons





- Pipelining doesn't help latency of single task, it helps throughput of entire workload
- Multiple tasks operating simultaneously
- Potential speedup = Number pipe stages
- Pipeline rate limited by slowest pipeline stage
- Unbalanced lengths of pipe stages reduces speedup
- Time to "fill" pipeline and time to "drain" it reduces speedup





F&Gare "ide", just holding their outputs stable while H performs its computation

Suppose F, G, H have propagation delays of 15, 20, 25 ns and we are using ideal zero-delay registers:

(CR) = 200ps





# A pipelining methodology

### Step 1:

Draw a line that crosses every output in the circuit, and mark the endpoints as terminal points.

### Step 2:

the terminal points across various circuit connections, ensuring that every connection crosses each line in the same direction. These lines demarcate pipeline stages.

Adding a pipeline register at every point where a separating line crosses a connection will always generate a valid pipeline.

#### STRATEGY:

pipelining registers around the isolation from slowest circuit elements other delays (BOTTLENECKS). OUTPU1 **INPUTS** T=8 c T=5 В 3 n/5 8 nS 5 nS 4 nS E 2 mST = 1/8 nsL = 24 nsCut-set

Focus your attention on placing











|         | LATENCY | THROUGHPUT |
|---------|---------|------------|
| O-pipe: | 2+1+1   | 1/4        |
| 1-plpe: | 4       | 1/4        |
| 2-plpe: | 2+2     | 1/2        |
| 3-pipe: | 2+2+2   | 1/2        |

#### OBSERVATIONS:

- 1-pipeline improves neither
- Timproved by breaking long. combinational paths, allowing faster clock.
- Too many stages cost L, don't improve T.
- Back-to-back registers are often required to keep pipeline well-formed.

## Pipelined Components



Pipelined systems can be hierarchical:

- Replacing a slow combinational component with a k-pipe version may Increase clock frequency
- Must account for new pipeline stages in our plan

split A in 2 w/ T=1 ? How?

# Circuit Interleaving

We can simulate a pipelined version of a slow component by replicating the critical element and alternate inputs between the various copies.

> This is a simple 2-state FSM that alternates between O and 1 on each clock





Wash = 30 Dry = 2×30

















Throughput = 2/30 = 1/15 load/min



parallel pipes w/interleave

We can combine interleaving and pipelining with parallelism.



#### Control Structure Alternatives



### Control Structure Taxonomy



### Data Dependency Graphs and Pipelining

- --- Multiple jobs
- --- High throughput
- --- Is parallelism as high as possible?
- --- Is timing data dependent?





### Here's our combinational multiplier:



#### What's its propagation delay?

Naive (but valid) bound:

- · O(n) additions
- · O(n) time for each addition
- Hence O(n²) time required

#### On closer inspection:

- Propagation only toward left, bottom
- Hence longest path bounded by length - width of array: O(n+n) = O(n)

#### Breaking O(n) combinational paths



both ways as systolic

#### Multiplier Cookbook: Chapter 4



# Can Pipelining Lead to an Arbitrary Short Clock Cycle?

- Min clock cycle 
   <del>Longest combinatorial delay + FF setup + clock skew</del>
- Pipelining reduces the combinatorial delay
  - Less work per pipeline stage
  - Ideally N stages reduce delay to 1/N
  - Best you can achieve is Clock cycle
     FF setup + clock skew
    - Diminishing returns from ever longer pipelines...
- Imbalance between stages also reduces benefits from subdividing
- Even if you could continuously improve clock frequency
  - ^ Power consumption ∞ Frequncy

#### 1-cycle MIPS processor

- --- Harvard Architecture (two memories)
- --- clocking PC initiates cycle



#### Single Cycle Processor Performance

Functional unit delay

- Memory: 200ps

- ALU and adders: 200ps

- Register file: 100 ps

PS = 10-12 sec

|                      |                    |                  |               | <u>=</u>       |                   |       |
|----------------------|--------------------|------------------|---------------|----------------|-------------------|-------|
| Instruction<br>Class | Instruction memory | Register<br>read | ALU operation | Data<br>memory | Register<br>write | Total |
| R-type               | 200                | 100              | 200           |                | 100               | 600   |
| load                 | 200                | 100              | 200           | 200            | 100               | 800   |
| store                | 200                | 100              | 200           | 200            |                   | 700   |
| branch               | 200                | 100              | 200           |                |                   | 500   |
| jump                 | 200                |                  |               |                |                   | 200   |

CPU clock cycle = 800 ps = 0.8ns (1.25GHz)

Max delay = Tolock

what if we let clock trigger delay by opcode?

| <ul><li>Instruction Mix</li><li>45% ALU</li><li>25% loads</li></ul> | Instruction<br>Class | Instruction<br>memory | Register<br>read | ALU<br>operation | Data<br>memory | Register<br>write | Total | 40       |
|---------------------------------------------------------------------|----------------------|-----------------------|------------------|------------------|----------------|-------------------|-------|----------|
| 10% stores                                                          | R-type               | 200                   | 100              | 200              |                | 100               | 600   | PS → 0.6 |
| //                                                                  | load                 | 200                   | 100              | 200              | 200            | 100               | 800   | 0.8      |
| - 15% branches                                                      | store                | 200                   | 100              | 200              | 200            |                   | 700   | 0.7      |
| <b>\                                    </b>                        | branch               | 200                   | 100              | 200              |                |                   | 500   | 0.5      |
|                                                                     | jump                 | 200                   |                  |                  |                |                   | 200   | 0.2      |
|                                                                     |                      |                       |                  |                  |                | <u> </u>          | _4    | 1.6      |

CPU clock cycle =  $(0.6) \times 45\% + (0.8) \times 25\% + (0.7) \times 10\% + (0.5) \times 15\% + (0.2) \times 5\%$ = 0.625 ns (1.6 GHz)

GHz

n5

what is speedup? Snew-old

# Pipelining Load Lw

- Load instruction takes 5 stages
  - Five independent functional units work on each stage
    - · Each functional unit used only once
  - Another load can start as soon as 1<sup>st</sup> finishes IF stage
  - Each load still takes 5 cycles to complete
  - The throughput, however, is much higher



1 job exits

pen cycle

PCPI

1 job

1 cycle

T= 5 cycles

each stage

busy



