**Q.** At right is the multiplier unit of a 4-bit machine (4-bit x and y). MULT has its own local clock, **CK**, to control its shift-add cycles, which is started by sys\_clk if the current opcode is MULT. In each **CK** cycle, the adder performs a partial sum, x shifts left, y shifts right, and the partial sum clocks into the 8-bit result register, **S**. MULT is considered to start operating just after x and y are clocked into their respective registers, and finishes just after the final result is clocked into **S**. The ADDER is a 4-stage, ripple-carry adder with a delay per stage of 4 ps. The MUX has a delay of 4 ps.

Suppose this machine's ALU consists of NAND, ADD, and MULT units, and an ALU\_MUX, with these delays:

| NAND:    | 6 ps  |
|----------|-------|
| ADD:     | 25 ps |
| ALU_MUX: | 6 ps  |

MULT unit



**Q.** In MULT, the product will be in **S** after how many **CK** cycles?

4 cycles. Each tick does an add and stores partial sum in S. when yo exits, MULT is complete.

**Q.** All ALU instructions takes **one sys\_clk clock cycle**. Supposing the clocks run as fast as possible, and the ALU determines the **sys\_clk** and **CK** rates, what is the machine's **sys\_clk** rate, CR?

MULT is the slowest, so it determines CR. From the time CK ticks to the time S's inputs are ready is (4 adder stages)  $\times 4_{ps/stage} + 4_{ps}$  MUX delay Total ALU delay for MULT is (4 CK ticks)  $\times (20 \text{ ps}/\text{cK} \text{ tick}) = 80 \text{ ps}$ , plus 6 ps delay for ALU-MUX = 86 ps. which is sys-clk's cycle time.  $CR = \frac{1}{86} \text{ ps} = (\frac{1}{86}) 10^{11} (cycles/sec) \approx 100 \text{ GHz}.$ 



**Q.** Claire notices that the MULT unit could be pipelined without changing its hardware except its clocking: Eliminate CK and clock MULT with sys\_clk instead. A MULT instruction would then take 4 sys\_clk cycles to finish. Claire claims that if we also decrease sys\_clk's cycle time, the machine's overall performance will improve. What delay now determines the CR? What is the new CR? What is the CR speedup?

The delays for NAND and ADD are (6+6) and (25+6) ps. MULT's delay in now (20+4+6) ps. So ADD's delay in now the slowest path. The new CR is 
$$\binom{1}{31}$$
 ps.  $S_{CR} = \frac{CR_{new}}{CR_{old}} = \frac{(1/31)}{(1/86)} \approx 2.3/4$ .

**Q.** Suppose a job mix of 10% NAND instructions, 50% ADDs, and the rest MULTs. What are the average CPIs for the old machine and Claire's new machine? What is the speedup of the new machine relative to the old machine?

$$\overline{CPI}_{old} = \frac{Total \ cycles}{n \ instruction s}$$

$$= \left[\eta_{ADD} \left(\frac{1 \ cycle}{ADD}\right) + \eta_{NAND} \left(\frac{1 \ cycle}{NAND}\right) + \eta_{MULT} \left(\frac{1 \ cycle}{MULT}\right)\right] \frac{1}{n}$$

$$= \left(\frac{7}{aDD} + \frac{7}{aND} + \frac{9}{MVLT}\right) \left(1 \ cycle/instruction\right) = 1 \ \left(\frac{cycle/instr.}{mULT}\right)$$

$$\overline{CPI}_{old} = \left[\eta_{ADD} \left(\frac{1 \ cycle}{ADD}\right) + \eta_{NAND} \left(\frac{1 \ cycle}{NAND}\right) + \eta_{MULT} \left(\frac{4 \ cycle}{MULT}\right)\right] \frac{1}{n}$$

$$= \frac{7}{aDD} \left(\frac{1}{ADD} + \frac{9}{NAND}\right) + \eta_{NAND} \left(\frac{1 \ cycle}{NAND}\right) + \eta_{MULT} \left(\frac{4 \ cycle}{MULT}\right) \frac{1}{n}$$

$$= \frac{7}{aDD} \left(\frac{1}{1} + \frac{9}{NAND}\right) + \eta_{NAND} \left(\frac{1 \ cycle}{NAND}\right) + \eta_{MULT} \left(\frac{4 \ cycle}{MULT}\right) \frac{1}{n}$$

$$= \frac{7}{aDD} \left(\frac{1}{1} + \frac{9}{NAND}\right) \left(\frac{1}{1} + \frac{9}{MULT}\right) + \frac{9}{mULT} \left(\frac{4}{mULT}\right)$$

$$\int_{new-old} = \frac{T_{old}}{T_{new}} = \frac{n \ (instr.) \ x \ \overline{CPI}_{old} \ x \left(\frac{1}{CR}_{old}\right)}{n \ (instr.) \ x \ \overline{CPI}_{new} \ x \left(\frac{1}{CR}_{new}\right)} = \left(\frac{1}{2.2}\right) x \left(\frac{9}{c_{eR}} = 2\frac{9}{4}\right)$$

$$\approx 25\frac{7}{4} \ faster$$

Q. What job mix gives the maximum speedup? The minimum? What are the speedups?

**Q.** Claire suggests that by adding hardware to MULT, it can be pipelined so that multiple MULTs can execute simultaneously. To get started, a one-cycle implementation of MULT is given below. Add pipeline registers where appropriate. Give the speedup for the job mixes above.



The MIPS instruction set is designed to accomodate pipelining. The LC3 instruction set is very similar. In fact, it is almost a 16-bit version of MIPS with a very reduced instruction set. We will reduce the LC3 instruction set even further to only LD, BR, and ADD. (These instructions are specified in our lecture notes "Lec-1c-hardware.pdf", and in our archive "projects/LC3trunk/docs/LC3-3-PP-Append-A.pdf".) The one-cycle MIPS implementation (see Patterson & Hennessy, Chapter 4) has two memories: IMEM (for instructions) and DMEM (for data). Below we show a similar implementation for the LC3 ISA. It is different from MIPS in that the ALU is used to produce the BR target address. (Although we are not including any store instructions, a path is provided (orange)). Delays are: IMEM and DMEM, 200 ps; RegFile 100 ps; ALU, 150 ps; Writeback, 100 ps.

**Q.** Provide pipeline registers to implement pipelining for the LC3 circuit shown below.

**Q.** For the instruction mix (50% ADD, 30% LD, 20% BR), calculate the new CPI and speedup. You can ignore pipe fill and drain overhead.

Q. What problem do you see arising if we tried to include LDI in our implementation?



| ADD | DR, SR1, SR2  |
|-----|---------------|
| ADD | DR, SR1, imm5 |

# Encodings

| 15   |      | 12 | 11 |    | 9 | 8    |  | 6 | 5 | 4    | 3  | 2 |     | 0 |
|------|------|----|----|----|---|------|--|---|---|------|----|---|-----|---|
| 0001 |      | 1  |    | DR |   | SR 1 |  |   | 0 |      | 00 |   | SR2 |   |
| 15   |      | 12 | 11 |    | 9 | . 8  |  | 6 | 5 | 4    |    |   |     | 0 |
| 1    | 0001 |    | DR |    |   | SR1  |  |   | 1 | imm5 |    |   |     |   |

# **O**peration

if (bit[5] == 0) DR = SR1 + SR2; else DR = SR1 + SEXT(imm5); setcc();

#### LDI DR, LABEL

# Encoding

| 15   |  |  | 12 | 11 |    | 9 | 8 |      |    |      |     |      | 0 |
|------|--|--|----|----|----|---|---|------|----|------|-----|------|---|
| 1010 |  |  |    |    | DR |   |   |      | PC | offs | et9 |      |   |
|      |  |  |    |    |    |   |   | <br> |    |      |     | <br> |   |

# Operation

$$\label{eq:dress} \begin{split} DR &= \texttt{mem}\left[\texttt{mem}\left[\texttt{PC}^{\dagger} \;+\; \texttt{SEXT}\left(\texttt{PCoffset9}\right)\right]\right];\\ \texttt{setcc}\left(\right); \end{split}$$

LD DR, LABEL

### Encoding

| 1 | 15   |  |  | 12 | 11 |    | 9 | 8 |   |    |      |     |   |   | 0 |
|---|------|--|--|----|----|----|---|---|---|----|------|-----|---|---|---|
|   | 0010 |  |  |    |    | DR |   |   |   | PC | offs | et9 | I | 1 | 1 |
|   |      |  |  |    |    |    |   |   | 1 |    |      |     |   |   | 1 |

#### BRn LABEL BRzp LABEL BRz LABEL BRnp LABEL BRp LABEL BRnz LABEL BR<sup>†</sup> LABEL BRnzp LABEL

### Encoding

| 15   | 12 | 11 | 10 | 9 | 8 |     |    |      |     |   | 0 |
|------|----|----|----|---|---|-----|----|------|-----|---|---|
| 0000 | 1  | n  | z  | р |   | 1 1 | PC | offs | et9 | 1 | 1 |

### Operation

if ((n AND N) OR (z AND Z) OR (p AND P))  $PC = PC^{\ddagger} + SEXT(PCoffset9);$ 

### Operation

$$\label{eq:DR} \begin{split} DR &= \text{mem}\left[\text{PC}^{\dagger} \ + \ \text{SEXT}\left(\text{PCoffset9}\right)\right];\\ \text{setcc();} \end{split}$$