- 1. Definitions
- 2. Goals
- 3. Types



a system, generally component = internally highly connected, not casily separable, functionally distinct communication = relatively sparse connections between Components Hierarchy of - Structure : grouping of components - Purpose: why, what-for Surfaces are logical and/or time-like and/or space-like

Boundaries are drawn arbitrarily, by us, for some specific purpose.

- s/ou communication between components (distance[physical]) - Less common dependency, more autonomy (no shared, common ) & - many components ~ peers (single component) — many components ≈ peers



e.g. single core Matrix mult



SISD



SIMD



Scaling

RSSUME:

$$\int_{2-1}^{V} = \frac{V_{2}}{V_{1}} = \frac{W_{2}/T_{2}}{W_{1}/T_{1}} = \frac{T_{1}}{T_{2}}$$

$$= \frac{(1-f)W/V}{(1-f)W/V} + \frac{fW/V}{(1-f)W/V}$$

$$= \frac{1}{(1-f) + \frac{f}{N}}$$

$$= \frac{n}{n(1-f) + \frac{f}{N}} = \frac{n}{n(1-f) + \frac{f}{J}} = \frac{n}{f \to 1} = \frac{n}{0+1}$$

$$W_{1} = W_{2} = W = W_{fixed} + W_{improvable}$$

$$= (i-f)W + fW$$

$$S_{f}^{\prime} = N \qquad \begin{cases} M_{2} \text{ is } n \text{ times faster doing} \\ W_{improvable} = fW \end{cases}$$

$$V = V_{fixed} = V_{improvable} \qquad fn M_{1}$$

$$V = V_{fixed}, \quad V_{improvable} = V_{fs}^{\prime}, \quad fn M_{2}$$

$$E.G.$$

$$grid \quad computation, \quad N_{0}^{\prime} \text{ serial}$$

$$I = Processor \qquad H \quad processor S$$

$$S^{\prime} = \frac{H}{H(1-0.9) + 0.9} = \frac{H}{1.3} \approx \frac{H}{4/3} = 3 < H$$

Caveat: serial work for M1 < serial work for M4 (coordination, dividing data)

$$W_{1} = W_{serial} + W_{parallel} \qquad \Rightarrow \qquad T_{1} = W_{V}'$$

$$W_{2} = W_{serial} + h \cdot W_{parallel} + W_{parallel}$$

$$W_{2} = W_{serial} + h W_{parallel} + W_{parallel}$$

$$T_{2} = \frac{W_{serial} + h W_{parallel}}{V} + \frac{W_{parallel}}{V \cdot g'_{h}}$$

$$= \frac{(1 - f)W + h f W}{V} + \frac{f W}{V \cdot \eta} \qquad \Rightarrow \qquad \int_{2^{-1}}^{T} = \frac{T_{1}}{T_{2}} = \frac{1}{(1 - f) + h + f/_{h}}$$
Here h contributes to the serial portion. Could contribute to parallel part.
$$h = h(n)$$

Here h contributes to the serial portion. Could contribute to parallel part.

STRUNG Scaling

But suppose 
$$h(n) < 0$$
?  $h \in (-1, 0]$ 

Less serial work needed w/ n processors

$$\int_{2-1}^{1} = \frac{1}{(1-f+h) + f/n} \xrightarrow{h \to (f-1)} \frac{n}{f} = O(n)$$

Wparallel 
$$\rightarrow f W + h W$$
  
Less parallel work needed w/n processors  
 $\int_{2-1}^{1} = \frac{1}{(1-f) + (f+h)/n} \qquad \frac{1}{h \rightarrow -f} \qquad \frac{1}{(1-f)}$ 

Weak scaling  
Weak scaling  
W = 
$$W_{fixe}$$
 +  $n \cdot W_{improvable}$  (or generally  $\Gamma(n) \cdot W_{improvable}$ )  
 $T_{z} = \frac{W_{fixe}}{\sqrt{1 + \frac{n \cdot W_{improvable}}{\sqrt{1 + \frac{n \cdot W_$ 

In general, all above effects can play a part (+ or - ).

- --- serial work for coordination, initialization --- parallel work for parallel portion, problem remapping
- --- communication overhead and resources
- --- memory and cache bandwidth effects
- --- cache coherency







Ω 8×4 interconnect





also, can broadcast:







Packet overhead

Headr cargo [~8 B] 4-64 B] HT Transaction / data link / physical







tow Latency response





non-shared memory, message passing

Thread parallelism Process parallelism Task parallelism

Threads





Shared Memory Synchonization through R/W





- --- Copy/Save/Restore state
- --- Shadow registers, renaming
- --- TLB content (separate page tables? or shared?)
  - --- hardware switch
  - --- TID, thread ID labeled

--- MULTIPLE THREADS from MULTIPLE PROCESSES

--- PID + TID

--- Larger state to consider (page tables, file and IO tables and buffers)

OS policy not known HW designer?



time



(Simultaneus)(Hyper) Threading

time



### Single threaded execution

#### Multi-tasking

- ---- Multiple concurrent execution (not simultaneous)
- ---- Memory shared but separate (virtual)
- ---- CPU time-multi-plexed --- cooperatively, pre-emptively, IO
- ---- Process context switching drain/fill (pipes, caches, TLB, ...)
- ---- Extract ILP from single stream ---- Unused issue slots
  - ---- pipeline bubbles/stalls



Credits: Introduction to Multithreading, Superthreading and Hyperthreading By Jon Stokes

## SMP, Symmetric Multi-Processing

- ---- Context switch per CPU
- ---- Simultaneous execution --- multiple programs/processes/threads
- ---- ILP extracted per process
  - --- double silicon resources
  - --- same NOP density
    - ==> Could speedup be > 2?



Single-threaded SMP

### Multi-Threaded (Superthreading)

- ---- Concurrent process scheduling
  - --- process context switching
  - --- cooperative, pre-emptive, ...
- ---- Single process, multiple thread execution
- ---- Time-multiplexed thread scheduling --- from same thread
- ---- Instructions issue from single thread
  - --- thread context switch
    - -- in HW
    - -- per stage
    - -- across stages
- ---- Execution slots filled
  - --- due to stalled threads
  - --- filled from non-stalled threads
  - --- Lower density of NOPs



# SMT, Simultaneous Multi-(Hyper)-Threading

- ---- Concurrent Processes --- context switching
- ---- Thread context switching
  - --- independently on different pipes
  - --- issue from multiple threads simultaneously
- ---- Average ILP = 2.5, empirically
  - --- max single-thread issue = 4 (here)
  - --- combined ILP ==> 4
- ---- Logical Processors == 2
- ---- Lowest NOP-density

Modification of existing 0-0-0 CPU => 10% added cost, p >2





64 - bit Result

what for? 16-bit sound DSP? Intel MMX => added to ISA: larger 1) Vectors (more elements) 2) elements (more bits)

