## Cache ## a problem - Aggregate peak bandwidth grows with # cores: - Intel Core i7 can generate two references per core per clock - Four cores and 3.2 GHz clock - 25.6 billion 64-bit data references/second + - 12.8 billion 128-bit instruction references - $= 409.6 \, \text{GB/s!}$ - DRAM bandwidth is only 6% of this (25 GB/s) 3 Amdah , 94% Requires: Multi-port, pipelined caches - Two levels of cache per core - Shared third-level cache on chip 1 3.2 GHz (2 data refs @ 8B) + 1 instr @ 16B) ≈ 100 GB/sec CORE X4 Cores: 400 GB/sec Programs ignore this. Can we help? Maybe. Is there Locality? Memory Reference Patterns Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality Repeated References, to a set of locations: Temporal Locality Take advantage of behavior patterns. If stable patterns last, Long Enough (?) Size of Locality depends on W W ==> total execution time, everything is local W ==> one instruction time, single address is local Trade-off Short time ===> Small set Long time ===> Large set Larger gap in access time. Hide latency #### **Technology Tradeoffs** Large set, Many bits ===> Bad: (Bandwidth, Latency), Good: (\$, Area, Watts) per bit Small set, Few bits ===> Good: (Bandwidth, Latency), Bad: (\$, Area, Watts) per bit small w -> fast set turn over -> more bandwidth (low latency) large w -> slow set turn over -> less bandwidth (high latency) #### We hope Most changes in So refer to items in S1 Most changes in S1 refer to items in S2 etc. ... less bandwidth required latency overlapped or hidden ## Exploiting the Memory Hierarchy #### Approach 1 (Cray, others): Expose Hierarchy Registers, Main Memory, Disk each available as storage alternatives; Tell programmers: "Use them cleverly" Programs do manage cache effects #### Approach 2: Hide Hierarchy - Programming model: SINGLE kind of memory, single address space. - Machine AUTOMATICALLY assigns locations to fast or slow memory, depending on usage patterns. Programs do not take into account cache effects, hope for the best. e.g. register loading/unloading: compiler L1, L2, L3: Memory /disk: cache controllers OS software, disk controllers 5.004 - Spring 2009 manage moving data Transfer a block at a time: - --- latency for 1-st word - --- remainder at bandwidth rate, hopefully Block size varies from level to level (2X) --- Pay delay for block transfer, but what if other words never used? #### Miss rate Fraction of cache access that result in a miss Causes of misses (missi = not found in level i) Compulsory ■ First reference to a block ⇒ No choice, 1st reference (? prefetch) Capacity ■ Blocks discarded and later retrieved ⇒ couldn't keep in cache, but wanted to Conflict Program makes repeated references to multiple addresses from different blocks that map to the same location in the cache CPIpenalty (cycles) = MR·Tpenalty (sec) CR ( cycles ) $$HR = (1 - MR)$$ $$= \frac{N_{hit}}{N_{excess}}$$ Metrics: AMAT = $$(hit \ rate)(hit \ time) + (miss \ rate)(miss \ time)$$ = $(1-MR) T_{hit} + MR(T_{access} + T_{hit})$ AMAT can be w.r.t. Global performance or Level i performance = $T_{hit} + MR(T_{access})$ miss Penalty What's important? ## How Processor Handles a Miss **Hit** • Assume that cache access occurs in 1 cycle no processor stall Hit is great, and basic pipeline is fine CPI penalty = miss rate x miss penalty = 0 miss A miss stalls the pipeline (for a instruction or data miss) - Stall the pipeline (you don't have the data it needs) Processor frozen Send the address that missed to the memory Instruct main memory to perform a read and wait - When access completes, return the data to the processor Load - Restart the instruction continue unfreeze processor, hit L1 We can Generalize **A Turing Machine Tape** R/W head moves L or R, copy a region at a time. Cost is proportional to distance and size of region copied. #### **Cache Organization and Methods** --- Big Memory, Small Cache ===> Block Mapping (how to place blocks in cache ) Associative: anything goes anywhere, check contents (contains address) complex + expensive (area, power) Direct Mapped: (like a Reg File, but words are blocks) simple + fast, but too restrictive placement? **Set Associative**: (hybrid of Associative and Direct Mapped) #### **Some Block Parameters** - --- How big? Spatial locality captured by fetching neighboring data/instructions. - --- Replace what when? Working set captures temporal locality. - --- Writing, when, where? Change locally or globally, maintain correct program behavior. - · Location in cache determined by (main) memory address - Direct mapped: only one choice - (Block address) modulo (#Blocks in cache) We use TAG bits to identify which block. But, what about at startup? - --- Content is random - --- boot process initializes valid bit (V = 0) - 8-blocks, - Initial state | Index | V | Tag | Data | |-------|---|-----|------| | 000 | N | ? | ? | | 001 | N | ? | ? | | 010 | N | ? | ? | | 011 | N | ? | ? | | 100 | N | ? | ? | | 101 | N | ? | ? | | 110 | N | .? | ? | | 111 | N | ? | ? | #### **Example:** DM, 32-bit address, byte-addressable, 1-word blocks (32-bit word = 4-byte block) Assumptions - 32-bit address - 4 Kbyte cache - 1024 blocks,1 word/block #### Steps - 1. Use index to read V, tag from cache - 2. Compare read tag with tag from address - 3. If match, return data & hit signal - 4. Otherwise, return miss Need only compare upper 20 bits as tag, index bits are the same for any item in same slot. LW R1, < address = 1100110 >LW R2, < address = 0101110 > SW R3, < address = 1100110 > SW R4, < address = 0101110 > LW R5, < address = 1100110 > Thrashing Each access evicts something needed later, or causes a miss. Worse than no eache! Can happen at any level or type of caching: Direct Mapped, Conflicts (as above) Fully Associative, Capacity e.g., Virtual Memory Page Thrashing Consider the following example code: ``` double a[8192], b[8192], c[8192]; 8 \frac{1}{k} \times 64b void vector_sum() int i; for (i = 0; i < 8192; i++)</pre> ``` C[i] C - Arrays a, b, and c will tend to conflict in small caches - Code will get cache misses with every array access (3 per loop) - Spatial locality savings from blocks will be eliminated - How can the severity of the conflicts be reduced? How can we fix this? Bigger cache? How big? $$\longrightarrow$$ index + offset > 17 bits (recall, C also) $$\longrightarrow$$ size $\geq 2^{18}$ B = 256 kB, 15-bit index How to make system crawl, worst case? } Let's have a contest! Programmer's mistake? How much is the programmer responsible for? Portable code, different architectures? Irregular data layouts a solution? Compiler's responsibility? ## BLock Size Effects 8 kB Cache Address Each cache line = [tag bits][data block bits] Total cache size = (# lines) X (# tag bits + # data bits) Storage overhead = (total # tag bits) / (total # data bits) $$(2^{10} \text{ blocks})^{X} (1 \text{ word/block}) \times (8 \text{ B/word})$$ $$k = 10 \qquad b = 3$$ $$= 32 - (10 + 0 + 3)$$ $$= 19 \text{ bits}$$ $$= 19/(2^{6} \text{ bits/block}) \approx 1/3 \text{ overhead}$$ ٧٤. $$h = 3$$ = 19 bits $$\frac{19}{(2^{4} \times 2^{6} \text{ bits/block})} = \frac{19}{1024} \approx \frac{1}{50} \text{ overhead}$$ Amortized latency per word ===> 1 / 16 1/50 overhead is spatial locality ## Block Size vs. Performance - Larger block sizes take advantage of spatial locality - Also incurs larger miss penalty since it takes longer to transfer the block into the cache - Large block can also increase the average time or the miss rate • Tradeoff in selecting block size • Average Access Time = Hit Time • (1-MR) + Miss Penalty • MR Averaged over selection of programs: Your performance may be different. ## Fully-assoc. vs. Direct-mapped #### Fully-associative N-line cache: - N tag comparators, registers used for tag/data storage (\$\$\$) - Location A might be cached in any one of the N cache lines no restrictions! - Replacement strategy (e.g., LRU) used to pick which line to use when loading new word(s) into cache - PROBLEM: Cost #### Direct-mapped N-line cache: - 1 tag comparator, SRAM used for tag/data storage (\$) - · Location A is cached in a specific line of the cache determined by its address; address "collisions" possible - Replacement strategy not needed: each word can only be cached in one specific cache line - PROBLEM: Contention! ### Cost vs Contention two observations... - Probability of collision diminishes with cache size. - ... so lets build HUGE direct-mapped caches, using cheap SRAM! Insight: an N-way set-associative cache affords modest parallelism - parallel lookup (associativity): restricted to small set of N lines - modest parallelism deals with most contention at modest cost - can implement using N direct-mapped caches, running in parallel ## Set Associative Cache ### E.G. - · Compare 4-block caches - Direct mapped, 2-way set associative fully associative - Block access sequence: 0, 8, 0, 6, 8 3 different block addresses - Direct mapped • 2-way set associative Fully associative | 1 | ADDRESS | | Hit/miss | Cache content after access | | | | | |---------|---------|--------------|----------|----------------------------|--------|--------|--|--| | | 0 00 | | miss | Mem[0] | | | | | | | 8 00 | | miss | Mem[0] | Mem[8] | | | | | TIME | 0 00 | | hit | Mem[0] | Mem[8] | | | | | | 6 10 | | miss | Mem[0] | Mem[8] | Mem[6] | | | | 1 11 1- | 8 00 | | hit | Mem[0] | Mem[8] | Mem[6] | | | | | Tag | Tea no index | | any block can be used | | | | | associativity higher ===> tags bigger (overhead?) Associativity vs. miss rate 8-way is (almost) as effective as fully-associative rule of thumb: N-line direct-mapped == N/2-line 2-way set assoc. ## A different Job mix • Simulation of a system with 64KB D-cache, 16-word blocks, SPEC2000 MR - 1-way: 10.3% - 2-way: 8.6% 4-way: 8.3% - 8-way: 8.1% I diminishing returns? 2.5% improvement, is that significant? What's the metric? Compare (MR X Miss Penalty) == actual improvement performance / \$ ? If \$ increment is small ==> bigger N. $$\frac{T_h (1-MR) + MR T_p}{T_h (1-2MR) + 2MR T_p}$$ $$= \frac{T_h + MR(T_P - T_h)}{T_h + \chi MR(T_P - T_h)}$$ #### Replacement Methods - Which line do you replace on a miss? - Direct Mapped - Easy, you have only one choice - Replace the line at the index you need - N-way Set Associative - Need to choose which way to replace - Random (choose one at random) - Least Recently Used (LRU) (the one used least recently) - Often difficult to calculate, so people use approximations. Often they are really not recently used Wasn't used since last I looked C. Kozyrakis EE 108b Lecture 12 23 ## Handling of WRITES What's our workload? --- How many READS --- How many WRITES --- How many READS after WRITES Stall for writes: Mem Observation: Most (90+%) of memory accesses are READs. How should we handle writes? Issues: Write-through: CPU writes are cached, but also written to main memory (stalling the CPU until write is completed) Memory always holds "the truth". Write-behind: CPU writes are <u>cached</u>; <u>writes to main</u> memory may be buffered, perhaps pipelined. CPU keeps executing while writes are completed (in order) in the background. Write-back: CPU writes are cached, but not immediately written to main memory. Memory contents can be "stale". 47000 Our cache thus far uses write-through. Can we improve write performance? 6004 - Spring 2009 no stall · Interesting observation Processor does not need to "wait" until the store completes #### Write Through Replacement: easy, clobber line (memory always updated >> consistent) Memory Bandwidth: high, every write (as if not using cache) but only 1-word writes Processor: stalls on every write simple, cheap #### **Write Back** Memory inconsistent until replacement (but, multi-processors?) need dirty bit Memory Bandwidth: lower load, multiple writes to cache block but n-word writes (blocks) but block-write pipelined, efficient Processor: stalls for write only when divy block replaced - Use Write Buffer between cache and memory - Processor writes data into the cache and the write buffer - Memory controller slowly "drains" buffer to memory - Write Buffer: a first-in-first-out buffer (FIFO) - Typically holds a small number of writes - Can absorb small bursts as long as the long term rate of writing to the buffer does not exceed the maximum rate of writing to DRAM ## write-through w/ buffer, Read Miss? Where should we look for data? - --- in buffer? - --- in memory? - --- how do we search buffer? Stall if not empty? write-back w/ dirty replacement ## Write Miss - Typical Choices ### Be Careful, Even with Write Hits - Reading from a cache Read (Tag, data) If it hits, return the data, else go to lower level Writing a cache can take more time First read (tag) (stall or No Stall) Writing a cache can take more time First read (tag) (stall or no stall) Write data Then overwrite data on a hit (access 2) Otherwise, you may overwrite dirty data or write the wrong cache way - Can you ever access tag and write data in parallel? (write-through?) ### **Splitting Caches** ### Multilevel Caches - Primary (L1) caches attached to CPU IMEM DMEM - Small but fast - Focusing on hit time rather than hit rate - Level-2 cache services misses from primary cache - Larger, slower but still faster than main memory - Unified instruction and data (why?) - Focusing on hit rate rather than hit time (why?) - Main memory services L-2 cache misses - Some high-end systems include L-3 cache ## E. G. W/O LZ - Given - CPU base CPI = 1, clock rate = 4GHz - Miss rate/instruction = 2% - Main memory access time = 100ns 1 cycle $$\rightarrow \frac{1}{46}$$ Sec = 0.25 ns miss penalty = $$100 \text{ ns} \left( \frac{1 \text{ cycle}}{\frac{1}{4} \text{ ns}} \right) = 400 \text{ cycles}$$ - With just a primary (L1) cache - Miss penalty = 100ns/0.25ns = 400 cycles - Effective CPI = 1 + 0.02 × 400 = 9 $$\frac{\overline{CPI}}{CPI} = (98\%)(1 \text{ cycle for hit}) + (2\%)(400 \text{ cycle stall} + 1 \text{ cycle})$$ $$= 0.98 + 0.02(400) + 0.02(1) = 1 + 0.02(400) = 9$$ ### E.G. w/ L2 - Now add L-2 cache - Access time = 5ns - Global miss rate to main memory = 0.5% - Primary miss with L-2 hit - Penalty 5ns/0.25ns = 20 cycles - Primary miss with L-2 miss - Extra penalty = 400 cycles - CPI = $1 + 0.02 \times 20 + 0.005 \times 400 = 3.4$ - Performance ratio = 9/3.4 = 2.6 C. Kozvrakis $$\frac{\text{CPI}}{N \text{ instructions}} = \frac{\text{% cycles}}{N \text{ instructions}} = (\frac{1}{N}) \left[ (N_1 + N_2 + N_3)(1) + (N_2 + N_3)(20) + N_3(400) \right]$$ $$N_{1} = 98\% N \qquad N_{3} = \frac{1}{2}\% N \qquad N_{2} = N - (N_{1} + N_{3}) \implies (N_{2} + N_{3}) = N - N_{1} = 2\% N$$ $$= \left[N(1) + 2\% N(20) + \frac{1}{2}\% N(400)\right] / N = 0.98 + 0.02(20) + 0.005(400) = 3.4$$ all: 64-B blocks, Write-Back Allocate | | Intel Nehalem P6 Quad | AMD Opteron X4 | | |-------------------------|---------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|-------------------| | L1 caches<br>(per core) | L1 I-cache: 32KB, 64-byte blocks 4-<br>way approx LRU eplacement, hit<br>time n/a | L1 I-cache: 32KB, 64-byte blocks, 2-<br>way LRU replacement, hit time 3 | hit<br>. 3 cycles | | \$1 | L1 D-cache 32KB, 64-byte blocks 8-<br>way, approx LRU eplacement, write-<br>back/allocate, hit time n/a | L1 D-cache: 32KB, 64-byte blocks, 2-<br>way LRU replacement, write-<br>back/allocate, hit time 3 cycles | hit , | | L2 unified | 256KB, 64-byte blocks, 8-way, approx | 512KB, 64-byte blocks 16-way, | - 9 cycles | | cache | LRU replacement, write- | approx LRU replacement, write- | • | | (per core) | back/allocate, hit time n/a | back/allocate, hit time 9 cycles | l hit | | L3 unified | 8MB, 64-byte blocks, 16-way, | 2MB 64-byte blocks 32-way, replace | hit<br>38 cycle | | cache | replacement n/a, write- | block shared by fewest cores write- | - 30 agence | | (shared) | back/allocate, hit time n/a | back/allocate, hit time 38 cycles | | n/a: data not available ### Interface Signals #### Cache Controller FSM #### See, LC3-based cache projects: http://pages.cs.wisc.edu/~karu/courses/cs552/spring2009/wiki/index.php/Main/CacheModulehttp://www.ece.ncsu.edu/muse/courses/ece406spr09/labs/proj2/proj2 spr09.pdf # Memory Technologies #### SRAM - Requires low power to retain bit - Requires 6 transistors/bit #### DRAM - Must be re-written after being read - Must also be periodically refeshed - Every ~ 8 ms - Each row can be refreshed simultaneously - One transistor/bit - Address lines are multiplexed: - Upper half of address: row access strobe (RAS) - Lower half of address: column access strobe (CAS) ### Some optimizations: - Multiple accesses to same row - Synchronous DRAM - Added clock to DRAM interface - Burst mode with critical word first - Wider nterfaces - Double data rate (DDR) - Multiple banks on each DRAM device Transfer on falling edges I ROW data out burst mode consecutive words - Bits in a DRAM are organized as a rectangular array - DRAM accesses an entire row, - Burst mode: supply successive words from a row with reduced latency - Double data rate (DDR) DRAM × 2 clocks 1 → - Transfer on rising and falling clock edges - Quad data rate QDR DRAM - Four transfers per cycle - DDR × 2 data bus (in, out) - Functions as a higher capacity, wider interface DRAM chip - Easier to manipulate, replace, ... column addr #### Row access strobe (RAS) | Production year Chip size DRAM Type Slowest DRAM (ns) Fastest DRAM (ns) Column access strobe (CAS)/ Cycle data transfer time (ns) Cycle time (ns) 1980 64K bit DRAM 180 150 75 250 1983 256K bit DRAM 150 120 50 220 1986 1M bit DRAM 120 100 25 190 1989 4M bit DRAM 100 80 20 165 1992 16M bit DRAM 80 60 15 120 1996 64M bit SDRAM 70 50 12 110 1998 128M bit SDRAM 70 50 10 100 2000 256M bit DDR1 65 45 7 90 2002 512M bit DDR1 60 40 5 80 2004 1G bit DDR2 55 35 5 70 2006 2G bit <td< th=""><th></th><th></th><th></th><th></th><th></th><th>-</th><th></th></td<> | | | | | | - | | |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|-----------|-----------|-----|-----|-----|-----| | 1983 256K bit DRAM 150 120 50 220 1986 1M bit DRAM 120 100 25 190 1989 4M bit DRAM 100 80 20 165 1992 16M bit DRAM 80 60 15 120 1996 64M bit SDRAM 70 50 12 110 1998 128M bit SDRAM 70 50 10 100 2000 256M bit DDR1 65 45 7 90 2002 512M bit DDR1 60 40 5 80 2004 1G bit DDR2 55 35 5 70 2006 2G bit DDR2 50 30 2.5 60 2010 4G bit DDR3 36 28 1 37 | Production year | Chip size | DRAM Type | | | | | | 1986 1M bit DRAM 120 100 25 190 1989 4M bit DRAM 100 80 20 165 1992 16M bit DRAM 80 60 15 120 1996 64M bit SDRAM 70 50 12 110 1998 128M bit SDRAM 70 50 10 100 2000 256M bit DDR1 65 45 7 90 2002 512M bit DDR1 60 40 5 80 2004 1G bit DDR2 55 35 5 70 2006 2G bit DDR2 50 30 2.5 60 2010 4G bit DDR3 36 28 1 37 | 1980 | 64K bit | DRAM | 180 | 150 | 75 | 250 | | 1989 4M bit DRAM 100 80 20 165 1992 16M bit DRAM 80 60 15 120 1996 64M bit SDRAM 70 50 12 110 1998 128M bit SDRAM 70 50 10 100 2000 256M bit DDR1 65 45 7 90 2002 512M bit DDR1 60 40 5 80 2004 1G bit DDR2 55 35 5 70 2006 2G bit DDR2 50 30 2.5 60 2010 4G bit DDR3 36 28 1 37 | 1983 | 256K bit | DRAM | 150 | 120 | 50 | 220 | | 1992 16M bit DRAM 80 60 15 120 1996 64M bit SDRAM 70 50 12 110 1998 128M bit SDRAM 70 50 10 100 2000 256M bit DDR1 65 45 7 90 2002 512M bit DDR1 60 40 5 80 2004 1G bit DDR2 55 35 5 70 2006 2G bit DDR2 50 30 2.5 60 2010 4G bit DDR3 36 28 1 37 | 1986 | 1M bit | DRAM | 120 | 100 | 25 | 190 | | 1996 64M bit SDRAM 70 50 12 110 1998 128M bit SDRAM 70 50 10 100 2000 256M bit DDR1 65 45 7 90 2002 512M bit DDR1 60 40 5 80 2004 1G bit DDR2 55 35 5 70 2006 2G bit DDR2 50 30 2.5 60 2010 4G bit DDR3 36 28 1 37 | 1989 | 4M bit | DRAM | 100 | 80 | 20 | 165 | | 1998 128M bit SDRAM 70 50 10 100 2000 256M bit DDR1 65 45 7 90 2002 512M bit DDR1 60 40 5 80 2004 1G bit DDR2 55 35 5 70 2006 2G bit DDR2 50 30 2.5 60 2010 4G bit DDR3 36 28 1 37 | 1992 | 16M bit | DRAM | 80 | 60 | 15 | 120 | | 2000 256M bit DDR1 65 45 7 90 2002 512M bit DDR1 60 40 5 80 2004 1G bit DDR2 55 35 5 70 2006 2G bit DDR2 50 30 2.5 60 2010 4G bit DDR3 36 28 1 37 | 1996 | 64M bit | SDRAM | 70 | 50 | 12 | 110 | | 2002 512M bit DDR1 60 40 5 80 2004 1G bit DDR2 55 35 5 70 2006 2G bit DDR2 50 30 2.5 60 2010 4G bit DDR3 36 28 1 37 | 1998 | 128M bit | SDRAM | 70 | 50 | 10 | 100 | | 2004 1G bit DDR2 55 35 5 70 2006 2G bit DDR2 50 30 2.5 60 2010 4G bit DDR3 36 28 1 37 | 2000 | 256M bit | DDR1 | 65 | 45 | 7 | 90 | | 2006 2G bit DDR2 50 30 2.5 60 2010 4G bit DDR3 36 28 1 37 | 2002 | 512M bit | DDR1 | 60 | 40 | 5 | 80 | | 2010 4G bit DDR3 36 28 1 37 | 2004 | 1G bit | DDR2 | 55 | 35 | 5 | 70 | | | 2006 | 2G bit | DDR2 | 50 | 30 | 2.5 | 60 | | 2012 8G bit DDR3 30 24 0.5 31 | 2010 | 4G bit | DDR3 | 36 | 28 | 1 | 37 | | | 2012 | 8G bit | DDR3 | 30 | 24 | 0.5 | 31 | $x 2^{17} = \frac{1}{2}M$ X 150 X 9 #### **DRAM Generations & Trends** #### Improving DRAM bandwidth (other than faster cycle time) | Standard | Clock rate (MHz) | M transfers per second | DRAM name | MB/sec/DIMM | DIMM name | |----------|------------------|------------------------|-----------|---------------|-----------| | DDR | 133 | 266 | DDR266 | 2128 | PC2100 | | DDR | 150 | 300 | DDR300 | 2400 | PC2400 | | DDR | 200 | 400 | DDR400 | 3200 | PC3200 | | DDR2 | 266 | 533 | DDR2-533 | 4264 | PC4300 | | DDR2 | 333 | 667 | DDR2-667 | 5336 | PC5300 | | DDR2 | 400 | 800 | DDR2-800 | 6400 | PC6400 | | DDR3 | 533 | 1066 | DDR3-1066 | 8528 | PC8500 | | DDR3 | 666 | 1333 | DDR3-1333 | 10,664 | PC10700 | | DDR3 | 800 | 1600 | DDR3-1600 | 12,800 | PC12800 | | DDR4 | 1066–1600 | 2133–3200 | DDR4-3200 | 17,056–25,600 | PC25600 | x10 - DDR: - DDR2 - Lower power (2.5 V -> 1.8 V) - Higher clock rates (266 MHz, 333 MHz, 400 MHz) - DDR3 - 1.5 V - 800 MHz - DDR4 - 1-1.2 V - 1600 MHz - Graphics memory: - Achieve 2-5 X bandwidth per DRAM vs. DDR3 - Wider interfaces (32 vs. 16 bit) - Higher clock rate - Possible because they are attached via soldering instead of socketted DIMM modules - Memory is susceptible to cosmic rays - Soft errors: dynamic errors - Detected and fixed by error correcting codes (ECC) - Hard errors: permanent errors - Use sparse rows to replace defective rows - Chipkill: a RAID-like error recovery technique ## **Increasing Memory Bandwidth** #### **Bus Cycle Timing, 4-word Access** ## Six basic cache optimizations: - Larger block size - Reduces compulsory misses - Increases capacity and conflict misses, increases miss penalty - Larger total cache capacity to reduce miss rate - Increases hit time, increases power consumption - Higher associativity - Reduces conflict misses - Increases hit time, increases power consumption - Higher number of cache levels - Reduces overall memory access time - Giving priority to read misses over writes - Reduces miss penalty - Avoiding address translation in cache indexing - Reduces hit time