## Memory Hierarchy Based on H&P, MJI, VA

#### Review: Major Components of a Computer



#### **Processor-Memory Performance Gap**



# The Memory Hierarchy Goal

 Fact: Large memories are slow and fast memories are small

- How do we create a memory that gives the illusion of being large, cheap and fast (most of the time)?
  - With hierarchy
  - With parallelism

## **A Typical Memory Hierarchy**

Take advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology at the speed offered by the fastest technology



## **Memory Technology**

- Static RAM (SRAM)
  - 0.5ns 2.5ns, \$2000 \$5000 per GB
- Dynamic RAM (DRAM)
  - 50ns 70ns, \$20 \$75 per GB
- Magnetic disk
  - 5ms 20ms, \$0.20 \$2 per GB
- Ideal memory
  - Access time of SRAM
  - Capacity and cost/GB of disk

## **Principle of Locality**

- Programs access a small proportion of their address space at any time
- Temporal locality
  - Items accessed recently are likely to be accessed again soon
  - e.g., instructions in a loop, induction variables
- Spatial locality
  - Items near those accessed recently are likely to be accessed soon
  - E.g., sequential instruction access, array data

## **Taking Advantage of Locality**

- Memory hierarchy
- Store everything on disk
- Copy recently accessed (and nearby) items from disk to smaller DRAM memory
  - Main memory
- Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory
  - Cache memory attached to CPU

## **Memory Hierarchy Levels**



- Block (aka line): unit of copying
  - May be multiple words
- If accessed data is present in upper level
  - Hit: access satisfied by upper level
    - Hit ratio: hits/accesses
- If accessed data is absent
  - Miss: block copied from lower level
    - Time taken: miss penalty
    - Miss ratio: misses/accesses
      - = 1 hit ratio
  - Then accessed data supplied from upper level

#### **Characteristics of the Memory Hierarchy**



## **Cache Size**



# **Cache Memory**

- Cache memory
  - The level of the memory hierarchy closest to the CPU
- Given accesses X<sub>1</sub>, ..., X<sub>n-1</sub>, X<sub>n</sub>

| X <sub>4</sub>   |
|------------------|
| X <sub>1</sub>   |
| X <sub>n-2</sub> |
|                  |
| X <sub>n-1</sub> |
| X <sub>2</sub>   |
|                  |
| X <sub>3</sub>   |

| X <sub>4</sub>   |
|------------------|
| X <sub>1</sub>   |
| X <sub>n-2</sub> |
|                  |
| X <sub>n-1</sub> |
| X <sub>2</sub>   |
| X <sub>n</sub>   |
| X <sub>3</sub>   |

- How do we know if the data is present?
- Where do we look?

b. After the reference to  $X_n$ 

a. Before the reference to  $X_n$ 

## **Block Size Considerations**

- Larger blocks should reduce miss rate
  - Due to spatial locality
- But in a fixed-sized cache
  - Larger blocks ⇒ fewer of them
    - More competition ⇒ increased miss rate
  - Larger blocks ⇒ pollution
- Larger miss penalty
  - Can override benefit of reduced miss rate
  - Early restart and critical-word-first can help

## **Increasing Hit Rate**

- Hit rate increases with cache size.
- Hit rate mildly depends on block size.



#### **Cache Misses**

- On cache hit, CPU proceeds normally
- On cache miss
  - Stall the CPU pipeline
  - Fetch block from next level of hierarchy
  - Instruction cache miss
    - Restart instruction fetch
  - Data cache miss
    - Complete data access

# Static vs Dynamic RAMs



# Random Access Memory (RAM)



## **Six-Transistor SRAM Cell**



# Dynamic RAM (DRAM) Cell



## **Advanced DRAM Organization**

- Bits in a DRAM are organized as a rectangular array
  - DRAM accesses an entire row
  - Burst mode: supply successive words from a row with reduced latency
- Double data rate (DDR) DRAM
  - Transfer on rising and falling clock edges
- Quad data rate (QDR) DRAM
  - Separate DDR inputs and outputs

## **DRAM Generations**

| Year | Capacity | \$/GB     |
|------|----------|-----------|
| 1980 | 64Kbit   | \$1500000 |
| 1983 | 256Kbit  | \$500000  |
| 1985 | 1Mbit    | \$200000  |
| 1989 | 4Mbit    | \$50000   |
| 1992 | 16Mbit   | \$15000   |
| 1996 | 64Mbit   | \$10000   |
| 1998 | 128Mbit  | \$4000    |
| 2000 | 256Mbit  | \$1000    |
| 2004 | 512Mbit  | \$250     |
| 2007 | 1Gbit    | \$50      |



## **Average Access Time**

- Hit time is also important for performance
- Average memory access time (AMAT)
  - AMAT = Hit time + Miss rate x Miss penalty
- Example
  - CPU with 1ns clock, hit time = 1 cycle, miss penalty = 20 cycles, I-cache miss rate = 5%
  - $\blacksquare$  AMAT = 1 + 0.05 × 20 = 2ns
    - 2 cycles per instruction

## **Performance Summary**

- When CPU performance increased
  - Miss penalty becomes more significant
- Can't neglect cache behavior when evaluating system performance

## **Multilevel Caches**

- Primary cache attached to CPU
  - Small, but fast
- Level-2 cache services misses from primary cache
  - Larger, slower, but still faster than main memory
- Main memory services L-2 cache misses
- Some high-end systems include L-3 cache

#### Interactions with Advanced CPUs

- Out-of-order CPUs can execute instructions during cache miss
  - Pending store stays in load/store unit
  - Dependent instructions wait in reservation stations
    - Independent instructions continue
- Effect of miss depends on program data flow
  - Much harder to analyse
  - Use system simulation

## **Virtual Memory**

- Use main memory as a "cache" for secondary (disk) storage
  - Managed jointly by CPU hardware and the operating system (OS)
- Programs share main memory
  - Each gets a private virtual address space holding its frequently used code and data
  - Protected from other programs
- CPU and OS translate virtual addresses to physical addresses
  - VM "block" is called a page
  - VM translation "miss" is called a page fault

## **Disk Storage**

Nonvolatile, rotating magnetic storage



Chapter 6 — Storage and Other I/O Topics — 27

#### **Disk Sectors and Access**

- Each sector records
  - Sector ID
  - Data (512 bytes, 4096 bytes proposed)
  - Error correcting code (ECC)
    - Used to hide defects and recording errors
  - Synchronization fields and gaps
- Access to a sector involves
  - Queuing delay if other accesses are pending
  - Seek: move the heads
  - Rotational latency
  - Data transfer
  - Controller overhead

## Disk Access Example

#### Given

- 512B sector, 15,000rpm, 4ms average seek time, 100MB/s transfer rate, 0.2ms controller overhead, idle disk
- Average read time
  - 4ms seek time
    - $+ \frac{1}{2} / (15,000/60) = 2$ ms rotational latency
    - + 512 / 100 MB/s = 0.005 ms transfer time
    - + 0.2ms controller delay
    - = 6.2 ms
- If actual average seek time is 1ms
  - Average read time = 3.2ms

#### **Disk Performance Issues**

- Manufacturers quote average seek time
  - Based on all possible seeks
  - Locality and OS scheduling lead to smaller actual average seek times
- Smart disk controller allocate physical sectors on disk
  - Present logical sector interface to host
  - SCSI, ATA, SATA
- Disk drives include caches
  - Prefetch sectors in anticipation of access
  - Avoid seek and rotational delay

## Flash Storage

- Nonvolatile semiconductor storage
  - 100x 1000x faster than disk
  - Smaller, lower power, more robust
  - But more \$/GB (between disk and DRAM)





## Flash Types

- NOR flash: bit cell like a NOR gate
  - Random read/write access
  - Used for instruction memory in embedded systems
- NAND flash: bit cell like a NAND gate
  - Denser (bits/area), but block-at-a-time access
  - Cheaper per GB
  - Used for USB keys, media storage, ...
- Flash bits wears out after 1000's of accesses
  - Not suitable for direct RAM or disk replacement
  - Wear leveling: remap data to less used blocks

## Virtual vs. Physical Address

- Processor assumes a certain memory addressing scheme:
  - A block of data is called a virtual page
  - An address is called virtual (or logical) address
- Main memory may have a different addressing scheme:
  - Memory address is called physical address
  - MMU translates virtual address to physical address
  - Complete address translation table is large and is kept in main memory
  - MMU contains TLB (translation lookaside buffer), which is a small cache of the address translation table → address translation can create its own hit or miss

## **Page Fault Penalty**

- On page fault, the page must be fetched from disk
  - Takes millions of clock cycles
  - Handled by OS code
- Try to minimize page fault rate
  - Smart replacement algorithms

## **Memory Protection**

- Different tasks can share parts of their virtual address spaces
  - But need to protect against errant access
  - Requires OS assistance
- Hardware support for OS protection
  - Privileged supervisor mode (aka kernel mode)
  - Privileged instructions
  - Page tables and other state information only accessible in supervisor mode
  - System call exception (e.g., syscall in MIPS)

## The Memory Hierarchy

#### **The BIG Picture**

- Common principles apply at all levels of the memory hierarchy
  - Based on notions of caching
- At each level in the hierarchy
  - Block placement
  - Finding a block
  - Replacement on a miss
  - Write policy

## **Virtual Machines**

- Host computer emulates guest operating system and machine resources
  - Improved isolation of multiple guests
  - Avoids security and reliability problems
  - Aids sharing of resources
- Virtualization has some performance impact
  - Feasible with modern high-performance comptuers
- Examples
  - IBM VM/370 (1970s technology!)
  - VMWare
  - Microsoft Virtual PC

## Multilevel On-Chip Caches

Intel Nehalem 4-core processor



Per core: 32KB L1 I-cache, 32KB L1 D-cache, 512KB L2 cache

## 3-Level Cache Organization

|                             | Intel Nehalem                                                                                           | AMD Opteron X4                                                                                             |
|-----------------------------|---------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|
| L1 caches (per core)        | L1 I-cache: 32KB, 64-byte blocks, 4-way, approx LRU replacement, hit time n/a L1 D-cache: 32KB, 64-byte | L1 I-cache: 32KB, 64-byte blocks, 2-way, LRU replacement, hit time 3 cycles L1 D-cache: 32KB, 64-byte      |
|                             | blocks, 8-way, approx LRU replacement, write-back/allocate, hit time n/a                                | blocks, 2-way, LRU replacement, write-back/allocate, hit time 9 cycles                                     |
| L2 unified cache (per core) | 256KB, 64-byte blocks, 8-way, approx LRU replacement, write-back/allocate, hit time n/a                 | 512KB, 64-byte blocks, 16-way, approx LRU replacement, write-back/allocate, hit time n/a                   |
| L3 unified cache (shared)   | 8MB, 64-byte blocks, 16-way, replacement n/a, write-back/allocate, hit time n/a                         | 2MB, 64-byte blocks, 32-way, replace block shared by fewest cores, write-back/allocate, hit time 32 cycles |

n/a: data not available

## **Concluding Remarks**

- Fast memories are small, large memories are slow
  - We really want fast, large memories ⊗
  - Caching gives this illusion ©
- Principle of locality
  - Programs use a small part of their memory space frequently
- Memory hierarchy
- Memory system design is critical for multiprocessors

## **Additional:**

- 512MB-400MHz DDR2-400 PC2-3200, 200p SODIMM, 1.8v Laptop memory upgrade. \$11.95
- 4GB DDR3 204Pin PC3-8500 CL7 128x8 DDR3-1066 Sodimm Laptop Memory Upgrade \$34.95
- DDR3 memory provides a reduction in power consumption of 30% compared to DDR2 modules due to DDR3's 1.5 V supply voltage, compared to DDR2's 1.8 V or DDR's 2.5 V.
- SDRAM latency refers to delays in transmitting data between the CPU and SDRAM. SDRAM latency is often measured in memory bus clock cycles.
- RAM speeds are given by the four numbers above, commonly in the format "tCAS-tRCD-tRP-tRAS". For example, latency values given as 2.5-3-3-8 would indicate tCAS=2.5, tRCD=3, tRP=3, tRAS=8.
- CAS latency of 9 at 1000 MHz (DDR3-2000) is 9 ns, while CAS latency of 7 at 667 MHz (DDR3-1333) is 10.5 ns.
- (CAS / Frequency (MHz))  $\times$  1000 = X ns