



Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display

- Logically, each transistor acts as a switch
- Combined to implement logic functions (gates)
  AND, OR, NOT

CS270 - Fall Semester 2016

2

- Combined to build higher-level structures
  - Multiplexer, decoder, register, memory ...
  - Adder, multiplier ...
- Combined to build simple processor
  - LC-3









### Copyright D The McGraw-Hill Comparises. Inc.: Permission required for reportantion or display. Propagation Delay

- Each gate has a propagation delay, typically fraction of a nanosecond (10<sup>-9</sup> sec).
- Delays accumulate depending on the chain of gates the signals have to go through.
- Clock frequency of a processor is determined by the delay of the longest combinational path between storage elements, i.e. cycle time.

CS270 - Fall Semester 2016

5













|      |        | Copyright © The I | WcGraw-Hill Compa | inies, Inc. Permiss | ion required for rep | oduction or display |      |   |
|------|--------|-------------------|-------------------|---------------------|----------------------|---------------------|------|---|
|      | Tr     | uth '             | Tabl              | e (fr               | om                   | circu               | uit) |   |
| Trut | h tabl | e for             | circuit           | on p                | reviou               | is slid             | е    |   |
|      | A      | в                 | С                 | W                   | Х                    | Y                   | Z    |   |
|      | 0      | 0                 | 0                 | 0                   | 0                    | 0                   | 1    |   |
|      | 0      | 0                 | 1                 | 0                   | 1                    | 1                   | 1    |   |
|      | 0      | 1                 | 0                 | 0                   | 1                    | 1                   | 1    |   |
|      | 0      | 1                 | 1                 | 0                   | 1                    | 1                   | 1    |   |
|      | 1      | 0                 | 0                 | 0                   | 0                    | 0                   | 1    |   |
|      | 1      | 0                 | 1                 | 0                   | 1                    | 1                   | 1    |   |
|      | 1      | 1                 | 0                 | 1                   | 1                    | 0                   | 0    |   |
|      | 1      | 1                 | 1                 | 1                   | 1                    | 0                   | 0    |   |
|      |        |                   |                   |                     |                      |                     |      |   |
|      |        |                   |                   |                     |                      |                     |      |   |
|      |        |                   | CS270             | ) - Fall Seme       | ster 2016            |                     |      | 9 |

























#### **Combinational vs. Sequential**

- Combinational Circuit
  - always gives the same output for a given set of inputs
     ex: adder always generates sum and carry, regardless of previous inputs

#### Sequential Circuit

- stores information
- output depends on stored information (state) plus input oso a given input might produce different outputs, depending on the stored information

#### example: ticket counter

- •advances when you push the button
- output depends on previous state
- useful for building "memory" elements and "state machines"

















#### **Finite State Machine**

 A description of a system with the following components:

Convrint D The McGraw-Hill Companies Inc. Remi

- 1. A finite number of states
- 2. A finite number of external inputs
- 3. A finite number of external outputs
- 4. An explicit specification of all state transitions
- 5. An explicit specification of what determines each external output value
- Often described by a state diagram.
  - Inputs trigger state transitions.
  - Outputs are associated with each state (or with each transition).

#### Mealy vs Moore state machines

• Moore: Outputs are only based on current state

- Each state labeled with an output
- Outputs change only at clock edge following input change
- Potentially simpler to conceptualize
- Simpler to interconnect with other state machines
- Every Moore machine convertible to a Mealy machine
- Mealy: Outputs are based on current state and inputs
  - Each arc/transition labeled with a output
  - Tend to have fewer states
  - Outputs shown on transition arcs in state diagrams
  - Output changes in the same cycle as input is received

https://en.wikipedia.org/wiki/Mealy\_machine
 https://en.wikipedia.org/wiki/Moore\_machine















## 



















# The Memory Hierarchy: Key facts and ideas (1)

- Programs keep getting bigger exponentially.
- Memory cost /bit
  - Faster technologies are expensive, slower are cheaper. Different by orders of magnitude
  - With time storage density goes up driving per bit cost down.
- Locality in program execution
  - · Code/data used recently will likely be needed soon.
  - Code/data that is near the one recently used, will likely be needed soon.



#### **Principle of Locality**

 Programs access a small proportion of their address space at any time

Copyright @ The McGraw-Hill Companies. Inc. Permi

- Temporal locality
  - Items accessed recently are likely to be accessed again soon
  - e.g., instructions in a loop, induction variables
- Spatial locality
  - Items near those accessed recently are likely to be accessed soon
  - . E.g., sequential instruction access, array data

#### **Cache Misses**

• On cache hit, CPU proceeds normally

On cache miss

Stall the CPU pipeline

Convright © The

- Fetch block from next level of hierarchy
- Instruction cache miss
   Restart instruction fetch
- Data cache miss
   Complete data access

#### Copyright D The McGraw-Hill Companies, Inc. Permission required for reproduction of display Multillevel Caches

Primary cache attached to CPU

- Small, but fast
- Level-2 cache services misses from primary cache

Larger, slower, but still faster than main memory

• Main memory services L-2 cache misses

Some systems now include L-3 cache

Copyright © The McGraw-Hill Companies, Inc. Permission r

#### **The Memory Hierarchy**

#### **The BIG Picture**

- Common principles apply at all levels of the memory hierarchy
  - Based on notions of caching
- At each level in the hierarchy
  - Block placement
  - Finding a block
  - Replacement on a miss
  - Write policy

Fast: Exploiting Memory Hierarchy —

#### **Block Placement**

• Determined by associativity

Direct mapped (1-way associative)
 One choice for placement

Convright ID The McGraw-Hill Companies Inc. Perm

- n-way set associative
   on choices within a set
- Fully associative
  - Any location
- Higher associativity reduces miss rate
  - Increases complexity, cost, and access time

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 41

## Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display

| FIN | aina | ав | ЮСК |
|-----|------|----|-----|
|     |      |    |     |

| Associativity            | Location method                               | Tag comparisons |
|--------------------------|-----------------------------------------------|-----------------|
| Direct mapped            | Index                                         | 1               |
| n-way set<br>associative | Set index, then search entries within the set | n               |
| Fully associative        | Search all entries                            | #entries        |

#### Hardware caches

Reduce comparisons to reduce cost

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 42

#### Replacement

Copyright © The McGraw-Hill Companies, Inc. Permission required for

- Choice of entry to replace on a miss
  - Least recently used (LRU)
  - Complex and costly hardware for high associativityRandom
    - Close to LRU, easier to implement

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 43

#### **Write Policy**

Copyright @ The McGraw-Hill Companies, Inc. Permission required f

#### Write-through

- Update both upper and lower levels
- Simplifies replacement, but may require write buffer

Write-back

- Update upper level only
- Update lower level when block is replaced
- Need to keep more state

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 44

# Copyright D The McGraer-Hill Comparises, Inc. Permission required for reported control of spinor.

- Compulsory misses (aka cold start misses)
   First access to a block
- Capacity misses
  - Due to finite cache size
  - A replaced block is later accessed again
- Conflict misses (aka collision misses)
  - In a non-fully associative cache
  - Due to competition for entries in a set
  - Would not occur in a fully associative cache of the same total size

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 45

| Design change          | Effect on miss rate           | Negative performant                                                                                   |
|------------------------|-------------------------------|-------------------------------------------------------------------------------------------------------|
| Increase cache size    | Decrease capacity misses      | May increase acces time                                                                               |
| Increase associativity | Decrease conflict<br>misses   | May increase acces time                                                                               |
| Increase block size    | Decrease compulsory<br>misses | Increases miss<br>penalty. For very lar<br>block size, may<br>increase miss rate<br>due to pollution. |



|                                   | Intel Nehalem                                                                                                                                                                                       | AMD Opteron X4                                                                                                                                                                               |
|-----------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| L1 caches<br>(per core)           | L1 I-cache: 32KB, 64-byte<br>blocks, 4-way, approx LRU<br>replacement, hit time n/a<br>L1 D-cache: 32KB, 64-byte<br>blocks, 8-way, approx LRU<br>replacement, write-<br>back/allocate, hit time n/a | L1 I-cache: 32KB, 64-byte<br>blocks, 2-way, LRU<br>replacement, hit time 3 cycle<br>L1 D-cache: 32KB, 64-byte<br>blocks, 2-way, LRU<br>replacement, write-<br>back/allocate, hit time 9 cycl |
| L2 unified<br>cache<br>(per core) | 256KB, 64-byte blocks, 8-way,<br>approx LRU replacement, write-<br>back/allocate, hit time n/a                                                                                                      | 512KB, 64-byte blocks, 16-w<br>approx LRU replacement, w<br>back/allocate, hit time n/a                                                                                                      |
| L3 unified<br>cache<br>(shared)   | 8MB, 64-byte blocks, 16-way,<br>replacement n/a, write-<br>back/allocate, hit time n/a                                                                                                              | 2MB, 64-byte blocks, 32-way<br>replace block shared by few<br>cores, write-back/allocate, hi<br>time 32 cycles                                                                               |

