The article is converted from a PPT which was written by me in 2013 used to train our team software engineers.

The Big Picture

inside cpu

CPU Basics

  • The computer’s CPU fetches, decodes, and executes program instructions.

  • The two principal parts of the CPU are the datapath and the control unit.

    • The datapath consists of an arithmetic-logic unit and storage units (registers) that are interconnected by a data bus that is also connected to main memory.
    • Various CPU components perform sequenced operations according to signals provided by its control unit.
    • The control unit determines which actions to carry out according to the values in a program counter register and a status register.

Clocks

  • Every computer contains at least one clock that synchronizes the activities of its components.

  • A fixed number of clock cycles are required to carry out each data movement or computational operation.

  • The clock frequency, measured in megahertz or gigahertz, determines the speed with which all operations are carried out.

  • Typical Intel and typical PIC clocks look like this:
    • 2 GHz clock has a cycle time of 0.5 nanoseconds.
    • 8 MHz clock has a cycle time of 0.125 microseconds.
  • One master clock has multiple frequencies used for various parts of the system.

ISA (Instruction Set Architecture)

  • Instruction Set Architecture is the structure of a computer that a machine language programmer (or a compiler) must understand to write a correct (timing independent) program for that machine.

  • ISA is the set of all instructions that the microprocessor can execute

inside cpu

  • The set of instructions executed by a modern processor may include:
    • Data transfer instructions
      • data movement (load, store, push, pop, move, swap - registers)
    • Data operation instructions
      • arithmetic and logical (negate, extend, add, subtract, multiply, divide, and, or, shift)
    • Program control instructions
      • control transfer (jump, call, trap - jump into the operating system, return - from call or trap, conditional branch)

Instruction Processing

  • The datapath based on data transfers required to perform instructions
  • A controller causes the right transfers to happen

inside cpu

Instruction Execution

inside cpu

Instruction Formats

In designing an instruction set, consideration is given to:

  • Instruction length.
    • Whether short, long, or variable.
  • Number of operands.
  • Number of addressable registers.
  • Memory organization.
    • Whether byte - or word addressable.
  • Addressing modes.
    • Choose any or all: direct, indirect or indexed

inside cpu

Addressing Modes

Instructions can specify many different ways to obtain their data

  • data in instruction
  • data in register
  • address of data in instruction
  • address of data in register
  • address of data computed from two or more values contained in the instruction and/or registers

On a RISC machine, arithmetic/logic instructions use only the first two of these ADDRESSING MODES

On a CISC machine, all addressing modes are generally available to all instructions

RISC vs. CISC

  • The believe that better performance would be obtained by reducing the number of instruction required to implement a program, lead to design of processors with very complex instructions (CISC)

    CISC – Complex Instruction Set Computers

  • As compiler technologies improved, researchers started to wonder if CISC architectures really delivered better performances than architectures with simpler instruction set

    RISC – Reduced Instruction Set Computers

inside cpu

  • Addition of two operands from memory, with result written in memory, in RISC and CISC architectures

  • Having an operation broken into small instructions (RISC) allows the compiler to optimize the code

    i.e. between the two LD instructions (memory is slow) the compiler can add some instructions that don’t need memory access

  • The CISC instruction has no option but to wait for its operands to come from the memory, potentially delaying other instructions

RISC Characteristics

  • One instruction per cycle
  • Register to register operations
  • Few, simple addressing modes
  • Few, simple instruction formats
  • Hardwired design (no microcode)
  • Fixed instruction format
  • More compile time/effort

Architecture Implementation

inside cpu

uArch

Pipelining

What Is Pipelining

  • Pipelining doesn’t help latency of single task, it helps throughput of entire workload
  • Pipeline rate limited by slowest pipeline stage
  • Multiple tasks operating simultaneously
  • Potential speedup = Number pipe stages
  • Unbalanced lengths of pipe stages reduces speedup
  • Time to “fill” pipeline and time to “drain” it reduces speedup

Instruction-Level Pipelining

  • For every clock cycle, one small step is carried out, and the stages are overlapped.

inside cpu

  • S1. Fetch instruction.
  • S2. Decode opcode.
  • S3. Calculate effective address of operands.
  • S4. Fetch operands.
  • S5. Execute.
  • S6. Store result.

inside cpu

The Five Stages of Load

inside cpu

  • Ifetch: Instruction Fetch
    • Fetch the instruction from the Instruction Memory
  • Reg/Dec: Registers Fetch and Instruction Decode
  • Exec: Calculate the memory address
  • Mem: Read the data from the Data Memory
  • Wr: Write the data back to the register file

Pipeline Hurdles

Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle.

  • Structural hazards: HW cannot support this combination of instructions
    • two different instructions use same h/w in same cycle
  • Data hazards: Instruction depends on result of prior instruction still in the pipeline
    • two different instructions use same storage
    • must appear as if the instructions execute in correct order
  • Control hazards: Pipelining of branches & other instructions that change the PC
    • one instruction affects which instruction is next
  • Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles” in the pipeline

Insert “Bubble” into the Pipeline

inside cpu

Data hazards

inside cpu

inside cpu

Predict Branch

  • Advantages
    • The main purpose of predication is to avoid jumps over very small sections of program code, increasing the effectiveness of pipelined execution and avoiding problems with the cache.
  • Disadvantages
    • Predication’s primary drawback is in increased encoding space.
  • 1-bit Prediction
  • 2-bit Prediction

inside cpu

Out-of-order execution

Processor executes instructions in an order governed by the availability of input data, rather than by their original order in a program

Register renaming

inside cpu

Parallelism

inside cpu

  • Instruction Level Parallelism
    • Superscalar
    • VLIW
  • Data Level Parallelism
    • SIMD (Single Instruction Multiple Data)
    • MMX
  • Thread Level Parallelism
    • Multithreading
    • Multicore
    • Multiprocessor

Superscalar

  • Common instructions (arithmetic, load/store, conditional branch) can be initiated and executed independently
    • Improve these operations by executing them concurrently in multiple pipelines
    • Requires multiple functional units
    • Requires re-arrangement of instructions

inside cpu

Superscalar Execution

inside cpu

Superpipelined

  • Many pipeline stages need less than half a clock cycle
  • Double internal clock speed gets two tasks per external clock cycle

inside cpu

VLIW - Very long instruction word

  • Why
    • To overcome the difficulty of finding parallelism in machine-level object code.
    • In a VLIW processor, multiple instructions are packed together and issued in parallel to an equal number of execution units.
    • The compiler (not the processor) checks that there are only independent instructions executed in parallel.
  • characteristics
    • VLIW contains multiple primitive instructions that can be executed in parallel by functional units of a processor.
    • The compiler packs a number of primitive, non-interdependent instructions into a very long instruction word
    • Since multiple instructions are packed in one instruction word, the instruction words are much larger than CISC and RISC’s.

Multithreading

inside cpu

Cache

  • The goal of a cache in computing is to keep the expensive CPU as busy as possible by minimizing the wait for reads and writes to slower memory.

Memory Hierarchy

inside cpu

inside cpu

Cache

inside cpu

  • Processor does all memory operations with cache.
  • Miss – If requested word is not in cache, a block of words containing the requested word is brought to cache, and then the processor request is completed.
  • Hit – If the requested word is in cache, read or write operation is performed directly in cache, without accessing main memory.
  • Block – minimum amount of data transferred between cache and main memory.

Cache Design

The level’s design is described by four behaviors:

  • Block Placement:
    • Where could a new block be placed in the given level?
  • Block Identification:
    • How is a existing block found, if it is in the level?
  • Block Replacement:
    • Which existing block should be replaced, if necessary?
  • Write Strategy:
    • How are writes to the block handled?

Cache Line

inside cpu

Three Major Placement Schemes

inside cpu

Direct-Mapped Cache

inside cpu

Two-Way Set-Associative Cache

inside cpu

Replacement Strategies

  • Which existing block do we replace, when a new block comes in?

  • With a direct-mapped cache:
    • There’s only one choice! (Same as placement)
  • With a (fully- or set-) associative cache:
    • If any “way” in the set is empty, pick one of those
    • Otherwise, there are many possible strategies:
      • Random: Simple, fast, and fairly effective
      • FIFO
      • Least-Recently Used (LRU)

Writing to Memory

  • Cache and memory become inconsistent when data is written into cache, but not to memory – the cache coherence problem.
  • Strategies to handle inconsistent data:
    • Write-through -Write to memory and cache simultaneously always.
      • Write to memory is ~100 times slower than to (L1) cache.
    • Write-back
      • Write to cache and mark block as “dirty”.
      • Write to memory occurs later, when dirty block is cast-out from the cache to make room for another block

inside cpu

Cache Coherence

  • Also, before a DMA transfer, need to determine if information in main memory is up-to-date with information in cache. (write back protocol)
  • One solution is to always flush the cache by forcing the dirty data to be written back to memory before a DMA transfer takes place

  • MESI