- The Big Picture
- CPU Basics
- Clocks
- ISA (Instruction Set Architecture)
- uArch
The article is converted from a PPT which was written by me in 2013 used to train our team software engineers.
The Big Picture
CPU Basics
-
The computer’s CPU fetches, decodes, and executes program instructions.
-
The two principal parts of the CPU are the datapath and the control unit.
- The datapath consists of an arithmetic-logic unit and storage units (registers) that are interconnected by a data bus that is also connected to main memory.
- Various CPU components perform sequenced operations according to signals provided by its control unit.
- The control unit determines which actions to carry out according to the values in a program counter register and a status register.
Clocks
-
Every computer contains at least one clock that synchronizes the activities of its components.
-
A fixed number of clock cycles are required to carry out each data movement or computational operation.
-
The clock frequency, measured in megahertz or gigahertz, determines the speed with which all operations are carried out.
- Typical Intel and typical PIC clocks look like this:
- 2 GHz clock has a cycle time of 0.5 nanoseconds.
- 8 MHz clock has a cycle time of 0.125 microseconds.
- One master clock has multiple frequencies used for various parts of the system.
ISA (Instruction Set Architecture)
-
Instruction Set Architecture is the structure of a computer that a machine language programmer (or a compiler) must understand to write a correct (timing independent) program for that machine.
-
ISA is the set of all instructions that the microprocessor can execute
- The set of instructions executed by a modern processor may include:
- Data transfer instructions
- data movement (load, store, push, pop, move, swap - registers)
- Data operation instructions
- arithmetic and logical (negate, extend, add, subtract, multiply, divide, and, or, shift)
- Program control instructions
- control transfer (jump, call, trap - jump into the operating system, return - from call or trap, conditional branch)
- Data transfer instructions
Instruction Processing
- The datapath based on data transfers required to perform instructions
- A controller causes the right transfers to happen
Instruction Execution
Instruction Formats
In designing an instruction set, consideration is given to:
- Instruction length.
- Whether short, long, or variable.
- Number of operands.
- Number of addressable registers.
- Memory organization.
- Whether byte - or word addressable.
- Addressing modes.
- Choose any or all: direct, indirect or indexed
Addressing Modes
Instructions can specify many different ways to obtain their data
- data in instruction
- data in register
- address of data in instruction
- address of data in register
- address of data computed from two or more values contained in the instruction and/or registers
On a RISC machine, arithmetic/logic instructions use only the first two of these ADDRESSING MODES
On a CISC machine, all addressing modes are generally available to all instructions
RISC vs. CISC
-
The believe that better performance would be obtained by reducing the number of instruction required to implement a program, lead to design of processors with very complex instructions (CISC)
CISC – Complex Instruction Set Computers
-
As compiler technologies improved, researchers started to wonder if CISC architectures really delivered better performances than architectures with simpler instruction set
RISC – Reduced Instruction Set Computers
-
Addition of two operands from memory, with result written in memory, in RISC and CISC architectures
-
Having an operation broken into small instructions (RISC) allows the compiler to optimize the code > i.e. between the two LD instructions (memory is slow) the compiler can add some instructions that don’t need memory access
-
The CISC instruction has no option but to wait for its operands to come from the memory, potentially delaying other instructions
RISC Characteristics
- One instruction per cycle
- Register to register operations
- Few, simple addressing modes
- Few, simple instruction formats
- Hardwired design (no microcode)
- Fixed instruction format
- More compile time/effort
Architecture Implementation
uArch
Pipelining
What Is Pipelining
- Pipelining doesn’t help latency of single task, it helps throughput of entire workload
- Pipeline rate limited by slowest pipeline stage
- Multiple tasks operating simultaneously
- Potential speedup = Number pipe stages
- Unbalanced lengths of pipe stages reduces speedup
- Time to “fill” pipeline and time to “drain” it reduces speedup
Instruction-Level Pipelining
- For every clock cycle, one small step is carried out, and the stages are overlapped.
- S1. Fetch instruction.
- S2. Decode opcode.
- S3. Calculate effective address of operands.
- S4. Fetch operands.
- S5. Execute.
- S6. Store result.
The Five Stages of Load
- Ifetch: Instruction Fetch
- Fetch the instruction from the Instruction Memory
- Reg/Dec: Registers Fetch and Instruction Decode
- Exec: Calculate the memory address
- Mem: Read the data from the Data Memory
- Wr: Write the data back to the register file
Pipeline Hurdles
Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle.
Structural hazards
: HW cannot support this combination of instructions- two different instructions use same h/w in same cycle
Data hazards
: Instruction depends on result of prior instruction still in the pipeline- two different instructions use same storage
- must appear as if the instructions execute in correct order
Control hazards
: Pipelining of branches & other instructions that change the PC- one instruction affects which instruction is next
- Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles” in the pipeline
Insert “Bubble” into the Pipeline
Data hazards
Predict Branch
- Advantages
- The main purpose of predication is to avoid jumps over very small sections of program code, increasing the effectiveness of pipelined execution and avoiding problems with the cache.
- Disadvantages
- Predication’s primary drawback is in increased encoding space.
- 1-bit Prediction
- 2-bit Prediction
Out-of-order execution
Processor executes instructions in an order governed by the availability of input data, rather than by their original order in a program
Register renaming
Parallelism
- Instruction Level Parallelism
- Superscalar
- VLIW
- Data Level Parallelism
- SIMD (Single Instruction Multiple Data)
- MMX
- Thread Level Parallelism
- Multithreading
- Multicore
- Multiprocessor
Superscalar
- Common instructions (arithmetic, load/store, conditional branch) can be initiated and executed independently
- Improve these operations by executing them concurrently in multiple pipelines
- Requires multiple functional units
- Requires re-arrangement of instructions
Superscalar Execution
Superpipelined
- Many pipeline stages need less than half a clock cycle
- Double internal clock speed gets two tasks per external clock cycle
VLIW - Very long instruction word
- Why
- To overcome the difficulty of finding parallelism in machine-level object code.
- In a VLIW processor, multiple instructions are packed together and issued in parallel to an equal number of execution units.
- The compiler (not the processor) checks that there are only independent instructions executed in parallel.
- characteristics
- VLIW contains multiple primitive instructions that can be executed in parallel by functional units of a processor.
- The compiler packs a number of primitive, non-interdependent instructions into a very long instruction word
- Since multiple instructions are packed in one instruction word, the instruction words are much larger than CISC and RISC’s.
Multithreading
Cache
- The goal of a cache in computing is to keep the expensive CPU as busy as possible by minimizing the wait for reads and writes to slower memory.
Memory Hierarchy
Cache
- Processor does all memory operations with cache.
Miss
– If requested word is not in cache, a block of words containing the requested word is brought to cache, and then the processor request is completed.Hit
– If the requested word is in cache, read or write operation is performed directly in cache, without accessing main memory.Block
– minimum amount of data transferred between cache and main memory.
Cache Design
The level’s design is described by four behaviors:
- Block Placement:
- Where could a new block be placed in the given level?
- Block Identification:
- How is a existing block found, if it is in the level?
- Block Replacement:
- Which existing block should be replaced, if necessary?
- Write Strategy:
- How are writes to the block handled?
Cache Line
Three Major Placement Schemes
Direct-Mapped Cache
Two-Way Set-Associative Cache
Replacement Strategies
-
Which existing block do we replace, when a new block comes in?
- With a direct-mapped cache:
- There’s only one choice! (Same as placement)
- With a (fully- or set-) associative cache:
- If any “way” in the set is empty, pick one of those
- Otherwise, there are many possible strategies:
- Random: Simple, fast, and fairly effective
- FIFO
- Least-Recently Used (LRU)
Writing to Memory
- Cache and memory become inconsistent when data is written into cache, but not to memory – the cache coherence problem.
- Strategies to handle inconsistent data:
- Write-through
-Write to memory and cache simultaneously always.
- Write to memory is ~100 times slower than to (L1) cache.
- Write-back
- Write to cache and mark block as “dirty”.
- Write to memory occurs later, when dirty block is cast-out from the cache to make room for another block
- Write-through
-Write to memory and cache simultaneously always.
Cache Coherence
- Also, before a DMA transfer, need to determine if information in main memory is up-to-date with information in cache. (write back protocol)
-
One solution is to always flush the cache by forcing the dirty data to be written back to memory before a DMA transfer takes place
- MESI