# CIS 371 Computer Organization and Design Unit 8: Static and Dynamic Scheduling With contributions by Drew Hilton ## This Unit: Static & Dynamic Scheduling - Pipelining and superscalar review - Code scheduling - To reduce pipeline stalls - To increase ILP (insn level parallelism) - Two approaches - Static scheduling by the compiler - Dynamic scheduling by the hardware # Readings - P&H - Chapter 4.10 4.11 ## Pipelining Review - Increases clock frequency by staging instruction execution - "Scalar" pipelines have a best-case CPI of 1 - Challenges: - Data and control dependencies further worsen CPI - Data: With full bypassing, load-to-use stalls - Control: use branch prediction to mitigate penalty - Big win, done by all processors today - How many stages (depth)? - Five stages is pretty good minimum - Intel Pentium II/III: 12 stages - Intel Pentium 4: 22+ stages - Intel Core 2: 14 stages ## Pipeline Diagram | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |-------------------------------|---|---|---|----|----|----|---|---|---| | add \$3,\$2,\$1 | F | D | Х | M | W | | | | | | lw \$4,4(\$3) | | F | D | ΨX | М | ,W | | | | | addi \$6, <mark>\$4</mark> ,1 | | | F | D | d* | ₹X | М | W | | | sub \$8,\$3,\$1 | | | | F | d* | D | Х | М | W | - Use compiler scheduling to reduce load-use stall frequency - Like software interlocks, but for performance not correctness | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |-------------------------|---|---|---|----|------------|----|---|---|---| | add \$3,\$2,\$1 | F | D | Х | ΙМ | , W | | | | | | lw \$4,4(\$3) | | F | D | X | М | ,W | | | | | sub \$8,\$3,\$1 | | | F | D | <b>▼</b> X | М | W | | | | addi \$6, <b>\$4</b> ,1 | | | | F | D | ₹X | М | W | | ## Superscalar Pipeline Review - Execute two or more instruction per cycle - Challenges: - wide fetch (branch prediction harder, misprediction more costly) - wide decode (stall logic) - wide execute (more ALUs) - wide bypassing (more possibly bypassing paths) - Finding enough independent instructions (and fill delay slots) - How many instructions per cycle max (width)? - Really simple, low-power cores are still scalar (single issue) - Even low-power cores a dual-issue (Intel Atom) - Most desktop/laptop chips three-issue or four-issue - A few 5 or 6-issue chips have been built (IBM Power4, Itanium II) ## Superscalar Pipeline Diagrams - Ideal #### scalar ``` lw 0(r1) → r2 lw 4(r1) → r3 lw 8(r1) → r4 add r14,r15 → r6 add r12,r13 → r7 add r17,r16 → r8 lw 0(r18) → r9 ``` ``` 10 11 12 3 5 6 8 4 X D M W F D X Μ W X Μ W D Χ M W X M W D X М W F X Μ D W ``` ### 2-way superscalar ``` lw 0(r1) → r2 lw 4(r1) → r3 lw 8(r1) → r4 add r14,r15 → r6 add r12,r13 → r7 add r17,r16 → r8 lw 0(r18) → r9 ``` ``` 5 6 8 10 11 12 X M W X W X Μ W X Μ W F X Μ W X W D X Μ W ``` ## Superscalar Pipeline Diagrams - Realistic #### scalar ``` lw 0(r1) → r2 lw 4(r1) → r3 lw 8(r1) → r4 add r4,r5→ r6 add r2,r3→ r7 add r7,r6→ r8 lw 0(r8) → r9 ``` ``` 10 11 12 3 5 6 8 X Μ W D M W X Μ W d* D X М W D X Μ W X Μ W F X Μ W ``` ### 2-way superscalar ``` lw 0(r1) → r2 lw 4(r1) → r3 lw 8(r1) → r4 add r4,r5→ r6 add r2,r3→ r7 add r7,r6→ r8 lw 0(r8) → r9 ``` ``` 5 6 10 11 12 Χ M W W d* X Μ W d* X D W X M W d* D X Μ W ``` ## Code Scheduling - Scheduling: act of finding independent instructions - "Static" done at compile time by the compiler (software) - "Dynamic" done at runtime by the processor (hardware) - Why schedule code? - Scalar pipelines: fill in load-to-use delay slots to improve CPI - Superscalar: place independent instructions together - As above, load-to-use delay slots - Allow multiple-issue decode logic to let them execute at the same time ## Compiler Scheduling - Compiler can schedule (move) instructions to reduce stalls - Basic pipeline scheduling: eliminate back-to-back load-use pairs - Example code sequence: a = b + c; d = f e; - sp stack pointer, sp+0 is "a", sp+4 is "b", etc... #### Before **After** ld r2, 4(sp)ld r2,4(sp) $1d = \frac{73}{8} (sp)$ 1d r3,8(sp)add \r3, r2, r1 //stall $1d r \sqrt{3}, 16 (sp)$ //no stall st r1,0(sp)add r3,r2,r1 $1d r_{6}, 20 (sp)$ ld r5,16(sp) $st r1 \gg (sp)$ 1d r6, 20 (sp)sub r5, r6, r4 //stall sub r5, r6, r4 //no stall st r4,12(sp) st r4,12(sp) ## Compiler Scheduling Requires ### Large scheduling scope - Independent instruction to put between load-use pairs - + Original example: large scope, two independent computations - This example: small scope, one computation | <u>Before</u> | | <u>After</u> | | |-----------------------------|---------|----------------------------|---------| | ld r2,4(sp) | | ld r2,4(sp) | | | ld r3,8(sp)<br>add r3,r2,r1 | | ld <mark>r3</mark> ,8(sp) | | | add r3,r2,r1 | //stall | $add \frac{1}{r3}, r2, r1$ | //stall | | st r1,0(sp) | | st r1,0(sp) | | One way to create larger scheduling scopes? | ı | | | |---|--|--| ## Compiler Scheduling Requires ### Enough registers - To hold additional "live" values - Example code contains 7 different values (including sp) - Before: max 3 values live at any time → 3 registers enough - After: max 4 values live → 3 registers not enough ### <u>Original</u> #### Wrong! ``` ld r2, 4(sp) ld r2, 4(sp) ld r1,8(sp) 1d r1, 8 (sp) add \r1, r2, r1 //stall \rightarrow 1d r2,16(sp) st r1, 0 (sp) - add r1, r2, r1 // wrong r2 ld r2, 16 (sp) ld r1,20 (sp) ld r1,20(sp) st r1,0(sp) // wrong r1 sub r2, r1, r1 //stall sub r2, r1, r1 st r1,12(sp) st r1,12(sp) ``` ## Compiler Scheduling Requires ### Alias analysis - Ability to tell whether load/store reference same memory locations - Effectively, whether load/store can be rearranged - Example code: easy, all loads/stores use same base register (sp) - New example: can compiler tell that r8 != sp? - Must be conservative ``` Wrong(?) Before ld r2, 4(sp) ld r2,4(sp) ld r3,8(sp) ld r3,8(sp) add r3,r2,r1 //stall 1d r5,0(r8) //does r8==sp? st r1,0(sp) - add r3,r2,r1 1d r6,4(r8) //does r8+4==sp? ld r5,0(r8) st r1,0(sp) 1d r6, 4(r8) sub r5,r6,r4 //stall sub r5, r6, r4 st r4,8(r8) st r4,8(r8) ``` ## Code Example: SAXPY - **SAXPY** (Single-precision A X Plus Y) - Linear algebra routine (used in solving systems of equations) - Part of early "Livermore Loops" benchmark suite - Uses floating point values in "F" registers - Uses floating point version of instructions (ldf, addf, mulf, stf, etc.) ### **New Metric: Utilization** - **Utilization**: actual performance / peak performance - Important metric for performance/cost - No point to paying for hardware you will rarely use - Adding hardware usually improves performance & reduces utilization - Additional hardware can only be exploited some of the time - Diminishing marginal returns - Compiler can help make better use of existing hardware - Important for superscalar ### **SAXPY Performance and Utilization** | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | |------------------------|---|---|---|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----| | ldf X(r1)→f1 | F | D | X | М | W | | | | | | | | | | | | | | | | | mulf f0,f1→f2 | | F | D | d* | E* | E* | E* | E* | E* | W | | | | | | | | | | | | ldf Y(r1) <b>→</b> f3 | | | F | p* | D | X | M | W | | | | | | | | | | | | | | addf f2,f3 <b>→</b> f4 | | | | | F | D | d* | d* | d* | E+ | E+ | W | | | | | | | | | | stf f4→Z(r1) | | | | | | F | p* | p* | p* | D | X | M | W | | | | | | | | | addi r1,4 <b>→</b> r1 | | | | | | | | | | F | D | X | M | W | | | | | | | | blt r1,r2,0 | | | | | | | | | | | F | D | Χ | M | W | | | | | | | ldf X(r1)→f1 | | | | | | | | | | | | F | D | X | M | W | | | | | ### Scalar pipeline - Full bypassing, 5-cycle E\*, 2-cycle E+, branches predicted taken - Single iteration (7 insns) latency: 16–5 = 11 cycles - Performance: 7 insns / 11 cycles = 0.64 IPC - **Utilization**: 0.64 actual IPC / 1 peak IPC = 64% ### **SAXPY Performance and Utilization** | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | |---------------|---|---|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----| | ldf X(r1)→f1 | F | D | Χ | М | W | | | | | | | | | | | | | | | | | mulf f0,f1→f2 | F | D | d* | d* | E* | E* | E* | E* | E* | W | | | | | | | | | | | | ldf Y(r1)→f3 | | F | D | p* | X | М | W | | | | | | | | | | | | | | | addf f2,f3→f4 | | F | p* | p* | D | d* | d* | d* | d* | E+ | E+ | W | | | | | | | | | | stf f4→Z(r1) | | | F | p* | D | p* | p* | p* | p* | d* | X | M | W | | | | | | | | | addi r1,4⇒r1 | | | | | F | p* | p* | p* | p* | p* | D | X | M | W | | | | | | | | blt r1,r2,0 | | | | | F | p* | p* | p* | p* | p* | D | d* | X | M | W | | | | | | | ldf X(r1) →f1 | | | | | | | | | | | F | D | X | M | W | | | | | | - 2-way superscalar pipeline - Any two insns per cycle + split integer and floating point pipelines - + Performance: 7 insns / 10 cycles = 0.70 IPC - Utilization: 0.70 actual IPC / 2 peak IPC = 35% - More hazards → more stalls - Each stall is more expensive # Static (Compiler) Instruction Scheduling - Idea: place independent insns between slow ops and uses - Otherwise, pipeline stalls while waiting for RAW hazards to resolve - Have already seen pipeline scheduling - To schedule well you need ... independent insns - Scheduling scope: code region we are scheduling - The bigger the better (more independent insns to choose from) - Once scope is defined, schedule is pretty obvious - Trick is creating a large scope (must schedule across branches) - Compiler scheduling (really scope enlarging) techniques - Loop unrolling (for loops) ## Loop Unrolling SAXPY - Goal: separate dependent insns from one another - SAXPY problem: not enough flexibility within one iteration - Longest chain of insns is 9 cycles - Load (1) - Forward to multiply (5) - Forward to add (2) - Forward to store (1) - Can't hide a 9-cycle chain using only 7 insns - But how about two 9-cycle chains using 14 insns? - Loop unrolling: schedule two or more iterations together - Fuse iterations - Schedule to reduce stalls - Schedule introduces ordering problems, rename registers to fix ## Unrolling SAXPY I: Fuse Iterations - Combine two (in general K) iterations of loop - Fuse loop control: induction variable (i) increment + branch - Adjust (implicit) induction uses: constants → constants + 4 ``` ldf X(r1),f1 mulf f0,f1,f2 mulf ldf Y(r1),f3 addf f2,f3,f4 stf f4,Z(r1) addi r1,4,r1 blt r1,r2,0 ldf X(r1),f1 mulf f0,f1,f2 ldf Y(r1),f3 addf f2,f3,f4 stf f4,Z(r1) addi r1,4,r1 blt r1,r2,0 ldf Y(r1),f3 addf f2,f3,f4 stf f4,Z(r1) addi r1,4,r1 blt r1,r2,0 ``` ``` ldf X(r1),f1 mulf f0,f1,f2 ldf Y(r1),f3 addf f2,f3,f4 stf f4,Z(r1) ``` ``` ldf X+4(r1),f1 mulf f0,f1,f2 ldf Y+4(r1),f3 addf f2,f3,f4 stf f4,Z+4(r1) addi r1,8,r1 blt r1,r2,0 ``` ## Unrolling SAXPY II: Pipeline Schedule - Pipeline schedule to reduce stalls - Have already seen this: pipeline scheduling ``` ldf X(r1),f1 mulf f0,f1,f2 ldf Y(r1),f3 addf f2,f3,f4 stf f4,Z(r1) ldf X+4(r1),f1 mulf f0,f1,f2 ldf Y+4(r1),f3 addf f2,f3,f4 stf f4,Z+4(r1) addi r1,8,r1 blt r1,r2,0 ``` ``` ldf X(r1),f1 ldf X+4(r1),f1 mulf f0,f1,f2 mulf f0,f1,f2 ldf Y(r1),f3 ldf Y+4(r1),f3 addf f2,f3,f4 addf f2,f3,f4 stf f4,Z(r1) stf f4,Z+4(r1) addi r1,8,r1 blt r1,r2,0 ``` ## Unrolling SAXPY III: "Rename" Registers - Pipeline scheduling causes reordering violations - Rename registers to correct ``` ldf X(r1),f1 ldf X+4(r1),f1 mulf f0,f1,f2 mulf f0,f1,f2 ldf Y(r1),f3 ldf Y+4(r1),f3 addf f2,f3,f4 addf f2,f3,f4 stf f4,Z(r1) stf f4,Z+4(r1) addi r1,8,r1 blt r1,r2,0 ``` ``` ldf X(r1),f1 ldf X+4(r1),f5 mulf f0,f1,f2 mulf f0,f5,f6 ldf Y(r1),f3 ldf Y+4(r1),f7 addf f2,f3,f4 addf f6,f7,f8 stf f4,Z(r1) stf f8,Z+4(r1) addi r1,8,r1 blt r1,r2,0 ``` ## Unrolled SAXPY Performance/Utilization ``` 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 X M 1df X(r1) \rightarrow f1 ldf X+4(r1) \rightarrow f5 D X M W F D F*|F* F* F* W mulf f0, f1 \rightarrow f2 D | F* F* F* F* W mulf f0, f5 \rightarrow f6 1df Y(r1) \rightarrow f3 M \cdot W ldf Y+4(r1) \rightarrow f7 D d^{*}E+E+S^{*}W addf f2, f3 \rightarrow f4 F p* D E+ p* E+ W addf f6, f7 \rightarrow f8 D X M W stf f4 \rightarrow Z(r1) stf f8 \rightarrow Z+4(r1) D X M W D \mathsf{X} \mathsf{M} \mathsf{W} addi r1→8,r1 D X M W blt r1, r2, 0 ldf X(r1) \rightarrow f1 ``` - + Performance: 12 insn / 13 cycles = 0.92 IPC - + Utilization: 0.92 actual IPC / 1 peak IPC = 92% - + **Speedup**: (2 \* 11 cycles) / 13 cycles = 1.69 ## **Loop Unrolling Shortcomings** - Static code growth → more I\$ misses (limits degree of unrolling) - Needs more registers to hold values (ISA limits this) - Doesn't handle non-loops... - Doesn't handle recurrences (inter-iteration dependences) ``` for (i=0;i<N;i++) X[i]=A*X[i-1]; ``` blt r1, r2, 0 CIS 371 (Hilton/Roth/Martin): Scheduling ``` 1df X-4(r1), f1 1df X-4(r1), f1 mulf f0,f1,f2 mulf f0,f1,f2 stf f2,X(r1) stf f2,X(r1) addi r1,4,r1 mulf f0, f2, f3 blt r1, r2, 0 stf f3,X+4(r1) ldf X-4(r1), f1 addi r1,4,r1 mulf f0,f1,f2 blt r1, r2, 0 stf f2,X(r1) addi r1,4,r1 ``` Two mulf's are not parallel Other (more advanced) techniques help ### **Another Limitation: Branches** # loop: jz r1, not\_found ld [r1] -> r2 sub r1, r2 -> r2 jz r2, found Id[r1+4] -> r1jmp loop Aside: what does this code do? Legal to move load up past branch? ## Recap: Static Scheduling Limitations - Limited number of registers (set by ISA) - Scheduling scope - Example: can't generally move memory operations past branches - Inexact memory aliasing information - Often prevents reordering of loads above stores - Caches misses (or any runtime event) confound scheduling - How can the compiler know which loads will miss vs hit? - Can impact the compiler's scheduling decisions ### Can Hardware Overcome These Limits? ### Dynamically-scheduled processors - Also called "out-of-order" processors - Hardware re-schedules insns... - ...within a sliding window of VonNeumann insns - As with pipelining and superscalar, ISA unchanged - Same hardware/software interface, appearance of in-order - Increases scheduling scope - Does loop unrolling transparently - Uses branch prediction to "unroll" branches - Examples: - Pentium Pro/II/III (3-wide), Core 2 (4-wide), Alpha 21264 (4-wide), MIPS R10000 (4-wide), Power5 (5-wide) - Basic overview of approach (more information in CIS501) ## The Problem With In-Order Pipelines ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 addf f0,f1→f2 mulf f2,f3→f2 subf f0,f1→f4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 F D E+E+E+W F D d* d* E* E* E* E* E* W F p* p* D E+E+E+W ``` - What's happening in cycle 4? - mulf stalls due to data dependence - OK, this is a fundamental problem - subf stalls due to pipeline hazard - Why? subf can't proceed into D because addf is there - That is the only reason, and it isn't a fundamental one - Maintaining in-order writes to register file - Why can't subf go into D in cycle 4 and E+ in cycle 5? ## Out-of-order Pipeline ## Code Example Code: #### Raw insns ``` add r2,r3 r1 sub r2,r1 r3 mul r2,r3 r3 div r1,4 r1 ``` - "True" (real) & "False" (artificial) dependencies - Divide insn independent of subtract and multiply insns - Can execute in parallel with subtract - Many registers re-used - Just as in static scheduling, the register names get in the way - How does the hardware get around this? - Approach: (step #1) rename registers, (step #2) schedule ## Step #1: Register Renaming - To eliminate register conflicts/hazards - "Architected" vs "Physical" registers level of indirection - Names: r1,r2,r3 - Locations: p1,p2,p3,p4,p5,p6,p7 FreeList Original mapping: r1→p1, r2→p2, r3→p3, p4-p7 are "available" | MapTable | | | | | | | | | |------------|------------|------------|--|--|--|--|--|--| | r1 | r2 | r3 | | | | | | | | p1 | p2 | р3 | | | | | | | | <b>p</b> 4 | <b>p</b> 2 | р3 | | | | | | | | <b>p</b> 4 | p2 | <b>p</b> 5 | | | | | | | | <b>p</b> 4 | p2 | <b>p</b> 6 | | | | | | | ``` p4,p5,p6,p7 p5,p6,p7 p6,p7 ``` | add | r2,r3,r1 | |-----|------------| | sub | r2,r1/r3 | | | r2, r3, r3 | | div | r1,4,r1 | Original insns add p2,p3,p4 sub p2,p4,p5 mul p2,p5,p6 div p4,4,p7 Renamed insns - Renaming conceptually write each register once - + Removes **false** dependences - + Leaves true dependences intact! - When to reuse a physical register? After overwriting insn done ## Register Renaming Algorithm - Data structures: - maptable[architectural\_reg] → physical\_reg - Free list: get/put free register - Algorithm: at decode for each instruction: ``` insn.phys_input1 = maptable[insn.arch_input1] insn.phys_input2 = maptable[insn.arch_input2] insn.phys_to_free = maptable[arch_output] new_reg = get_free_phys_reg() insn.phys_output = new_reg maptable[arch_output] = new_reg ``` - At "commit" - Once all older instructions have committed, free register put free phys reg(insn.phys to free) xor r1 ^ r2 -> r3 add r3 + r4 -> r4 sub r5 - r2 -> r3 addi r3 + 1 -> r1 | r1 | p1 | |----|----| | r2 | p2 | | r3 | рЗ | | r4 | p4 | | r5 | p5 | Free-list xor **r1** ^ **r2** -> r3 add r3 + r4 -> r4 sub r5 - r2 -> r3 addi r3 + 1 -> r1 | <b></b> | xor | <b>p1</b> | ٨ | <b>p2</b> | -> | |---------|-----|-----------|---|-----------|----| |---------|-----|-----------|---|-----------|----| | r1 | <b>p1</b> | |----|-----------| | r2 | <b>p2</b> | | r3 | р3 | | r4 | p4 | | r5 | p5 | Free-list | <b></b> | xor | <b>p1</b> | ^ p2 -> | <b>p6</b> | |---------|-----|-----------|---------|-----------| |---------|-----|-----------|---------|-----------| | r1 | p1 | |----|----| | r2 | p2 | | r3 | рЗ | | r4 | p4 | | r5 | p5 | Free-list | <b>•</b> | xor | p1 | ^ p2 | -> | p6 | |----------|-----|----|------|----|----| |----------|-----|----|------|----|----| | r1 | p1 | |----|-----------| | r2 | p2 | | r3 | <b>p6</b> | | r4 | p4 | | r5 | p5 | Free-list | r1 | p1 | |----|----| | r2 | p2 | | r3 | p6 | | r4 | p4 | | r5 | p5 | Free-list | r1 | р1 | |----|----| | r2 | p2 | | r3 | p6 | | r4 | p4 | | r5 | р5 | Free-list | r1 | p1 | |----|-----------| | r2 | p2 | | r3 | p6 | | r4 | <b>p7</b> | | r5 | р5 | Free-list | r1 | p1 | |----|-----------| | r2 | <b>p2</b> | | r3 | p6 | | r4 | р7 | | r5 | <b>p5</b> | Free-list | r1 | p1 | |----|----| | r2 | p2 | | r3 | p6 | | r4 | р7 | | r5 | p5 | Free-list | r1 | p1 | |----|----| | r2 | p2 | | r3 | p8 | | r4 | р7 | | r5 | p5 | Free-list Map table Free-list | r1 | p1 | |----|----| | r2 | p2 | | r3 | p8 | | r4 | р7 | | r5 | p5 | **p9** p10 Map table Free-list | r1 | <b>p9</b> | |----|-----------| | r2 | p2 | | r3 | p8 | | r4 | р7 | | r5 | p5 | Free-list ## Out-of-order Pipeline ### Dispatch - Renamed instructions into ooo structures - Re-order buffer (ROB) - All instruction until commit - Issue Queue - Un-executed instructions - Central piece of scheduling logic - Content Addressable Memory (CAM) ### RAM vs CAM - Random Access Memory - Read/write specific index - Get/set value there - Content Addressable Memory - Search for a value (send value to all entries) - Find matching indices (use comparator at each entry) - Output: one bit per entry (multiple match) - One structure can have ports of both types ### RAM vs CAM: RAM RAM: read/write specific index CIS 371 (Hilton/Roth/Martin): Scheduling ### RAM vs CAM: CAM CAM: search for value ### Issue Queue - Holds un-executed instructions - Tracks ready inputs - Physical register names + ready bit - AND to tell if ready ## Dispatch Steps - Allocate IQ slot - Full? Stall - Read ready bits of inputs - Table 1-bit per preg - Clear ready bit of output in table - Instruction has not produced value yet - Write data in IQ slot xor p1 ^ p2 -> p6 add p6 + p4 -> p7 sub p5 - p2 -> p8 addi p8 + 1 -> p9 #### **Issue Queue** | Insn | Inp1 | R | Inp2 | R | Dst | Age | |------|------|---|------|---|-----|-----| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | p1 | у | |----|---| | p2 | У | | р3 | У | | p4 | У | | p5 | У | | p6 | У | | p7 | У | | p8 | У | | р9 | у | xor p1 ^ p2 -> p6 add p6 + p4 -> p7 sub p5 - p2 -> p8 addi p8 + 1 -> p9 #### **Issue Queue** | Insn | Inp1 | R | Inp2 | R | Dst | Age | |------|------|---|------|---|-----|-----| | xor | p1 | у | p2 | y | p6 | 0 | | | | | | | | | | | | | | | | | | | | | | | | | | p1 | у | |----|---| | p2 | У | | рЗ | У | | p4 | У | | р5 | У | | p6 | n | | р7 | У | | p8 | У | | р9 | у | xor p1 ^ p2 -> p6 add p6 + p4 -> p7 sub p5 - p2 -> p8 addi p8 + 1 -> p9 #### **Issue Queue** | Insn | Inp1 | R | Inp2 | R | Dst | Age | |------|-----------|---|------|---|-----|-----| | xor | p1 | у | p2 | у | р6 | 0 | | add | <b>p6</b> | n | p4 | y | p7 | 1 | | | | | | | | | | | | | | | | | | p1 | у | |----|---| | p2 | У | | р3 | У | | p4 | У | | p5 | У | | p6 | n | | p7 | n | | p8 | У | | р9 | У | xor p1 ^ p2 -> p6 add p6 + p4 -> p7 sub p5 - p2 -> p8 addi p8 + 1 -> p9 #### **Issue Queue** | Insn | Inp1 | R | Inp2 | R | Dst | Age | |------|------|---|------|---|-----|-----| | xor | p1 | у | p2 | у | р6 | 0 | | add | p6 | n | p4 | у | р7 | 1 | | sub | p5 | у | p2 | y | p8 | 2 | | | | | | | | | | p1 | у | |----|---| | p2 | У | | р3 | У | | p4 | У | | p5 | У | | p6 | n | | р7 | n | | p8 | n | | р9 | У | xor p1 ^ p2 -> p6 add p6 + p4 -> p7 sub p5 - p2 -> p8 addi p8 + 1 -> p9 #### **Issue Queue** | Insn | Inp1 | R | Inp2 | R | Dst | Age | |------|------|---|------|---|-----|-----| | xor | p1 | у | p2 | у | р6 | 0 | | add | р6 | n | p4 | у | р7 | 1 | | sub | р5 | у | p2 | у | p8 | 2 | | addi | p8 | n | | у | p9 | 3 | | p1 | у | |----|---| | p2 | У | | рЗ | У | | p4 | У | | р5 | У | | p6 | n | | p7 | n | | р8 | n | | р9 | n | ### Out-of-order pipeline - Execution (ooo) stages - **Select** ready instructions - Send for execution - Wakeup dependents ### Dynamic Scheduling/Issue Algorithm - Data structures: - Ready table[phys\_reg] → yes/no (part of issue queue) - Algorithm at "schedule" stage (prior to read registers): ``` foreach instruction: if table[insn.phys_input1] == ready && table[insn.phys_input2] == ready then insn as "ready" select the oldest "ready" instruction table[insn.phys output] = ready ``` ### Issue = Select + Wakeup - **Select** N oldest, ready instructions - > "xor" is the oldest ready instruction below - > "xor" and "sub" are the two oldest ready instructions below - Note: may have resource constraints: i.e. load/store/fp | Insn | Inp1 | R | Inp2 | R | Dst | Age | |------|------|---|------|---|-----|-----| | xor | p1 | у | p2 | у | р6 | 0 | | add | p6 | n | p4 | у | р7 | 1 | | sub | p5 | у | p2 | у | p8 | 2 | | addi | p8 | n | | у | р9 | 3 | Ready! Ready! ### Issue = Select + Wakeup - **Wakeup** dependent instructions - CAM search for Dst in inputs - Set ready - Also update ready-bit table for future instructions | Insn | Inp1 | R | Inp2 | R | Dst | Age | |------|-----------|---|------|---|-----|-----| | xor | p1 | у | p2 | у | p6 | 0 | | add | p6 | у | p4 | у | р7 | 1 | | sub | p5 | у | p2 | у | p8 | 2 | | addi | <b>p8</b> | у | | у | р9 | 3 | | у | |---| | | | y | | у | | у | | у | | y | | n | | y | | n | | | ### **Issue** - Select/Wakeup one cycle - Dependents go back to back - Next cycle: add/addi are ready: | Insn | Inp1 | R | Inp2 | R | Dst | Age | |------|------|---|------|---|-----|-----| | | | | | | | | | add | p6 | у | p4 | у | р7 | 1 | | | | | | | | | | addi | p8 | у | | у | р9 | 3 | ### Register Read - When do instructions read the register file? - Option #1: after select, right before execute - (Not done at decode) - Read physical register (renamed) - Or get value via bypassing (based on physical register name) - This is Pentium 4, MIPS R10k, Alpha 21264 style - Physical register file may be large - Multi-cycle read - Option #2: as part of issue, keep values in Issue Queue - Pentium Pro, Core 2, Core i7 # Renaming review Everyone rename this instruction: mul r4 \* r5 -> r1 | r1 | p1 | |----|----| | r2 | p2 | | r3 | рЗ | | r4 | p4 | | r5 | р5 | Free-list # Dispatch Review #### Everyone dispatch this instruction: div p7 / p6 -> p1 | Insn | Inp1 | R | Inp2 | R | Dst | Age | |------|------|---|------|---|-----|-----| | | | | | | | | | p1 | у | |----|---| | p2 | У | | р3 | У | | p4 | У | | р5 | У | | р6 | n | | р7 | У | | р8 | у | | р9 | У | | | | ### Select Review | Insn | Inp1 | R | Inp2 | R | Dst | Age | |------|------|---|------|---|-----|-----| | add | р3 | у | p1 | у | p2 | 0 | | mul | p2 | n | p4 | у | р5 | 1 | | div | p1 | у | p5 | n | р6 | 2 | | xor | p4 | у | p1 | у | р9 | 3 | Determine which instructions are ready. Which will be issued on a 1-wide machine? Which will be issued on a 2-wide machine? # Wakeup Review | Insn | Inp1 | R | Inp2 | R | Dst | Age | |------|------|---|------|---|-----|-----| | add | р3 | у | p1 | у | p2 | 0 | | mul | p2 | n | p4 | у | р5 | 1 | | div | p1 | у | p5 | n | р6 | 2 | | xor | p4 | у | p1 | у | р9 | 3 | What information will change if we issue the add? # OOO execution (2-wide) # OOO execution (2-wide) # Multi-cycle operations - Multi-cycle ops (load, fp, multiply, etc) - Wakeup deferred a few cycles - Structural hazard? - Cache misses? - Speculative wake-up (assume hit) - Cancel exec of dependents - Re-issue later - Details: complicated, not important # Re-order Buffer (ROB) - All instructions in order - Two purposes - Misprediction recovery - In-order commit - Maintain appearance of in-order execution - Freeing of physical registers # Renaming revisited - Overwritten register - Freed at commit - Restore in map table on recovery - Branch mis-prediction recovery - > Also must be read at rename | r1 | p1 | |----|----| | r2 | p2 | | r3 | р3 | | r4 | p4 | | r5 | p5 | Free-list | - Γ | n2 ' | 1 | |-----|------|---| | L | PO . | J | | r1 | р1 | |----|----| | r2 | p2 | | r3 | р3 | | r4 | p4 | | r5 | р5 | | p6 | |-----| | р7 | | p8 | | p9 | | p10 | Free-list | г | ·~ ^ | 1 | |---|------|---| | П | D3 | Т | | | | | | r1 | p1 | |----|----| | r2 | p2 | | r3 | p6 | | r4 | p4 | | r5 | р5 | Free-list | [ | р3 | ] | |---|-----------|---| | [ | <b>p4</b> | ] | | r1 | p1 | |----|----| | r2 | p2 | | r3 | p6 | | r4 | p4 | | r5 | p5 | Free-list | [ | р3 | ] | |---|-----------|---| | [ | <b>p4</b> | ] | | r1 | p1 | |----|-----------| | r2 | p2 | | r3 | p6 | | r4 | <b>p7</b> | | r5 | p5 | Free-list | | p3 | ] | |---|-----------|---| | [ | <b>p4</b> | ] | | [ | p6 | ] | | r1 | p1 | |----|----| | r2 | p2 | | r3 | p6 | | r4 | р7 | | r5 | р5 | Free-list | [ | p3 | ] | |---|-----------|---| | [ | <b>p4</b> | ] | | [ | p6 | ] | | r1 | p1 | |----|----| | r2 | p2 | | r3 | p8 | | r4 | р7 | | r5 | p5 | Free-list | [ | р3 | ] | |---|-----------|---| | [ | <b>p4</b> | ] | | [ | p6 | ] | | [ | p1 | ] | | r1 | p1 | |----|----| | r2 | p2 | | r3 | p8 | | r4 | р7 | | r5 | p5 | Free-list | [ | р3 | ] | |---|----|---| | [ | p4 | ] | | [ | p6 | ] | | Γ | p1 | 1 | | r1 | <b>p9</b> | |----|-----------| | r2 | p2 | | r3 | p8 | | r4 | р7 | | r5 | p5 | Free-list #### **ROB** - ROB entry holds all info for recover/commit - Logical register names - Physical register names - Instruction types - Dispatch: insert at tail - Full? Stall - Commit: remove from head - Not completed? Stall ### Recovery - Completely remove wrong path instructions - Flush from IQ - Remove from ROB - Restore map table to before misprediction - Free destination registers bnz r1 loop xor r1 ^ r2 -> r3 add r3 + r4 -> r4 sub r5 - r2 -> r3 addi r3 + 1 -> r1 | bnz p1, loop | |-------------------| | xor p1 ^ p2 -> p6 | | add p6 + p4 -> p7 | | sub p5 - p2 -> p8 | | addi p8 + 1 -> p9 | | [ | | | |---|-----------|---| | [ | р3 | ] | | [ | <b>p4</b> | ] | | [ | p6 | ] | | [ | p1 | ] | | r1 | р9 | |----|----| | r2 | p2 | | r3 | р8 | | r4 | р7 | | r5 | р5 | bnz r1 loop xor r1 ^ r2 -> r3 add r3 + r4 -> r4 sub r5 - r2 -> r3 addi r3 + 1 -> r1 | bnz p1, loop | |-------------------| | xor p1 ^ p2 -> p6 | | add p6 + p4 -> p7 | | sub p5 - p2 -> p8 | | addi p8 + 1 -> p9 | | [ | | | |---|-----------|---| | [ | р3 | ] | | [ | <b>p4</b> | ] | | [ | p6 | ] | | [ | p1 | ] | | r1 | <b>p1</b> | |----|-----------| | r2 | p2 | | r3 | p8 | | r4 | р7 | | r5 | p5 | bnz r1 loop xor r1 ^ r2 -> r3 add r3 + r4 -> r4 sub r5 - r2 -> r3 | r1 | p1 | |----|----| | r2 | p2 | | r3 | p6 | | r4 | р7 | | r5 | p5 | bnz r1 loop xor r1 ^ r2 -> r3 add r3 + r4 -> r4 [ p3] [ p4] | r1 | p1 | |----|-----------| | r2 | p2 | | r3 | р6 | | r4 | <b>p4</b> | | r5 | p5 | bnz r1 loop xor r1 ^ r2 -> r3 [ p3] | r1 | p1 | |----|----| | r2 | p2 | | r3 | р3 | | r4 | p4 | | r5 | p5 | bnz r1 loop bnz p1, loop [ ] | r1 | p1 | |----|----| | r2 | p2 | | r3 | р3 | | r4 | p4 | | r5 | p5 | p6 p7 p8 p9 p10 #### What about stores - Stores: Write D\$, not registers - Can we rename memory? - Recover in the cache? #### What about stores - Stores: Write D\$, not registers - Can we rename memory? - Recover in the cache? - No (at least not easily) - Cache writes unrecoverable - Stores: only when certain - Commit #### Commit | L | ρs | J | |---|-----------|---| | [ | <b>p4</b> | ] | | [ | p6 | ] | | ſ | <b>p1</b> | 1 | [ ca ] - Commit: instruction becomes architected state - In-order, only when instructions are finished - Free overwritten register (why?) ### Freeing over-written register - P3 was r3 **before** xor - P6 is r3 after xor - Anything older than xor should read p3 - Anything younger than xor should p6 (until next r3 writing instruction - At commit of xor, no older instructions exist [ p3 ] [p4] [p6] [p1] | r1 | p9 | |----|----| | r2 | p2 | | r3 | p8 | | r4 | р7 | | r5 | p5 | | r1 | p9 | |----|----| | r2 | p2 | | r3 | p8 | | r4 | р7 | | r5 | p5 | | r1 | p9 | |----|----| | r2 | p2 | | r3 | p8 | | r4 | р7 | | r5 | p5 | | r1 | p9 | |----|----| | r2 | p2 | | r3 | p8 | | r4 | р7 | | r5 | p5 | | r1 | p9 | |----|----| | r2 | p2 | | r3 | p8 | | r4 | р7 | | r5 | p5 | - Standard style: large and cumbersome - Change layout slightly - Columns = stages (dispatch, issue, etc) - Rows = instructions - Content of boxes = cycles - For our purposes: issue/exec = 1 cycle - Ignore preg read latency, etc - Load-use, mul, div, and FP longer | Instruction | Disp | Issue | WB | Commit | |-------------------|------|-------|----|--------| | Ld [p1] -> p2 | | | | | | add p2 + p3 -> p4 | | | | | | xor p4 ^ p5 -> p6 | | | | | | ld [p7] -> p8 | | | | | 2-wide Infinite ROB, IQ, Pregs Loads: 3 cycles | Instruction | Disp | Issue | WB | Commit | |-------------------|------|-------|----|--------| | Ld [p1] -> p2 | 1 | | | | | add p2 + p3 -> p4 | 1 | | | | | xor p4 ^ p5 -> p6 | | | | | | ld [p7] -> p8 | | | | | Cycle 1: Dispatch Id and add | Instruction | Disp | Issue | WB | Commit | |-------------------|------|-------|----|--------| | Ld [p1] -> p2 | 1 | 2 | 5 | | | add p2 + p3 -> p4 | 1 | | | | | xor p4 ^ p5 -> p6 | 2 | | | | | ld [p7] -> p8 | 2 | | | | #### Cycle 1: - Dispatch xor and Id - 1st Ld issues -- also note WB cycle while you do this (Note: don't issue if WB ports full) | Instruction | Disp | Issue | WB | Commit | |-------------------|------|-------|----|--------| | Ld [p1] -> p2 | 1 | 2 | 5 | | | add p2 + p3 -> p4 | 1 | | | | | xor p4 ^ p5 -> p6 | 2 | | | | | ld [p7] -> p8 | 2 | 3 | 6 | | #### Cycle 3: - add and xor are not ready - 2nd load is- issue it | Instruction | Disp | Issue | WB | Commit | |-------------------|------|-------|----|--------| | Ld [p1] -> p2 | 1 | 2 | 5 | | | add p2 + p3 -> p4 | 1 | 5 | 6 | | | xor p4 ^ p5 -> p6 | 2 | | | | | ld [p7] -> p8 | 2 | 3 | 6 | | Cycle 4: Nothing Cycle 5: Add can issue | Instruction | Disp | Issue | WB | Commit | |-------------------|------|-------|----|--------| | Ld [p1] -> p2 | 1 | 2 | 5 | 6 | | add p2 + p3 -> p4 | 1 | 5 | 6 | | | xor p4 ^ p5 -> p6 | 2 | 6 | 7 | | | ld [p7] -> p8 | 2 | 3 | 6 | | #### Cycle 6: - 1st load can commit (oldest instruction and finished) - xor can issue | Instruction | Disp | Issue | WB | Commit | |-------------------|------|-------|----|--------| | Ld [p1] -> p2 | 1 | 2 | 5 | 6 | | add p2 + p3 -> p4 | 1 | 5 | 6 | 7 | | xor p4 ^ p5 -> p6 | 2 | 6 | 7 | | | ld [p7] -> p8 | 2 | 3 | 6 | | Cycle 7: Add can commit | Instruction | Disp | Issue | WB | Commit | |-------------------|------|-------|----|--------| | Ld [p1] -> p2 | 1 | 2 | 5 | 6 | | add p2 + p3 -> p4 | 1 | 5 | 6 | 7 | | xor p4 ^ p5 -> p6 | 2 | 6 | 7 | 8 | | ld [p7] -> p8 | 2 | 3 | 6 | 8 | #### Cycle 8: • Commit xor and Id (2-wide: can do both at once) #### Dynamically Scheduling Memory Ops - Compilers must schedule memory ops conservatively - Options for hardware: - Don't execute any load until all prior stores execute (conservative) - Execute loads as soon as possible, detect violations (aggressive) - When a store executes, it checks if any later loads executed too early (to same address). If so, flush pipeline ``` Learn violations over time, selectively reorder (predictive) Before Wrong(?) 1d r2, 4(sp) 1d r2, 4(sp) ld r3,8(sp) 1d r3,8(sp) add r3,r2,r1 //stall \rightarrow 1d r5,0(r8) //does r8==sp? add r3,r2,r1 st r1, 0 (sp) ld r5,0(r8) 1d r6, 4(r8) //does r8+4==sp? st r1,0(sp) 1d r6, 4(r8) sub r5, r6, r4 //stall sub r5, r6, r4 st r4,8(r8) st r4,8(r8) ``` #### **Loads and Stores** | Instruction | Disp | Issue | WB | Commit | |-------------------|------|-------|----|--------| | fdiv p1 / p2 ->p3 | 1 | 2 | 25 | | | st p4 -> [ p5 ] | 1 | 2 | 3 | | | st p3 -> [ p6 ] | 2 | | | | | ld [ p7 ] -> p8 | 2 | | | | #### Cycle 3: - Can Id [ p7 ] -> p8 execute? - Why or why not? #### Loads and Stores | Instruction | Disp | Issue | WB | Commit | |-------------------|------|-------|----|--------| | fdiv p1 / p2 ->p3 | 1 | 2 | 25 | | | st p4 -> [ p5 ] | 1 | 2 | 3 | | | st p3 -> [ p6 ] | 2 | | | | | ld [ p7 ] -> p8 | 2 | | | | #### Aliasing (again) - p5 == p7? - p6 == p7? #### Loads and Stores | Instruction | Disp | Issue | WB | Commit | |-------------------|------|-------|----|--------| | fdiv p1 / p2 ->p3 | 1 | 2 | 25 | | | st p4 -> [ p5 ] | 1 | 2 | 3 | | | st p3 -> [ p6 ] | 2 | | | | | ld [ p7 ] -> p8 | 2 | | | | Suppose p5 == p7 and p6 != p7 Can ld execute now? ## Memory Forwarding - Stores write cache at commit - Commit is in-order, delayed by all instructions - Allows stores to be "undone" on branch mis-predictions, etc. - Loads read cache - Early execution of loads is critical - Forwarding - Allow store -> load communication before store commit - Conceptually like reg. bypassing, but different implementation - Why? Addresses unknown until execute #### Forwarding: Store Queue - Store Queue - Holds all in-flight stores - CAM: searchable by address - Age logic: determine youngest matching store older than load - Store execution - Write Store Queue - Address + Data - Load execution - Search SQ - Match? Forward - Read D\$ #### Load scheduling - Store->Load Forwarding: - Get value from executed (but not comitted) store to load - Load Scheduling: - Determine when load can execute with regard to older stores - Conservative load scheduling: - All older stores have executed - Some architectures: split store address / store data - Only require known address - Advantage: always safe - Disadvantage: performance (limits out-of-orderness) ``` Id [r1] -> r5 Id [r2] -> r6 add r5 + r6 -> r7 st r7 -> [r3] Id 4[r1] -> r5 Id 4[r2] -> r6 add r5 + r6 -> r7 st r7 -> 4[r3] // loop control here ``` With conservative load scheduling, what can go out of order? | | Disp | Issue | WB | Commit | |-------------------|------|-------|----|--------| | ld [p1] -> p5 | 1 | | | | | ld [p2] -> p6 | 1 | | | | | add p5 + p6 -> p7 | | | | | | st p7 -> [p3] | | | | | | ld 4[p1] -> p8 | | | | | | ld 4[p2] -> p9 | | | | | | add p8 + p9 -> p4 | | | | | | st p4 -> 4[p3] | | | | | Suppose 2 wide, conservative scheduling. May issue 1 load per cycle. Loads take 3 cycles to complete. | | Disp | Issue | WB | Commit | |-------------------|------|-------|----|--------| | ld [p1] -> p5 | 1 | 2 | 5 | | | ld [p2] -> p6 | 1 | | | | | add p5 + p6 -> p7 | 2 | | | | | st p7 -> [p3] | 2 | | | | | ld 4[p1] -> p8 | | | | | | ld 4[p2] -> p9 | | | | | | add p8 + p9 -> p4 | | | | | | st p4 -> 4[p3] | | | | | | | Disp | Issue | WB | Commit | |-------------------|------|-------|----|--------| | ld [p1] -> p5 | 1 | 2 | 5 | | | ld [p2] -> p6 | 1 | 3 | 6 | | | add p5 + p6 -> p7 | 2 | | | | | st p7 -> [p3] | 2 | | | | | ld 4[p1] -> p8 | 3 | | | | | ld 4[p2] -> p9 | 3 | | | | | add p8 + p9 -> p4 | | | | | | st p4 -> 4[p3] | | | | | | | Disp | Issue | WB | Commit | |-------------------|------|-------|----|--------| | ld [p1] -> p5 | 1 | 2 | 5 | | | ld [p2] -> p6 | 1 | 3 | 6 | | | add p5 + p6 -> p7 | 2 | | | | | st p7 -> [p3] | 2 | | | | | ld 4[p1] -> p8 | 3 | | | | | ld 4[p2] -> p9 | 3 | | | | | add p8 + p9 -> p4 | 4 | | | | | st p4 -> 4[p3] | 4 | | | | Conservative load scheduling: can't issue ld4[p1]->p8 | | Disp | Issue | WB | Commit | |-------------------|------|-------|----|--------| | ld [p1] -> p5 | 1 | 2 | 5 | 6 | | ld [p2] -> p6 | 1 | 3 | 6 | | | add p5 + p6 -> p7 | 2 | 6 | 7 | | | st p7 -> [p3] | 2 | | | | | ld 4[p1] -> p8 | 3 | | | | | ld 4[p2] -> p9 | 3 | | | | | add p8 + p9 -> p4 | 4 | | | | | st p4 -> 4[p3] | 4 | | | | | | Disp | Issue | WB | Commit | |-------------------|------|-------|----|--------| | ld [p1] -> p5 | 1 | 2 | 5 | 6 | | ld [p2] -> p6 | 1 | 3 | 6 | 7 | | add p5 + p6 -> p7 | 2 | 6 | 7 | | | st p7 -> [p3] | 2 | 7 | 8 | | | ld 4[p1] -> p8 | 3 | | | | | ld 4[p2] -> p9 | 3 | | | | | add p8 + p9 -> p4 | 4 | | | | | st p4 -> 4[p3] | 4 | | | | | | Disp | Issue | WB | Commit | |-------------------|------|-------|----|--------| | ld [p1] -> p5 | 1 | 2 | 5 | 6 | | ld [p2] -> p6 | 1 | 3 | 6 | 7 | | add p5 + p6 -> p7 | 2 | 6 | 7 | 8 | | st p7 -> [p3] | 2 | 7 | 8 | | | ld 4[p1] -> p8 | 3 | 8 | 11 | | | ld 4[p2] -> p9 | 3 | | | | | add p8 + p9 -> p4 | 4 | | | | | st p4 -> 4[p3] | 4 | | | | | | Disp | Issue | WB | Commit | |-------------------|------|-------|----|--------| | ld [p1] -> p5 | 1 | 2 | 5 | 6 | | ld [p2] -> p6 | 1 | 3 | 6 | 7 | | add p5 + p6 -> p7 | 2 | 6 | 7 | 8 | | st p7 -> [p3] | 2 | 7 | 8 | 9 | | ld 4[p1] -> p8 | 3 | 8 | 11 | | | ld 4[p2] -> p9 | 3 | 9 | 12 | | | add p8 + p9 -> p4 | 4 | | | | | st p4 -> 4[p3] | 4 | | | | | | Disp | Issue | WB | Commit | |-------------------|------|-------|----|--------| | ld [p1] -> p5 | 1 | 2 | 5 | 6 | | ld [p2] -> p6 | 1 | 3 | 6 | 7 | | add p5 + p6 -> p7 | 2 | 6 | 7 | 8 | | st p7 -> [p3] | 2 | 7 | 8 | 9 | | ld 4[p1] -> p8 | 3 | 8 | 11 | 12 | | ld 4[p2] -> p9 | 3 | 9 | 12 | | | add p8 + p9 -> p4 | 4 | 12 | 13 | | | st p4 -> 4[p3] | 4 | | | | | | Disp | Issue | WB | Commit | |-------------------|------|-------|----|--------| | ld [p1] -> p5 | 1 | 2 | 5 | 6 | | ld [p2] -> p6 | 1 | 3 | 6 | 7 | | add p5 + p6 -> p7 | 2 | 6 | 7 | 8 | | st p7 -> [p3] | 2 | 7 | 8 | 9 | | ld 4[p1] -> p8 | 3 | 8 | 11 | 12 | | ld 4[p2] -> p9 | 3 | 9 | 12 | 13 | | add p8 + p9 -> p4 | 4 | 12 | 13 | | | st p4 -> 4[p3] | 4 | 13 | 14 | | | | Disp | Issue | WB | Commit | |-------------------|------|-------|----|--------| | ld [p1] -> p5 | 1 | 2 | 5 | 6 | | ld [p2] -> p6 | 1 | 3 | 6 | 7 | | add p5 + p6 -> p7 | 2 | 6 | 7 | 8 | | st p7 -> [p3] | 2 | 7 | 8 | 9 | | ld 4[p1] -> p8 | 3 | 8 | 11 | 12 | | ld 4[p2] -> p9 | 3 | 9 | 12 | 13 | | add p8 + p9 -> p4 | 4 | 12 | 13 | 14 | | st p4 -> 4[p3] | 4 | 13 | 14 | | | | Disp | Issue | WB | Commit | |-------------------|------|-------|----|--------| | ld [p1] -> p5 | 1 | 2 | 5 | 6 | | ld [p2] -> p6 | 1 | 3 | 6 | 7 | | add p5 + p6 -> p7 | 2 | 6 | 7 | 8 | | st p7 -> [p3] | 2 | 7 | 8 | 9 | | ld 4[p1] -> p8 | 3 | 8 | 11 | 12 | | ld 4[p2] -> p9 | 3 | 9 | 12 | 13 | | add p8 + p9 -> p4 | 4 | 12 | 13 | 14 | | st p4 -> 4[p3] | 4 | 13 | 14 | 15 | Our 2-wide ooo processor may as well be 1-wide in-order! | | Disp | Issue | WB | Commit | |-------------------|------|-------|----|--------| | ld [p1] -> p5 | 1 | 2 | 5 | | | ld [p2] -> p6 | 1 | 3 | 6 | | | add p5 + p6 -> p7 | 2 | | | | | st p7 -> [p3] | 2 | | | | | ld 4[p1] -> p8 | 3 | 4 | 7 | | | ld 4[p2] -> p9 | 3 | | | | | add p8 + p9 -> p4 | 4 | | | | | st p4 -> 4[p3] | 4 | | | | - It would be nice if we could issue Id 4[p1]->p8 in c4. - Can we speculate and issue it then? ## **Load Speculation** - Speculation requires two things..... - Detection of mis-speculations - How can we do this? - Recovery from mis-speculations - Squash from offending load - Saw how to squash from branches: same method #### Load Queue - Detects load ordering violations - Load execution: Write address into LQ - Also note any store forwarded from - Store execution: Search LQ - Younger load with same addr? - Didn't forward from younger store? #### Store Queue + Load Queue - Store Queue: handles forwarding - Written by stores (@ execute) - Searched by loads (@ execute) - Read from to write data cache (@ commit) - Load Queue: detects ordering violations - Written by loads (@ execute) - Searched by stores (@ execute) - Both together - Allows aggressive load scheduling - Stores don't constrain load execution | | Disp | Issue | WB | Commit | |-------------------|------|-------|----|--------| | ld [p1] -> p5 | 1 | 2 | 5 | | | ld [p2] -> p6 | 1 | 3 | 6 | | | add p5 + p6 -> p7 | 2 | | | | | st p7 -> [p3] | 2 | | | | | ld 4[p1] -> p8 | 3 | 4 | 7 | | | ld 4[p2] -> p9 | 3 | | | | | add p8 + p9 -> p4 | 4 | | | | | st p4 -> 4[p3] | 4 | | | | - Aggressive load scheduling? - Issue ld 4[p1]->p8 in cycle 4 | | Disp | Issue | WB | Commit | |-------------------|------|-------|----|--------| | ld [p1] -> p5 | 1 | 2 | 5 | | | ld [p2] -> p6 | 1 | 3 | 6 | | | add p5 + p6 -> p7 | 2 | | | | | st p7 -> [p3] | 2 | | | | | ld 4[p1] -> p8 | 3 | 4 | 7 | | | ld 4[p2] -> p9 | 3 | 5 | 8 | | | add p8 + p9 -> p4 | 4 | | | | | st p4 -> 4[p3] | 4 | | | | | | Disp | Issue | WB | Commit | |-------------------|------|-------|----|--------| | ld [p1] -> p5 | 1 | 2 | 5 | 6 | | ld [p2] -> p6 | 1 | 3 | 6 | 7 | | add p5 + p6 -> p7 | 2 | 6 | 7 | 8 | | st p7 -> [p3] | 2 | 7 | 8 | 9 | | ld 4[p1] -> p8 | 3 | 4 | 7 | 9 | | ld 4[p2] -> p9 | 3 | 5 | 8 | 10 | | add p8 + p9 -> p4 | 4 | 8 | 9 | 10 | | st p4 -> 4[p3] | 4 | 9 | 10 | 11 | Saves 4 cycles over conservative Actually uses ooo-ness #### Aggressive Load scheduling - Allows loads to issue before older stores - Increases out-of-orderness - + When no conflict, increases performance - Conflict => squash => worse performance than waiting - Some loads might forward from stores - Always aggressive will squash a lot - Can we have our cake AND eat it too? #### Predictive Load scheduling - Predict which loads must wait for stores - Fool me once, shame on you-- fool me twice? - Loads default to aggressive - Keep table of load PCs that have been caused squashes - Schedule these conservatively - + Simple predictor - Makes "bad" loads wait for all older stores is not so great - More complex predictors used in practice - Predict which stores loads should wait for #### Out of Order: Window Size - Scheduling scope = ooo window size - Larger = better - Constrained by physical registers (#preg) - ROB roughly limited by #preg = ROB size + #logical registers - Big register file = hard/slow - Constrained by issue queue - Limits number of un-executed instructions - CAM = can't make big (power + area) - Constrained by load + store queues - Limit number of loads/stores - CAMs - Active area of research: scaling window sizes - Usefulness of large window: limited by branch prediction - 95% branch mis-prediction rate: 1 in 20 branches, or 1 in 100 insn. #### Out of Order: Benefits - Allows speculative re-ordering - Loads / stores - Branch prediction - Schedule can change due to cache misses - Different schedule optimal from on cache hit - Done by hardware - Compiler may want different schedule for different hw configs - Hardware has only its own configuration to deal with ## Recap: Dynamic Scheduling - Dynamic scheduling - Totally in the hardware - Also called "out-of-order execution" (OoO) - Fetch many instructions into instruction window - Use branch prediction to speculate past (multiple) branches - Flush pipeline on branch misprediction - Rename to avoid false dependencies - Execute instructions as soon as possible - Register dependencies are known - Handling memory dependencies more tricky - "Commit" instructions in order - Anything strange happens before commit, just flush the pipeline - Current machines: 100+ instruction scheduling window ## Out of Order: Top 5 Things to Know - Register renaming - How to perform is and how to recover it - Commit - Precise state (ROB) - How/when registers are freed - Issue/Select - Wakeup: CAM - Choose N oldest ready instructions - Stores - Write at commit - Forward to loads via LQ - Loads - Conservative/aggressive/predictive scheduling - Violation detection #### Static vs Dynamic Scheduling - If we can do this in software... - ...why build complex (slow-clock, high-power) hardware? - + Performance portability - Don't want to recompile for new machines - + More information available - Memory addresses, branch directions, cache misses - + More registers available - Compiler may not have enough to schedule well - + Speculative memory operation re-ordering - Compiler must be conservative, hardware can speculate - But compiler has a larger scope - Compiler does as much as it can (not much) - Hardware does the rest #### This Unit: Code Scheduling - Pipelining and superscalar review - Code scheduling - To reduce pipeline stalls - To increase ILP (insn level parallelism) - Two approaches - Static scheduling by the compiler - Dynamic scheduling by the hardware - Up next: multiprocessing