Instructor: | Prof. Milo Martin |
---|---|
Due: | Noon, Tuesday, December 7th (at the start of class) |
As you're also working on your projects (and many of the project would benefit from being able to build on the simulation you write for this assignment), I've decided to allow you to do this assignment in groups (the same group as your project). Thus, you'll turn in one writeup in class (graphs and discussion question answers) and turn in one set of files to blackboard (your code and debug trace). For the files turned in via blackboard, just have one group member submit it.
As this is being done in groups, there is no use of grace periods. Thus, no late assignments will be accepted.
Now that you've implemented register renaming (the first part of this two-part assignment), in this part of the assignment you will explore the effectiveness of out-of-order execution (dynamic scheduling) impact on instruction-level-parallelism. As a fully detailed simulation of such a core is beyond what could be feasibly done for an assignment, the model of the processor core described in this assignment is approximate in many ways (failing to capture many second-order effects) but captures the most important first-order effects of instruction execution.
This assignment uses the same traces as all previous assignments (see Trace Format document).
As before, you'll write a program that reads in the trace and simulates different processor core configurations. You're welcome to write the program in whatever language you like; although we are going to ask for you to turn in your source code (via blackboard), to end result of this assignment are the experimental results (the program source code is just a means to the end).
This assignment builds upon the code that you wrote in the previous assignment to generate a trace of renamed instructions. To simplify the simulation, we are going to assume a physical register file for 2048 registers (much larger than one would build in practice), allow you to explore the performance of even very large instruction windows (1024 entries, say) without needing to worry about stalling due to lack of registers.
To simulate an (idealized) out-of-order processor core, we are going to model a reorder buffer (ROB) of finite size, but assume there are plenty of physical registers, instruction queue entries, load queue entries, store queue entries, and assume that an N-way superscalar machine can always fetch N instructions and execute any N instructions in a given cycle. All non-loads will have a single-cycle latency; loads have a three-cycle latency (two-cycle load-to-use penalty) and never miss in the cache. The pipeline is also simplified in that we are going to model a relatively short pipeline.
As before, we have generated test cases that allow you to check your simulator to make sure it is executing correctly (see below).
The key data structure for dynamic scheduling is the reorder buffer (ROB). It has fixed number of entries, with one entry for each micro-op than has been fetched and not yet committed. The ROB is managed as a queue. At instruction fetch, micro-ops are enqueued at the tail of the ROB. At instruction commit, micro-ops are dequeued from the head of the ROB. Each entry of the ROB is an object (or struct) with all the information about each micro-ops:
- SourceRegisters. The input (source) physical registers (up to three of them)
- RegisterReady. Is the corresponding input register ready? (up to three registers)
- RegisterDestinations. The output (destination) physical registers (up to two of them)
- isLoad. Is the micro-op a load operation?
- isStore. Is the micro-op a store operation?
- Address. If the memory operation is a load or a store, this field specifies the memory location it accesses (note: this is not the PC of the instruction, but the address loaded or stored by the micro-op).
- Age. The "age" of the instruction. For simplicity, this can be encoded as an integer value as to how many micro-ops have been fetched thus far in the simulation. In reality, this would be encoded in fewer bits, using wrap-around arithmetic when comparing the ages of two micro-ops.
- Issued. Has this instruction been issued?
As you fetch, issue, finish execution, and commit each micro-op, you'll want to track when you do each. These wouldn't always exist in a real design, but we'll use them for various bookkeeping and for generating the debugging trace.
- FetchCycle. Cycle the micro-op was fetched.
- IssueCycle. Cycle the micro-op was issued.
- DoneCycle. Cycle the micro-op completed execution.
- CommitCycle. Cycle the micro-op committed.
You'll also want to include some information for the debugging traces:
- MacroOp name. The text string of the macro-op name.
- MicroOp name. The text string of the micro-op name.
To prevent an instruction from flowing through the pipeline in a single cycle, we perform the pipeline stages in reverse order.
During commit, we look at the micro-op at the head of the ROB and commit (dequeue) it if it has completed executing. We can commit at most N instructions per cycle:
for (i=0; i<N; i++): if (ROB.head().DoneCycle <= CurrentCycle): ROB.dequeue()
When generating the output trace, the micro-op should be printed to the debug output right before it commits.
During issue, we select the N oldest ready instructions to begin execution:
count = 0 foreach microop in ROB (iterating from the oldest to the youngest): if (microop.issued == false) and microop.isReady(): microop.issued = true microop.issuedCycle = CurrentCycle microop.DoneCycle = CurrentCycle + latency foreach destination physical register in microop: set scoreboard[physical register] = latency count = count + 1 if (count == N): break // stop looking for instructions to execute
A micro-op is considered ready when all of its source (input) registers are ready (for now, ignore memory dependencies). To track which registers are ready or not, a real processor has a ready table (one bit per physical register) and also ready bits in each entry in the instruction queue, which are updated during instruction wakeup. Although we could model this sort of wakeup operation in the simulator, it is easier to use a register scoreboard that records how many cycles until a register is ready (just as we did for the pipeline simulator, except acting on physical registers rather than architectural registers).
Thus, the second key data structure is the "scoreboard" table. This table has one entry per physical register. Each entry is an integer that indicates how many cycle until the register will be ready. The value of zero indicates the register is ready. A negative value says the instruction the write that physical register is waiting to issue. A positive value indicates the number of cycles until the instruction that writes that register will generate its output value. At the start of the simulation, initialize all physical registers as ready (value of zero).
To determine if a regiser is ready, your code should just check to see if the entry in the scoreboard is equal to zero. If all the source physical registers are ready, then the instruction is ready to issue:
bool isReady(MicroOp op): foreach physical_register in op's source physical registers: if physical_register is not ready return false
At fetch, instruction are renamed and inserted in the ROB:
for (i=0; i<N; i++): if ROB.isFull(): break; // ROB is full 1. Read the next micro-op from the trace 2. Rename the micro-op (as per the previous assignment) 3. Enqueue the micro-op into the ROB, filling in the fields as appropriate 4. foreach valid destination physical register set the register as "not ready" by setting the correspond scoreboard entry to -1.
At the end of each cycle, we need to advance to the next cycle:
CurrentCycle = CurrentCycle + 1 foreach preg in Scoreboard: if (Scoreboard[preg]) > 0: Scoreboard[preg] = Scoreboard[pref] - 1; // decrement
The following experiments incrementally refine the simulation to less idealized and explores the impact on performance of those changes.
The first experiments explores the amount of instruction level parallelism for various ROB sizes. This result is going to be overly optimistic, as this experiment is going to totally ignore memory dependencies and ignore branch prediction. As noted above, loads are three-cycle instructions and all other instructions are single-cycle instructions. For these experiments, set N to 8 (the processor can fetch, rename, issue, execute, and commit up to eight micro-ops per cycle).
Vary the size of the ROB from 8 micro-ops to 1024 micro-ops in powers of 2 (16, 32, 64, 128, 256, 512, 1024). Run the simulations for each of these ROB sizes and record the micro-ops (instructions) per cycle (IPC). Plot the data as a line graph with the IPC on the y-axis and the log of the ROB size on the x-axis. In essense, the x-axis is a log scale. Otherwise, all the interesting data points are crunched to the far left of a graph and basically unreadable. This is basically the same thing we did with the branch predictor sizes and cache sizes, so this isn't anything new.
In the above experiment, the simulation ignored memory dependencies completely. For this experiment, implement conservative memory scheduling. That is, in addition to the requirement that all source registers must be ready, a load can issue only when all prior stores have issued:
bool isReady(MicroOp op): foreach physical_register in op's source physical registers: if physical_register is not ready return false if op.isLoad: foreach op2 in ROB: if (op2.isStore) and (op2.Age < op.Age) and (op2.isNotDone()): return false return true
An instruction is "Done" if it has issued (op.issued is true and DoneCycle is <= Current Cycle). Thus, an instruction is "NotDone" is just the negation.
Based on the above conservative memory scheduling, re-run the simulation for the same ROB sizes as the earlier experiment. Plot the data (IPC vs ROB size) on the same graph.
Neither of the above two experiments are representative of state-of-the-art memory scheduling. In this experiment, we'll use the fact that our trace already has memory addresses in it to implement perfect memory scheduling. Adjust the "isReady()" computation to include an addresses match:
bool isReady(MicroOp op): foreach physical_register in op's source physical registers: if physical_register is not ready return false if op.isLoad: foreach op2 in ROB: if (op2.isStore) and (op2.Age < op.Age) and (op2.isNotDone()): if (op.Address == op2.Address): return false return true
The only addition to the above algorithm is the check that delays a load only if the earlier (younger) store is to the same address.
Note
We are actually still being a bit optimistic, because an 8-byte load might depend on a mis-aligned 8-byte store. As the addresses of these two memory operations wouldn't match, our simulations wouldn't capture that dependency. In addition, an 8-byte load might get its value from two 4-byte stores. This "partial forwarding" or "multiple forwarding" case is actually really nasty to get right (in simulation and even worse in real hardware), but it can actually happen in some codes, so real processors must handle it. We are ignoring it here.
As with the previous two experiments, run the code with the range of ROB sizes and plot the data on the same graph.
For the next experiment, we are going to consider realistic branch prediction. Simulate a gshare predictor with 215 two-bit counters (8KBs of state) and 15 bits of branch history. Perform the branch prediction during fetch/rename. In a real machine, we would only know that the branch was mis-predicted once it executes (just like with memory addresses, above). However, in our trace-based simulations, we know immediately if the prediction was correct or not. If the branch prediction was correct, continue as before. If the branch prediction is incorrect, fetch is suspended (no more instructions are fetched) and mark the branch in the ROB as "mispredicted" (you'll need to add another field to each entry in the ROB). At issue, check to see if the micro-op being "selected" is the mis-predicted branch micro-op. If so, set a "restart fetch counter" to the mis-prediction restart penalty. For these simulations, assume a reasonably small restart penalty of four cycles (this represents the time it takes to redirect fetch, access the instruction cache, and start renaming instructions again.
Thus, adjust the fetch bases on the following code:
for (i=0; i<N; i++): if (FetchReady != 0): FetchReady = FetchReady - 1 break; if ROB.isFull(): break; // ROB is full else: Fetch/rename a micro-ops as described above ... ... if (fetched micro-op is a conditional branch): Make branch prediction if (fetched micro-op is a mis-predicted branch): FetchReady = -1; // Suspend fetch microop.isMisPredictedBranch = true
During micro-op select, if a micro-op is selected to execute, also perform the following actions:
... if (microop.isMisPredictedBranch): assert(FetchReady < 0) FetchReady = latency + 4; // The mis-prediction restart penalty ...
Together, a mis-predicted branch will block fetch until 4 cycles after it completes execution.
As with the previous experiments, run the simulation with the branch predictor for the range of ROB sizes. Plot the data on the same graph as before, but labeled as "realistic branch prediction".
Note
Let me reemphasize that in a real machine, the hardware wouldn't know the branch was mis-predicted, so it would go on to fetch instructions down the "wrong path", inject them into the instruction queue and execute them speculatively. As we are doing this based on a trace, we know immediately that we mis-predicted the branch, thus we can approximate the performance of speculative execution without actually implementing recovery and such. In fact, fetching down wrong paths is actually really difficult to model using traces, as we only have the "correct" trace of instructions. This is a limitation of simple trace-based simulations. In contrast, a simulation that actually simulates the execution of the program can actually model the performance (and energy) impact of executing down the wrong path.
Using the data you collect for the above two experiments, calculate the "cost of each branch mis-prediction" for each of the ROB sizes. To do this, for each ROB size, subtract the total cycles for the realistic and perfect branch prediction simulations. Then divide by the total number of branch mis-predictions (which will be the same for all ROB sizes). This will give the average number of cycles lost due to each branch prediction.
On a new graph, plot this data as a line graph wth the ROB size on the x-axis. Properly label each graph.
For experiment #1 (unrealistic memory scheduling and perfect branch prediction):
Approximately how large must the ROB be to achieve an average uIPC (micro-ops per cycle) of four?
Approximately how large must the ROB be to achieve 95% of the performance of same configuration with a 1024-entry ROB?
For experiment #2 (conservative memory scheduling and perfect branch prediction):
Approximately now large must the ROB be to achieve an average uIPC (micro-ops per cycle) of four?
Approximately how large must the ROB be to achieve 95% of the performance of same configuration with a 1024-entry ROB?
For all the ROB sizes you simulated, what is the largest difference between unrealistic and conservative memory scheduling? Give you answer in the form: unrealistic memory scheduling is x% faster than conservative memory scheduling.
For experiment #3 (perfect memory scheduling and perfect branch prediction):
Approximately now large must the ROB be to achieve an average uIPC (micro-ops per cycle) of four?
Approximately how large must the ROB be to achieve 95% of the performance of same configuration with a 1024-entry ROB?
For all the ROB sizes you simulated, what is the largest difference between unrealistic and perfect memory scheduling? Give you answer in the form: unrealistic memory scheduling is x% faster than perfect memory scheduling.
For experiment #4 (realistic branch prediction):
Approximately now large must the ROB be to achieve an average uIPC (micro-ops per cycle) of four? (Yes, this is a trick question.)
Approximately how large must the ROB be to achieve 95% of the performance of same configuration with a 1024-entry ROB?
For all the ROB sizes you simulated, what is the largest difference between perfect and realistic branch prediction (assuming perfect memory scheduling)? Give you answer in the form: perfect branch prediction is x% faster than realistic branch prediction.
Comparison to contemporary processor:
The Intel Core i7 has a 128-entry ROB and can sustain four-wide execution. How does this compare to what your experiments would suggest is a good design point? (Is it far away or close to what your results suggest?)
One reason that Intel's design might be a bit different than what our results indicate could be due to our model simplifications of various sorts. Give at lesat two other key "big picture" reasons that Intel's actual design might be different than what our results suggest.
For experiment 5 (average cost per branch mis-prediction):
Why is the average cost per branch prediction so much higher than the five cycle delay to redirect fetch?
Why is the average cost per brnach prediction higher for larger ROBs?
We've made several output debug traces available to help you debug and test your code. The debug output trace looks like:
1: 0 1 2 2, r13 -> p50 [p13] | SET ADD 2: 0 1 2 2, r49 -> p49, r13 -> p51 [p50] | SET ADD_IMM 3: 0 1 4 4, r5 -> p5, r45 -> p52 [p45] | CMP LOAD 4: 0 4 5 5, r45 -> p52, r3 -> p3, r44 -> p53 [p44], r49 -> p54 [p49] | CMP SUB 5: 1 2 5 5, r5 -> p5, r3 -> p55 [p3] | MOV LOAD 6: 1 2 3 5, r0 -> p56 [p0] | SET ADD 7: 1 5 6 6, r49 -> p54, r0 -> p57 [p56] | SET ADD_IMM 8: 1 2 3 6, r12 -> p58 [p12] | XOR ADD 9: 2 6 7 7, r13 -> p51, r0 -> p57, r13 -> p59 [p51], r49 -> p60 [p54] | OR OR 10: 2 3 4 7 | JMP JMP_IMM 11: 4 5 8 8, r3 -> p55, r0 -> p61 [p57] | MOV LOAD 12: 5 8 9 9, r0 -> p61, r0 -> p61, r44 -> p62 [p53], r49 -> p63 [p60] | TEST AND 13: 5 9 10 10, r49 -> p63 | J JMP_IMM 14: 5 8 11 11, r0 -> p61, r7 -> p64 [p7] | MOV LOAD
The first column is the number of the micro-op. The next four columns are the cycle in which the micro-op was: fetched, issued, completed execution, and committed. The remaining columns are the same as the output the rename trace from the previous assignment.
We've provided three tests debug output for each of the four experiments. To make it easier to debug, we've provided: (1) debug traces with just a few ROB entries and single-issue, and (2) slightly larger ROB and four-way issue, and (3) large ROB with full 8-wide exeuction. The number of physical registers also vary (but is always sufficient to not stall the pipeline) to test cases in which registers get reused. The names of the traces encode the exact settings of these parameters:
Output Debug Trace. Via blackboard, turn in an output trace file for the GCC-1K trace file for an 8-way superscalar with a 256-entry ROB with 2048 physical registers with perfect memory dependence prediction and using the GShare branch predictor (basically, the configuration for the final experiment). I'm asking you to turn this in to ensure that you actually get the simulation whoring correctly. Except maybe for some whitespace differences, it should match up exactly with our reference trace.
Short answers. Turn in your answers to the above questions. I strongly urge you to type up your answers.
Graphs. Turn in printouts of the following graphs. One per sheet of paper (to make them big enough to read). Label the axises and give each line a descriptive names in a legend.
- Graph A [Experiment #1 through #4]: The graph has the IPC versus ROB sizes for the four experiments (one line per experiment). As stated above, the x-axis is a log scale to avoid most of the data points being squished into the left of the graph.
- Graph B [Experiment #4]: This graph has the average branch prediction cost (y-axis) for each of the ROB sizes (log-scale x-axis).
Source Code. Turn in your source code via blackboard.