Instructor: | Prof. Milo Martin |
---|---|
Instructions: | This lab should be done in groups. |
In this lab, you will design and build a superscalar five-stage pipelined processor for the LC4 ISA. As described below, the pipeline is fully bypassed, includes a simple branch predictor, and can execute up to two instructions per cycle. This lab is broken down into three distinct parts.
Be sure to read this entire document before starting, especially the "Hints" section at the very end.
Designing a pipelined processor is significantly more complex than designing a single-cycle version of the same processor. So, the first step is creating a design document for the pipelined processor (which is described in Part B). Although Part C asks you to further extend your design, this initial design document needs to cover only the initial pipelined processor in Part B.
Before you begin writing any Verilog code, I'm asking you to first figure out the design of your new processor. This process should include considering all aspects of the design and testing of the processor. Focus on the trickiest parts of the design and use this exercise to help identity and address implementation issues early.
You'll want to consider how the datapath is broken into stages, identify which values are placed in pipeline latches, uncover any tricky bypassing cases, and decide upon a strategy for stalling the pipeline and recovering from branch mis-predictions. You'll likely want to include a detailed pipeline diagram and list of what values are placed in pipeline latches between each stage. Don't spend a lot of time polishing this document, but you should touch on at least all of the following issues:
Important
The entire goal of the design document is to help (force) you to really think through the design. The design document is not intended to be busy work; I'm requiring the document because I think it will save you time in the end. One corollary to this is that you shouldn't spend significant time polishing the document. However, you need to convince the grader that you've actually deeply considered all aspects of the design and that you're adequately prepared to begin the implementation.
The next step is to implement a scalar five-stage pipelined processor. To improve the CPI, the pipeline is fully bypassed (including the NZP condition codes). In the next part, you'll be asked to extend your implementation to use a simple branch predictor and execute two instructions per cycle.
This processor will use the "standard" five-stage pipeline:
All instructions travel through all five stages.
In the standard five-stage fully bypassed pipeline, the only stall is a single-cycle load-to-use penalty. There are a few tricky cases to consider. First, you should only stall an instruction after the load if the subsequent instruction is actually dependent on the load (either by register value or NZP condition code):
LDR R1, R2, 5 // load r1 <- [R2+10] // stall ADD R3, R1, R4 // add r3 <= R1+R4
Second, a load followed by a dependent store may or may not stall, depending on if the output of the load feeds the address generation (stall) or the data input of a store (no stall):
LDR R1, R2, 5 // load R1 <- [R2+10] // no stall STR R1, R3, 6 // store R1 -> [R3+6] LDR R1, R2, 5 // load R1 <- [R2+10] // stall STR R7, R1, 6 // store R7 -> [R1+6]
Third, you'll need to bypass the NZP condition codes (which are read only by BR instructions). The tricky case is that not all instructions write the condition codes, so you'll need to use the NZP bits based on the last instruction that writes the condition codes prior to the BR instruction.
Note: technically a "BRnzp" or just "BR" instruction wouldn't technically need to stall for condition codes (as they are unconditional branches), however LC4 also has a PC-relative JMP that is used for unconditional branching by convention, so all the "BR" instructions can be treated uniformly.
The register file you used in the single-cycle design is different than the register file used in the pipeline described in the CIS371 lecture. Unlike what we've been assuming in class, a write to the register file will not effect the register being read in the same cycle, so you will need to explicitly add another stage of "bypassing" around the register file.
Thus, when a register is read and written in the same cycle, the register file you used in your single-cycle design returns the old value. For this pipelined design, you want it to return the newly written value. More specifically, if the register file write enable is asserted, and the register being read matches the register being written, you'll write to return the new value being written (and not the old value).
One simple way to do this is to make a new module that contains the existing register file and the additional logic for this "local" bypassing. You'll also need to consider how this impacts the bypassing of the NZP condition codes as well.
For branches, always predict "not taken" and thus the next PC is always predicted as "PC+1". This approach does not require an explicit branch predictor table, but it does require implementing the logic to handle branch mispredictions by squashing and restarting the pipeline. All control transfers (branches, etc.) are resolved in the execute stage, so a mis-predicted branch has a two-cycle penalty.
Use the file lc4_pipeline.v as the start of your pipelined design. Of course, you will likely want to retain the modules (such as the register file, ALU, and branch unit) as well as create additional modules (say, a branch predictor module). As before, labfiles.tar.gz contains this file, Verilog code for the memory and all of the devices, user constraints, test fixtures, and various trace and hex files. The CLK, RST, GWE and instruction and data memory interfaces are the same as before.
In Part C, your goal is to improve the CPI by adding two more advanced features: a (simple) branch predictor and two-way superscalar execution.
Branches are resolved in the execute stage, so a mis-predicted branch has a two-cycle penalty. The branch predictor is an eight-entry direct-mapped branch target buffer (BTB). Each of the eight entries in the BTB has both a 16-bit "tag" and a 16-bit "next PC" field.
Prediction. On an instruction fetch, index into the BTB using the lowest three bits of the PC. If the tag matches (the tag for that entry is equal to the PC) then the predicted next PC is whatever was in the "next PC" field. If the tag doesn't match, the predicted PC is just PC+1. Recall, the BTB is accessed during the fetch stage, and it is accessed for every instruction fetched.
Mis-prediction, recovery, and training. Control instructions resolve during the execute phase. If the predicted PC is wrong (doesn't match the actual PC), three things occur:
- The two instructions following the mis-predicted branch are squashed (nullified)
- Fetch is re-started the next cycle at the correct PC
- The branch predictor is trained. Using the lowest three bits of the mis-predicted PC to index into the BTB, the tag is set to the PC and the next PC field is set to the calculated PC.
Note
Once you've completed Part B, you already have a working pipeline with an always predict PC+1. Once that is working correctly, getting the predictor integrated into the pipeline isn't so difficult.
Note
Notice that the register file module you already have is an eight-entry RAM with 16-bit values... so there is an opportunity for code re-use.
You'll also design and implement a superscalar extension, as described below.
Verifying that your designs are working correctly requires both "functionality" testing (does it calculate the right answer?) and "performance" testing (does your pipeline stall too much?).
The same test harness and testing traces we gave you in the previous lab will also work for testing the scalar pipeline, but with two twists.
First, the pipeline spreads the execution of an instruction over many stages, but the test fixture expects all the test_ signals to correspond to a single instruction. To handle this issue, you'll need to carry the relevant signals all the way into the "writeback" pipeline stage and assign them to the proper test_ signals.
Second, the pipelined processor doesn't complete an instruction each cycle (due to stalls and pipeline flushes). However, the test fixture we've given you already includes a test_stall signal. If this signal is zero (indicating no stall), the test fixture will check the various test_ signals versus the next line in the .trace file. When the test_stall signal is non-zero (indicating a stall), it ignores the cycle.
Part of having a "correct" pipeline is not stalling too much (and thus not capturing all the performance benefits of the pipeline). To measure and understand your pipeline's performance, the test_stall signal is actually a two-bit value to indicate the source of the stall as follows:
The test fixture can then use these values to measure the source of stalls.
Note
If a stall occurred due to a load-use penalty, but the instruction ended up being squashed due to a branch mis-prediction, that stall cycle should be classified as a branch mis-prediction cycle.
The superscalar version of the pipeline executes up to two instructions per cycle using the same five pipeline stages. It fetches, decodes, and executes up to two instructions per cycle, but with the limitation that only one load or store can execute per cycle (as explained below). A summary of some of the considerations for each of the stages:
Fetch (F): To allow fetching of two instructions per cycle, we've modified the memory interface files to return two consecutive 16-bit instructions from memory. This means the fetch stage can fetch up to two instructions per cycle and pass them along to the decode stage. When the first instruction is predicted as a taken branch, the second instruction is not passed along to the decode stage.
Decode (D): The decode stage checks for dependencies between the two instructions and for load-to-use stalls. As described below, as only one load or store can occur per cycle, the stall logic must also stall the second of the two instructions if both are memory operations. Based on the results of the stall logic, the pipeline will pass along two, one, or zero instructions to the execute stage. The register file must also be extended to support up to four reads per cycle (two each for two instructions).
Execute (X): The execute stage will need an additional ALU and extra logic to support full superscalar bypassing.
Memory (M): The memory system support only a single load or store per cycle, so your superscalar pipeline can execute only one memory operation per cycle; that is, if two memory instructions are fetch in the same cycle, the decode stage must stall the second of them. However if either of the instructions is a non-memory instruction and are independent, both instructions should execute.
Writeback (W): The register file must also be extended to support up to two writes per cycle (one each for two instructions). To preserve regular ISA semantics, if two instructions write the same register in the same cycle, the write of the second (or "younger") of the two instructions takes priority.
Branch prediction: One of the subtle issues is how to use branch prediction and how to determine the next address from which to fetch instructions the next cycle. The fetch stage should access the BTB for both instructions. If the BTB predicts the first instruction is predicted taken, then the second instruction should not be passed along to the decode stage.
Determining which address to fetch from the next cycle requires both the outcome of the branch predictions but also how many instructions the decode stage determined could execute. If no instructions executed (say, due to a load-to-use dependency), then the PC should not advance. If the decode stage determines that the first instruction can advance but the second one cannot, the "second" instruction should shift up to become the "first" instruction for the next cycle. Accordingly, the newly fetched instruction should be injected into the decode stage as the "second" instruction and the second fetched instruction is dropped, likely to be re-fetched the following cycle. In the end, there are five different cases to consider for the next program counter; it could be: the same PC, the PC+1, the PC+2, or the predicted target of either the first or second instruction fetched.
Note
I suggest you design your pipeline such that the instruction in the "first" (or "top" or "A") part of the pipeline is always the older instruction and that the instruction in the "second" (or "bottom" or "B") part of the pipeline is always the younger instruction. This approach makes it clear which instruction should have priority when calculating stalls, bypassing, and register writeback.
Note
I strong suggest you adopt a naming convention to make it clear which wires belong to which instructions. For example, "X_A_PC" could be the PC of the "A" (first) instruction in the execute stage. "X_B_PC" would be the "B" (second) instruction in the execute stage.
Note
An instruction that reads the NZP bits is considered dependent on a previous instruction that writes the NZP bits, so such a pair of instructions cannot execute in the same cycle.
Testing. Testing a superscalar design requires checking two instructions per cycle. To allow you to do this, we've extended all the test_... signals from the original tester to include a test_A_... and a test_B_... for each signal sent to the testbench. The testbench will then check these signals again two, one, or zero of the instructions from the trace, depending on the values of test_A_stall and test_B_stall.
Stall cycle accounting. As with the regular pipeline, the superscalar testbench counts the number of stall cycles and the source of those stalls:
Use the same subset of Verilog as in the previous lab.
There will be three demos:
All group members should understand the entire design well enough to be able to answer any such questions. As before, all group members should be present at the demos.
This lab counts for a significant portion of your lab grade for the course. Rather than giving you a binary all-or-nothing trade-off, I am going to refine it a little bit:
The percentages above represent the maximum amount earned for the lab, including the design document, demos, and final writeup.
Stated another way, if your group chooses, you can stop after implement the fully bypassed pipeline and earn up to half the points on the assignment. In fact, if your group chooses such a route prior to the design document, the design document need only describe the parts your group has decided to tackle.
Notice that getting test_alu + test_mem working for superscalar will give you a maximum of 95%.
For those groups behind in the demos, you can demo the above milestones late for 50% of their value. So if you demo a working pipeline with branch prediction (70%) on time and then demo the test_alu working on the superscalar (15%) late, then you have a maximum score of 70+(15*50%) = 77.5%. For the preliminary demos, the "late" demo is Wed. For the final demo, Wednesday is "on time" and Thursday is "late". The TAs will have a limited number of demo slots on Thursday to handle those that have not completed by Wednesday.
There will be three documents to turn in via BlackBoard.