Lab 3 - Pipelined LC4 Processor

CIS 371 (Spring 2012): Computer Organization and Design

Instructor:	Prof. Milo Martin
Instructions:	This lab should be done in groups.

Due Dates

Design Document Due: Thursday, April 5th (turn in PDF via Blackboard by 7pm)
Preliminary Demo I Due: Thursday, April 12th (you must demo it in the KLab by 7pm)
Preliminary Demo II Due: Thursday, April 19th (you must demo it in the KLab by 7pm)
Final Demo Due: Wednesday, April 25th (you must demo it in the KLab by 7pm)
Writeup, Retrospective, and Contribution Report Due: Thursday, April 26th (turn in PDF via Blackboard by 7pm)

Announcements

[April 22]: Added a new test_ld_br.trace file, clarified grading breakdown, and posted late demo policy.
[April 15]: The files for superscalar are in the latest labfiles.tar.gz.
[April 11]: Added the description of the superscalar pipeline to the document.

Overview

In this lab, you will design and build a superscalar five-stage pipelined processor for the LC4 ISA. As described below, the pipeline is fully bypassed, includes a simple branch predictor, and can execute up to two instructions per cycle. This lab is broken down into three distinct parts.

Be sure to read this entire document before starting, especially the "Hints" section at the very end.

Part A: Design Document For Scalar Pipelined Processor

Designing a pipelined processor is significantly more complex than designing a single-cycle version of the same processor. So, the first step is creating a design document for the pipelined processor (which is described in Part B). Although Part C asks you to further extend your design, this initial design document needs to cover only the initial pipelined processor in Part B.

Before you begin writing any Verilog code, I'm asking you to first figure out the design of your new processor. This process should include considering all aspects of the design and testing of the processor. Focus on the trickiest parts of the design and use this exercise to help identity and address implementation issues early.

You'll want to consider how the datapath is broken into stages, identify which values are placed in pipeline latches, uncover any tricky bypassing cases, and decide upon a strategy for stalling the pipeline and recovering from branch mis-predictions. You'll likely want to include a detailed pipeline diagram and list of what values are placed in pipeline latches between each stage. Don't spend a lot of time polishing this document, but you should touch on at least all of the following issues:

Design decisions and overview: Describe in prose any high-level design decisions for your pipelined datapath. A short description is fine.
Module descriptions: Describe any additional modules in your design. What are they? What are their interfaces? How are you going to test them? (You don't need to re-explain the register file, ALU, and branch modules from the previous labs, but you should explain how they are going to be used and/or modified.)
Pipeline diagram: You'll probably want a pipeline datapath diagram. You can either try to fit the whole diagram on a single page or use one page per pipeline stage. You may wish to start with the single-cycle datapath we gave you (lc4_datapath.pdf or lc4_datapath.pptx).
Pipeline registers: Give a table or list that describes which values you're latching between which stages. Give these signals good names that you'll then use in your actual Verilog code. For example, you might want to call the PC in the fetch stage F_PC, the PC in the decode stage D_PC, and latch between the stages the FD_register. We strongly suggest that you adopt some signal naming convention, such as prepending each signal with either F, D, X, M, or W to distinguish in which stage the signal belongs.
Design issues: Describe your approach and strategies for handling various control and datapath issues such as implementing the bypass logic, stalling logic, branch predictor, flushing, etc.
Testing plan: Describe how you plan to test the processor (both for "functional" and "performance" correctness).
Implementation plan: What order do you plan to complete the components? How are you going to divide up the work among the members of the group? Are their any intermediate milestones to target?

Important

The entire goal of the design document is to help (force) you to really think through the design. The design document is not intended to be busy work; I'm requiring the document because I think it will save you time in the end. One corollary to this is that you shouldn't spend significant time polishing the document. However, you need to convince the grader that you've actually deeply considered all aspects of the design and that you're adequately prepared to begin the implementation.

Part B: Pipelined Scalar Processor

The next step is to implement a scalar five-stage pipelined processor. To improve the CPI, the pipeline is fully bypassed (including the NZP condition codes). In the next part, you'll be asked to extend your implementation to use a simple branch predictor and execute two instructions per cycle.

Basic Pipeline Stages

This processor will use the "standard" five-stage pipeline:

Fetch (F): reads the next instruction from the memory, predicts the next PC.
Decode (D): stall logic and reads the register file
Execute (X): performs ALU calculations, branch unit calculations, resolves branches. All non-memory instructions execute in a single cycle, including MUL, MOD, and DIV.
Memory (M): read or write the data memory
Writeback (W): write the register file

All instructions travel through all five stages.

Bypassing & Stall Logic

In the standard five-stage fully bypassed pipeline, the only stall is a single-cycle load-to-use penalty. There are a few tricky cases to consider. First, you should only stall an instruction after the load if the subsequent instruction is actually dependent on the load (either by register value or NZP condition code):

LDR R1, R2, 5     // load r1 <- [R2+10]
// stall
ADD R3, R1, R4    // add r3 <= R1+R4

Second, a load followed by a dependent store may or may not stall, depending on if the output of the load feeds the address generation (stall) or the data input of a store (no stall):

LDR R1, R2, 5     // load R1 <- [R2+10]
// no stall
STR R1, R3, 6     // store R1 -> [R3+6]

LDR R1, R2, 5     // load R1 <- [R2+10]
// stall
STR R7, R1, 6     // store R7 -> [R1+6]

Third, you'll need to bypass the NZP condition codes (which are read only by BR instructions). The tricky case is that not all instructions write the condition codes, so you'll need to use the NZP bits based on the last instruction that writes the condition codes prior to the BR instruction.

Note: technically a "BRnzp" or just "BR" instruction wouldn't technically need to stall for condition codes (as they are unconditional branches), however LC4 also has a PC-relative JMP that is used for unconditional branching by convention, so all the "BR" instructions can be treated uniformly.

Register File & Bypassing

The register file you used in the single-cycle design is different than the register file used in the pipeline described in the CIS371 lecture. Unlike what we've been assuming in class, a write to the register file will not effect the register being read in the same cycle, so you will need to explicitly add another stage of "bypassing" around the register file.

Thus, when a register is read and written in the same cycle, the register file you used in your single-cycle design returns the old value. For this pipelined design, you want it to return the newly written value. More specifically, if the register file write enable is asserted, and the register being read matches the register being written, you'll write to return the new value being written (and not the old value).

One simple way to do this is to make a new module that contains the existing register file and the additional logic for this "local" bypassing. You'll also need to consider how this impacts the bypassing of the NZP condition codes as well.

Handling Branches

For branches, always predict "not taken" and thus the next PC is always predicted as "PC+1". This approach does not require an explicit branch predictor table, but it does require implementing the logic to handle branch mispredictions by squashing and restarting the pipeline. All control transfers (branches, etc.) are resolved in the execute stage, so a mis-predicted branch has a two-cycle penalty.

Project Files

Use the file lc4_pipeline.v as the start of your pipelined design. Of course, you will likely want to retain the modules (such as the register file, ALU, and branch unit) as well as create additional modules (say, a branch predictor module). As before, labfiles.tar.gz contains this file, Verilog code for the memory and all of the devices, user constraints, test fixtures, and various trace and hex files. The CLK, RST, GWE and instruction and data memory interfaces are the same as before.

Part C: Superscalar Processor with Branch Prediction

In Part C, your goal is to improve the CPI by adding two more advanced features: a (simple) branch predictor and two-way superscalar execution.

Branch Prediction

Branches are resolved in the execute stage, so a mis-predicted branch has a two-cycle penalty. The branch predictor is an eight-entry direct-mapped branch target buffer (BTB). Each of the eight entries in the BTB has both a 16-bit "tag" and a 16-bit "next PC" field.

Prediction. On an instruction fetch, index into the BTB using the lowest three bits of the PC. If the tag matches (the tag for that entry is equal to the PC) then the predicted next PC is whatever was in the "next PC" field. If the tag doesn't match, the predicted PC is just PC+1. Recall, the BTB is accessed during the fetch stage, and it is accessed for every instruction fetched.

Mis-prediction, recovery, and training. Control instructions resolve during the execute phase. If the predicted PC is wrong (doesn't match the actual PC), three things occur:

The two instructions following the mis-predicted branch are squashed (nullified)

Fetch is re-started the next cycle at the correct PC

The branch predictor is trained. Using the lowest three bits of the mis-predicted PC to index into the BTB, the tag is set to the PC and the next PC field is set to the calculated PC.

Note

Once you've completed Part B, you already have a working pipeline with an always predict PC+1. Once that is working correctly, getting the predictor integrated into the pipeline isn't so difficult.

Note

Notice that the register file module you already have is an eight-entry RAM with 16-bit values... so there is an opportunity for code re-use.

Superscalar

You'll also design and implement a superscalar extension, as described below.

Testing

Verifying that your designs are working correctly requires both "functionality" testing (does it calculate the right answer?) and "performance" testing (does your pipeline stall too much?).

Functional Testing

The same test harness and testing traces we gave you in the previous lab will also work for testing the scalar pipeline, but with two twists.

First, the pipeline spreads the execution of an instruction over many stages, but the test fixture expects all the test_ signals to correspond to a single instruction. To handle this issue, you'll need to carry the relevant signals all the way into the "writeback" pipeline stage and assign them to the proper test_ signals.

Second, the pipelined processor doesn't complete an instruction each cycle (due to stalls and pipeline flushes). However, the test fixture we've given you already includes a test_stall signal. If this signal is zero (indicating no stall), the test fixture will check the various test_ signals versus the next line in the .trace file. When the test_stall signal is non-zero (indicating a stall), it ignores the cycle.

Performance Testing

Part of having a "correct" pipeline is not stalling too much (and thus not capturing all the performance benefits of the pipeline). To measure and understand your pipeline's performance, the test_stall signal is actually a two-bit value to indicate the source of the stall as follows:

test_stall == 0: no stall
test_stall == 1: flushed due to branch mis-prediction
test_stall == 2: stalled due to load-use penalty
test_stall == 3: used only for superscalar, as described below

The test fixture can then use these values to measure the source of stalls.

Note

If a stall occurred due to a load-use penalty, but the instruction ended up being squashed due to a branch mis-prediction, that stall cycle should be classified as a branch mis-prediction cycle.

Superscalar Pipeline

The superscalar version of the pipeline executes up to two instructions per cycle using the same five pipeline stages. It fetches, decodes, and executes up to two instructions per cycle, but with the limitation that only one load or store can execute per cycle (as explained below). A summary of some of the considerations for each of the stages:

Fetch (F): To allow fetching of two instructions per cycle, we've modified the memory interface files to return two consecutive 16-bit instructions from memory. This means the fetch stage can fetch up to two instructions per cycle and pass them along to the decode stage. When the first instruction is predicted as a taken branch, the second instruction is not passed along to the decode stage.
Decode (D): The decode stage checks for dependencies between the two instructions and for load-to-use stalls. As described below, as only one load or store can occur per cycle, the stall logic must also stall the second of the two instructions if both are memory operations. Based on the results of the stall logic, the pipeline will pass along two, one, or zero instructions to the execute stage. The register file must also be extended to support up to four reads per cycle (two each for two instructions).
Execute (X): The execute stage will need an additional ALU and extra logic to support full superscalar bypassing.
Memory (M): The memory system support only a single load or store per cycle, so your superscalar pipeline can execute only one memory operation per cycle; that is, if two memory instructions are fetch in the same cycle, the decode stage must stall the second of them. However if either of the instructions is a non-memory instruction and are independent, both instructions should execute.
Writeback (W): The register file must also be extended to support up to two writes per cycle (one each for two instructions). To preserve regular ISA semantics, if two instructions write the same register in the same cycle, the write of the second (or "younger") of the two instructions takes priority.
Branch prediction: One of the subtle issues is how to use branch prediction and how to determine the next address from which to fetch instructions the next cycle. The fetch stage should access the BTB for both instructions. If the BTB predicts the first instruction is predicted taken, then the second instruction should not be passed along to the decode stage.

Determining which address to fetch from the next cycle requires both the outcome of the branch predictions but also how many instructions the decode stage determined could execute. If no instructions executed (say, due to a load-to-use dependency), then the PC should not advance. If the decode stage determines that the first instruction can advance but the second one cannot, the "second" instruction should shift up to become the "first" instruction for the next cycle. Accordingly, the newly fetched instruction should be injected into the decode stage as the "second" instruction and the second fetched instruction is dropped, likely to be re-fetched the following cycle. In the end, there are five different cases to consider for the next program counter; it could be: the same PC, the PC+1, the PC+2, or the predicted target of either the first or second instruction fetched.

Note

I suggest you design your pipeline such that the instruction in the "first" (or "top" or "A") part of the pipeline is always the older instruction and that the instruction in the "second" (or "bottom" or "B") part of the pipeline is always the younger instruction. This approach makes it clear which instruction should have priority when calculating stalls, bypassing, and register writeback.

Note

I strong suggest you adopt a naming convention to make it clear which wires belong to which instructions. For example, "X_A_PC" could be the PC of the "A" (first) instruction in the execute stage. "X_B_PC" would be the "B" (second) instruction in the execute stage.

Note

An instruction that reads the NZP bits is considered dependent on a previous instruction that writes the NZP bits, so such a pair of instructions cannot execute in the same cycle.

Testing. Testing a superscalar design requires checking two instructions per cycle. To allow you to do this, we've extended all the test_... signals from the original tester to include a test_A_... and a test_B_... for each signal sent to the testbench. The testbench will then check these signals again two, one, or zero of the instructions from the trace, depending on the values of test_A_stall and test_B_stall.

Stall cycle accounting. As with the regular pipeline, the superscalar testbench counts the number of stall cycles and the source of those stalls:

test_stall == 0: no stall
test_stall == 1: flushed due to branch mis-prediction
test_stall == 2: stalled due to load-use penalty
test_stall == 3: stalled due to superscalar constraints. This could be due to two consecutive dependent instructions or only fetching one instruction per cycle due to a taken branch. For example, if the pipeline fetches two dependent instruction, the test_A_stall would be 0 and test_B_stall would be 3. This option is only used for test_B_stall; test_A_stall should never be set to four. However, if the second instruction was actually independent of the first instruction but was stalled solely due to a load-to-use dependency, test_B_stall would be set to 2.

Verilog Restrictions

Use the same subset of Verilog as in the previous lab.

Demos

There will be three demos:

Preliminary demo I: Demonstrate that the "test_alu.trace" works correctly on your pipeline. This trace file is a good stress-test for various bypassing corner cases, but contains no branches or memory operations.
Preliminary demo II: Demonstrate the fully-working design, including bypassing and PC+1 branch prediction. This includes:
1. Showing that all the .trace files we've provided work in simulation.
2. Demonstrating that tetris.hex and invaders.hex work on the FPGA board.
Final demo: Demonstrate your fully-working superscalar pipeline with the branch predictor. This demo includes:
1. Showing that all the .trace files we've provided work in simulation.
2. Demonstrating that tetris.hex and invaders.hex work on the FPGA board.
3. Recording the stall cycle accounting for traces provided by the TAs, which will check for over-stalling and that your superscalar execution is working properly.

All group members should understand the entire design well enough to be able to answer any such questions. As before, all group members should be present at the demos.

Grading

This lab counts for a significant portion of your lab grade for the course. Rather than giving you a binary all-or-nothing trade-off, I am going to refine it a little bit:

A fully bypassed 5-stage pipeline that passes the preliminary demo is worth 30% of the assignment.
A fully bypassed 5-stage pipeline with PC+1 branch prediction that passes all the test and runs on the board is worth another 20%.
Correctly implementing branch prediction using the branch predictor (as measured by the stall accounting) is worth another 20%.
Implementing superscalar correctly and efficiently (as determined by the stall accounting) for the test_alu trace is worth another 15%.
Implementing superscalar correctly and efficiently (as determined by the stall accounting) for the test_mem or test_br trace is worth another 10%.
Implementing superscalar correctly and efficiently (as determined by the stall accounting) for all traces and running on the board is worth the final 5%.

The percentages above represent the maximum amount earned for the lab, including the design document, demos, and final writeup.

Stated another way, if your group chooses, you can stop after implement the fully bypassed pipeline and earn up to half the points on the assignment. In fact, if your group chooses such a route prior to the design document, the design document need only describe the parts your group has decided to tackle.

Notice that getting test_alu + test_mem working for superscalar will give you a maximum of 95%.

Late Demo Policy

For those groups behind in the demos, you can demo the above milestones late for 50% of their value. So if you demo a working pipeline with branch prediction (70%) on time and then demo the test_alu working on the superscalar (15%) late, then you have a maximum score of 70+(15*50%) = 77.5%. For the preliminary demos, the "late" demo is Wed. For the final demo, Wednesday is "on time" and Thursday is "late". The TAs will have a limited number of demo slots on Thursday to handle those that have not completed by Wednesday.

What to Turn In

There will be three documents to turn in via BlackBoard.

Design Document. As described above in Part A.
Final Writeup:
- Design update and retrospective. Write a short "design retrospective", focusing on what you learned during the design that you wish you would have known when you began the implementation. Even if you scored well on your initial design document, you'll probably have uncovered a lot of additional information in your final lab writeup. In the end, how did you actually implemented bypassing and stalling? Etc. Basically, describe how your design changed, what parts you left out of the design document, which parts you got wrong, and assess the effectiveness of your approach to testing and debugging your implementation. What are a few things that, in retrospect, you wish you could got a month back in time and tell yourself about this lab? This shouldn't (just) be a narrative of what when wrong; instead, it should be a critic and analysis of your original design.
- Verilog code. As before, your Verilog code should be well-formatted, easy to understand, and include comments where appropriate. Some part of the project grade will be dependent on the style and readability of your Verilog, including formatting, comments, good signal names, and proper use of hierarchy.
- Assessment. Answer the following questions:
  - How many hours did it take you to complete this assignment (total for entire group)?
  - On a scale of 1 (least) to 5 (most), how difficult each part of this assignment?
    1. Pipelined processor with PC+1 branch prediction
    2. Pipelined processor with a branch predictor
    3. Superscalar pipelined processor
Group Contribution Report: Finally, to encourage equal participation by all group members, each group member must individually submit (via blackboard) an document describing your view of the contribution of each member of your group. For each group member (including yourself), give a percentage that member contributed to the project and describe their role in the project. The sum of the percentages should add up to 100%.

Hints

The preliminary demos will make sure that you are on track. But you probably want to set some finer-grained intermediate milestones as well. For instance, you could create a short program in which all dependent instructions are separated by two explict NOPs. A program like this will run correctly on a pipeline that has no bypassing or stall logic. You can then add bypassing logic in stages.
Use good signal names and comments. Using well-named signals and well-documented code will help you and your partner understand each other's code. Plus, this code is complicated enough that good structure, good signal names, and comments really will pay off. I really cannot stress this enough.
Be careful not to bypass a value from a prior instruction that either doesn't write a register or an instruction that has been squashed due to a branch misprediction.
Similarly, be sure not to falsely stall any instruction whose encoding might match the previous instruction's output register, but does not actually read the register. For example, the JUMP instruction doesn't read any registers, so it shouldn't be stalled just because the bits in its immediate field happen to match the output register of a prior load. Over-stalling is difficult to detect because it is a "performance bug" and not a "correctness bug". That is, you'll get the right answer, but your processor will have a worse CPI than it should.
The NZP bits are implicit values that have to be bypassed between register writing instructions and conditional branches. Basically, you can think of the NZP bits as an additional register.
In several places in the design, you'll have instructions that either (1) don't write a register value (for example, stores) or (2) don't read both input registers (for example, loads). You might consider encoding these cases using an explicit "valid" bit for each 3-bit register specifier. Doing so could dramatically simplify your bypass and stall logic.
You can implement flushing and stalling (NOP insertion) by using a mux to set the instruction bits to all zeros (which is a NOP in LC4). However, you'll still need to record the source of the stall.
When a pipeline bubble arrives at the Writeback stage, it is difficult to figure out what caused it to set the test_stall signal correctly. Thus, you will need to remember why different pipeline registers are cleared at those latches and then propagate the information forward. Careful, although you can clear the main pipeline registers, you should not clear the pipeline registers that track why the main pipeline registers were cleared.