Instructor: | Prof. Milo Martin |
---|---|
Due: | 3pm, Wednesday, April 1st (at the start of class) |
In this assignment you will explore the effectiveness of branch direction prediction (taken vs not taken) on an actual program. We've used Pin, a binary instrumentation tool, to generate a trace of branches and their outcomes. Your task will be use this representative trace to evaluate the effectiveness of a few simple branch prediction schemes.
To do this, you'll write a program that reads in the trace and simulates different branch prediction schemes and different predictor sizes. You can write the program in whatever language you like; although we are going to ask for a printout of your source code, to end result of this assignment are the experimental results (and not the program source code).
The trace we're giving you is a trace of 16 million conditional branches. These are the conditional branches from an execution of the program GCC (Gnu C Compiler) from the SPECint2000 benchmark suite. As unconditional branches are always taken, they are excluded from the trace. Each line of the trace file has two fields. Below are the first four lines of a trace file:
3086629576 T 3086629604 T 3086629599 N 3086629604 T
The first field is the address of the branch instruction. The second field is the character "T" or "N" for branch taken or not taken.
The trace is available as an 8.6MB gzip compressed file (branch-trace-gcc.trace.gz). If you wish to verify that the files downloaded correctly, the MD5 sum is 79d52827a9efbb5210a0190b976c20b9.
You can either (1) download the compressed file, uncompress it, and process the uncompressed file, or (2) process the compressed file directly. Under Linux this can be done by using zcat to pipe the uncompressed trace into a program reading from standard input. For example: zcat file.trace.gz | your-program. Another option is to use zlib (for C/C++) or java.util.zip (for Java) to open and process the compressed file directly.
To help you debug your predictors, we've provided some annotated prediction results for the first 200 branches for various predictors (see below). Comparing your results to these annotated outputs should help ensure your predictor is working properly.
Before looking at dynamic branch prediction, we are going to look at simple "static" branch prediction policies of "always predict taken" and "always predict not taken". Write a program to read in the trace and calculate the mis-prediction rate (that is, the percentage of conditional branches that were mis-predicted) for these two simple schemes.
Which of these two policies is more accurate (has few mis-predictions)?
Based on common programming idioms, what might explain the above result?
The simplest dynamic branch direction predictor is an array of 2n two-bit saturating counters. Each counter includes one of four values: strongly taken (T), weakly taken (t), weakly not taken (n), and strongly not taken (N).
Prediction. To make a prediction, the predictor selects a counter from the table using using the lower-order n bits of the instruction's address (its program counter value). The direction prediction is made based on the value of the counter.
Training. After each branch (correctly predicted or not), the hardware increments or decrements the corresponding counter to bias the counter toward the actual branch outcome (the outcome given in the trace file). As these are two bit saturating counters, decrementing a minimum counter or incrementing a maxed out counter should have no impact.
Initialization. Although initialization doesn't effect the results in any significant way, your code should initialize the predictor to "strongly not taken" so that your results match the example traces that we've given you.
Analyze the impact of predictor size on prediction accuracy. Write a program to simulate the bimodal predictor. Use your program to simulate varying sizes of bimodal predictor. Generate data for bimodal predictors with 22, 23, 24, 25 ... 220 counters. These sizes correspond to predictor index size of 2 bits, 3 bits, 4 bits, 5 bits, ... 20 bits. Generate a line plot of the data using MS Excel or some other graphing program. On the y-axis, plot "percentage of branches mis-predicted" (a metric in which smaller is better). On the x-axis plot the log of the predictor size (basically, the number of index bits). By plotting the predictor size in terms of number of index bits, the x-axis in essence becomes a log scale, which is what we want for this graph.
Answer the following questions base on the data you collected:
Given a large enough predictor, what is the best mis-prediction rate obtainable by the bimodal predictor?
How large must the predictor be to reduce the number of mis-predictions by approximately half as compared to the better of "always taken" and "always not taken"? Give the predictor size both in terms of number of counters as well as bytes.
At what point does the performance of the predictor pretty much max out? That is, how large does the predictor need to be before it basically captures almost all of the benefit of a much larger predictor.
From the previous data, you can see that neither bimodal or gshare is always best. To remedy this situation, we can use a hybrid predictor that tries to capture the best of both style of predicts. A tournament predictor consists of three tables. The first and second tables are just normal bimodal and gshare predictors. The third table is a "chooser" table that predicts whether the bimodal or gshare predictor will be more accurate.
The basic idea is that some branches do better with a bimodal predictor and some do better with a gshare predictor. So, the chooser is a table of two-bit saturating counters indexed by the low-order bits of the PC that determines which of the other two table's prediction to return. For the choose, the two-bit counter encodes: strongly prefer bimodal (B), weakly prefer bimodal (b), weakly prefer gshare (g), and strong prefer gshare (G).
Prediction. Access the chooser table using the low-order bits of the branch's program counter address. In parallel, the bimodal and gshare tables are access just as normal predictors and they both generate an independent prediction. Based on the result of the lookup in the chooser table, the final prediction is either the prediction from the bimodal predictor (if the choose counter indicates a preference for bimodal) or the prediction from the gshare predictor (otherwise).
Training. Both the gshare and bimodal predictors are trained on every branch using their normal training algorithm. The choose predictor is trained toward which of the two predictors was more accurate on that specific prediction:
As stated above, the bimodal and gshare tables are trained on every branch, totally independent of the chooser table.
Initialization. The chooser table is initialized to strongly prefer bimodal. The gshare and bimodal tables are initialized as normal.
Add support to your simulator for the tournament predictor by adding support for the chooser table. If your code was written in a modular way, you should be able to fully re-use your gshare and bimodal code (and thus avoid re-writing or replicating code).
Let's compare the tournament predictor's accuracy versus the data from its two constituent predictors (using the data from the previous question). For now, let's compare a 2n-counter bimodal (or gshare) versus a tournament predictor with three 2n-counter tables. (We'll make the comparison more fair in terms of a "same number of bits" comparison in a moment.) As above for the gshare predictor, the gshare component of the tournament predictor should use a history length equal to the number of index bits (log of the table size). The graph will have three lines total.
How does the tournament predictor's accuracy compare to bimodal and gshare? Is the tournament predictor successful in meeting its goal?
Does the tournament predictor improve the overall peak (best-case) accuracy? If so, why? If not, what are the benefits of the tournament predictor?
In the previous question, the tournament predictor was given the unfair advantage of having three times the storage. In this question, run another set of experimental data in which all the predictors at the same location on the x-axis have the same storage capacity. To do this, compare a 2n-counter predictor to a tournament predictor with the following three table sizes:
- Chooser table: 2n-2 counters
- Bimodal table: 2n-2 counters
- Gshare table: 2n-1 counters
As 2n-2 + 2n-2 + 2n-1 is equal to 2n, this becomes a fair "same size" comparison.
Add a line to the graph from the previous question with this new "fair tournament" data. The graph now has four lines: bimodal, gshare, tournament, tournament-fair.
Compare the old and new tournament predictor data. What was the impact of moving to the smaller (fairer) table sizes?
Once adjusted to be a fair-size comparison, does the tournament predictor succeed in its goal of being the best of bimodal and gshare?
Given a fixed transistor budget for a branch predictor (say, 4KBs), what predictor approach would you use?
Note
The idea of gshare and tournament (or "combining") branch predictors was proposed by Scott McFarling in his seminal paper published in 1993 entitled: Combining Branch Predictors. I encourage you to look at McFarling's paper. After having completed this assignment, I think you'll find the graphs and other results in this paper familiar.
We're provide you with three annotated results (one for each predictor type) from running the first 200 branches from the trace.
bimodal3.output: bimodal with 23 counters:
nNNNTNNN | b7fa3ae4 T | T correct 3
The columns are: table counter state, PC from input trace (but in hexadecimal rather than binary), branch outcome from input trace, prediction made, prediction result (correct/incorrect), running total of mis-predictions thus far.
gshare4-3.output: gshare with 24 counters and history length of three:
NNnNNntNnNNNNNNN NTN | b7fa3ae4 T | T correct 5
The columns are: table counter state, history register, PC from input trace (in hex), branch outcome from input trace, prediction made, prediction result (correct/incorrect), running total of mis-predictions thus far.
tournament3-bimodal3-gshare4-4.output: tournament predictor with a chooser table with 23 counters, a bimodal with 23 counters, and a gshare with 24 counters with a history length of four:
BBBBBBBB nNNNTNNN NNNNNnNNnNNNNNtN TNTN | b7fa3ae4 T | T correct 3
The columns are: chooser predictor table, bimodal predictor, gshare predictor table, gshare history register, PC from input trace (in hex), branch outcome from input trace, prediction made, prediction result (correct/incorrect), running total of mis-predictions thus far.
Short answers. Turn in your answers to the above questions. I strongly urge you to type up your answers.
Graphs. Turn in printouts of the following three graphs. One per sheet of paper (to make them big enough to read) and label axises and gives the lines descriptive names in a legend.
- Graph A [Questions #2, #3, and #5]: Include just one graph with all three lines: bimodal, gshare-8, and gshare-n.
- Graph B [Question #4]: Include the graph describe in question four.
- Graph C [Questions #6 and #7] This graph has four lines: bimodal, gshare-n, tournament, and tournament-fair.
Source Code. Print out your final source code that can simulate the various predictors.