Instructor: Prof. Milo Martin
Initial Data Due: Wed, March 3rd (emailed to me by 10am)
Final Data and Writeup Due: Friday, March 5th (under my door by 5pm).
In this assignment you will compare the performance of a various lock implementations: Test-and-Test-and-Set (TTS), ticket lock, array lock, and a list-based queue lock. As with the previous assignment, you'll use the Penn Parallelism Primitives library.
The basic experimental setup is a computation that updates a counter selected from an array of counters at random. The array of N counters is large and the counters are protected by an array of L locks (one lock per every "N divided by L" counters). By varying the number of locks, the lock contention is increased or decreased. The driver provided does some work within the critical section and when not in the critical section, making the test a bit more realistic. Thus, even for a single lock, there will be some speedup on multiple cores.
Note: As we only have a few machines with more than four cores, most of the experiments performed below will be on the four-core machines in the core2-quad.q queue.
The first step is to run the driver code without any locking code (just comment out the call to acquire and release in the main work loop in driver-hw4.C). This will get the "wrong" answer when run with more than one thread, but it will capture the best-case runtime if the lock overhead was zero while still capturing most of the locality and sharing misses caused by the counters themselves. We'll be using an array of 16,384 32-bit counters (which is the default). Record the runtime on one processor and on four processors (use the core2-quad.q queue). On subsequent graphs, normalize all data to the single-processor case, but also draw a horizontal line that represents the four-core speedup (which will be the best-case achievable).
To make all the runtimes tractable, the driver runs each thread for the specified number of iterations and displays the runtime in seconds and the runtime per iteration. This runtime per iteration number can be used as the base measurement (instead of seconds), allow you to run a smaller number of iterations to avoid high runtimes that results from significant lock contention.
Answer the following questions:
The next step is to use the simple test-and-test-and-set (TTS) lock provided for you in TTSLock.C and TTSLock.h. Vary the number of locks starting at 1 and increasing it by one until the performance levels off (likely at something around a dozen or more locks for four cores).
Graph the data. The y-axis is throughput normalized to the single-thread case (higher is better). The no-lock case data is displayed as a horizontal line across the top of the graph. The x-axis is the number of locks.
Answer the following questions:
For each of the lock variants below, answer the same three questions above and plot the data on the same graph.
Repeat Part 2 above, but using the ticket lock described in class.
Repeat Part 2 above, but with the array lock described in class (Anderson's lock). You may also find the following pseudo code helpful:
http://www.cs.rochester.edu/research/synchronization/pseudocode/ss.html#anderson
Repeat Part 2 above, but with the list-based queue lock described in class (CLH lock). You may also find the following pseudo code helpful:
http://www.cs.rochester.edu/research/synchronization/pseudocode/ss.html#clh
You'll likely want to use a global __thread variable to hold the local pointer. Initialize it to NULL. To allocate a node, you have a few options: (1) you can initialize it in ppp.C right after each thread is created, or (2) you can check for NULL each time you use the lock and allocate a node if the pointer is NULL. Option #1 is more invasive but has lower runtime overhead than option #2. Either way, you'll want to reuse the nodes, thus avoiding frequent calls to malloc()/free().
Note: choose only one of Part 6, Part 7, or Part 8 to complete
The atomic template class automatically adds cache-block padding to each atomic variable. Repeat the above experiments but with padding disabled (just comment out the line in atomic.h). This will smash together all the objects putting multiple TTS locks in the same cache block, both parts of the ticket lock in the same block, and multiple parts of the array lock into the same block, etc.
Plot the new data, analyze the impact versus the results with padding, and discuss the results. As this is the simplest of the last three parts in terms of code and data collection, I expect a better analysis and prose explanation of the results than for the other options.
Note: choose only one of Part 6, Part 7, or Part 8 to complete
Explore the effectiveness of the "pause" instruction. Disable the pause instruction (comment out the line in atomic.h). Re-run the experiments above (all the lock types) with and without the "pause" instruction on the eight-core Core 2 systems (the core2-oct.q queue) using 8 software threads. Is the a difference between the pause and pause-free versions? Explain why or why not?
Also repeat the experiments on the Core i7 box using 8 software threads (on the corei7.q queue) both with and without the "pause" instructions. Is the relative impact larger on the Core i7? Explain why or why not? (Recall the Core i7 is a multithreaded chip.)
Note: choose only one of Part 6, Part 7, or Part 8 to complete
Rerun the above experiments (the first five parts) on the 128-thread Niagara T2 machine. As the TTS lock with so many threads is likely to perform poorly, you will also need to implement a TTS variant with exponential backoff. The pseudo code can be found at:
http://www.cs.rochester.edu/research/synchronization/pseudocode/ss.html#tas
As there is only one of these machines and it will take more runs to explore the space, I'm not asking everyone to do this. As others will be using the machine, please keep your runs short (as short as possible, say ~30 seconds or shorter, to get reasonable data) and you don't need to run every sequential data point between 1 and lots of locks. Sample some points in-between to avoid performing so many runs. I also suggest that you run with 127 software threads, leaving one thread for the OS or background tasks. This will likely reduce the noise and thus allow for shorter runtimes.
Pause. Using the TTSLock code as an example, be sure to call the "pause" instruction during any busy spin wait.
All of the source code you'll need is available as PPPv2.tar.gz. See https://www.cis.upenn.edu/~cis534/ppp2.html for the most recent information on the Penn Parallelism Primitives.
I've added a way to select different lock types at the command line when compiling the code. For example the following lines build an x86 for all the lock times:
g++ -m64 -std=c++0x -pthread -lrt -Wall -O3 ppp.C barrier.C random.C driver-hw4.C QueueLock.C -DLOCKTYPE=TTSLock -o TTSLock g++ -m64 -std=c++0x -pthread -lrt -Wall -O3 ppp.C barrier.C random.C driver-hw4.C QueueLock.C -DLOCKTYPE=TicketLock -o TicketLock g++ -m64 -std=c++0x -pthread -lrt -Wall -O3 ppp.C barrier.C random.C driver-hw4.C QueueLock.C -DLOCKTYPE=ArrayLock -o ArrayLock g++ -m64 -std=c++0x -pthread -lrt -Wall -O3 ppp.C barrier.C random.C driver-hw4.C QueueLock.C -DLOCKTYPE=QueueLock -o QueueLock
This selection is done at compile time so that the lock code can be inlined and no virtual function calls (or whatever) are on the critical path of the lock acquire/release.
As before, to cross-compile for SPARC:
~cis534/public/cross/bin/sparc-sun-solaris2.10-g++ -R /home1/c/cis534/public/cross/sparc-sun-solaris2.10/lib/sparcv9/ -m64 -std=c++0x -pthread -lrt -Wall -O3 ...
The driver takes the following command-line parameters:
For example:
./TTSLock --locks 16 --threads 4 --iterations 20000000
Data collection: This assignment requires only a little code, but lots of runs. I suggest you consider writing shell or Python scripts that both (1) setup the jobs to submit to SGE and (2) also assist in analyzes the data files by making data files that is easy to import into Excel or that generates graphs directly using GNUPlot, jgraph, or other script-based graphing tool.
Eniac: Log into eniac.seas.upenn.edu to develop and test your code You should run your timing results using the dedicated machines (see https://www.cis.upenn.edu/~cis534/sge.html).
Individual work: Each student must submit their own writeup, but I encourage you to discuss this general problem with each other outside of class. Sharing of code is not allowed.
For the initial due date:
Print out the following to bring to class on the final due date:
It is quite likely that the primitives library I give you will have bugs in it. It is a small amount of code, but tricky code and parallel code is difficult to test. I've done what I can to test it, but if something odd is happening it could be a bug in the code I wrote.
I haven't collected data for these locks on these systems. Thus, I don't know if the data is actually going to be interesting or not. It could be that four cores isn't enough to show much of anything. In that case, I might end up asking you all to collect the data for the first parts on the few larger machines we have.
After assigning and grading this assignment, a few things to consider correcting in future years: