Homework 4 - CIS534 Spring 2010

Instructor: Prof. Milo Martin

Initial Data Due: Wed, March 3rd (emailed to me by 10am)

Final Data and Writeup Due: Friday, March 5th (under my door by 5pm).

Overview

In this assignment you will compare the performance of a various lock implementations: Test-and-Test-and-Set (TTS), ticket lock, array lock, and a list-based queue lock. As with the previous assignment, you'll use the Penn Parallelism Primitives library.

The basic experimental setup is a computation that updates a counter selected from an array of counters at random. The array of N counters is large and the counters are protected by an array of L locks (one lock per every "N divided by L" counters). By varying the number of locks, the lock contention is increased or decreased. The driver provided does some work within the critical section and when not in the critical section, making the test a bit more realistic. Thus, even for a single lock, there will be some speedup on multiple cores.

Note: As we only have a few machines with more than four cores, most of the experiments performed below will be on the four-core machines in the core2-quad.q queue.

Part 1

The first step is to run the driver code without any locking code (just comment out the call to acquire and release in the main work loop in driver-hw4.C). This will get the "wrong" answer when run with more than one thread, but it will capture the best-case runtime if the lock overhead was zero while still capturing most of the locality and sharing misses caused by the counters themselves. We'll be using an array of 16,384 32-bit counters (which is the default). Record the runtime on one processor and on four processors (use the core2-quad.q queue). On subsequent graphs, normalize all data to the single-processor case, but also draw a horizontal line that represents the four-core speedup (which will be the best-case achievable).

To make all the runtimes tractable, the driver runs each thread for the specified number of iterations and displays the runtime in seconds and the runtime per iteration. This runtime per iteration number can be used as the base measurement (instead of seconds), allow you to run a smaller number of iterations to avoid high runtimes that results from significant lock contention.

Answer the following questions:

What is the maximum speedup possible even without locking?
What effects causes this speedup to be less than a 4x speedup?

Part 2

The next step is to use the simple test-and-test-and-set (TTS) lock provided for you in TTSLock.C and TTSLock.h. Vary the number of locks starting at 1 and increasing it by one until the performance levels off (likely at something around a dozen or more locks for four cores).

Graph the data. The y-axis is throughput normalized to the single-thread case (higher is better). The no-lock case data is displayed as a horizontal line across the top of the graph. The x-axis is the number of locks.

Answer the following questions:

What is the maximum speedup obtainable with the this lock?
What is the minimum number locks needed to achieve near this maximum throughput?
With a single lock, is the throughput greater than or less than the single-processor case. Why?

For each of the lock variants below, answer the same three questions above and plot the data on the same graph.

Part 3

Repeat Part 2 above, but using the ticket lock described in class.

Part 4

Repeat Part 2 above, but with the array lock described in class (Anderson's lock). You may also find the following pseudo code helpful:

http://www.cs.rochester.edu/research/synchronization/pseudocode/ss.html#anderson

Part 5

Repeat Part 2 above, but with the list-based queue lock described in class (CLH lock). You may also find the following pseudo code helpful:

http://www.cs.rochester.edu/research/synchronization/pseudocode/ss.html#clh

You'll likely want to use a global __thread variable to hold the local pointer. Initialize it to NULL. To allocate a node, you have a few options: (1) you can initialize it in ppp.C right after each thread is created, or (2) you can check for NULL each time you use the lock and allocate a node if the pointer is NULL. Option #1 is more invasive but has lower runtime overhead than option #2. Either way, you'll want to reuse the nodes, thus avoiding frequent calls to malloc()/free().

Part 6

Note: choose only one of Part 6, Part 7, or Part 8 to complete

The atomic template class automatically adds cache-block padding to each atomic variable. Repeat the above experiments but with padding disabled (just comment out the line in atomic.h). This will smash together all the objects putting multiple TTS locks in the same cache block, both parts of the ticket lock in the same block, and multiple parts of the array lock into the same block, etc.

Plot the new data, analyze the impact versus the results with padding, and discuss the results. As this is the simplest of the last three parts in terms of code and data collection, I expect a better analysis and prose explanation of the results than for the other options.

Part 7

Note: choose only one of Part 6, Part 7, or Part 8 to complete

Explore the effectiveness of the "pause" instruction. Disable the pause instruction (comment out the line in atomic.h). Re-run the experiments above (all the lock types) with and without the "pause" instruction on the eight-core Core 2 systems (the core2-oct.q queue) using 8 software threads. Is the a difference between the pause and pause-free versions? Explain why or why not?

Also repeat the experiments on the Core i7 box using 8 software threads (on the corei7.q queue) both with and without the "pause" instructions. Is the relative impact larger on the Core i7? Explain why or why not? (Recall the Core i7 is a multithreaded chip.)

Part 8

Note: choose only one of Part 6, Part 7, or Part 8 to complete

Rerun the above experiments (the first five parts) on the 128-thread Niagara T2 machine. As the TTS lock with so many threads is likely to perform poorly, you will also need to implement a TTS variant with exponential backoff. The pseudo code can be found at:

http://www.cs.rochester.edu/research/synchronization/pseudocode/ss.html#tas

As there is only one of these machines and it will take more runs to explore the space, I'm not asking everyone to do this. As others will be using the machine, please keep your runs short (as short as possible, say ~30 seconds or shorter, to get reasonable data) and you don't need to run every sequential data point between 1 and lots of locks. Sample some points in-between to avoid performing so many runs. I also suggest that you run with 127 software threads, leaving one thread for the OS or background tasks. This will likely reduce the noise and thus allow for shorter runtimes.

Logistics

Pause. Using the TTSLock code as an example, be sure to call the "pause" instruction during any busy spin wait.
All of the source code you'll need is available as PPPv2.tar.gz. See https://www.cis.upenn.edu/~cis534/ppp2.html for the most recent information on the Penn Parallelism Primitives.

I've added a way to select different lock types at the command line when compiling the code. For example the following lines build an x86 for all the lock times:

g++ -m64 -std=c++0x -pthread -lrt -Wall -O3 ppp.C barrier.C random.C driver-hw4.C QueueLock.C -DLOCKTYPE=TTSLock -o TTSLock
g++ -m64 -std=c++0x -pthread -lrt -Wall -O3 ppp.C barrier.C random.C driver-hw4.C QueueLock.C -DLOCKTYPE=TicketLock -o TicketLock
g++ -m64 -std=c++0x -pthread -lrt -Wall -O3 ppp.C barrier.C random.C driver-hw4.C QueueLock.C -DLOCKTYPE=ArrayLock -o ArrayLock
g++ -m64 -std=c++0x -pthread -lrt -Wall -O3 ppp.C barrier.C random.C driver-hw4.C QueueLock.C -DLOCKTYPE=QueueLock -o QueueLock

This selection is done at compile time so that the lock code can be inlined and no virtual function calls (or whatever) are on the critical path of the lock acquire/release.

As before, to cross-compile for SPARC:

~cis534/public/cross/bin/sparc-sun-solaris2.10-g++ -R /home1/c/cis534/public/cross/sparc-sun-solaris2.10/lib/sparcv9/ -m64 -std=c++0x -pthread -lrt -Wall -O3 ...

The driver takes the following command-line parameters:
- --locks: total number of locks (required)
- --iterations: total number of iterations split over all threads (required)
- --threads: number of threads (defaults to 1)
- --counters: number of counters (defaults to 16384)
For example:
```
./TTSLock --locks 16 --threads 4 --iterations 20000000
```
Data collection: This assignment requires only a little code, but lots of runs. I suggest you consider writing shell or Python scripts that both (1) setup the jobs to submit to SGE and (2) also assist in analyzes the data files by making data files that is easy to import into Excel or that generates graphs directly using GNUPlot, jgraph, or other script-based graphing tool.
Eniac: Log into eniac.seas.upenn.edu to develop and test your code You should run your timing results using the dedicated machines (see https://www.cis.upenn.edu/~cis534/sge.html).
Individual work: Each student must submit their own writeup, but I encourage you to discuss this general problem with each other outside of class. Sharing of code is not allowed.

What to Turn In

For the initial due date:

By 10am on the day of class, send an email to cis534@cis.upenn.edu with the quantitative results for Part 1 and Part 2 above. Just the numbers is what I'm looking for.
Be ready to describe your results during the in-class discussion.

Print out the following to bring to class on the final due date:

Insightful writeup. The goal of this assignment is to get you to think more deeply about the various lock primitives choices and implementations. The role of the writeup is to convince with that you learned something by saying insightful things in the writeup. I'm not looking for lots of prose, but lots of insight described tersely.
Graphs. Include the graphs described above. Label the x-axis and y-axis. Give each graph a descriptive title. The graphs will generally have a different line for each lock. Make sure to use different symbols or dot/dash patterns to clearly identify which lock is which.
Your code. Print out your code for each lock.

Disclaimer

It is quite likely that the primitives library I give you will have bugs in it. It is a small amount of code, but tricky code and parallel code is difficult to test. I've done what I can to test it, but if something odd is happening it could be a bug in the code I wrote.

I haven't collected data for these locks on these systems. Thus, I don't know if the data is actually going to be interesting or not. It could be that four cores isn't enough to show much of anything. In that case, I might end up asking you all to collect the data for the first parts on the few larger machines we have.

Addendum

None, yet.

Retrospective

After assigning and grading this assignment, a few things to consider correcting in future years:

The biggest issue with the assignment is that it didn't ask any "big picture" sort of questions about the overall results. I was a bit disappointed by the level of analysis in some of the assignments submitted, but I realize I also didn't explicitly ask for it, either.
Several students used an atomic increment to release the ticket lock, rather than just doing a normal increment using get() and set() on the atomic variable. This actually impacts the performance of the ticket lock appreciably, so the assignment should clarify that no atomic is needed.
Several students said "false sharing" as the reason for slowdown even without locks, whereas it is actually "true sharing" that is the main culprit.
Perhaps a true "atomic increment" using x86 would be better than the current one build from compare and swap.
Adding a "fetch_and_store" operation to the atomic.h might be helpful when implementing the locks.
It isn't clear if the two counters of the ticket lock should be in the same cache block (reducing the number of cache misses) or in different ones (allowing a different location for spinning). Perhaps careful use of prefetch instruction (especially write prefetching) might mitigate some of the issues with putting them in different cache blocks.