Instructor: Prof. Milo Martin
Initial Data Due: Monday, February 22nd (emailed to me by 10am)
Final Data and Writeup Due: Friday, February 26th (under my door by 5pm).
Create parallel implementations of computations from Homework 2. The goal is minimize overall runtime. The first two parts of the assignment is relatively concrete. The remaining parts of more open-ended, for which you'll be asked to make your own design choices, perform your own analysis, and lucidly describe them.
I'll provide some code for the low-level primitives you'll need. We'll call it Penn Parallelism Primitives library.
Write a parallel implementation of Computation 1a (the n2 double-nested loop version) from Homework 2. To start with perform a static partitioning of work in which each core takes a equal share of particles.
Calculate the runtimes for 200,000 particles for 1, 2, 4, and 8 cores for the Core 2 machines. Throughout this assignment, when four cores or less, use the following qub options: -q core2-quad.q -pe threaded 4; for 8 cores, use -q core2-oct.q -pe threaded 8.
Create two graphs. First, plot the both runtimes (y-axis) vs cores (x-axis). Second, plot the speedup (y-axis) versus cores (x-axis). For speedup, higher is better. For the speedup graph, make sure that both axis are either linear or logarithmic (both don't mix them). Plot a "perfect scalability" line (n speedup on n cores), which should be a straight line if done right.
Note: The speedups throughout this assignment should be the speedup as compared to your fastest sequential version of the code, not the original non-optimized codes from the first assignment.
Plot the speedup versus number of particles (up to 200,000 or larger) for four threads. What is the smallest number of particles (approximately) in which your code can achieve at least a 2x on 4 cores. The point of this question is to give you some feel for what is the smallest amount of computation than be reasonably performed in parallel.
For the same computation as part 1 above, use a dynamic partitioning approach. Use a shared counter to distribute the work among the threads in chunks of particles as defined by a grain size parameter.
Plot the speedup versus grain size (from 1 to number of particles) for four threads and 200,000 particles. As the interesting part of the graph is where grain sizes are relatively small and relatively large, you'll want to run more data points in those ranges.
What is the impact of too small a grain size? What is the impact of too large a grain size? A "reasonable" grain size is the smallest grain size that is within a few percent of "optimal" grain size. What is a reasonable grain size?
Plot speedup versus number of threads for 200,000 particles. Plot two lines: your dynamic one with a "reasonable" grain size along with the results from Part 1b. To ensure the baseline is independent of the grain size, be sure to normalize the runtimes to the static partitioned version running on a single core, and not just the dynamically parallel version running on a single core.
Parallelize computation 2 from Homework 2. Try a few implementations, measure and graph their speedups, and describe what worked well and/or what worked less well.
Note: you may wish to complete Part 5 first, and then you can treat Part 4 as a simpler 1-dimensional case of part 5's 2-dimensional computation. Or, you can use a different algorithm for this part to obtain the highest performance.
Re-run one of the previous parts on both the Core i7 and the Niagara T2 machine. Compare and contrast the machines (recall that our Core i7 machine is a single-chip, has more memory bandwidth, and had more on-chip cache). Explore any challenges you met with achieving scalability on 128 threads on the Niagara T2 system. These machines are both hardware multithreaded, so you wouldn't really expect linear speedup beyond the number of cores (but you will hopefully observe some speedup).
Note: When using the Niagara T2, use one data set size to compare the performance from one thread to eight threads. Use a larger size to compare eight threads thorough 128 threads. The larger size likely won't complete on a single core within the CPU time limits set on the SGE queue for the machine.
The Penn Parallelism Primitives is a collection of low-level shared-memory primitives for writing multicore software written specifically for this course. It includes functions for spawning threads, atomic operations, simple barriers, etc.
See: https://www.cis.upenn.edu/~cis534/ppp.html
All of the source code you'll need is available as PPPv1.tar.gz.
To compile with these primitives:
g++ -m64 -std=c++0x -pthread -lrt -Wall -O3 ppp.C barrier.C random.C driver.C compute.C
To cross-compile for SPARC:
~cis534/public/cross/bin/sparc-sun-solaris2.10-g++ -R /home1/c/cis534/public/cross/sparc-sun-solaris2.10/lib/sparcv9/ -m64 -std=c++0x -pthread -lrt -Wall -O3 ppp.C barrier.C random.C driver.C compute.C
For the initial due date:
Print out the following to bring to class on the final due date:
Insightful writeup. There are few "right" answers for this assignment. And I don't actually know the "right" answers, anyway. The point of this assignment is the process of writing parallel code and showing whatever insight you gained from the process. I'm not looking for lots of prose, but lots of insight described tersely.
We'll discuss these insights in-class after the assignments are due.
Graphs. Include the graphs described above. Label the x-axis and y-axis. Give each graph a descriptive title.
Your code. Print out your final (fastest) versions of each of the computations.
Your faster results. Your results for the fastest results for Part two, part three, and part four on the eight-core Core 2 machine. Give the results for both raw runtime in seconds and normalized speedup.
It is quite likely that the primitives library I give you will have bugs in it. It is a small amount of code, but tricky code and parallel code is difficult to test. I've done what I can to test it, but if something odd is happening it could be a bug in the code I wrote.
There probably isn't any "right" solution. If there is, I don't know what it is, as I haven't tried this out myself. It could be the obvious way is the fastest way of doing it; or it could be more subtle.
After assigning and grading this assignment, a few things to consider correcting in future years: