This web page presents a quick summary of some key issues in performance tuning X10 applications with the goal of helping users understand and improve the performance of applications written in X10 2.4. Significantly more detail on this topic can be found in two papers presented at the X10'11 workshop at PLDI. We strongly recommend that you at least skim these papers if you are interested in performance tuning X10 applications.
- A Performance Model for X10 Applications presents a general discussion of the X10 performance model and details of the Native X10 implementation (X10 compiled to C++)
- Compiling X10 to Java contains an in depth discussion of Managed X10 performance (X10 compiled to Java).
The rest of this page covers the following topics:
- Best Practices for Performance Tuning
- Compiler and build options to maximize performance
- Configuring the X10 Runtime
- Selecting the right X10RT Implementation
- Tips and Tricks for Performance Analysis of X10 programs
- Implementation Limitations and Other Pitfalls
Best Practices for Performance Tuning
Functionality first, Performance second
It's always advisable to first get the code working and thoroughly tested first, then worry about making it perform.
Use the right compiler flags (and the right compiler).
Be sure to compile your program with -O. Compile with -O -NO_CHECKS for peak performance once you are sure that the code is correct. For most programs, you will get better performance with x10c++, not x10c. The main exception are programs that are heavily object-oriented and/or have a very high allocation rate.
Worry about single-place performance first
Our experience has been that programs that perform poorly in a single place also don't scale. So before worrying about scaling, first make sure the single-place version of the code is reasonably efficient.
Get good scaling for a small number of places before attempting larger runs
Quite a bit of scaling performance work can be done with only a few X10 places. In particular, doing runs with 2, 4, and 8 places is usually sufficient to identify non-scalable communication patterns, excessive data transfer, and excessive creation of remote-references to heap allocated objects (which will become memory leaks due to the lack of distributed GC).
Scaling to a very large number of places
Scaling to hundreds or thousands of places requires more careful usage of X10 language features and more careful programming than scaling to a smaller number of places. For some examples of benchmarks that do scale, see the X10 version of the HPCC benchmarks and the UTS benchmark. They can be found at https://github.com/x10-lang/x10-benchmarks/tree/master/PERCS or as an optional download with each X10 release. These codes have been successfully run on large PowerPC clusters and on BlueGene systems. Notice that they very carefully avoid using many X10 language features, do not create very many remote references, and do not uses clocks, whens, atomics, or futures. We are continually updating these codes to be "as nice as possible" within the current limitations of the X10 implementation, so spending some time understanding how this code works and why it is written the way it is will be very helpful if you want to try to scale other codes to large systems. As the X10 implementation matures, we expect it to become progressively easier to scale programs, but it currently requires care and some amount of persistence.
As a rule of thumb, so far most X10 programs that have scaled well to a large number of nodes have had the following control pattern:
- The main activity has a top-level finish under which it uses asyncs to create a top-level async at each place that runs for a long time. That top-level async may in turn use asyncs/at operations fairly freely to communicate asynchronously with the other places.
- Bulk data transfers of primitive (non-pointer containing) data should utilize the asyncCopy methods of Rail and Array.
- It may be possible to have nested local finishes, but large numbers of nested distributed finishes probably need to be avoided due to limitations in the current distributed finish implementation.
One way to describe this control pattern as SPMD augmented with active messages.
Of course, one may be able to use more general control patterns on subsets of nodes within a larger computation. We have not deeply explored all of the possible combinations.
We are always interested in more sample X10 programs, especially ones that have been scaled up to run on large systems. If you have one, please consider contributing it back to the project!
Compiler and build options to maximize performance
The default compiler/build options are set to minimize compile time, not maximize performance!
To do performance evaluation of X10, you need to be sure to use the right set of flags. Using the default flags will result in significantly lower than peak performance because (as is standard with C++ compilers), the default set of options result in compiling with no optimizations.
There are two options you need to pass to either the x10c++ or x10c compiler for peak performance
Option | Semantics |
---|---|
-O | enable optimization |
-NO_CHECKS | disable array bounds checking, null pointer checking, and place checking |
Depending on the kind of code you are running, -NO_CHECKS may or may not have significant performance impact. For array based code where the arrays have rank>1, -NO_CHECKS is currently critical for maximizing performance as the multi-dimensional array bounds checking code has not been designed for high performance.
You will also want to make sure that the X10 class libraries are compiled with the proper options as well. If you are using a pre-built binary release of X10, the standard library was compiled with -O, but not with -NO_CHECKS. For absolutely peak performance, you would need to rebuild the class libraries with -NO_CHECKS, however since the largest impact of -NO_CHECKS is for arrays and the basic array functions will be inlined into the application code with -O you may be able to simply compile the application code with -O -NO_CHECKS with minimal performance loss vs. recompiling the standard libraries as well.
Here is how to build the X10 standard libraries from source for absolute peak performance.
cd x10.dist ant distclean; ant dist -Doptimize=true -DNO_CHECKS=true
And invoke x10c++ to compile your application like:
x10c++ -O -NO_CHECKS <....rest of command line...>
Properly configuring the X10 Runtime
The X10 runtime internally executes asyncs by scheduling them on a pool of worker threads within each place. By default, the X10 runtime only creates a single worker thread for place. To exploit multiple cores within a place, you must set the X10_NTHREADS environment variable to the desired number of worker threads to properly exploit the additional cores. A good rule of thumb is to create one X10 worker thread per available core. For example, suppose you wanted to run an X10 program with two places on a machine with 8 cores and wanted the program to use all the available cores. Then you should set X10_NTHREADS=4 (2 places x 4 threads per place = 8 active cores). The X10 runtime will endeavor to keep the number of active worker threads in its pool at the requested value by dynamically adding/removing threads as needed. For more details and related issues please see the Runtime section of the Performance Model paper.
Selecting the right X10RT Implementation
The sockets implementation of X10RT is supported on all platforms, but multi-place programs using it may not perform as well as alternative transports (higher latency, lower bandwidth). If it is available for your platform, use pami instead of sockets. As a second choice, use the MPI-based implementation of X10RT.
For more details, see X10RT Implementations
Tips and Tricks for Performance Analysis of X10 programs
Identifying sequential bottlenecks: use standard tooling for C/C++
Many profiling tools for C/C++ executables will "just work" with the executables produced by x10c++.
We have had decent luck with using gperftools to find bottlenecks in X10 programs. First install gperftools on your machine. You can then give the x10c++ script the command line argument -gpt to cause it to link you X10 executable with libprofiler. You then set the CPUPROFILE environment variable to an output file and run pprof to process the results (as documented in the perftool wiki).
Valgrind tools have also been used successfully, although it is typically necessary to build X10 with GC disabled (ant -DDISABLE_GC=true
).
Many multi-place bottlenecks can be identified by enabling tracing options
The C++ native runtime (x10aux) layer includes several tracing options that can help you examine multi-place performance. These tracing options are always enabled:
- X10RT_RXTX Show the total bytes transmitted/received and the number of messages transmitted/received by each place. This information is printed at each place upon program termination. Some common experiments to try to see if you may have a serialization related performance problem are:
- Run your program on two places and at several input sizes. Plot the bytes/messages sent as a function of input size. If there is a super-linear growth, then you are very likely to have problems scaling the program to larger inputs. Also consider if the numbers match your rough estimates of how much data should be moving between places. Note that there will be some nominal messaging/serialization related to initialing the X10 runtime that will be happen for all programs. For example, in X10 2.4 enabling X10RT_TX on the empty program and running it on two places yields:
Place: 0 msg_rx: 32/4 msg_tx: 32/5
Place: 0 put_rx: 0(&0)/0 put_tx: 0(&0)/0
Place: 0 get_rx: 0(&0)/0 get_tx: 0(&0)/0
Place: 1 msg_rx: 40/6 msg_tx: 40/5
Place: 1 put_rx: 0(&0)/0 put_tx: 0(&0)/0
Place: 1 get_rx: 0(&0)/0 get_tx: 0(&0)/0
The first line indicates that Place 0 received 4 messages totaling 32 bytes and sent 5 messages totaling 32 bytes. Place 1 received 6 messages totaling 40 bytes and sent 5 messages totaling 40 bytes. The numbers don't exactly match because some additional messages are sent to coordinate the printing of the message counts.
- Run your program on the same input at 2, 4, and 8 places. Plot the bytes/messages/bytes sent as a function of the number of places. Also plot messages/bytes sent divided by the number of places as a function of the number of places. A clear danger sign is if serialization costs are increasing faster than the number of places. It may be ok for the serialization to grow linearly with the number of places, but it is usually not desirable. Programs that are likely to scale well will almost always will show decreasing bytes/place when run on a constant input with an increasing number of places.
Some additional tracing options are available when the X10 runtime is built without optimization (and are therefore not available in pre-built X10 releases). All of the tracing options are enabled by setting environment variables. Some of the more useful additional tracing options are:
- X10_TRACE_SER Trace serialization (data sent between places by at operations). It will also report on all global references that are actually serialized to another place (thus causing their values to become uncollectable by the garbage collector, representing a space leak).
The tracing can perturb performance, but can be useful for identifying communication bottlenecks and memory leaks due to serialization. - X10_TRACE_X10RT Trace calls into X10RT
- X10_TRACE_NET Shorthand for X10_TRACE_SER and X10_TRACE_X10RT
All of these additional tracing options will print large amounts of output to stdout. The output will be prefixed by the place id. You will often want to redirect the output into a log file. A few typical ways to use these detailed logs are:
- Inspect serialization activity for specific X10 types or source level ats to identify unexpected field captures. This is unfortunately fairly tedious. You will often find the problem more easily by examining the source code of the program, injecting counters to discover which ats are executing frequently, and then carefully examining the source code of those ats to determine what is being serialized by their execution.
- Extract the messages related to GlobalRefs being serialized to identify excessive cross-place pointer creation and memory retention. Do this by grepping for RefLogger: in the logfile.
[%1] ]$ grep RefLogger log.txt
0: SS: RefLogger: recording 0x20c99b0 as a new globally escaped reference
0: SS: RefLogger: 0x20c99b0 was already recorded as a globally escaped reference
0: SS: RefLogger: 0x20c99b0 was already recorded as a globally escaped reference
0: SS: RefLogger: recording 0x20c9820 as a new globally escaped reference
0: SS: RefLogger: 0x20c9820 was already recorded as a globally escaped reference
0: SS: RefLogger: 0x20c9820 was already recorded as a globally escaped reference
0: SS: RefLogger: recording 0x20c9500 as a new globally escaped reference
0: SS: RefLogger: 0x20c9500 was already recorded as a globally escaped reference
0: SS: RefLogger: 0x20c9500 was already recorded as a globally escaped reference
Implementation Limitations and Other Pitfalls
Several language features are not implemented as well as we would like in X10 2.4. There are also weaknesses in the compiler's ability to optimize common idioms that can significantly impact performance. Here are some of the most common issues to be aware of.
Language features to avoid using in performance critical code
- Don't use atomic in frequently executed code if you intend to run with multiple worker threads per place. Atomic blocks are implemented via a place-wide lock and therefore limit intra-place scalability.
- Don't use when at all. In addition to the scalability problems of atomic, when usually entails additional blocking operations and often results in excessive thread usage by the fork-join runtime.
- The performance of multi-dimensional general arrays (x10.regionarray.Array) is significantly reduced by array bounds checking code even with -O. You need to compile with -NO_CHECKS to get acceptable multi-dimensional array performance.
- The default distributed finish implementation does not scale beyond a moderate number of places (hundreds). The issue is that finish termination messages are all routed to the "root node" of the finish, and the resulting message traffic can overwhelm the node. This can be overcome by using pragmas to select specialized finish implementations that do scale to tens of thousands of places.
Be aware of async granularity.
As discussed in the Performance Model paper, the default runtime uses a fork-join based implementation and co-operative scheduling. This has a number of performance implications. Some of the most important ones to keep in mind are:
- Large numbers of fine-grained asyncs are likely to hurt performance. In the fork-join based runtime, each async created results in somewhere around 50 or 100 machine instructions of overhead (allocation of a task object, pushing the task object onto a deque, and so forth). Therefore it is advisable that the bulk of the asyncs in your program represent non-trivial pieces of computation (minimal hundreds of machine instructions, ideally thousands of instructions). If a large number of small asyncs are created, the program will spend a large fraction of total execution cycles on activity creation related bookkeeping.
- However, in multi-place programs overly coarse asyncs can also hurt overall performance. In particular, the runtime will only automatically poll the network and respond to incoming messages (at's) in between executing asyncs. If all of the workers in a place are busy executing long-running asyncs, then there may be long delays in executing "short" at's from other places, resulting in those remote places being idle and poor load balancing. A work around for this problem is to inject explicit calls to x10.lang.Runtime.poll() in the bodies of long-running asyncs to provide additional opportunities for the scheduler to process incoming messages from other places.
- In general, an at statement or expression will cause the suspension of the current thread until the execution of the at completes at the remote place. This can result in a large number of worker threads being created (to replace the threads suspended by the at). However, the compiler/runtime optimize the special case of at (p) async S and implement it by immediately executing the sending portion of the implied message send on the current activity. So, when possible, consider using at (p) async S instead of just at (p) S.
There is an experimental Cilk-style workstealing scheduler available in recent X10 releases (use x10c++ -WORK_STEALING=true to compile your program), however it is still in development and may have implementation limitations that prevent it from working on some input programs.
Capturing of "this" and other excessive serialization of objects
X10 makes it easy to capture object large object graphs within an at body and serialize them to another place. Be aware that within an instance method it is easy to capture this in an at or closure without realizing it (it will be captured by accessing a field or calling an instance method). Part of tuning an application may entail manually rewriting code to grab just the parts of an object that the at actually needs, storing them into local variables, and then using the local variables in the at body. In some cases, you may also want to make some instance fields of a class transient to avoid them being serialized when their containing object is serialized.
Serialization of very large object graphs
When an object graph is captured by an at, X10 serialization semantics require that a new isomorphic object graph be created at the destination place. This requires that the serialization code do enough bookkeeping of class instances to detect repeated references to the same object and faithfully recreate the shared structure at the destination. Internally, this is implemented by tracking objects that are being serialized in a hash table. For very large object graphs (tens or hundreds of thousands of objects) this can become a significant overhead.
A minor corollary to the semantics of object serialization, is that it is more efficient to serialize instances of structs than it is to serialize instances of classes. Therefore, there may be some additional performance benefit to choosing to represent aggregate data as a struct where possible (the data is immutable and subclassing is not needed).
Avoid specifying explicit types for local val variables
Type inference of local val declarations based on the initialization expression will often infer a more precise type than a human is willing to declare. Sometimes the precise type is critical to enable optimization. Therefore in X10, one should either avoid declaring types on local variables entirely or use the val x<:T = ... form where the static type is taken as an upper bound, not a lower bound.
For loop optimization over x10.regionarray.Region is overly sensitive to trivial details of how the loop is written
The compiler does a good job of converting some for loops over rectangular regions into counted for loop nests:
val r = Region.make(1..10, 1..10);
for ([i,j] in r) {
...compute based on i and j
}
val r = Region.make(1..10, 1..10); val a = new Array[double](r, ....) for ([i,j] in a) { ... a(i,j)... }
However, very minor variations on the above loops will not perform as well due to weaknesses in the for loop expansion in the current compiler.
val r:Region(2) = Region.make(1..10, 1..10);
for ([i,j] in r) {
...compute based on i and j
}
val r = Region.make(1..10, 1..10); val a = new Array[double](r, ....) for (p[i,j] in a) { ... a(i,j)... }
val r = Region.make(1..10, 1..10); val a = new Array[double](r, ....) for (p in a) { ... a(p)... }