Next: References Up: A Brief Introduction To X10 Previous: The APGAS model

The X10 Performance Model

Programmers need an intuitive understanding of the performance characteristics of the core constructs of their programming language to be able to write applications with predictable performance. We will call this understanding a performance model for the language. Desirable characteristics of a performance model include simplicity, predictive ability, and stability across different implementations of the language. The performance model should abstract away all non-essential details of the language and its implementation, while still enabling reasoning about those details that do have significant performance impact. Languages with straightforward mappings of language constructs to machine instructions usually have fairly straightforward performance models. As the degree of abstraction provided by the language's constructs and/or the sophistication of its implementation increase, its performance model also tends to become more complex.

In this chapter, we present an abridged version^4.1 of the performance model for X10 v2.3.1 focusing on those aspects that are most relevant for the HPC programmer. Although the rate of change of the X10 language has significantly decreased from earlier stages of the project, the language specification and its implementations are still evolving more rapidly than production languages such as C++ and Java. Therefore, we break this chapter into two logical sections: aspects that we believe are fundamental to the language itself and aspects that are more closely tied to specific choices embodied in the X10 v2.3.1 implementation and thus more likely to change in future versions. The second logical section is presented simultaneously with a discussion of some of the central implementation decisions embodied in the X10 v2.3.1 runtime system and compiler.

Fundamental X10 Performance Model

The core language model of X10 is that of a type-safe object-oriented language. Thus much of the core performance model is intended to be similar to that of Java. We believe the Java performance model is generally well-understood. Therefore in this section we focus on areas where the performance models for X10 and Java diverge or where X10 has new constructs that do not trivially map to Java constructs.

X10 Type System

The type systems of X10 and Java differ in three ways that have important consequences for the X10 performance model. First, although X10 classes are very similar to Java's, X10 adds two additional kinds of program values: functions and structs. Second, X10 's generic type system does not have the same erasure semantics as Java's generic types do. Third, X10 's type system includes constrained types, the ability to enhance type declarations with boolean expressions that more precisely specify the acceptable values of a type.

Functions in X10 can be understood by analogy to closures in functional languages or local classes in Java. They encapsulate a captured lexical environment and a code block into a single object such that the code block can be applied multiple times on different argument values. X10 does not restrict the lifetime of function values; in particular they may escape their defining lexical environment. Thus, the language implementation must ensure that the necessary portions of the lexical environment are available for the lifetime of the function object. In terms of the performance model, the programmer should expect that an unoptimized creation of a function value will entail the heap allocation of a closure object and the copying of the needed lexical environment into that object. The programmer should also expect that trivial usage of closures (closures that do not escape and are created and applied solely within the same code block) will be completely eliminated by the language implementation via inlining of the function body at the application site.

Structs in X10 are designed to be space-efficient alternatives to full-fledged classes. Structs may implement interfaces and define methods and fields, but do not support inheritance. Furthermore structs are immutable: a struct's instance fields cannot be modified outside of its constructor. This particular design point was chosen specifically for its implications for the performance model. Structs can be implemented with no per-object meta-data and can be freely inlined into their containing context (stack frame, containing struct/object, or array). Programmers can consider structs as user-definable primitive types, that carry none of the space or indirection overheads normally associated with objects.

The X10 generic type system differs from Java's primarily because it was designed to fully support the instantiation of generic types on X10 structs without losing any of the performance characteristics of structs. For example, x10.lang.Complex is a struct type containing two double fields; x10.util.ArrayList[T] is a generic class that provides a standard list abstraction implemented by storing elements of type T in a backing array that is resized as needed. In Java, a java.util.ArrayList[Complex], would have a backing array of type Object[] that contained pointers to heap-allocated Complex objects. In contrast, the backing storage for X10 's x10.util.ArrayList[Complex] is an array of inline Complex structs without any indirection or other object-induced space overheads. This design point has a number of consequences for the language implementations and their performance model. Much of the details are implementation-specific so we defer them to Section 4.4 and to the paper by Takeuchi et al [10]. However, one high-level consequence of this design is generally true: to implement the desired semantics the language implementation's runtime type infrastructure must be able to distinguish between different instantiations of a generic class (since instantiations on different struct types will have different memory layouts).

Constrained types are an integral part of the X10 type system and therefore are intended to be fully supported by the runtime type infrastructure. Although we expect many operations on constrained types can be checked completely at compile time (and thus will not have a direct runtime overhead), there are cases where dynamic checks may be required. Furthermore, constrained types can be used in dynamic type checking operations (as and instanceof). We have also found that some programmers prefer to incrementally add constraints to their program, especially while they are still actively prototyping it. Therefore, the X10 compiler supports a compilation mode where instead of rejecting programs that contain type constraints that cannot be statically entailed it, will generate code to check the non-entailed constraint at runtime (in effect, the compiler will inject a cast to the required constrained type). When required, these dynamic checks do have a performance impact. Therefore part of performance tuning an application as it moves from development to production is reducing the reliance on dynamic checking of constraints in frequently executed portions of the program.

Distribution

An understanding of X10 's distributed object model is a key component to the performance model of any multi-place X10 computation. In particular, understanding how to control what objects are serialized as the result of an at can be critical to performance understanding.

Intuitively, executing an at statement entails copying the necessary program state from the current place to the destination place. The body of the at is then executed using this fresh copy of the program state. What is necessary program state is precisely defined by treating each upwardly exposed variable as a root of an object graph. Starting with these roots, the transitive closure of all objects reachable by properties and non-transient instance fields is serialized and an isomorphic copy is created in the destination place. Furthermore, if the at occurs in an instance method of a class or struct and the body of the at refers to an instance field or calls an instance method, this is also implicitly captured by the at and will be serialized. It is important to note that an isomorphic copy of the object graph is created even if the destination place is the same as the current place. This design point was chosen to avoid a discontinuity between running a program using a single place and with multiple places.

Serialization of the reachable object graph can be controlled by the programmer primarily through injection of transient modifiers on instance fields and/or GlobalRefs. It is also possible to have a class implement a custom serialization protocol^4.2 to gain even more precise control. An X10 implementation may be able to eliminate or otherwise optimize some of this serialization, but it must ensure that any program visible side-effects caused by user-defined custom serialization routines happen just as they would have in an unoptimized program. Thus, the potential of user-defined custom serialization makes automatic optimization of serialization behavior a fairly complex global analysis problem. Therefore, the base performance model for object serialization should not assume that the implementation will be able to apply serialization reducing optimizations to complex object graphs with polymorphic or generic types.

The X10 standard library provides the GlobalRef, RemoteArray and RemoteIndexedMemoryChunk types as the primitive mechanisms for communicating object references across places. Because of a strong requirement for type safety, the implementation must ensure that once an object has been encapsulated in one of these types and sent to a remote place via an at, the object will be available if the remote place ever attempts to spawn an activity to return to the object's home place and access it. For the performance model, this implies that cross-place object references should be managed carefully as they have the potential for creating long-lived objects. Even in the presence of a sophisticated distributed garbage collector [6] ^4.3, the programmer should expect that collection of cross-place references may take a significant amount of time and incur communication costs and other overheads.

Closely related to the remote pointer facility provided by GlobalRef is the PlaceLocalHandle functionality. This standard library class provides a place local storage facility in which a key (the PlaceLocalHandle instance) can be used to look up a value, which may be different in different places. The library implementation provides a collective operation for key creation and for initializing the value associated with the key in each place. Creation and initialization of a PlaceLocalHandle is an inherently expensive operation as it involves a collective operation. On the other hand cross-place serialization of a PlaceLocalHandle value and the local lookup operation to access its value in the current place are relatively cheap operations.

Async and Finish

The async and finish constructs are intended to allow the application programmer to explicitly identify potentially concurrent computations and to easily synchronize them to coordinate their interactions. The underlying assumption of this aspect of the language design is that by making it easy to specify concurrent work, the programmer will be able to express most or all of the useful fine-grained concurrency in their application. In many cases, they may end up expressing more concurrency than can be profitably exploited by the implementation. Therefore, the primary role of the language implementation is to manage the efficient scheduling of all the potentially concurrent work onto a smaller number of actually concurrent execution resources. The language implementation is not expected to automatically discover more concurrency than was expressed by the programmer. In terms of the performance model, the programmer should be aware that an async statement is likely to entail some modest runtime cost, but should think of it as being a much lighter weight operation than a thread creation.

As discussed in more detail in Section 4.3, the most general form of finish involves the implementation of a distributed termination algorithm. Although programmers can assume that the language implementation will apply a number of static and dynamic optimizations, they should expect that if a finish needs to detect the termination of activities across multiple places, then it will entail communication costs and latency that will increase with the number of places involved in the finish.

Exceptions

The X10 exception model differs from Java's in two significant ways. First, X10 defines a ``rooted'' exception model in which a finish acts as a collection point for any exceptions thrown by activities that are executing under the control of the finish. Only after all such activities have terminated (normally or abnormally) does the finish propagate exceptions to its enclosing environment by collecting them into a single MultipleExceptions object which it will then throw. Second, the current X10 language specification does not specify the exception semantics expected within a single activity. Current implementations of X10 assume a non-precise exception model that enables the implementation to more freely reorder operations and increases the potential for compiler optimizations.

X10 v2.3.1 Implementation Overview

X10 v2.3.1 is implemented via source-to-source compilation to another language, which is then compiled and executed using existing platform-specific tools. The rationale for this implementation strategy is that it allows us to achieve critical portability, performance, and interoperability objectives. More concretely, X10 v2.3.1 can be either compiled to C++ or Java. The resulting C++ or Java program is then compiled by either a platform C++ compiler to produce an executable or compiled to class files and then executed on a cluster of JVMs. We term these two implementation paths Native X10 and Managed X10 respectively.

Portability is important because we desire implementations of X10 to be available on as many platforms (hardware/operating system combinations) as possible. Wide platform coverage both increases the odds of language adoption and supports productivity goals by allowing programmers to easily prototype code on their laptops or small development servers before deploying to larger cluster-based systems for production.

X10 programs need to be capable of achieving close to peak hardware performance on compute intensive kernels. Therefore some form of platform-specific optimizing compilation is required. Neither interpretation nor unoptimized compilation is sufficient. However, by taking a source-to-source compilation approach we can focus our optimization efforts on implementing a smaller set of high-level, X10 -specific optimizations with significant payoff while still leveraging all of the classical and platform-specific optimization found in optimizing C++ compilers and JVMs.

Finally, X10 needs to be able to co-exist with existing libraries and application frameworks. For scientific computing, these libraries are typically available via C APIs; therefore Native X10 is the best choice. However, for more commercial application domains existing code is often written in Java; therefore Managed X10 is also an essential part of the X10 implementation strategy.

**Figure 4.1:** X10 Compiler Architecture

The overall architecture of the X10 compiler is depicted in Figure 4.1. This compiler is composed of two main parts: an AST-based front-end and optimizer that parses X10 source code and performs AST based program transformation; Native/Java backends that translate the X10 AST into C++/Java source code and invokes a post compilation process that either uses a C++ compiler to produce an executable binary or a Java compiler to produce bytecode.

Using source-to-source compilation to bootstrap the optimizing compilation of a new programming language is a very common approach. A multitude of languages are implemented via compilation to either C/C++ and subsequent post-compilation to native code or via compilation to Java/C# (source or bytecodes) and subsequent execution on a managed runtime with an optimizing JIT compiler. An unusual aspect of the X10 implementation effort is that it is pursuing both of these paths simultaneously. This decision has both influenced and constrained aspects of the X10 language design (consideration of how well/poorly a language feature can be implemented on both backends is required) and provided for an interesting comparison between the strengths and limitations of each approach. It also creates some unfortunate complexity in the X10 performance model because the performance characteristics of C++ and Java implementations are noticeably different.

X10 v2.3.1 Runtime

**Figure 4.2:** X10 Runtime Architecture

Figure 4.2 depicts the major software components of the X10 runtime. The runtime bridges the gap between X10 application code and low-level facilities provided by the network transports (PAMI etc.) and the operating system. The lowest level of the X10 runtime is X10RT which abstracts and unifies the capabilities of the various network layers to provide core functionality such as active messages, collectives, and bulk data transfer.

The core of the runtime is XRX, the X10 Runtime in X10. It implements the primitive X10 constructs for concurrency and distribution (async, at, finish, atomic, and when). The X10 compiler replaces these constructs with calls to the corresponding runtime services. The XRX runtime is primarily written in X10 on top of a series of low-level APIs that provide a platform-independent view of processes, threads, primitive synchronization mechanisms (e.g., locks), and inter-process communication. For instance, the x10.lang.Lock class is mapped to pthread_mutex (resp. java.util.concurrent.locks.ReentrantLock) by Native X10 (resp. Managed X10 ).

The X10 Language Native Runtime layer implements the object-oriented features of the sequential X10 language (dynamic type checking, interface invocation, memory management, etc.) and is written in either C++ (Native X10 ) or Java (Managed X10 ).

The runtime also provides a set of core class libraries that provide fundamental data types, basic collections, and key APIs for concurrency and distribution such as x10.util.Team for multi-point communication or x10.array.Array.asyncCopy for large data transfers.

In this section, we review the specifics of the X10 v2.3.1 runtime implementation focusing on performance aspects.

Distribution

The X10 v2.3.1 runtime maps each place in the application to one process.^4.4 Each process runs the exact same executable (binary or bytecode).

Upon launch, the process for place 0 starts executing the main activity.

finish { main(args); }

Static fields.

Static fields are lazily initialized in each place when they are first accessed. Both X10 v2.3.1 backend compilers map static fields initialized with compile time constants to static fields of the target language. Other static fields are mapped to method calls. The method checks to see if the field has already been initialized in the current place, evaluates the initialization expression if it has not, and then returns the value of the field. Therefore accessing a static field with a non-trivial initialization expression will be more modestly more expensive in X10 than the corresponding operations in either C++ or Java.

X10RT.

The X10 v2.3.1 distribution comes with a series of pluggable libraries for inter-process communication referred to as X10RT libraries [11,12]. The default X10RT library--sockets--relies on POSIX TCP/IP connections. The standalone implementation supports SMPs via shared memory communication. The mpi implementation maps X10RT APIs to MPI [8]. Other implementations support various IBM transport protocols (DCMF, PAMI).

Each X10RT library has its own performance profile--latency, throughput, etc. For instance, the X10 v2.3.1 standalone library is significantly faster than the sockets library used on a single host.

The performance of X10RT can be tuned via the configuration of the underlying transport implementation. For instance, the mpi implementation honors the usual MPI settings for task affinity, fifo sizes, etc.

Teams.

The at construct only permits point-to-point messaging. The X10 v2.3.1 runtime provides the x10.util.Team API for efficient multi-point communication.

Multi-point communication primitives--a.k.a. collectives--provided by the x10.util.Team API are hardware-accelerated when possible, e.g., broadcast on BlueGene/P. When no hardware support is available, the Team implementation is intended to make a reasonable effort at minimizing communication and contention using standard techniques such as butterfly barriers and broadcast trees.

AsyncCopy.

The X10 v2.3.1 tool chain implements at constructs via serialization. The captured environment gets encoded before transmission and is decoded afterwards. Although such an encoding is required to correctly transfer object graphs with aliasing, it has unnecessary overhead when transmitting immediate data, such as arrays of primitives.

As a work around, the X10 v2.3.1 x10.array.Array class provides specific methods--asyncCopy--for transferring array contents across places with lower overhead. These methods guarantee the raw data is transmitted as efficiently as permitted by the underlying transport with no redundant packing, unpacking, or copying. Hardware permitting, they initiate a direct copy from the source array to the destination array using RDMAs.^4.5

Concurrency

The cornerstone of the X10 runtime is the scheduler. The X10 programming model requires the programmer to specify the place of each activity. Therefore, the X10 scheduler makes per-place decisions, leaving the burden of inter-place load balancing to the library writer and ultimately the programmer.

The X10 v2.3.1 scheduler assumes a symmetrical, fixed number of concurrent execution units (CPU cores) per process for the duration of the execution. This assumption is consistent with the HPCS context--job controllers typically assign concurrently running applications to static partitions of the available computing resources--but will be relaxed in subsequent releases of X10 .

Work-Stealing scheduler.

The X10 v2.3.1 scheduler belongs to the family of work-stealing schedulers [2,3] with a help-first scheduling policy [5]. It uses a pool of worker threads to execute activities. Each worker thread owns a double-ended queue of pending activities. A worker pushes one activity for each async construct it encounters. When a worker completes one activity, it pops the next activity to run from its deque. If the deque is empty, the worker attempts to steal a pending activity from the deque of a randomly selected worker.

Since each worker primarily interacts with its own deque, contention is minimal and only arises with load imbalance. Moreover, a thief tries to grab an activity from the top of the deque whereas the victim always pushes and pops from the bottom, further reducing contention.

In X10 v2.3.1 , the thief initially chooses a victim at random then inspects the deque of every worker in a cyclic manner until it manages to steal a pending activity.

The X10 scheduler borrows the deque implementation of Doug Lea's Fork/Join framework [7].

Life cycle.

A worker may be in one of four states:

running: one activity,
searching: for an activity to execute,
suspended: because the activity it is running has executed a blocking construct, such as finish or when, or method, such as System.sleep or x10.util.Team.barrier,
stopped: because there are already enough workers running or searching.

Suspended and stopped workers are idle. Starting in X10 v2.3.1 , the runtime will mostly suspend excess idle workers, but even if the place is entirely inactive the runtime still requires one worker to be polling the network (i.e., busy waiting) to respond to incoming messages. We expect that at least for some x10rt implementations that in later X10 release it will be possible to eliminate the need for busy waiting entirely.

Cooperative scheduler.

The X10 v2.3.1 scheduler never preempts a worker running user code. The X10 v2.3.1 runtime is designed to enable achieving the highest possible performance on MPI-like distributed X10 applications where the programmer use matching send and receive instructions to achieve total control over the communication and execution schedule. If the user code never yields to the runtime then pending activities (local or remote) are not processed. In other words, the runtime does not make any fairness guarantee.

A worker may yield either by executing a blocking statement or by invoking the Runtime.probe method. The latter executes all the pending activities at the time of the call before returning to the caller. This includes all the pending remote activities--activities spawned here from other places--and all the activities already in this worker deque, but does not include activities in other deques.

Parallelism.

The user can specify the number of workers in the pool in each place using the X10_NTHREADS environment variable.^4.6 The X10 v2.3.1 scheduler may create additional threads during the execution. But it strives to maintain the number of non-idle workers close to the requested value.

If a worker suspends, the scheduler wakes a stopped worker if available or allocates and starts a new worker if not.
If a suspended worker resumes, the scheduler preempts and stops a searching worker if any.
If there are more than X10_NTHREADS workers running then the scheduler preempts and stops the first one who empties its deque.

As a result, the current scheduler guarantees the following properties that are intended to hold for any X10 implementation.

If there are X10_NTHREADS pending activities or more then there are X10_NTHREADS or more workers processing them, that is, running them or searching for them.
If there are `` $n<\texttt{X10\_NTHREADS}$ '' workers running user code then there are `` $\texttt{X10\_NTHREADS}-n$ '' workers searching for pending activities.
If there are $\texttt{X10\_NTHREADS}$ or more workers running then there are no workers spinning.

Property 1 is the goal of any work-stealing scheduler: assuming the effort of finding pending activities is negligible, parallel activities are processed in parallel using X10_NTHREADS parallel processing units.

Property 2 guarantees that available cores are used to find pending activities quickly.

Property 3 mitigates the penalty of busy waiting in the current implementation: spinning workers are never getting in the way of the application provided the user makes sure that X10_NTHREADS is at most equal to the number of hardware cores available to the runtime for each place. For instance, if running 8 places on a 32-core node, X10_NTHREADS must not be larger than 4 workers per place.

Joining.

In order to minimize pool size adjustments, the scheduler implements one key optimization. If a worker blocks on a finish construct but its deque is not empty, it does not suspend but instead processes the pending activities from its deque. It only eventually suspends if its deque becomes empty or if it reaches some other blocking construct (different from finish). By design, the pending activities that get processed by the worker in this phase must have been spawned from the blocked finish body. In the X10 v2.3.1 implementation, the worker will not attempt to steal activities from others if the finish construct is still waiting for spawned activities to terminate when the deque gets empty as this would require to carefully pick activities the finish construct is waiting for.

Thanks to this behavior, finish has much less scheduling overhead than other synchronization mechanisms, e.g., when constructs, and should be preferred when possible.

While this optimization is typically very effective at improving performance without observable drawbacks, it may lead to unbounded stack growth for pathological programs. Therefore, it may be disabled by setting the environment variable X10_NO_STEALS.^4.7

Overhead.

For each async statement, the current worker must make work available to other workers. In the best implementation and best case scenario (no contention) this requires at least one CAS instruction^4.8 per async. As a result, async constructs should only be used to guard computations that require (significantly) more resources than a CAS.

The X10 v2.3.1 runtime also allocates one small heap object per async. Again, anything smaller than that should be executed sequentially rather than wrapped with an async. Moreover, memory allocation and garbage collection can become a bottleneck if vast amounts of activities are created concurrently. The runtime therefore exposes the Runtime.surplusActivityCount method that returns the current size of the current worker deque. Application and library code may invoke this method to decide whether or not to create more asynchronous activities, as in:

if (Runtime.surplusActivityCount() >= 3) m(); else async m();

Synchronization

Finish.

Within a place, one only needs to count activity creation and termination events to decide the completion of a finish construct. The story is different across places as inter-process communication channels are likely to reorder messages so that termination events may be observed ahead of the corresponding creation events. The X10 v2.3.1 implementation of finish keeps track of these events on an unambiguous, per-place basis.

In the worst-case scenario, with places, there will be counters in each place, that is, $p\times p$ counters for each finish. Moreover, there could be one inter-process message for each activity termination event. Messages could contain up to data elements.

In practice however much fewer counters, fewer messages, and smaller messages are necessary thanks to various optimizations embedded in the X10 v2.3.1 implementation. In particular, events are accumulated locally and only transmitted to the finish place when local quiescence is detected--all local activities for this finish have completed. Counters are allocated lazily. Messages use sparse encodings.

To complement these runtime mechanisms, the programmer may also specify finish pragmas, which inform the runtime system about the kind of concurrent tasks that the finish will wait for, as in:

@Pragma(Pragma.FINISH_ASYNC) finish at (p) async s;

The pragmas key the usage of more efficient implementations of distributed termination detection.

Currently, the runtime system supports five finish pragmas. Below the ``home'' place of a finish is the place at which the finish statement was spawned. A ``remote'' place is a place that is not home.

FINISH_ASYNC

A finish for a unique (possibly remote) async. This finish serves as an indicator at the origin for the completion of the async.

FINISH_LOCAL

All asyncs created under this finish (recursively) execute at home.

FINISH_SPMD

A finish whose remote activities have the property that any nested activities they spawn must themselves be nested within a finish. This corresponds to the ``Single Program Multiple Data'' idioms (e.g. using MPI) where each place runs a single top-level activity. In X10, the nested finish nested blocks (containing async's and at's) may be used for (possibly overlapping) communication between places.

FINISH_HERE

A finish that does not count the remote creation or termination of activities (controlled by it).

For this scheme to be a correct implementation of finish, it must be the case that the number of remote creations and terminations of activities is equal, and these events happen before the count of local creation and termination events (for asyncs controlled by this finish) reaches .

For example, this idiom is applicable for this (often used) ``ping pong'' idiom:

val home=here;
finish
   at (other) async /*Ping*/ {
     S;
     at (home) async /*Pong*/
       S1;
   }

According to the semantics of finish the statement will terminate when both Ping and Pong have terminated. A naive implementation would require the finish to be informed of the creation of Pong and the termination of Ping. Note however, that both of these events are remote - we consider at (other) async as a ``fused'' construct which starts at the home place and finishes at the remote place. The FINISH_HERE implementation will only count the creation of Ping and the termination of Pong (both of these are local events).

FINISH_DENSE

A scalable finish implementation for large place counts using indirect routes for control messages so as to reduce network contention at the expense of latency.

For now, neither the compiler nor the runtime makes any attempt at checking the validity of the pragma. Therefore a pragma, if misused, may result in the spurious (early) termination of the annotated finish or in a deadlock.

Uncounted Async.

There are some situations in which an async may be annotated with @Uncounted (from the x10.compiler package). This annotation tells the compiler and runtime not to perform any of the book-keeping necessary to ensure that the governing finish progresses only after this async has terminated.

There are two principle cases in which the use of Uncounted is recommended. First, the programmer may be able to establish that the lifetime of this async and all asyncs spawned in its dynamic extent is contained within the lifetime of the current activity. For instance one situation in which this happens is if the async corresponds to a remote message send, and on the remote side the body of the message executes some local operations and responds with a message send to the originator. In the meantime, the originator sits waiting in a loop for the return message (e.g. by executing Runtime.probe(). This is a safe use of Uncounted (see chap:Unbalanced).

The second situation is one in which the async corresponds to a message send directly implemented in the hardware, and some other reasoning is used to establish that these messages complete in time (see chap:RA).

Atomic and When.

The X10 v2.3.1 implementation of the atomic construct uses a place-wide lock. The lock is acquired for the duration of the atomic section. The when construct is implemented using the same lock. Moreover, every suspended when statement is notified on every exit from an atomic section, irrespective of condition.

The per-place lock effectively serializes all atomic operation in a place whether they might inerfere or not. This implementation does not scale well beyond a few worker threads. Similarly, the when implementation does not scale well beyond a few occurrences (distinct condition variables).

The X10 standard library provides various atomic classes and locks that enable better scaling. Both the collecting finish idiom and the x10.util.WorkerLocalStorage API may be also used to minimize contention.

X10 v2.3.1 Compilation

When an application programmer writes X10 code that they are intending to execute using Native X10 , their base performance model should be that of C++. Unless discussed below, the expected performance of an X10 language construct in Native X10 is the same as the corresponding C++ construct.

Classes and Interfaces

X10 classes are mapped to C++ classes and the compiler directly uses the C++ object model to implement inheritance, instance fields, and instance methods. Interfaces are also mapped to C++ classes to support method overloading, but the X10 implements relationship is not implemented using the C++ object model. Instead, additional interface dispatch tables (akin to ITables in Java^4.9) are generated by the X10 compiler ``outside'' of the core C++ object model. The motivation for this design decision was to stay within the simpler, single-inheritance subset of C++ that minimizes per-object space overheads and also preserves the useful property that a pointer to an object always points to the first word of the object and that no ``this pointer adjustment'' needs to be performed on assignments or during the virtual call sequence.

Non-interface method dispatch corresponds directly to a C++ virtual function call. Interface method dispatch will involve additional table lookups and empirically is 3 to 5 times slower than a virtual function call. C++ compilers typically do not aggressively optimize virtual calls, and will certainly not be able to optimize away the dispatch table lookup used to implement interface dispatch. Therefore, as a general rule, non-final and interface method invocations will not perform as well in Native X10 as they will in Managed X10 .

Unless specially annotated, all class instances will be heap allocated and fields/variables of class types will contain a pointer to the heap allocated object.

Primitives and Structs

The dozen X10 struct types that directly correspond to the built-in C primitive types (int, float, etc.) are implemented by directly mapping them to the matching primitive type. Any X10 level functions defined on this types are implemented via static inline methods. The performance characteristics of the primitive C++ types is exactly the performance of their X10 counterparts.

All other X10 structs are mapped to C++ classes. However, all of the methods of the C++ class are declared to be non-virtual. Therefore, the C++ class for a struct will not include a vtable word. Unlike object instances, struct instances are not heap allocated. They are instead embedded directly in their containing object or stack-allocated in the case of local variables. When passed as a parameter to a function, a struct is passed by value, not by reference. In C++ terms, a variable or field of some struct type S is declared to be of type S, not S*.

This implementation strategy optimizes the space usage for structs and avoids indirections. Programmers can correctly think of structs as taking only the space directly implied by their instance fields (modulo alignment constraints). However, passing structs, especially large structs, as method parameters or return value is significantly more expensive than passing/returning a class instance. In future versions of X10 we hope to be able to pass structs by reference (at the implementation level) and thus ameliorate this overhead.

Closures and Function Types

An X10 function type is implemented exactly the same as other X10 interface types. An X10 closure literal is mapped to a C++ class whose instance fields are the captured lexical environment of the closure. The closure body is implemented by an instance method of the C++ class. The generated closure class implements the appropriate function type interface. Closure instances are heap allocated. If the optimizer is able to propagate a closure literal to a program point where it is evaluated, the closure literal's body is unconditionally inlined. In many cases this means that the closure itself is completely eliminated as well.

Generics

Generic types in X10 are implemented by using C++'s template mechanism. Compilation of a generic class or struct results in the definition of a templatized C++ class. When the generic type is instantiated in the X10 source, a template instantiation happens in the generated C++ code.

The performance of an X10 generic class is very similar to that of a similar C++ templatized class. In particular, instantiation based generics enable X10 generic types instantiated on primitives and structs to be space efficient in the same way that a C++ template instantiated on a primitive type would be.

Memory Management

On most platforms Native X10 uses the Boehm-Demers-Weiser conservative garbage collector as its memory manager. A runtime interface to explicitly free an object is also available to the X10 programmer. The garbage collector is only used to automatically reclaim memory within a single place. The BDWGC does not yield the same level of memory management performance as that of the memory management subsystem of a modern managed runtime. Therefore, when targeting Native X10 the application programmer may need to be more conscious of avoiding short-lived objects and generally reducing the application's allocation rate.

Because the X10 v2.3.1 implementation does not include a distributed garbage collector, if a GlobalRef to an object is sent to a remote place, then the object (and therefore all objects that it transitively refers to) become uncollectable. The life-time of all multi-place storage must currently be explicitly managed by the programmer. This is an area of the implementation that needs further investigation to determine what mix of automatic distributed garbage collection and additional runtime interfaces for explicit storage control will result in the best balance of productivity and performance while still maintaining memory safety.

Other Considerations

In general, Native X10 inherits many of the strengths and weaknesses of the C++ performance model. C++ compilers may have aggressive optimization levels available, but rarely utilize profile-directed feedback. C++ compilers are generally ineffective at optimizing non statically-bound virtual function calls. Over use of object-oriented features, interfaces, and runtime type information is likely to reduce application performance more in Native X10 than it does in Managed X10 .

The C++ compilation model is generally file-based, rather than program-based. In particular, cross-file inlining (from one .cc file to another) is performed fairly rarely and only at unusually high optimization levels. Since the method bodies of non-generic X10 classes are mostly generated into .cc files, this implies that they are not easily available to be inlined except within their own compilation unit (X10 file). Although for small programs, this could be mitigated by generating the entire X10 application into a single .cc file, this single-file approach is not viable for the scale of applications we need Native X10 to support.

Final Thoughts

Clearly, the performance models described in this chapter are not the final and definitive X10 performance model. However, we do believe that the language specification and its implementations are well-enough understood that it is possible for significant X10 programs to be written and for programmers to obtain predictable and understandable performance behavior from those programs. As the X10 implementations continue to mature, we expect to be able to eliminate some of the less desirable features of the X10 v2.3.1 performance models.

We hope that the open discussion of our design decisions in implementing X10 and their implications for its performance will be useful to the X10 programmer community and to the broader research community that is engaged in similar language design and implementation projects.

Footnotes

... version ^4.1: based on the discussion in [4]
... protocol ^4.2: x10.io.CustomSerialization
...Kawachiya:2012:DGC:2246056.2246061 ^4.3: which is not available in Native X10 v2.3.1
... process.^4.4: The X10 v2.3.1 runtime may launch additional processes to monitor the application processes. These helper processes are idle most of the time.
... RDMAs.^4.5: RDMA: remote direct memory access.
... variable.^4.6: Some X10RT libraries may internally use additional threads for network management. See documentation.
...X10_NO_STEALS.^4.7: The X10_NO_STEALS flag essentially turns deep stacks into large collections of mostly-idle threads with smaller stacks, avoiding stack overflow errors. But ultimately, this only matters to unscalable programs of little practical relevance.
... instruction ^4.8: CAS: compare-and-swap.
... Java ^4.9: see the description of ``searched ITables'' in Alpern et al. [1]

Next: References Up: A Brief Introduction To X10 Previous: The APGAS model