Advanced Registration

Early registration deadline: May 16, 2011

(in no particular order)


Parallel Programming: Design of an Overview Class

Christoph von Praun

 

Summary

We designed an introductory parallel programming course at the bachelor level. The class differs from other courses in its structure: The course is organized along the {\em tiers of parallelism}~\cite{Scott:2009_MPEW}. The tiers categorize abstractions and concepts that a software developer can choose when crafting a parallel program. The tiers are from higher to lower abstraction levels: (1) automatic/implicit parallelism, e.g., parallel libraries; (2) deterministic parallelism, e.g., at the level of independent loops; (3) explicitly synchronized, e.g., shared memory with locks; (4) low-level concurrent programming with data races, e.g., lock-free data structures. The goal of the class is to introduce fundamental principles of parallel systems and to expose students to all tiers in the architecture of a parallel system. The course serves as a platform for further exploration in specialized classes.

The course has a significant share of lab sessions and programming projects. We chose the programming language X10 as the core technology and found that it facilitates the learning and rapid application of concepts at different abstraction layers and programming models. The language permits to specify common forms of parallelism, data sharing, distribution, and synchronization with succinct syntax and support for an eclipse-based IDE. We report on our first experience in teaching this course, which resulted in very positive student feedback.

 


A Performance Model for X10 Applications

David Grove, Olivier Tardieu, David Cunningham, Ben Herta, Igor Peshansky and Vijay Saraswat

 

Summary

To reliably write high performance code in any programming language, an application programmer must have some understanding of the performance characteristics of the language's core constructs. We call this understanding a performance model for the language. Performance models are simultaneously defined and informed by the collection of implementation choices embodied in the programming language's compiler and runtime system. In this paper, we describe selected aspects of the implementation of version 2.2 of the X10 programming language with the twin goals of establishing a useful performance model for the X10 application programmer and to document aspects of the implementation that we believe are more generally applicable to other language implementation efforts.

 


X10 on the Single-Chip Cloud Computer

Keith Chapman, Ahmed Hussein and Antony L. Hosking

 

Summary

The Single-Chip Cloud Computer (SCC) is an experimental pro- cessor created by Intel Labs. SCC is essentially a "cluster-on-a- chip", so X10 with its support for places and remote asynchronous invocations is a natural fit for programming this platform. We report here on our experience porting X10 to the SCC, and show perfor- mance and scaling results for representative X10 benchmark appli- cations. We compare results for our extensions to the SCC native messaging primitives in support of the X10 run-time, versus X10 on top of a prototype MPI API for SCC. The native SCC run-time exhibits better performance and scaling than the MPI binding. Scaling depends on the relative cost of computation versus communication in the workload used, since SCC is relatively underpowered for computation but has hardware support for message passing.

 


Compiling X10 to Java

Mikio Takeuchi, Yuki Makino, Kiyokuni Kawachiya, Hiroshi Horii, Toyotaro Suzumura, Toshio Suganuma and Tamiya Onodera

 

Summary

X10 is a new programming language for improving the software productivity in the multicore era by making the parallel/distributed programming easier. X10 programs are compiled into C++ or Java source codes, however, X10 supports various features not supported in Java. To handle them efficiently on Java, new compilation techniques become necessary. This paper discusses issues in translating X10-unique functions to Java, and provides our solutions. By using appropriate implementations, sequential execution performance has been improved by more than 7 times and now it is comparable to Java. Parallel execution performance has also been improved and the gap to Java Fork/Join performance is about 3 times when runs on a single place. Most of the results in this paper have already been incorporated in the latest X10 release 2.1.2. The compilation techniques described in this paper are useful for implementing other programming languages targeted for Java or other managed environments.

 


Object Initialization in X10

Yoav Zibin, David Cunningham, Igor Peshansky and Vijay Saraswat

 

Summary

X10 is an object oriented programming language with a sophisticated type system (constraints, class invariants, non-erased generics, closures) and concurrency constructs (asynchronous activities, multiple places, global references). Object initialization is a crosscutting concern that interacts with all these features in delicate ways that may cause type-, runtime-, and security- errors. This paper discusses possible designs for object initialization, and the “hardhat” design chosen and implemented in X10 version 2.1. Our implementation includes a fixed-point inter-procedural (intra-class) data-flow analysis that infers, for each method called during initialization, the set of fields that are read and those that are asynchronously and synchronously assigned.

 


Work-Stealing by Stealing States from Live Stack Frames of a Running Application

Vivek Kumar, Daniel Frampton, David Grove, Olivier Tardieu and Steve Blackburn

 

Summary

The use of a work stealing scheduler has become a popular approach for providing task parallelism. It is used in many modern parallel programming languages, such as Cilk and X10, which have emerged to address the concerns of parallel programming complexity on modern multicore architectures. There are various challenges in providing an efficient implementation of work-stealing, but in any implementation it must be possible for the thief to access the execution state required to per- form the stolen task. The natural way to achieve this is to save the necessary state whenever a producer creates stealable work. While the ability to provide some degree of parallelism may dominate performance at scale, it is common for the vast majority of potentially stealable work to never actually be stolen, but instead processed by the producer itself. This indicates that to further improve performance we should minimize the overheads incurred in making work available for stealing. We are not the only ones to make this observation, for example X10's current C++ work-stealing implementation stack-allocates state objects and lazily copies them to the heap to avoid unnecessary heap allocation during normal execution. In our context of a Java virtual machine, it is possible to extend this idea further and avoid stack allocating state objects, but instead allow thieves to ex- tract state directly from within stack frames of the producer. This is achieved by using state-map information provided by a cooperative runtime compiler, allowing us to drive down the cost of making state available for stealable work items. We discuss our design and preliminary findings for the implementation of our framework in- side X10 work-stealing runtime and the baseline compiler of Jikes RVM, a high-performance Java research virtual machine.

 


X10 implementation of Parallel Option Pricing with BSDE method

Hui Liu, Ying Peng, DaiZhen Wei and Bin Dai

 

Summary

Option pricing is one of the most important parts in the field of financial derivatives pricing and risk management. To promote precision of pricing, we use an parallel method based on BSDE(Backward Stochastic Differential Equation) model, which is a widely used numerical model in financial computing domain.BSDE model can improve the accuracy and effectiveness of option pricing. X10 is a high level high-performance programming language with new concurrency constructs, namely, places, asyncs, finish, atomic and clock. X10 also provides a rich array language which includes region, distributions and distributed arrays. In this paper, we present how to utilize X10's properties to implement option pricing with BSDE model. Experimental results show that the parallel program implemented in x10 can achieve a superior performance and nicer speedup. It can be widely used in financial area.

 


Distributed deductive databases, declaratively: The L10 logic programming language

Robert Simmons, Frank Pfenning and Bernardo Toninho

 

Summary

We present the preliminary design of L10, a rich forward-chaining (a.k.a. "bottom-up") logic programming language. L10 allows parallel computation to be explicitly specified through the use of *worlds*, a logically-motivated concept that has been used to describe distributed functional programming. An interpreter for L10 runs these logic programs on top of the infrastructure of the X10 programming language, and is responsible for mapping between L10's worlds and *places*, the related X10 construct for describing distributed computation.

 


Using the Cowichan Problems to Investigate the Programmability of X10 Programming System

Jeeva Paudel and Jose Nelson Amaral

 

Summary

In today's era of multicores and clustered architectures, high performance and high productivity are central concerns in the design of parallel programming languages that aim to solve large computational problems. X10 is a language based on state-of-the-art object-oriented programming ideas and claims to take advantage of their proven flexibility and ease- of-use to solve a wide spectrum of programming problems. The Cowichan problems is a set of computational problems that were designed to stress parallel programming environments and to assess their programmability. This paper uses Cowichan problems to assess the flexibility of X10.

 


GPU Programming in a High Level Language Compiling X10 to CUDA

David Cunningham, Rajesh Bordawekar and Vijay Saraswat

 

Summary

GPU architectures have emerged as a viable way of considerably improving performance for appropriate applications. Program fragments (kernels) appropriate for GPU execution can be implemented in CUDA or OpenCL and glued into an application via an API.

While there is plenty of evidence of performance improvements using this approach, there are many issues with productivity. Programmers must understand an additional programming model and API to program the accelerator; concurrency and synchronization in this programming model is typically expressed differently from the programming model for the host. On top of this, the languages used to write kernels are very low level and thus prone to the kinds of errors that one does not encounter in higher level languages. Programmers must explicitly deal with moving data back-and-forth between the host and the accelerator. These problems are compounded when the user code must be run across a cluster of accelerated nodes. Now the host programming model must further be extended with constructs to deal with scale-out and remote accelerators. We believe there is a critical need for a single source programming model that can be used to write clean, efficient code for heterogeneous, multi-core and scale-out architectures.

The APGAS programming model has been developed for such architectures over the past six years. APGAS is based on four fundamental (and architecture-independent) notions: locality, asynchrony, conditional atomicity and order. X10 is an instantiation of the APGAS programming model on top of a base sequential language with Java-style productivity. Earlier work has shown that X10 can be used to write clean and efficient code for homogeneous multi-cores, SMPs, Cell-accelerated nodes, and clusters of such nodes. In this paper we show how X10 programmers can write code that can be compiled and run on GPUs. GPU programming idioms such as threads, blocks, barriers, constant memory, local registers, shared memory variables, etc. can be directly expressed in X10, and do not require new language extensions. We present the design of an extension of the X10-to-C++ compiler which recognizes such idioms and produces CUDA kernel code. We show several benchmarks written in this style. The performance of these kernels is within 80\% of hand-written CUDA kernels.

We believe these results establish X10 as a single-source programming language in which clean, efficient programs can be written for GPU-accelerated clusters.

 


Phaser Beams: Integrating Stream Parallelism with Task Parallelism

Jun Shirako, David Peixotto, Dragos Sbirlea and Vivek Sarkar

 

Summary

Current streaming languages place significant restrictions on the structure of parallelism that they support, and usually do not allow for dynamic task parallelism. In contrast, there are a number of task-parallel programming models that support dynamic parallelism but lack the ability to set up efficient streaming communications among dynamically varying sets of tasks. We address this gap by introducing Phaser Beams as a foundation for integrating stream parallelism with task parallelism. Phaser Beams builds on past work on accumulators and point-to-point synchronization in Habanero-Java (HJ) phasers, which in turn was derived from X10 clocks.

Phaser Beams introduce three key extensions relative to past work: 1) a bounded phaser that limits the maximum phase difference between a producer and a consumer synchronizing on that phaser and the buffer size needed to support streaming, 2) an extension to accumulators to work in non-barrier mode with bounded phasers for use in streaming, and 3) a dynamic cycle-detection algorithm to phaser registration to detect cyclic structures in a streaming program so as to enable efficient batching optimizations for acyclic structures. These extensions could easily be incorporated in a future version of X10.

Our preliminary Java-based implementation of Phaser Beams is restricted to the single node case, and the results obtained on three multicore SMPs are encouraging. As a calibration of the baseline performance of phaser synchronization, the performance of barriers in HJ phasers was found to be significantly faster than Java's CyclicBarrier and the Java and C++ implementations of X10's clocks, when measured using the BarrierBench benchmark.

We evaluate our Phaser Beams implementation using four streaming applications (FilterBank, FMRadio, BeamFormer from StreamIt benchmarks, and FacilityLocation to demonstrate combination of task and stream parallelism). The results are encouraging; the Java-based Phaser Beams implementation shows comparable performance to the C-based implementation of StreamIt.