Authoring User-Defined Domain Maps in Chapel

Page created by Max Romero
 
CONTINUE READING
Authoring User-Defined Domain Maps in Chapel∗
         Bradford L. Chamberlain                Sung-Eun Choi Steven J. Deitz David Iten                            Vassily Litvinov
                                                            Cray Inc.
                                                      chapel info@cray.com

                                                                         Abstract
     One of the most promising features of the Chapel parallel programming language from Cray Inc. is its support for user-defined
     domain maps, which give advanced users control over how arrays are implemented. In this way, parallel programming experts
     can focus on details such as how array data and loop iterations are mapped to a target architecture’s nodes, while end-users can
     benefit from their efforts when using Chapel’s high-level global array operations. Chapel’s domain maps also support control over
     finer-grained decisions like what memory layout to use when storing an array’s elements on each node. In this paper, we provide
     an introduction to Chapel’s user-defined domain maps and summarize the framework for specifying them.

1     Introduction                                                                  mented using Chapel’s control-oriented productivity fea-
                                                                                    tures, including task parallelism, locality control, iterator
Chapel [13, 7] is a parallel programming language being                             functions, type inference, object-orientation, and generic
developed by Cray Inc. with the goal of improving pro-                              programming.
grammer productivity compared to conventional program-                                 In previous work, we introduced Chapel’s philosophy for
ming notations for HPC, like MPI, OpenMP, UPC, and                                  user-defined distributions and provided a very high-level
CAF. Chapel’s goal is to greatly improve upon the degree of                         view of the software framework used to specify them [10].
programmability and generality provided by current tech-                            In this paper, we describe the framework at the next level
nologies, while supporting performance and portability that                         of detail in order to provide a better picture of what is in-
is similar or better. Chapel has a portable implementation                          volved in authoring a domain map. For even more detail,
that is available under the BSD license and is being devel-                         the reader is referred to the documentation for domain maps
oped as an open-source project at SourceForge1 .                                    within the Chapel release itself.
   One of Chapel’s most attractive features for improving                              This paper is organized as follows: The next section pro-
productivity is its support for global-view arrays, which                           vides a brief summary of our goals for supporting user-
permit programmers to apply natural operations to an array                          defined domain maps. Following that, we provide a sum-
even though its implementation may potentially span mul-                            mary of related work, contrasting our approach with previ-
tiple distributed memory nodes. Unlike previous languages                           ous efforts to support global-view arrays and user-defined
that have supported global-view arrays (e.g., HPF, ZPL,                             distributions. Section 4 provides a brief introduction of
UPC), Chapel allows advanced users to author their own                              Chapel to support and motivate the domain map frame-
parallel array implementations. This permits them to spec-                          work. The detailed description of our framework for defin-
ify the array’s distribution across nodes, its layout within                        ing user-defined domain maps is given in Section 5. Sec-
a node’s memory, its parallelization strategy, and other im-                        tion 6 describes our implementation status while Section 7
portant details.                                                                    summarizes and lists future work.
   In Chapel, these user-defined array implementations are
known as domain maps because they map a domain—
Chapel’s representation of an array’s index set—down to                             2 Goals of This Work
the target machine. Domain maps that target a single shared
memory segment are known as layouts while those that
                                                                                    The overarching goal of this work is for the Chapel lan-
target multiple distinct memory segments are referred to
                                                                                    guage to support a very general and powerful set of dis-
as distributions. All of Chapel’s domain maps are writ-
                                                                                    tributed global-view array types, while also providing ad-
ten within Chapel itself. As a result, they can be imple-
                                                                                    vanced users with the ability to implement those arrays in
  ∗ This material is based upon work supported by the Defense Advanced              whatever ways they see fit. We adopt a language-based
Research Projects Agency under its Agreement No. HR0011-07-9-0001.                  solution both for the improved syntax it supports and to ex-
   1 http://sourceforge.net/projects/chapel/                                        pose optimization opportunities to the compiler.
A well-written domain map should support a plug-and-             ciple applies to more complex parallel operations as well,
play quality in which one distribution or layout can be sub-        such as loops, stencil computations, reductions, etc.
stituted for another in a line or two of declaration code.             Another important theme in this work is that all of
Meanwhile, any global-view operations on the arrays them-           Chapel’s standard distributions and layouts will be imple-
selves can remain unchanged, thereby insulating the spec-           mented using the same framework that a typical Chapel
ification of the parallel algorithm from its implementation         programmer would use. We take this approach as a means
details. As a simple motivating example, consider the fol-          of ensuring that users can write user-defined domain maps
lowing STREAM Triad kernel, which computes a scaled                 with good performance. It also safeguards against creating
vector addition:                                                    a performance cliff when moving from the set of standard
      const D = [1..n];           // the problem space              domain maps to a user-defined one.
                                                                       Finally, whereas previous distributed array capabilities
      var A, B, C: [D] real; // the three vectors                   have been fairly dimensional and static in nature, we want
      A = B + alpha * C;          // the computation
                                                                    to ensure that our framework can support very general,
                                                                    holistic, and dynamic implementations of distributed index
As written here, the domain D representing the problem              sets and arrays. Motivating examples include multidimen-
space is declared without any domain map specification.             sional recursive bisections, graph partitioning algorithms,
As a result, it is implemented using Chapel’s default lay-          and arrays supporting dynamic load balancing.
out. This causes arrays A, B, and C to be allocated using              For a more detailed description of our motivating
memory local to the current node. Moreover, the parallel            themes, as well as samples of distributions that our frame-
computation itself will be implemented using the processor          work was designed to support, please refer to our previous
cores of that node. Thus, we have written a Chapel program          paper [10].
suitable for desktop multicore programming.
   To target a large-scale distributed-memory system, we
can simply change the domain’s value to include a distribu-         3 Related Work
tion:
      const D = [1..n] dmapped Cyclic(startIdx=1);                  Chapel’s global-view arrays are most closely related to
                                                                    those provided by ZPL [28, 29] and the High Performance
      var A, B, C: [D] real; // the three vectors                   Fortran family of languages [22, 23, 18, 11, 1]. Each of
      A = B + alpha * C;          // the computation
                                                                    these languages supports the concept of an array type that
                                                                    is declared and operated upon as though it is a single log-
Here, we specify that the domain’s indices should                   ical array even though its implementation may distribute
be mapped to the target architecture (dmapped) using                the elements among the disparate memories of multiple
Chapel’s standard Cyclic distribution. This will deal its           distributed-memory nodes. Chapel extends the support for
indices out to the nodes on which the program is execut-            dense and sparse rectangular arrays in HPF/ZPL to include
ing in a round-robin manner, starting from the specified in-        associative arrays that map from arbitrary value types to
dex, 1. The array declaration and computation lines need            array elements. Chapel also supports unstructured arrays
not change since they are independent of the implemen-              that are designed for compactly computing over pointer-
tation details. Finally, we can switch to another distribu-         based data structures. Chapel supports a language concept
tion simply by changing the domain map. For example, a              for representing first-class index sets like ZPL’s region but
Block distribution could be specified as follows:                   calls it a domain. As in ZPL, domains serve as the founda-
      const D = [1..n] dmapped Block([1..n]);                       tion for defining and operating on global-view arrays.
                                                                       HPF and ZPL were both fairly restricted in terms of
      var A, B, C: [D] real; // the three vectors                   the distributions they supported. HPF supported a small
      A = B + alpha * C;          // the computation
                                                                    set of regular, per-dimension distributions—Block, Cyclic,
                                                                    and Block-Cyclic—while ZPL traditionally supported only
   Because of this plug-and-play characteristic, Chapel             multidimensional Block distributions. In both cases, the
supports a separation of roles: Expert HPC programmers              distributions were defined by the language and imple-
can wrestle with the details of implementing efficient paral-       mented directly in its compiler and runtime. This permitted
lel data structures within the domain map framework itself.         the language implementors to optimize for their distribu-
Meanwhile, parallel-aware end-users benefit from their              tions, yet did not provide any means of specifying more
efforts by writing parallel computations using Chapel’s             general distributions or data structures. In contrast, Chapel
high-level notation, without having to understand (or even          provides a general framework for defining distributed ar-
look at) the low-level implementation details. While this           rays and implements all of its standard distributions using
STREAM Triad example is quite simple, the same prin-                the same framework that is available to end-users.

                                                                2
Subsequent work in both HPF and ZPL sought to im-                                        
prove upon their limitations. HPF-2 added support for in-                                      
direct distributions that could support arbitrary mappings
of data to processors [35, 20], though arguably at great cost                                  
in space and efficiency. Other extensions to HPF proposed                                   
support for distributed compressed sparse row (CSR) ar-                                      
rays by having the programmer write code in terms of dis-                                           

tributed versions of the underlying 1D data vectors [32].                                    
Late in the ZPL project, a lattice of distribution types was
designed to extend ZPL’s generality [15]; however, only a
few of these distributions were ever implemented. None of            Figure 1: A notional diagram of Chapel’s multiresolution
these efforts provide as much generality and flexibility as          design.
Chapel’s user-defined domain map framework, which is de-
signed to support extremely general array implementations
via a functional interface rather than a predefined mapping          Chapel itself should serve as an ideal language for imple-
interface within the language itself.                                menting next-generation versions of these technologies.
   Unified Parallel C (UPC), a parallel dialect of C, also
supports language-based global-view arrays [17], yet it is
limited to 1D arrays distributed in a block-cyclic manner.           4     Chapel Overview
Unlike HPF and ZPL, UPC also supports global pointers
that give users the ability to build their own general dis-          This section provides a brief overview of Chapel as context
tributed data structures. This is similar to Chapel’s sup-           for this work. For further information on Chapel, the reader
port for programming at lower levels of the language [6],            is referred to other language overviews [13, 31, 7].
yet without the ability to build up global-view abstractions
supported by the language’s syntax and compiler. UPC also
has the disadvantage of only supporting SPMD-style paral-            4.1    Language Overview
lelism compared to Chapel’s general multithreaded paral-
lelism.                                                              Chapel was designed with a multiresolution approach in
                                                                     which higher-level features such as parallel arrays and do-
   Unlike UPC, the two other traditional partitioned global
                                                                     main maps are built in terms of lower-level features that
address space (PGAS) languages, Co-Array Fortran (CAF)
                                                                     support task parallelism, locality control, iteration, and tra-
and Titanium, provide multidimensional arrays [25, 34].
                                                                     ditional language concepts [6]. The goal of this approach
However, by our definition their arrays are not global-view
                                                                     is to permit programmers to use high-level abstractions
since users work with distributed arrays as an array of local
                                                                     for productivity while also permitting them to drop down
per-process arrays rather than a single logical whole. Re-
                                                                     closer to the machine when required for reasons of con-
cent work by Numrich focuses on creating better abstrac-
                                                                     trol or performance. This is in stark contrast to previous
tions for global arrays in CAF by building abstract data
                                                                     global-view languages like HPF and ZPL in which users
types that wrap co-arrays, providing a more global-view
                                                                     had little recourse if the supported set of arrays and distri-
abstraction [26]. While this approach bears some similar-
                                                                     butions failed them. In particular, these languages did not
ity to ours, our distributions differ in that they are defined
                                                                     provide any lower-level capabilities for dynamically creat-
by the Chapel language and therefore known to the com-
                                                                     ing new parallel tasks, nor for having the existing parallel
piler and runtime, supporting analysis and optimization.
                                                                     tasks take divergent code paths, create arbitrary per-node
We consider this knowledge key for providing very general
                                                                     data structures, and so forth.
operations on global-view arrays without severely compro-
                                                                        In contrast, Chapel has been designed and implemented
mising performance.
                                                                     in a layered manner in which the higher-level features are
   Within the HPC community, a number of interesting
                                                                     implemented in terms of the lower-level ones. Figure 1
data distributions and partitioning schemes have been im-
                                                                     provides a notional view of Chapel’s language layers, each
plemented throughout the years, the most successful of
                                                                     of which is summarized in the following paragraphs.
which have typically been libraries or standalone sys-
tems [3, 4, 21, 19, 14, 33]. Our goal is not to supplant such
technologies, but rather to provide a framework in which             Locality Features At the lowest level of Chapel’s con-
they can be used to implement global-view data structures            cept stack are locality-oriented features. Their goal is to
supporting language-based operations. If our approach is             permit users to reason about the placement and affinity
successful, not only should the domain map framework                 of data and tasks on a large-scale distributed-memory ma-
be rich enough to support all of these data structures, but          chine. The central concept in this area is the locale type.

                                                                 3
A locale abstractly represents a unit of the target architec-       domains and arrays. To break this cycle, Chapel supports
ture that supports processing and storage. On conventional          a default array layout that is implemented in terms of a C-
machines, a locale is typically defined to be a single node,        style primitive data buffer.
its memory, and its multicore/SMP processors. Chapel pro-
grammers can reference a built-in array of locales that rep-
resents the compute resources on which their program is             4.2    Arrays, Domains, and Domain Maps
running. They can make queries about the compute re-                A Chapel array is a one-to-one mapping from an index set
sources via a number of methods on the locale type. In              to a set of variables of arbitrary but homogeneous type.
addition, they can control the placement of variables and           Chapel supports dense or strided rectangular arrays that
tasks on the machine’s nodes using on-clauses, either in an         provide capabilities like those of Fortran 90. It also sup-
explicit or data-driven manner.                                     ports associative arrays, whose indices are arbitrary values,
                                                                    in order to support dictionary or key-value collections. A
Base Language On top of the locality layer sits the se-             third class of array supports unstructured data aggregates
quential base language with support for generic program-            such as sets and unstructured graphs. All of these array
ming, static type inference, ranges, tuple types, CLU-style         types are designed to support sparse varieties in which a
iterators [24], and a rich compile-time language for meta-          large subset of the indices map to an implicitly replicated
programming. The base language supports object-oriented             value (often zero in practice).
programming (OOP), function overloading, and rich argu-                Chapel arrays are defined using a domain—a first-class
ment passing capabilities including name-based argument             language concept representing an index set. Chapel’s do-
matching and default values. All of these features are em-          mains are a generalization of the region concept pioneered
bedded in an imperative, block-structured language with             by the ZPL language [5]. In Chapel, domains can be
support for fairly standard data types, expressions, and            named, assigned, and passed between functions. Domains
statements. Although our user-defined domain map frame-             support iteration, intersection, set-oriented queries, and op-
work could be implemented using existing OOP languages              erations for creating other domains. They are also used
such as Java, C#, or C++, we believe that Chapel’s base             to declare, slice, and reallocate arrays. We refer to dense
language features support the complexity of implementing            rectangular domains as regular due to the fact that their in-
domain maps far more productively.                                  dex sets can be represented using O(1) storage via bounds,
                                                                    striding, and alignment values per dimension. In contrast,
                                                                    other domain types are considered irregular since they typ-
Task Parallelism The next level contains features for               ically require storage proportional to the number of indices.
task parallelism to support the creation and synchronization           The main benefit of introducing a domain-like concept
of parallel tasks. Chapel tasks can be created in an unstruc-       into a parallel language is that it simplifies reasoning about
tured manner using the begin keyword. Common cases                  the implementation and relative alignment of groups of ar-
for supporting groups of tasks are supported by structured          rays, both for the compiler and for users. Domains also
cobegin and coforall statements. Chapel tasks syn-
                                                                    support the amortization of overheads associated with stor-
chronize in a data-oriented manner, either using synchro-           ing and operating on arrays since multiple arrays can share
nization variables that support full/empty semantics [2] or         a single domain.
using transactional sections that can be implemented using
                                                                       Chapel domain maps specify the implementation of do-
hardware or software techniques [30]. Unlike most conven-
                                                                    mains and their associated arrays in the Chapel language. If
tional parallel programming models that rely on a cooperat-
                                                                    a domain map targets a single locale’s memory, it is called
ing executables model of parallelism, Chapel tasks are very
                                                                    a layout. If the domain map targets a number of locales we
flexible, supporting dynamic and nested parallelism.
                                                                    refer to it as a distribution.
                                                                       Layouts tend to focus on details like how a domain’s in-
Data Parallelism and Domain Maps Toward the top of                  dices or array’s elements are stored in memory; or how a
Chapel’s feature stack are data parallel concepts, which are        parallel iteration over the domain or array should be im-
based on Chapel’s arrays (described in detail in the fol-           plemented using local processor resources. Distributions
lowing subsection). Chapel’s data parallel operations in-           specify those details as well, but also map the indices and
clude parallel loops over arrays and iteration spaces, re-          elements to distinct locales. In particular, a distribution
duction and scan operations, and whole-array computations           maps a complete index space—such as the set of all 2D
through the promotion of scalar functions and operators.            64-bit integer indices—to a user-specified set of target lo-
The arrows in Figure 1 indicate an interdependent relation-         cales. When multiple domains share a single distribution,
ship between Chapel’s domain maps and arrays in that all            they are considered to be aligned since a given index will
Chapel arrays are implemented in terms of domain maps,              map to the same locale for each domain. Just as domains
most of which are themselves implemented using simpler              permit the amortization of overheads associated with index

                                                                4
sets across multiple arrays, distributions support the amor-        quire to represent the corresponding Chapel-level concepts
tization of overheads associated with distributing aligned          accurately and to implement the semantics of the DSI rou-
index sets.                                                         tines correctly.
   An array’s elements are mapped to locales according to              In practice, domain maps that represent distributions
its defining domain’s domain map. In this way, a single             tend to allocate additional descriptors on a per-locale basis
domain map can be used to declare several domains, while            in order to store state describing that locale’s portion of the
each domain can in turn define multiple arrays. Chapel also         larger data structure. This helps make the domain map scal-
supports subdomain declarations, which support semantic             able by avoiding the need for O(numIndices) storage in
reasoning about index subsets.                                      any single descriptor. To distinguish these descriptors, we
   As an example, the following Chapel code declares a              refer to the three primary descriptors as global descriptors
Cyclic distribution named M followed by a pair of aligned           and any additional per-locale descriptors as local descrip-
domains D1 and D2 followed by a pair of arrays for each             tors. It is worth noting that the Chapel compiler only knows
domain:                                                             about the global descriptors; any local descriptors are sim-
      const M = new dmap(new Cyclic(...));                          ply a specific type of state that a developer may choose to
                                                                    help represent the descriptor’s state.
      const D1 = [0..n+1] dmapped M,                                   The following three sections describe these descriptors,
            D2 = [1..n] dmapped M;
                                                                    their state, and their primary DSI routines in more detail.
      var A1, B1: [D1] int,                                         Following that, Section 5.4 describes other descriptor inter-
          A2, B2: [D2] real;                                        faces beyond the required set. For a more detailed technical
                                                                    description of the descriptors and current DSI routines, the
   Chapel’s domain maps do more than simply map in-
                                                                    interested reader is referred to technotes/README.dsi
dices and array elements to locales, however. They also
                                                                    in Chapel release’s documentation.
specify how indices and array elements should be stored
within each locale, and how to implement Chapel’s array
and domain operations on the data structures. In this sense,        5.1    Domain Map Descriptors
Chapel domain maps are recipes for implementing parallel,
distributed, global-view arrays. The next section provides          The domain map descriptor stores any state required to
more detail on how our framework supports user-defined              characterize the domain map as a whole. Examples might
domain maps.                                                        include whether the domain map uses a row- or column-
                                                                    major-order storage layout; the indices to be blocked be-
                                                                    tween locales for a block distribution; the start index for a
5    Domain Map Framework                                           cyclic or block-cyclic distribution; the block size to be used
                                                                    in a tiled layout or block-cyclic distribution; or the tree of
Creating a user-defined domain map in Chapel involves               cutting planes used for a multidimensional recursive bisec-
writing a set of three descriptors that collectively imple-         tion. For distributions, the global domain map descriptor
ment Chapel’s Domain map Standard Interface (or DSI for             will also typically store the set of locales that is being tar-
short). These DSI routines are invoked from the code gen-           geted.
erated by the Chapel compiler to implement an end-user’s               Since domain maps may be represented using arbitrary
operations on global-view domains and arrays. The de-               characteristics, the constructors for domain map descrip-
scriptors and DSI routines can be viewed as the recipe for          tors are invoked explicitly by end-users in their Chapel
implementing parallel and distributed arrays in Chapel. As          programs. In our current implementation, this construc-
such, they define how to map high-level operations like             tor must be wrapped in a new instance of the built-in dmap
Chapel’s forall loops down to the per-processor data                type, representing a domain map value.
structures and methods that are required to implement them             In practice, global distribution descriptors use local de-
on a distributed memory architecture.                               scriptors to store values that are specific to the correspond-
   In practice, these descriptors are implemented using             ing locale. For example, local descriptors may store the
Chapel classes. Their methods implement the DSI rou-                subset of the index space owned by that locale; for irreg-
tines using lower-level Chapel features such as iterators,          ular distributions like graph partitioning algorithms, they
task parallelism, and locality-oriented features. The classes       may store a subset of the complete distribution’s state for
themselves are generic with respect to characteristics like         the purposes of scalability.
the domain’s index type, the rank of the domain, and the               Figure 2 illustrates sample distribution descriptors. It
array’s element type.                                               shows a Chapel declaration that creates a new 1D instance
   The three descriptors are used to represent the Chapel           of the Cyclic distribution, specifying the starting index as 1
concepts of (1) domain map, (2) domain, and (3) array, re-          and targeting locales 0, 1, and 2 (denoted in the example as
spectively. The descriptors can store whatever state they re-       L0, L1, and L2). The left column shows a conceptual view

                                                                5
                                                      !""#

                                                                                                   
                                                                                                      
                                             
                                                                                                   
                                       
                                                                                                          
                                                    

                                                       $% $" & '$ 
                                                       (')$*+,,$-$. $-/ 0"
                                                                                         & 0 ,$ /
                                                       

               Figure 2: An illustration of the distribution descriptors for an instance of the Cyclic distribution.

of this distribution, illustrating that index 1 is mapped to          store a representation of the domain’s index set. For layouts
L0 while all other 1D indices are cyclically mapped to the            and regular domains, the complete index set representation
locales in both directions. The middle column shows the               is typically stored directly within the descriptor. For distri-
global descriptor’s state, which stores the start index and           butions of irregular domains that require O(numIndices)
the set of target locales. The right column shows the local           storage to represent the index set, a distribution will typi-
descriptors corresponding to the three target locales, each           cally store only summarizing information in its global de-
of which stores the index subset owned by that locale.                scriptor. The representation of the complete index set is
   The key DSI routines required from a domain map de-                spread between its associated local descriptors in order to
scriptor are as follows:                                              achieve scalability.
                                                                         For both regular and irregular distributions, local domain
Index Ownership The domain map descriptor must sup-                   descriptors are often used to store the locale’s local index
port a method, dsiIndexToLocale() which takes an in-                  subset. These local indices are often represented using a
dex as an argument and returns the locale that owns the               non-distributed domain field of a matching type.
index. This is used to implement the idxToLocale query                   Figure 3 shows a domain declaration that is mapped us-
that Chapel users can make to determine where a specific              ing the Cyclic distribution created in Figure 2. The left
index is stored. It is also used to implement other operators         column illustrates that the domain value conceptually rep-
on domains and arrays.                                                resents the indices 0 through 6. The center column shows
                                                                      that the global descriptor stores the complete index set as
                                                                      0 . . . 6. Since this is a regular domain, the global descrip-
Create Domain Descriptors A domain map descriptor
                                                                      tor can afford to store the complete index set since it only
also serves as a factory class that creates new domain de-
                                                                      requires O(1) space. Finally, the local descriptors store the
scriptors for each domain value created using its domain
                                                                      subset of indices owned by each locale, as defined by the
map. For example, for each rectangular domain variable in
                                                                      distribution M. In practice, these descriptors use a domain
a Chapel program, the compiler will generate an invocation
                                                                      field to represent the strided 1D index set locally using the
of dsiNewRectangularDom() on the corresponding do-
                                                                      default layout.
main map descriptor, passing it the rank, index type, and
                                                                         The key DSI routines required of the domain descriptors
stridability parameters for the domain value. Similar calls
                                                                      are as follows:
are used to create new associative, unstructured, or sparse
domains. Each routine returns a domain descriptor that
serves as the runtime representation of that domain value.            Query/Modify Index Set A domain descriptor must
They also allocate any local domain descriptors, if appli-            support certain methods that permit its index set to be
cable. Typically, each domain map descriptor will only                queried and modified. To implement assignment of rect-
support the creation of a single type of domain, such as              angular domains, the Chapel compiler generates a call to
rectangular or associative.                                           dsiGetIndices() on the source domain descriptor, pass-
                                                                      ing the result to dsiSetIndices() on the target domain.
                                                                      These routines return and accept a tuple of ranges to rep-
5.2 Domain Descriptors
                                                                      resent the index set in an implementation-independent rep-
A domain descriptor is used to represent each domain value            resentation. This supports assignments between domains
in a Chapel program. As such, its main responsibility is to           with distinct distributions or layouts.

                                                                 6
   
         $%&'##()                *
                                                                                               
                                                                                           
                                                                                            

              +',-,.,/,0,1,(2                                                          

                                                        
                                                             
                                                            
                                                        
                                                        
                                                        
                                                      
                                                       
                                                          
                                                          
                                                       !  !"  ! "
                                                      
                                                      ###

Figure 3: An illustration of the domain descriptors for a domain D defining an index set that is mapped using the Cyclic
distribution M from Figure 2.

  Irregular domains support more general dsiAdd() and                 Create Array Descriptors Domain descriptors serve as
dsiRemove() methods that can be used to add or remove                 factories for array descriptors via the dsiBuildArray()
indices from the sets they represent. They also support a             method. This call takes the array’s element type as its ar-
dsiClear() method that empties the index set.                         gument and is generated by the compiler whenever a new
   When a distribution uses local domain descriptors, these           array variable is created. The dsiBuildArray() routine
routines must also partition the new indices between the              allocates storage for the array elements and returns the ar-
target locales and update the local representations appro-            ray descriptor that will serve as the runtime representation
priately.                                                             of the array. If applicable, dsiBuildArray() also allo-
                                                                      cates the local array descriptors which, in turn, allocate lo-
                                                                      cal array storage.

Query Index Set Properties Domain descriptors also
support a number of methods that implement queries on the             5.3    Array Descriptors
domain’s index set. For example, dsiMember() queries
                                                                      Each array value in a Chapel program is represented by an
whether or not its argument index is a member of the do-
                                                                      array descriptor at runtime. As such, its state must repre-
main’s index set. It is used for operations like array bounds
                                                                      sent the collection of variables representing the array’s ele-
checking and user membership queries. Another routine,
                                                                      ments. Since arrays require O(numElements) storage by
dsiNumIndices(), is used to query the size of a domain’s
                                                                      definition, distributions will typically farm the storage for
index set. Rectangular domains support additional queries
                                                                      these variables out to the local descriptors, while layouts
to determine the bounds and strides of their dimensions.
                                                                      will typically store the array elements directly within the
                                                                      descriptor. The actual array elements are typically stored
                                                                      within a descriptor using a non-distributed array declared
Iterators Domain descriptors must provide serial and                  over a domain field from the corresponding domain de-
parallel iterators that generate all of the indices described         scriptor.
by their index set. The compiler generates invocations of                Figure 4 shows a Chapel array of integer values defined
these iterators to implement serial and parallel loops over           over the domain from Figure 3. The left column illustrates
domain values. Parallel iterators for distributions will typ-         that an integer variable is allocated for each index in the
ically be written such that each locale generates the local           domain. The middle column shows the global descriptor
indices that it owns. Parallel iteration is a fairly advanced         which represents the array’s element type, but no data val-
topic in Chapel due to its use of a novel leader/follower it-         ues. Instead, the array elements are stored in the local ar-
erator strategy to support zippered parallel iteration, which         ray descriptors. Each one allocates storage corresponding
is beyond the scope of this paper.                                    to the indices it owns according to the domain descriptors.

                                                                  7
      
                 
                                                                                                                             
                                                                                         

                                                                                                                                  
                                                                      
                                                                                                                                         
                                                                                                                       
                                                                                 
                                         

                                                    
                                                      ! "!# $ !
                                                   %   ! "!# $ !
                                                   % &'( "  ! ) *  &)+++
                                                  "!# $ !
                                                     !
                                                  +++

  Figure 4: An illustration of the array descriptors for an array A of integers defined in terms of the domain D from Figure 3.

   Though not shown in these figures, a single distribution           alias the elements stored by the original array descriptor. In
can support multiple domain descriptors, and each domain              the case of reindexing and rank change, new domain and/or
descriptor can in turn support multiple arrays. In this way,          domain map descriptors may also need to be created to de-
the overhead of representing an index set or domain map               scribe the new index sets and mappings.
can be amortized across multiple data structures. This or-
ganization also gives the user and compiler a clear picture
                                                                      5.4    Non-Required Descriptor Interfaces
of the relationship and relative alignment between logically
related data structures in a Chapel program.                          In addition to the required DSI routines outlined in the pre-
   The following DSI routines must be implemented for ar-             ceding sections, Chapel’s domain map descriptors can sup-
ray descriptors:                                                      port two additional classes of routines, optional and user-
                                                                      oriented. Optional interface routines implement descriptor
                                                                      capabilities that are not required from a domain map im-
Array indexing The dsiAccess() method implements
                                                                      plementation, but which, if supplied, can be used by the
random access into the array, taking an index as its argu-
                                                                      Chapel compiler to generate optimized code.
ment. It determines which array element variable the index
corresponds to and returns a reference to it. In the most                User-oriented interface routines are ones that an end-user
general case, this operation may require consulting the do-           can manually invoke on any domains or arrays that they
main and/or domain map descriptors to locate the array el-            create using the domain map. These permit the domain
ement’s locale and memory location.                                   map author to expose additional operations on a domain
                                                                      or array implementation that are not inherently supported
                                                                      by the Chapel language. The downside of relying on such
Iterators The array descriptor must provide serial and                routines, of course, is that they make client programs brit-
parallel iterators to generate references to its array ele-           tle with respect to changes in their domain maps since the
ments. Invocations of these iterators are generated by the            methods are not part of the standard DSI interface. This
compiler to implement serial and parallel loops over the              is counter to the goal of having domain maps support a
corresponding array. As with domains, the parallel iterator           plug-and-play characteristic; however it also provides an
will typically yield each array element from the locale on            important way for advanced users to extend Chapel’s stan-
which it is stored.                                                   dard operations on domains and arrays. By definition, the
                                                                      Chapel compiler does not know about these interfaces and
Slicing, Reindexing, and Rank Change Chapel sup-                      therefore will not generate implicit calls to them.
ports array slicing, reindexing, and rank change operators               We give some examples of optional interfaces in the
that can be used to refer to a subarray of values, poten-             paragraphs that follow. We anticipate that this list of op-
tially using a new index set. These are supported on array            tional interfaces will grow over time.
descriptors using the dsiSlice(), dsiReindex() and
dsiRankChange() methods, respectively. Each of these                  Privatization Interface Most user-level domain and ar-
methods returns a new array descriptor whose variables                ray operations are implemented via a method call on the

                                                                8
global descriptor. By default, the global descriptor is allo-         these optional interfaces when provided by the domain map
cated on the locale where the task that encounters its decla-         author. In this manner we anticipate supporting communi-
ration is running. If a DSI routine is subsequently invoked           cation optimizations similar to those in our previous work
from a different locale than the one on which the descrip-            with ZPL [12, 9, 16].
tor is stored, communication will be inserted by the com-
piler. As an example, indexing into an array whose global
descriptor is on a remote locale will typically require com-
munication, even if the index in question is owned by the             6     Implementation Status
current locale.
   To reduce or eliminate such communications, Chapel’s
                                                                      Our open-source Chapel compiler implements the domain
domain map framework supports an optional privatization
                                                                      map framework described in this paper and uses it exten-
interface on the global descriptors. If the interface is imple-
                                                                      sively. As stated in Section 2, all of the standard layouts
mented, the Chapel compiler will allocate a privatized copy
                                                                      and distributions in our implementation have been written
of the global descriptor on each locale and redirect any DSI
                                                                      using this framework to avoid creating an unfair distinction
invocations on that descriptor to the privatized copy on the
                                                                      between “built-in” and user-defined array implementations.
current locale.
                                                                      This section describes our current uses of the domain map
   Each of the global descriptor classes can support pri-
                                                                      framework within the Chapel code base. Most of these
vatization independently of the others. A descriptor in-
                                                                      distributions can be browsed within the current Chapel
dicates whether it supports privatization by defining the
                                                                      release in the modules/internal, modules/layouts,
compile-time method dsiSupportsPrivatization()
                                                                      and modules/dists directories.
to return true or false. If a descriptor supports priva-
tization, it must provide the method dsiPrivatize() to
create a copy of the original descriptor and the method
dsiReprivatize() to update a copy when the original                   6.1    Default and Standard Layouts
descriptor (or its privatized copy) changes. The Chapel im-
plementation invokes these methods at appropriate times;              Each of Chapel’s domain types—rectangular, sparse, asso-
for example when a domain is assigned, its descriptors will           ciative, and unstructured—has a default domain map that is
be re-privatized.                                                     used whenever the programmer does not specify one. By
                                                                      convention, these domain maps are layouts that target the
Fast Follower Interface A second optional interface in                locale on which the task evaluating the declaration is run-
our current implementation is the fast follower interface.            ning. Our default rectangular layout stores array elements
This interface supports optimized iteration over domains              using a dense contiguous block of memory, arranging el-
and arrays by eliminating overheads related to unnecessar-            ements in row-major order. Its sparse counterpart stores
ily conservative runtime communication checks. To im-                 domains using Coordinate list format (COO), and it stores
plement this interface, a domain map needs to support the             arrays using a dense vector of elements. The default as-
ability to perform a quick alignment check before starting            sociative layout stores domains using an open addressing
a loop to see whether or not it is aligned with the other             hash table with quadratic probing and rehashing to deal
data structures in question. As with any optional routines,           with collisions. As with the sparse layout, arrays are stored
failing to provide the interface does not impact the domain           using a dense vector of values. Our unstructured domain
map’s correctness or completeness, only the performance               map builds on the associative case by hashing on internal
that may be achieved.                                                 addresses of dynamically-allocated elements. Operations
                                                                      on these default layouts are parallelized using the proces-
Other Optional Interfaces As the Chapel compiler ma-                  sor cores of the current locale.
tures in terms of its communication optimizations, we an-                In addition to our default layouts, we have implemented
ticipate that a number of other optional interfaces will be           some standard layouts that are included in the Chapel code
added. One important class of anticipated interfaces will             base and can be selected by a programmer. One is an al-
support common communication patterns such as halo ex-                ternative to the default sparse layout that uses Compressed
changes, partial reductions, and gather/scatter idioms. At            Sparse Row (CSR) format to store a sparse domain’s in-
present, these patterns are all implemented by the com-               dices. Another is an experimental layout for rectangular
piler in terms of the required DSI routines. In practice,             arrays that targets a GPU’s memory [27]. Over time, we
this tends to result in very fine-grain, demand-driven com-           plan to expand this set to include additional layouts for rect-
munication which is suboptimal for most conventional ar-              angular arrays (e.g., column-major order and tiled storage
chitectures. As we train the Chapel compiler to optimize              layouts) as well as additional sparse formats and hashing
such communication idioms in a user’s code, it will target            strategies for irregular domains and arrays.

                                                                  9
6.2    Standard Distributions                                         benchmarks [8]. Since that time we have made additional
                                                                      performance and capability improvements that we plan to
For mapping to multiple locales, Chapel currently sup-
                                                                      showcase at SC11.
ports full-featured multidimensional Block, Cyclic, and
                                                                         At present, our team’s main focus is to expand Chapel’s
Replicated distributions for rectangular domains and ar-
                                                                      standard set of domain maps while improving the perfor-
rays. The Block and Cyclic distributions decompose the
                                                                      mance and scalability of our current ones. As alluded to
domain’s index set using traditional blocking and round-
                                                                      in Section 5.4, some of our performance optimization work
robin schemes per dimension, similar to HPF. The Repli-
                                                                      will involve expanding our set of optional DSI routines to
cated distribution maps the domain’s complete index set to
                                                                      reduce overheads for common communication patterns via
every target locale so that an array declared over the do-
                                                                      latency hiding and coarser data transfers. In other cases
main will be stored redundantly across the locales.
                                                                      we simply need to improve the Chapel compiler and do-
   We also have some other standard distributions that are            main map code to eliminate communication that is overly
currently under development. We have a multidimensional               conservative or semantically unnecessary. We also plan
block-cyclic distribution that implements the basic DSI               to continue improving our documentation of the domain
functionality, but is lacking some of the more advanced               map framework so that general Chapel users can start writ-
reindexing/remapping functions and doesn’t yet take ad-               ing domain maps independently. Finally, we need to im-
vantage of parallelism within a locale, only across locales.          prove the robustness and scalability of our descriptor priva-
We are also working on a dimensional distribution that                tization machinery which is currently too global and syn-
takes other distributions as its constructor arguments and            chronous to support loosely-coupled computations as well
applies them to the dimensions of a rectangular domain. In            as it should.
this way, one dimension might be mapped to the locales us-
ing the Block distribution while another is mapped using
Replicated. Both of these distributions are being imple-              Acknowledgments
mented to support global-view dense linear algebra algo-
rithms in a scalable manner.                                          The authors would like to thank David Callahan, Hans
   For irregular distributions, we have prototyped an asso-           Zima, and Roxana Diaconescu for their early contributions
ciative distribution to support global-view distributed hash          to Chapel and their role in helping refine our user-defined
tables. Our approach applies a user-supplied hash function            domain map philosophy. We would also like to thank all
to map indices to locales and then stores indices within              current and past Chapel contributors and users for their help
each locale using the default associative layout. We have             in developing the language foundations that have permitted
also planned for a variation on the Block distribution that           us to reach this stage.
uses the CSR layout on each locale in order to support
distributed sparse arrays for use in the NAS CG and MG
benchmarks.                                                           References
   Over time, we anticipate expanding this set of initial dis-
tributions to store increasingly dynamic and holistic dis-             [1] Eugene Albert, Kathleen Knobe, Joan D. Lukas, and
tributions such as multilevel recursive bisection, dynami-                 Guy L. Steele, Jr. Compiling Fortran 8x array fea-
cally load balanced distributions, and graph partitioning al-              tures for the Connection Machine computer system.
gorithms.                                                                  In PPEALS ’88: Proceedings of the ACM/SIGPLAN
                                                                           conference on Parallel Programming: experience
                                                                           with applications, languages, and systems, pages 42–
7     Summary                                                              56. ACM Press, 1988.
                                                                       [2] Robert Alverson, David Callahan, Daniel Cummings,
This paper describes Chapel’s framework for user-defined
                                                                           Brian Koblenz, Allan Porterfield, and Burton Smith.
domain maps to support a very flexible means for users
                                                                           The Tera computer system. In Proceedings of the
to implement their own parallel data structures that sup-
                                                                           4th international conference on Supercomputing, ICS
port Chapel’s global-view operations. We are currently us-
                                                                           ’90, pages 1–6, New York, NY, USA, 1990. ACM.
ing this domain map framework extensively, not only to
implement standard distributions like Block and Cyclic,                [3] G. Bikshandi, J. Guo, D. Hoeflinger, G. Almasi, B. B.
but also for our shared memory parallel implementations                    Fraguela, M.J. Garzaran, D. Padua, and C. von Praun.
of regular and irregular arrays. While this paper does not                 Programming for parallelism and locality with hier-
contain performance results, our domain map framework                      archically tiled arrays. In PPoPP ’06: Proceedings
was used to obtain our 2009 HPC Challenge performance                      of the eleventh ACM SIGPLAN symposium on Prin-
results which demonstrated performance that was com-                       ciples and Practice of parallel programming, pages
petitive with MPI and scaled reasonably for some simple                    48–57. ACM Press, March 2006.

                                                                 10
[4] E. Boman, K. Devine, L. A. Fisk, R. Heaphy, B. Hen-           [15] Steven J. Deitz. High-Level Programming Lan-
     drickson, C. Vaughan, U. Catalyurek, D. Bozdag,                    guage Abstractions for Advanced and Dynamic Par-
     W. Mitchell, and J. Teresco. Zoltan 3.0: Parallel par-             allel Computations. PhD thesis, University of Wash-
     titioning, load balancing and data-management ser-                 ington, 2005.
     vices; user’s guide. Technical Report SAND2007-
     4748W, Sandia National Laboratories, Albuquerque,             [16] Steven J. Deitz, Bradford L. Chamberlain, Sung-Eun
     NM, 2007.                                                          Choi, and Lawrence Snyder. The design and imple-
                                                                        mentation of a parallel array operator for the arbitrary
 [5] Bradford L. Chamberlain. The Design and Implemen-                  remapping of data. In Proceedings of the ACM Con-
     tation of a Region-Based Parallel Language. PhD the-               ference on Principles and Practice of Parallel Pro-
     sis, University of Washington, November 2001.                      gramming, 2003.
 [6] Bradford L. Chamberlain.       Multiresolution lan-           [17] Tarek El-Ghazawi, William Carlson, Thomas Ster-
     guages for portable yet efficient parallel program-                ling, and Katherine Yelick. UPC: Distributed Shared-
     ming. http://chapel.cray.com/papers/DARPA-RFI-                     Memory Programming. Wiley-Interscience, June
     Chapel-web.pdf, October 2007.                                      2005.
 [7] Bradford L. Chamberlain, David Callahan, and                  [18] Geoffrey Fox, Seema Hiranandani, Ken Kennedy,
     Hans P. Zima. Parallel programmability and the                     Charles Koelbel, Ulrich Kremer, Chau-Wen Tseng,
     Chapel language. International Journal of High Per-                and Min-You Wu. Fortran D language specification.
     formance Computing Applications, 21(3):291–312,                    Technical Report CRPC-TR 90079, Rice University,
     August 2007.                                                       Center for Research on Parallel Computation, De-
                                                                        cember 1990.
 [8] Bradford L. Chamberlain, Sung-Eun Choi, Steven J.
     Deitz, and David Iten. HPC Challenge benchmarks               [19] Michael A. Heroux, Roscoe A. Bartlett, Vicki E.
     in Chapel. (available from http://chapel.cray.com),                Howle, Robert J. Hoekstra, Jonathan J. Hu, Tamara G.
     November 2009.                                                     Kolda, Richard B. Lehoucq, Kevin R. Long, Roger P.
                                                                        Pawlowski, Eric T. Phipps, Andrew G. Salinger,
 [9] Bradford L. Chamberlain, Sung-Eun Choi, and                        Heidi K. Thornquist, Ray S. Tuminaro, James M. Wil-
     Lawrence Snyder. A compiler abstraction for ma-                    lenbring, Alan Williams, and Kendall S. Stanley. An
     chine independent parallel communication genera-                   overview of the Trilinos project. ACM Trans. Math.
     tion. In Proceedings of the Workshop on Languages                  Softw., 31(3):397–423, 2005.
     and Compilers for Parallel Computing, 1997.
                                                                   [20] High Performance Fortran Forum. High Performance
[10] Bradford L. Chamberlain, Steven J. Deitz, David Iten,              Fortran Language Specification Version 2.0, January
     and Sung-Eun Choi. User-defined distributions and                  1997.
     layouts in Chapel: Philosophy and framework. In
     HotPAR ‘10: Proceedings of the 2nd USENIX Work-               [21] George Karypis and Vipin Kumar. A fast and high
     shop on Hot Topics, June 2010.                                     quality multilevel scheme for partitioning irregular
                                                                        graphs. SIAM Journal on Scientific Computing,
[11] Barbara M. Chapman, Piyush Mehrotra, and Hans P.                   20(1):359–392, 1999.
     Zima. Programming in Vienna Fortran. Scientific Pro-
     gramming, 1(1):31–50, 1992.                                   [22] Charles Koelbel and Piyush Mehrotra. Programming
                                                                        data parallel algorithms on distributed memory us-
[12] Sung-Eun Choi and Lawrence Snyder. Quantifying                     ing Kali. In ICS ’91: Proceedings of the 5th inter-
     the effects of communication optimizations. In Pro-                national conference on Supercomputing, pages 414–
     ceedings of the IEEE International Conference on                   423. ACM, 1991.
     Parallel Processing, 1997.
                                                                   [23] Charles H. Koelbel, David B. Loveman, Robert S.
[13] Cray Inc., Seattle, WA. Chapel language specifica-                 Schreiber, Guy L. Steele, Jr., and Mary E. Zosel.
     tion. (Available at http://chapel.cray.com/).                      The High Performance Fortran Handbook. Scientific
[14] Alain Darte, John Mellor-Crummey, Robert Fowler,                   and Engineering Computation. MIT Press, September
     and Daniel Chavarrı́a-Miranda. Generalized multipar-               1996.
     titioning of multi-dimensional arrays for paralleliz-         [24] B. Liskov, A. Snyder, R. Atkinson, and C. Schaffert.
     ing line-sweep computations. Journal of Parallel and               Abstraction mechanisms in CLU. Communications of
     Distributed Computing, 63(9):887–911, 2003.                        the ACM, 20(8):564–576, August 1977.

                                                              11
[25] Robert W. Numerich and John Reid. Co-array fortran            [31] Chapel Team. Parallel programming in Chapel:
     for parallel programming. SIGPLAN Fortran Forum,                   The      Cascade     High-Productivity Language.
     17(2):1–31, 1998.                                                  http://chapel.cray.com/tutorials.html, November
                                                                        2010.
[26] Robert W. Numrich. A parallel numerical library for
     co-array fortran. Springer Lecture Notes in Computer          [32] Manuel Ujaldon, Emilio L. Zapata, Barbara M. Chap-
     Science, LNCS 3911:960–969, 2005.                                  man, and Hans P. Zima. Vienna-Fortran/HPF ex-
                                                                        tensions for sparse and irregular problems and their
[27] Albert Sidelnik, Bradford L. Chamberlain, Maria J.                 compilation. IEEE Transactions on Parallel and Dis-
     Garzaran, and David Padua. Using the high pro-                     tributed Systems, 8(10), October 1997.
     ductivity language Chapel to target GPGPU architec-
     tures. Technical report, Department of Computer Sci-          [33] David S. Wise, Jeremy D. Frens, Yuhong Gu, and
     ence, University of Illinois Urbana-Champaign, April               Gregory A. Alexander. Language support for Morton-
     2011. http://hdl.handle.net/2142/18874.                            order matrices. In PPoPP ’01: Proceedings of
                                                                        the eighth ACM SIGPLAN symposium on Principles
[28] Lawrence Snyder. Programming Guide to ZPL. MIT                     and practice of parallel programming, pages 24–33.
     Press, Cambridge, MA, 1999.                                        ACM, 2001.
[29] Lawrence Snyder. The design and development of                [34] Kathy Yelick, Luigi Semenzato, Geoff Pike, Carleton
     ZPL. In HOPL III: Proceedings of the third ACM                     Miyamato, Ben Liblit, Arvind Krishnamurthy, Paul
     SIGPLAN conference on History of programming lan-                  Hilfinger, Susan Graham, David Gay, Phil Colella,
     guages, pages 8–1–8–37, New York, NY, USA, 2007.                   and Alex Aiken. Titanium: A high-performance
     ACM.                                                               Java dialect. Concurrency: Practice and Experience,
[30] Srinivas Sridharan, Jeffrey S. Vetter, Peter M. Kogge,             10(11–13):825–836, September 1998.
     and Steven J. Deitz. A scalable implementation                [35] Hans Zima, Peter Brezany, Barbara Chapman, Piyush
     of language-based software transactional memory                    Mehrotra, and Andreas Schwald. Vienna Fortran —
     for distributed memory systems. Technical Report                   a language specification version 1.1. Technical Re-
     FTGTR-2011-02, Oak Ridge National Laboratory,                      port NASA-CR-189629/ICASE-IR-21, Institute for
     May 2011. http://ft.ornl.gov/pubs-archive/chplstm1-                Computer Applications in Science and Engineering,
     2011-tr.pdf.                                                       March 1992.

                                                              12
You can also read