Runtime Locality Optimizations of Distributed Java Applications
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
16th Euromicro Conference on Parallel, Distributed and Network-Based Processing
Runtime Locality Optimizations of Distributed Java Applications
Christian Hütter, Thomas Moschny
University of Karlsruhe
{huetter, moschny}@ipd.uni-karlsruhe.de
Abstract performance gains through parallelism in a distributed
environment.
Solely distributing objects and threads over virtual
In distributed Java environments, locality of objects machines is not sufficient for achieving performance
and threads is crucial for the performance of parallel gains. Since the placement of an object determines the
applications. We introduce dynamic locality processor of its methods, only methods of objects that
optimizations in the context of JavaParty, a reside on different machines can actually be executed
programming and runtime environment for parallel in parallel. So we have two conflicting goals: On the
Java applications. Until now, an optimal distribution one hand, groups of objects with frequent and
of the individual objects of an application has to be expensive communication should be placed on the
found manually, which has several drawbacks. same node. On the other hand, objects should be
Based on a former static approach, we develop a distributed over the available processors to enable
dynamic methodology for automatic locality parallelism.
optimizations. By measuring processing and Until now, JavaParty provides a mechanism to
communication times of remote method calls at create remote objects on specific nodes of a cluster
runtime, a placement strategy can be computed that environment. The developer is responsible for
maps each object of the distributed system to its distributing the individual objects and thus for
optimal virtual machine. Objects then are migrated distributing the activities to the processing nodes. Such
between the processing nodes in order to realize this a manual approach has several disadvantages. First, the
placement strategy. We evaluate our approach by object distribution is dependent on the specific
comparing the performance of two benchmark topology for which the program is compiled. The
applications with manually distributed versions. It is distribution strategy must be adapted to each target
shown that our approach is particularly suitable for platform. Second, manually specifying the location of
dynamic applications where the optimal object every single object creation is tedious. Third, the
distribution varies at runtime. optimal placement of objects often cannot be
determined statically for dynamic applications where
the optimal location of objects changes at runtime.
The work at hand focuses on the automatic
1. Introduction generation of a distribution strategy for remote objects.
The generation is based on runtime information of the
Java enables developers to express concurrency and distributed system. Thus, the programmer does not
to create parallel applications by means of threads. have to worry about a proper object distribution and
Performance gains over a sequential solution can only can focus on the solution of the problem. Even if the
be expected if the virtual machine is executed on a initial object distribution generated by JavaParty is not
system with several processors. JavaParty [10] extends optimal, the locality of the application is optimized at
Java by a distributed runtime environment that consists runtime.
of several Java virtual machines. The virtual machines In chapter 2 we give a brief overview of JavaParty.
are executed on the nodes of a cluster of workstations. Chapter 3 discusses related work in the field of
Each virtual machine has its own address space, but distributed Java applications. In Chapter 4 we describe
can perform remote method invocations on other the design of our approach and explain some basic
virtual machines. Thus, JavaParty allows for concepts that are necessary for further understanding.
0-7695-3089-3/08 $25.00 © 2008 IEEE 149
DOI 10.1109/PDP.2008.76
Authorized licensed use limited to: National Kaohsiung University of Applied Sciences. Downloaded on May 08,2010 at 14:53:01 UTC from IEEE Xplore. Restrictions apply.Chapter 5 presents the implementation and discusses a standard JVM. The advantage of using non-standard
the problems we encountered. In chapter 6 we evaluate JVMs is increased efficiency due to the ability to
the effectiveness and efficiency of our work using two access machine resources directly rather than through
benchmark applications. Finally, chapter 7 concludes the JVM. A weakness of such systems is their lack of
this paper. cross-platform compatibility.
cJVM aims at virtualizing a cluster and at obtaining
2. JavaParty high performance for regular Java applications. A
number of optimization techniques are used to address
JavaParty extends Java by a pre-processor and a caching, locality of execution and object placement.
runtime environment for distributed parallel The smart proxy mechanism of cJVM can be used as
programming in workstation clusters. It transparently framework to implement different locality protocols.
adds remote objects to Java whose methods can be Currently, cJVM is unable to use a standard JIT
invoked from remote virtual machines. Programmers compiler and does not implement a custom one.
can use the keyword remote to indicate that a class JESSICA2 applies transparent Java thread
should be remotely accessible. Instances of remote migration to multi-threaded Java applications. The
classes are called remote objects, regardless on which migration mechanism allows distributing threads
virtual machine they reside. The runtime system offers among cluster nodes at runtime. To support shared
a mechanism to migrate remote objects between object access, a global object space has been
machines. implemented. The system includes some important
features, e.g. load balancing through thread migration,
Java Remote Method Invocation (RMI) [14]
an adaptive home-migration protocol, and a custom
permits the creation of classes whose instances can be
JIT compiler.
accessed remotely from other JVMs. JavaParty uses
Other systems compile the source or class files of a
RMI as target and thus inherits some of its advantages,
Java application into native machine code. Both
e.g. distributed garbage collection. It uses a special
Hyperion [1] and Jackal [15] support standard Java
pre-processor to generate pure Java source code that is
and do not change its programming paradigm. The
consistent with the RMI requirements. This approach
usage of a custom source or byte code compiler has the
hides the increased program complexity due to RMI
disadvantage that such a compiler must continually be
constraints as well as the additional code for creation
adapted to changes of the Java language specification.
and access of remote objects.
The advantage of compiler-based systems is their
JavaParty code is transformed into regular Java
increased performance because of compiler
code plus RMI hooks. The resulting RMI portions are
optimizations and direct access to system resources.
fed into the RMI compiler to generate stubs and
Hyperion offers an infrastructure for heterogeneous
skeletons. Since existing code might be using the
clusters providing the illusion of a single JVM. The
original classes, handle objects are introduced that hide
original Java threads are mapped onto native system
the RMI classes from the user. This approach
threads which are spread across the processing nodes
maintains the Java object semantics such that the
to provide load balancing. The Java memory model is
programmer can use remote objects just like normal
implemented by a DSM protocol, so the original
Java objects.
semantics of the Java language is kept unchanged. To
achieve portability, the Hyperion platform has been
3. Related work built on top of a portable runtime environment which
supports various networks and communication
This section gives an overview of existing systems interfaces.
for distributed execution of Java applications. The goal Jackal is a DSM system for Java which consists of
of these systems is to gain increased computational an optimizing compiler and a runtime system. In
power while preserving Java’s parallel programming combination with compiler optimizations, Jackal
paradigm. In [3], distributed runtime systems are applies various runtime optimizations to increase
categorized into cluster-aware VMs, compiler-based locality and manage large data structures. The runtime
DSM systems, and systems using standard JVMs. system includes a distributed garbage collector and
The first category consists of systems that use a provides thread and object location transparency.
non-standard JVM on each node to execute distributed While most systems use standard JVMs, only a few
applications. The most important examples of such of them preserve the standard Java programming
systems are cJVM [2] and JESSICA2 [16]. Both paradigm. Examples for such systems are
approaches provide a complete single system image of JavaSymphony [4] and ADAJ [5]. Using standard
150
Authorized licensed use limited to: National Kaohsiung University of Applied Sciences. Downloaded on May 08,2010 at 14:53:01 UTC from IEEE Xplore. Restrictions apply.JVMs has the advantage that such systems can use slicing and blocking, competing activities on one JVM
heterogeneous nodes which locally optimize their decrease the total parallelism. Additional costs are
performance using a JIT compiler. The main introduced by the remote method invocation itself
disadvantage of such systems is their relatively slow because of communication latency and bandwidth
access to system resources. limitations. Thus, the general distribution strategy must
JavaSymphony is a programming environment for be activity-centered: different activities should be
distributed and parallel computing that exploits placed onto different JVMs. Objects should be co-
heterogeneous resources. In order to use located to activities such that method invocation is
JavaSymphony efficiently, the programmer has to local. Local method invocation avoids network
explicitly control data locality and load balancing. The communication and competing activities.
structure of the computing resources has to be defined Haumacher proposes an iterative procedure [6] to
manually. Since all objects must be created, mapped, assign objects to activities and then activities to virtual
and freed explicitly, the handling of remote objects can machines. Based on a static type analysis, estimates for
be quite cumbersome. JavaSymphony does not offer two values are derived: work(t, a) describes the
assistance for those manual steps, so the semi- computing time that activity t spends on methods of
automatic distribution is likely to be error-prone. object a, and cost(t, a) describes the communication
ADAJ is an environment for the development and time that would be necessary if t and a were not
execution of distributed Java applications. ADAJ is located in the same address space. Through the
designed on top of JavaParty and is therefore most placement of object a, the computing time of that
closely related to our work. The ADAJ project deals activity t should be maximized in which address space
with placement and migration of Java objects. It a is created. At the same time, the sum of
automatically deploys parallel Java applications on a communication cost that is required for those activities
cluster of workstations by monitoring the application ti assigned to remote virtual machines should be
behavior. ADAJ contains a load-balancing mechanism minimized.
that considers changes in the evolution of the We assume an initial setting where all objects are
application. While the focus of ADAJ is to balance the located in a single address space with a single
load between the individual JVMs, we concentrate on processor such that all method calls are local. In order
optimizing the locality of the distributed application. to distribute objects to activities, we suppose that each
activity is running in a different address space with its
own processor. By placing object a in the address
4. Design space of activity t, method calls of a by t can be
executed parallel to other activities. Thus, work(t, a)
4.1. Locality optimizations indicates the time that is gained by the placement of a
within the address space of t. The communication cost
Philippsen and Haumacher proposed locality that other activities ti spend to access methods of a
optimizations in JavaParty by means of static type break even if work(t, a) is greater than the sum of
analysis [11]. They classify approaches to deal with cost(ti, a). So each object a can be mapped to an
locality in parallel object-oriented languages in three activity t in which address space it should be placed:
categories: (i) let the programmer specify placement
and migration explicitly by means of annotations, activity(a) = t ⇔ t maximizes (work(t,a) − ∑ cost(t i , a))
ti ≠t
(ii) static object distribution where the compiler tries to
predict the best node for a new object, and Since usually more activities are used than virtual
(iii) dynamic object distribution based on a runtime machines are available, several activities must share a
system that keeps track of the call graph. JavaParty virtual machine. Thus, it is necessary to identify
already provides mechanisms for manual object groups of activities that should be executed on a shared
placement and migration, so we focus on static and virtual machine. The parallelization win of each
dynamic object distribution in the following. activity can be estimated by mapping each object to its
optimal activity. The parallelization win is computed
4.1.1. Static object distribution. Although a Java by the sum of work(t, a) for objects a which reside in
thread cannot migrate, the control flow (called activity the address space of activity t minus the sum of
in the following) can: when a method of a remote cost(t, b) for objects b that are placed remotely:
object is invoked, the activity conceptually leaves the
JVM of the caller and is continued at the callee’s JVM
win(t) = ∑ work(t, a) −
{a|activity(a) = t}
∑ cost(t, b)
{b|activity(b) ≠ t}
where it competes with other activities. Due to time-
151
Authorized licensed use limited to: National Kaohsiung University of Applied Sciences. Downloaded on May 08,2010 at 14:53:01 UTC from IEEE Xplore. Restrictions apply.The sum of work(t, a) represents the computing 4.2. Time measurements
time that activity t spends in its own address space.
This work is done in parallel to other activities if no Having developed a placement methodology for
synchronization mechanisms are used. The time that is remote objects, we now focus on how to measure the
spent for communication with other address spaces is time values required for the distribution algorithm.
represented by the sum of cost(t, b) for all objects b Beginning with the Pentium processor, Intel allows the
that are not assigned to activity t. Note that we charge programmer to access a time-stamp counter [8]. This
the cost of a remote call to the activity that invoked the counter keeps an accurate count of every cycle that
remote method, not to the activity that actually occurs on the processor since it is incremented every
executes the method call. clock cycle, starting with zero. To access the counter,
Activities are assigned to the available virtual programmers can use the RDTSC (read time-stamp
machines in decreasing order of their parallelization counter) instruction. We use the counter to get an time
wins until a single activity has been scheduled to each estimate for the duration of method invocations.
virtual machine. For each remaining activity, a new Note that the time-stamp counter measures cycles,
parallelization win is computed that accounts for the not time. Thus, comparing cycle counts only makes
potential co-location with other activities. The activity sense on processors of the same speed – like in a
is assigned to that group of activities with the highest homogeneous cluster environment. To compare
combined parallelization win. This process is repeated processors of different speeds, the cycle counts should
until all activities are scheduled to their optimal virtual be converted into time units. While the unit of time
machine. returned by currentTimeMillis() is a
millisecond, the granularity of the value depends on
The result of the distribution analysis is a mapping the underlying OS and may be larger. Thus, the time-
of each remote object to the virtual machine on which stamp counter also allows much finer measurements.
it should be placed. To avoid measurement errors because of
concurrency, we assume that the workstations of the
cluster are used exclusively for JavaParty. In the
4.1.2. Dynamic object distribution. While Philippsen presence of background jobs, cycle counting does not
and Haumacher focus on static object distribution always reflect the real execution time of an application.
through type analysis, we rely on dynamic object But in the long run, the interrupts through background
distribution to improve locality. This approach is jobs are approximately the same for all workstations of
reported to have two disadvantages: First, there is no a homogenous cluster. Thus, we assume that those
knowledge about future call graphs as well as interrupts balance over time such that cycle counting
invocation frequencies. Second, the creation of objects actually reflects the average execution time.
that cannot migrate often results in a broad re-
distribution of other objects. The first problem is 4.3. Remote Method Invocation
inherent to dynamic approaches, but can be softened
by using heuristics to predict future behavior. The
RMI uses a standard mechanism for communicating
second problem is not exactly an issue in
with remote objects – stubs and skeletons. A stub for a
homogeneous cluster environments and can be handled
remote object acts as a local representative or proxy
by avoiding cyclic redistributions of remote objects.
for the remote object. The stub hides the serialization
Besides these problems, the dynamic approach has of parameters and the network communication whereas
the essential advantage that instead of estimating the the skeleton is responsible for dispatching the call to
values of work and cost, they can be measured: we the actual remote object implementation. We want to
take work as the actual execution time of a method call measure work(t, a) and cost(t, a) in order to apply the
and cost as the communication time of a remote distribution algorithm. In the context of stubs and
method invocation. As detailed later, we have to skeletons, work corresponds to the time that the actual
estimate the cost of remote calls that are actually method implementation takes and cost corresponds to
executed locally because the called object resides on the time that is required for carrying out the remote
the same node. We adapt Haumacher’s approach and call, i.e. marshaling and transmitting parameters and
use an iterative procedure to distribute objects to result.
activities and then assign activities to virtual machines. For remote object r, a stub is instantiated on each
Objects are migrated to the virtual machine their node while only one skeleton is instantiated on the
optimal activity is assigned to. node where the implementation of r resides. That is,
152
Authorized licensed use limited to: National Kaohsiung University of Applied Sciences. Downloaded on May 08,2010 at 14:53:01 UTC from IEEE Xplore. Restrictions apply.there are n stubs and one skeleton for each remote values are stored in the skeleton using a special data
object. Basically, our approach is to measure the structure described later.
communication time of a remote call in the stub and
the execution time of the implementation in the 5.3. Estimation of cost
skeleton by using the RDTSC instruction. We store
aggregated work and cost values in the skeleton. An important optimization carried out by JavaParty
is that a call is only executed remotely if the called
5. Implementation object actually resides on another node. Otherwise, the
call is executed locally. Recall that cost(t, a) estimates
5.1. Time measurements the communication time that would be necessary if
activity t was not located on the same node as object a.
While we’re able to measure the actual communication
Our framework for performance measuring wraps
time of remote calls, we have to estimate the cost of
the RDTSC instruction described in the previous
local calls as if they were remote. Thus, we have to
chapter using the Java Native Interface [13]. As
develop a model to estimate the communication cost
detailed in Table 1, accessing the system time is orders
based on the measured cost of a local call.
of magnitude more expensive than using the RDTSC
instruction. Times were measured on a Pentium III 800 Whenever the client and server objects are in the
MHz system. same address space, arguments and result are cloned to
preserve the copy semantics of a remote call. JavaParty
produces a deep clone with all referenced objects also
Table 1. Cost of System.currentTimeMillis() being cloned. In the generated stubs, the instrumented
Call Cycles Time version of the local short cut measures the cost of
RDTSC.readccounter() 613 0.77 µs cloning arguments and return value.
System.currentTimeMillis() 36941 46.18 µs The measuring can be divided into three parts:
cloning of the arguments, local method invocation, and
cloning of the result. Based on the measured local cost
5.2. KaRMI of cloning arguments and result, we estimate the
communication cost if the call was remote. For this
purpose we analyzed the results of a benchmark suite
KaRMI [12] is a fast replacement for Java RMI. It
that measures the execution times of local and remote
is based on an efficient object serialization mechanism
method calls for a representative set of parameter
that replaces regular Java serialization. Since the
types.
remote method invocation protocol is different from
Java RMI, the format of stubs and skeletons is Given the duration of a local call, we estimate how
different, too. The KaRMI compiler generates stub and long a remote call takes. While the absolute values are
skeleton classes from compiled remote classes. We likely to vary on different machines, the relation
modified the generation of stubs and skeletons to between local and remote calls should approximately
include code that measures the execution times of be the same. For simplicity, we assume a linear model
remote calls. The measured times are processed by the with offset a and gradient b:
distribution task to compute an optimal object remote cost = a + b ⋅ (local cost)
distribution.
More precisely, we modified the generation of stubs We applied a nonlinear least-squares algorithm to
to measure the total execution time of remote calls. the results of the benchmark suite in order to fit the
Once a remote call returns, the stub sends the total time estimate function and determine the values of a and b.
to the skeleton which measured the execution time of
the actual implementation (i.e. work). Using both 5.4. Smoothing and storing time values
values, we compute cost as the difference between the
total time and work. We use a hash map to store time values, mapping
In order to transmit the total time from stub to activities to work and cost values. JavaParty assigns a
skeleton, we added methods to send and receive the globally unique thread id to activities that face remote
measured times to the client and server side of the calls. If a new measurement is to be stored, the given
connection. These methods are called after a remote thread id is mapped to a pair of work and cost values.
method invocation has been completed and the result is We store these values directly with the skeleton, so the
marshaled back to the caller. Finally, the work and cost addressed object is implicit. Since work and cost
153
Authorized licensed use limited to: National Kaohsiung University of Applied Sciences. Downloaded on May 08,2010 at 14:53:01 UTC from IEEE Xplore. Restrictions apply.indicate the computing and communication times an The first application is a numerical algorithm that
activity spends on all methods of an object, we have to has a static structure. We started with a sub-optimal
aggregate the values of the individual methods in a distribution and optimized its locality during runtime.
reasonable way. The second application is an n-body simulation with an
We use an exponential moving average which has inherently dynamic structure. We started with an
the following advantages over simply adding up the optimal distribution and adapted the locality as the
time values: First, the weighting for each data point structure of the application changed.
decreases exponentially, giving more importance to All measurements in this chapter have been
recent observations while still not discarding older conducted on our Carla cluster, using the Java Server
observations. Second, the weighting makes our VM 1.4.2_13-b06. This cluster consists of 16 nodes
measurement more robust against outliers, e.g. delayed equipped with two Pentium III 800 MHz processors
execution because of distributed garbage-collection. and 1 GB RAM each.
Third, the exponential moving average is easy to
compute and thus a relatively cheap operation. 6.2. Successive over-relaxation
5.6. Application monitoring Successive over-relaxation is a numerical algorithm
for solving Laplace equations on a grid. The sequential
JavaParty offers an interface that allows plugging in implementation involves an outer loop for the
additional classes that can be used for monitoring the iterations and two inner loops, each looping over the
distributed environment. In our case, the monitor grid. During an iteration, the new value of each point
interface is implemented as an invisible task that of the grid is determined by calculating the average
collects runtime data based on instrumentation. This value of the four neighbor points. The algorithm
data is used to analyze the distribution of remote terminates if no point of the grid has changed more
objects over the virtual machines. than a certain threshold.
In JavaParty, references to remote objects are stored The parallel implementation [9] provided by
in a distributed fashion. Thus, we have to iterate over Maassen is based on a red-black ordering mechanism.
all virtual machines to obtain references to the remote The grid is partitioned among the available processors,
objects. These references are used to collect the each processor receiving a number of adjacent rows.
measured times. Before a processor starts to update the points of a
The monitor also serves as front end for the certain color, it exchanges the border rows of the
distribution task which can either be scheduled for opposite color with its neighbors.
repeated fixed-delay execution or invoked manually
via a library call. Basically, our distribution task 1000x1000 grid, 300 iterations
fetches the measured times and runs the distribution
120
algorithm discussed in section 4.1.
The distribution algorithm sorts the application 100
threads according to their parallelization wins. Each 80
time [ms]
manual
activity is assigned to a group of activities which are 60 optimized
optimally placed on the same virtual machine. Finally, random
40
each object is assigned to its optimal JVM and possibly
20
migrated there. The migration succeeds only for
objects that are not declared to be resident. If nothing 0
2 4 8 16
was changed during the migration, the distribution task
# machines
is canceled.
Figure 1. Results of the SOR benchmark
6. Evaluation
The SOR benchmark performs 300 iterations of
In order to evaluate the effectiveness and efficiency successive over-relaxation on a 1000x1000 grid of
of our work, we examined two applications that have double values. The performance was measured on 2, 4,
potential for locality optimizations. If a program was 8, and 16 nodes and is reported in milliseconds per
already distributed optimally at compile time and its iteration. In order to evaluate our approach, we created
locality did not change during run time, there would be three versions of the benchmark: (i) a manual version
nothing to optimize. that creates all remote objects at their optimal location,
(ii) a random version where the location of the remote
154
Authorized licensed use limited to: National Kaohsiung University of Applied Sciences. Downloaded on May 08,2010 at 14:53:01 UTC from IEEE Xplore. Restrictions apply.objects is determined randomly, and (iii) an optimized The benchmark performs 10 iterations of n-body
version which invokes the locality optimizations after simulation with 1000 particles. The performance was
the first iteration based on the random object measured on 2, 4, 8, and 16 nodes and is reported in
distribution. seconds per iteration. Again, we created three versions
The results of the SOR benchmark are shown in of the benchmark: (i) a manual version with explicit
Figure 1. As expected, the manual version performs placement annotations, (ii) a random version where the
best with a constant speedup as the number of location of the remote objects is determined randomly,
machines increases. The random version performs and (iii) an optimized version which invokes the
worst and does not scale with additional machines. locality optimizations after the first iteration based on
Finally, the optimized version of the benchmark the random object distribution.
performs considerably better than the random version,
improving its performance towards the optimal Figure 2 shows the results of the n-body
version. If more iterations were performed, the benchmark. Because of the dynamic structure of the
optimized version would do even better since the cost benchmark, an optimal distribution of the remote
of the locality optimizations would bear less weight. objects is hard to predict and depends on the spatial
Figure 1 might give the impression that the distribution of the particles. As the initial coordinates
optimized version does not scale with additional of the particles are determined randomly and thus are
machines. This is not exactly true since the cost of the not known a priori, the manual version of the
locality optimizations is proportional to the number of benchmark performs only slightly better than the
nodes, too. Table 2 details the cost of the procedure for random version. Since the locality of the application is
the SOR benchmark. Polling the remote objects clearly adapted to the actual location of the particles, the
dominates the overall cost. In spite of its square optimized version of the benchmark performs best.
complexity, the cost of the distribution algorithm is The cost of the locality optimizations can easily be
relatively small. Again, if the number of iterations was covered by the savings achieved during the following
increased or a benchmark with longer processing times iterations.
was used, the cost would decrease.
Table 2. Cost of the locality optimizations 1000 particles, 10 iterations
polling computing cost of overall
300
remote locality migrating cost
objects algorithm objects [ms] 250
2 929 43 235 1206,87 200
manual
time [s]
4 1799 137 249 2185,82 150 optimized
8 4044 332 588 4963,73 random
100
16 7000 652 1068 8720,30
50
0
2 4 8 16
6.3. N-body simulation # machines
The n-body simulation approximates the movement Figure 2. Results of the n-body benchmark
of n particles in a two-dimensional space based on
mutual gravitation. The simulation is discretized into
time steps where the gravity between each of the The n-body benchmark is a good example for the
n particles must be computed for each time step. effectiveness of our approach. In dynamic settings
Afterwards, acceleration and change in velocity and such as the n-body simulation, it is hard and sometimes
location are determined for each particle. In order to impossible to determine a good initial distribution of
avoid the square complexity of computing forces, the the remote objects. Even if an optimal distribution can
present implementation uses an approximation be determined, the performance of the initial
proposed by Barnes and Hut. Through hierarchical distribution will decrease since the locality of the
grouping and generation of substitute masses for application changes. Only a dynamic approach that
distant space regions, the computation complexity is optimizes the locality at runtime can guarantee
reduced to O(n log(n)) operations per time step. We consistently high performance throughout the whole
refer to [7] for a detailed description of the benchmark. life cycle of the application.
155
Authorized licensed use limited to: National Kaohsiung University of Applied Sciences. Downloaded on May 08,2010 at 14:53:01 UTC from IEEE Xplore. Restrictions apply.7. Conclusion and future work [4] T. Fahringer, “JavaSymphony: a system for
development of locality-oriented distributed and parallel
Java applications”, Cluster Computing, 2000.
In this work, we presented runtime locality [5] V. Felea, R. Olejnik, and B. Toursel, “ADAJ: a Java
optimizations of distributed Java applications. Based Distributed Environment for Easy Programming Design
on a static approach, we developed a dynamic and Efficient Execution”, Shedae Informaticae, UJ
methodology to automatically generate a distribution Press, Krakow, 2004, pp. 9-36.
strategy for the objects of a distributed system. We [6] B. Haumacher, “Lokalitätsoptimierung durch statische
instrumented stubs and skeletons to measure the Typanalyse in JavaParty“, Diploma theses, Institute for
execution time and communication cost of remote Program Structures and Data Organization, University
calls. The measured time values are stored locally to of Karlsruhe, January 1998.
[7] B. Haumacher, “Plattformunabhängige Umgebung für
avoid communication overhead. The locality
verteilt paralleles Rechnen mit Rechnerbündeln“, PhD
optimizations are implemented as a task that runs thesis, Institute for Program Structures and Data
periodically or can be started on demand. This task Organization, University of Karlsruhe, October 2005.
collects the measured time values and computes an [8] Intel Corp, “Using the RDTSC Instruction for
optimal distribution strategy. In order to realize the Performance Monitoring”, 1997.
distribution strategy, objects are migrated between http://developer.intel.com/drg/pentiumII/appnotes/RDT
machines. SCPM1.HTM
We evaluated the effectiveness and efficiency of [9] J. Maassen and R.V. Nieuwpoort, “Fast parallel Java“,
our work by optimizing two benchmark applications. Master's thesis, Dept. of Computer Science, Vrije
The first benchmark is a typical example of a Universiteit, Amsterdam, August 1998.
[10] M. Philippsen and M. Zenger, “JavaParty - Transparent
numerical algorithm with a static structure, so we
Remote Objects in Java”, Concurrency: Practice and
created a random initial distribution of the objects and Experience, November 1997.
optimized their locality at runtime. The second [11] M. Philippsen and B. Haumacher, “Locality
benchmark has a dynamic structure, so that the optimization in JavaParty by means of static type
performance of the initial object distribution – even of analysis”, Proc. Workshop on Java for High
an optimal one – will deteriorate at runtime. We have Performance Network Computing at EuroPar '98,
shown that our approach is particularly suitable for Southhampton, September 1998.
such dynamic settings. [12] M. Philippsen, B. Haumacher, and C. Nester, “More
In future work, we will focus on automatically Efficient Serialization and RMI for Java”, Concurrency:
Practice and Experience, John Wiley & Sons,
adapting the periodic time of the distribution task such
Chichester, West Sussix, May 2000, pp. 495-518.
that it reflects the processing time of the application. If [13] Sun Microsystems, “Java Native Interface”, 2003.
the structure of the application does not change, we http://java.sun.com/j2se/1.4.2/docs/guide/jni/
might even want to switch off the measuring [14] Sun Microsystems, “Java Remote Method Invocation
completely. For large clusters with thousands of Specification”, 2003.
processors or applications with a great number of http://java.sun.com/j2se/1.4.2/docs/guide/rmi/spec/rmiT
objects, an algorithm with square complexity might be OC.html
suboptimal. We could imagine a distributed algorithm [15] R. Veldema, R. A. F. Bhoedjang, and H. E. Bal,
that works with exact time values for only a couple of “Jackal, a compiler based implementation of java for
clusters of workstations”, Proceedings of PPoPP, 2001.
local nodes and extrapolates the values for remote
[16] W. Zhu, C.-L. Wang, and F. C. M. Lau, “JESSICA2: A
nodes. Distributed Java Virtual Machine with Transparent
Thread Migration Support”, IEEE Fourth International
References Conference on Cluster Computing, Chicago, USA,
September 2002.
[1] G. Antoniu, L. Bouge, P. Hatcher, M. MacBeth, K.
McGuigan, and R. Namyst, “The Hyperion system:
Compiling multithreaded Java bytecode for distributed
execution”, Parallel Computing, 2001.
[2] Y. Aridor, M. Factor, and A. Teperman, “cJVM: a
single system image of a JVM on a cluster”, Parallel
Processing, 1999, pp. 4-11.
[3] M. Factor, A. Schuster, and K. Shagin, “A distributed
runtime for Java: yesterday and today”, Parallel and
Distributed Processing Symposium, 2004.
156
Authorized licensed use limited to: National Kaohsiung University of Applied Sciences. Downloaded on May 08,2010 at 14:53:01 UTC from IEEE Xplore. Restrictions apply.You can also read