NUMA-aware algorithms: the case of data shuffling

Page created by Norma Franklin
 
CONTINUE READING
NUMA-aware algorithms: the case of data shuffling

         Yinan Li†∗ Ippokratis Pandis Rene Mueller Vijayshankar Raman Guy Lohman
                    †
                        University of Wisconsin–Madison         IBM Almaden Research Center
                              Madison, WI, USA                        San Jose, CA, USA
                              yinan@cs.wisc.edu         {ipandis,muellerr,ravijay,lohmang}@us.ibm.com

ABSTRACT                                                                        There is no standard NUMA system: the server market
In recent years, a new breed of non-uniform memory access                    has a smögåsbord of different configurations with varying
(NUMA) systems has emerged: multi-socket servers of multi-                   numbers of sockets, cores per socket, memory DIMMs per
cores. This paper makes the case that data management sys-                   socket, and interconnect topologies. For example, focusing
tems need to employ designs that take into consideration the                 only on Intel, server configurations range from one to eight
characteristics of modern NUMA hardware. To prove our                        sockets, each socket having four to ten cores and anywhere
point, we focus on a primitive that is used as the building                  from a single 2 GB DIMM to sixty-four 32 GB DIMMs and
block of numerous data management operations: data shuf-                     vastly varying inter-processor interconnect (“QPI”, in Intel’s
fling. We perform a comparison of different data shuffling                   terminology) topologies. Figure 1 shows the configuration
algorithms and show that a naı̈ve data shuffling algorithm                   of 4 different Intel servers we have in our lab: a 2-socket,
can be up to 3× slower than the highest performing, NUMA-                    two 4-socket systems, and an 8-socket system. The config-
aware one. To achieve the highest performance, we employ                     urations also change quite a bit from one generation to the
a combination of thread binding, NUMA-aware thread allo-                     next.
cation, and relaxed global coordination among threads. The                      Primitives: DBMS servers need to perform well across
importance of such NUMA-aware algorithm designs will only                    this hardware spectrum, but it would be impossible to main-
increase, as future server systems are expected to feature                   tain a distinct implementation of the core operations of a
ever larger numbers of sockets and increasingly complicated                  DBMS for each machine configuration as they are complex
memory subsystems.                                                           and consist of millions of lines of code. So we argue that
                                                                             DBMS software should isolate performance-critical primi-
                                                                             tives: smaller, isolated modules that can be heavily tuned
1. INTRODUCTION                                                              for a hardware configuration, and whose specification is sim-
  The last decade has seen the end of clock-frequency scal-                  ple enough that they can be provided as libraries by hard-
ing, and the emergence of multi-core processors as the pri-                  ware developers (a good analogy here are the Basic Linear
mary way to improve processor performance. In response,                      Algebra Subprograms, or BLAS, linear algebra primitives
there have been concerted efforts to develop data manage-                    used heavily in scientific computing [17] ).
ment systems that exploit multi-core parallelism.                               In this paper, we study data shuffling as an exemplary
  Another equally important trend is towards multi-socket                    data management primitive. We define data shuffling broadly
systems with non-uniform memory access (NUMA) mem-                           as threads exchanging roughly equal-sized pieces of data
ory architectures. Each socket in a NUMA system has its                      among themselves. Specifically, we assume each of the N
own ‘local’ memory (DRAM) and is connected to the other                      threads has initially partitioned its data, that reside on its
sockets and, hence to their memory, via one or more links.                   local memory, into N pieces. The goal is to transmit the ith
Access latency and bandwidth therefore varies depending                      partition from each thread onto the ith thread, for all i.
on whether a core in a socket is accessing ‘local’ or ‘remote’                  Shuffling is needed in many data management operations.
memory. To get good performance on a NUMA system, one                        The most famous example, of course, is Map-Reduce, in
must understand the topology of the interconnect between                     which mapper tasks shuffle data to reducer tasks [8] . Shuf-
sockets and memories and accordingly align the algorithm’s                   fle also appears naturally in parallel joins, aggregation, and
data structures and communication patterns.                                  sorts [3, 9, 11] .
                                                                                There has been much recent research on multi-core-friendly
∗
    Work done while the author was at IBM.                                   join, sort, etc., but comparatively less attention on the NUMA
                                                                             aspects of these algorithms. We find that NUMA awareness
                                                                             is particularly important for shuffling. In this paper, we an-
                                                                             alyze the behavior of shuffling on different NUMA machines,
                                                                             show that a straightforward implementation significantly un-
This article is published under a Creative Commons Attribution License       derutilizes some parts of the inter-processor interconnect,
(http://creativecommons.org/licenses/by/3.0/), which permits distribution    and present a hardware-tuned shuffle implementation that is
and reproduction in any medium as well allowing derivative works, pro-       up to 3 times more efficient. We also quantify the potential
vided that you attribute the original work to the author(s) and CIDR 2013.
6th Biennial Conference on Innovative Data Systems Research (CIDR’13)
                                                                             of thread and state migration instead of data shuffling as
January 6-9, 2013, Asilomar, California, USA.
.
2-socket                       4-socket (a)                  4-socket (b)                       8-socket

Figure 1: Interconnect topologies for four Intel servers. Blue bars represent memory DIMMs, orange blocks
are processors, and pipes are data interconnects. See Section 2 for details.

another approach that achieves equivalent results (multiple                          Socket 0           Socket 1
threads access data that are provided by other threads).

                                                                            Memory

                                                                                                                     Memory
  The contributions of this paper are three-fold:                                        1
                                                                                             2
• We demonstrate the importance of revising database al-
  gorithms to take into consideration the non-uniform mem-
  ory access (NUMA) characteristics of multi-socket systems
  having multiple cores.
                                                                                     3
• We focus on shuffling, a primitive that is used in a vari-
  ety of important data management algorithms. We show

                                                                            Memory

                                                                                                                     Memory
  that a naı̈ve algorithm can be up to 3 times slower than                           4
  a NUMA-aware one that exploits thread binding, NUMA-
  aware memory allocation, and thread coordination. This
  optimized data shuffling primitive can speed up a state-of-
  the-art join algorithm by 8%.                                                      Socket 2           Socket 3
• We show the potential of thread migration instead of data
  shuffling for highly asymmetric hardware configurations.       Figure 2: Data access pattern in a 4-socket configu-
  In particular, we show that if the working state of threads    ration with 4 QPI links.
  is not large, migrating a thread (moving the computation
  to the data) can be up to 2× faster than data shuffling
  (moving the data to the computation). We show that             slower due to the latencies incurred by the interconnect net-
  thread migration can speed up parallel aggregation algo-       work. Contention caused by other traffic on the interconnect
  rithms by up to 25%.                                           can further increase latencies of remote accesses.

  The rest of the document is structured as follows. Section 2   Example: 4-socket system with 4 QPI links. Consider
discusses the new breed of NUMA systems. The problem             the 4-Socket (b) configuration shown in Figure 1 . This
of data shuffling in NUMA environments is described in           topology is used in a fully populated IBM X Series x3850 X5
Section 3 . Section 4 presents a performance comparison of       system. It contains four 8-core Nehalem-EX X7560 pro-
various data shuffling algorithms. We discuss related work       cessors that are fully connected through six bi-directional
in Section 5 and our conclusions in Section 6 .                  3.2 GHz Intel QuickPath Interconnect (QPI) links.
                                                                    For illustration purposes, we disconnect two QPI links by
2. NUMA: HARDWARE & OS SUPPORT                                   removing the two QPI wrapper cards, thus obtaining the 4-
   Even the most cost-effective database servers today con-      link interconnect depicted in Figure 2 . Using socket-local
sist of more than one processor socket (e.g., two sockets        and socket-remote memory read accesses, we create four
for the Facebook Open Compute Platform [21] ). Today’s           different data flows shown in the figure. Table 1 lists the
commercial mid-range and enterprise class servers may con-       measured bandwidth and latency for each flow. Our mea-
tain up to eight processor sockets. Moreover, each socket        surements show that we need at least 12 reader threads per
accommodates up to ten CPU cores, along with several             socket or QPI link to achieve maximum bandwidth. Flow
memory DIMM modules that are attached to the socket               1 corresponds to a read of the local memory of Socket 0.
through different memory channels. An interconnect net-          The maximum aggregate bandwidth on one socket for this
work among the sockets allows each core to access non-local      flow was measured to be 24.7 GB/s using 12 threads and a
memory and maintains the coherency of all caches and mem-        sequential read pattern. The latency for data-dependent ran-
ories. Figure 1 shows different configurations for Intel-based   dom access 1 is 340 CPU cycles (≈ 150 ns). 2 Adding a hop
servers. Though NUMA architectures have been around for          1
more than two decades, today’s multi-socket processors and           In a data-dependent access pattern, the memory location
point-to-point interconnect networks (as opposed to shared           referenced next will be determined by the content of the
                                                                     memory location that is currently being accessed.
buses) make designing high-bandwidth, low-latency solutions      2
                                                                     The latency numbers are given for data-dependent ran-
more challenging than ever before. Threads running on dif-           dom accesses, where the first word of a cache line is used
ferent cores but within the same socket have (near-) uni-            to determine the next read location. The latency increases
form memory access, whereas access to off-socket memory is           by 5 cycles if the last word of a cache line is used instead.
over one single QPI link (flow 2 ) decreases the bandwidth
to 10.9 GB/s, which corresponds to the uni-directional band-       Table 1: Bandwidth and Latency of a 4-socket
width of a single QPI link, and increases the random-access        Nehalem-EX system with 4 QPI links
                                                                                  Bandwidth seq. Latency data-
latency by 80 CPU cycles (≈ 35 ns). A second hop (flow 3 )
                                                                                  memory access  dependent random
adds another 100 CPU cycles (≈ 45 ns) to the random ac-
                                                                                  (12 threads)   accesses (1 thread)
cess latency. For data-independent accesses, these latencies
can be hidden through pipelining. Hence, adding a second             local memory       24.7 GB/s           340 cycles/access
hop does not further affect the sequential bandwidth.                access 1                               (≈ 150 ns/access)
   Cross traffic can lead to contention on the interconnect          remote memory      10.9 GB/s           420 cycles/access
links, which lowers the sequential bandwidth for a thread            1 hop 2                                (≈ 185 ns/access)
and increases the access latency. For example, when running
flow 3 concurrently with flow 4 , the aggregate bandwidth            remote memory      10.9 GB/s           520 cycles/access
of the 12 reader threads on flow 3 drops to 5.3 GB/s and             2 hops 3                               (≈ 230 ns/access)
the median of the random access latency increases by 10              remote memory      5.3 GB/s            530 cycles/access
CPU cycles (≈ 4.4 ns).                                               2 hops 3 with                          (≈ 235 ns/access)
   In a micro-benchmark in which 12 threads on each socket           cross traffic 4
simultaneously access data on all three remote sockets (4
threads per remote socket, and hence per QPI link), we
measure an aggregate bandwidth of 61 GB/s. The aggregate           For example, in relational systems, partitioning-based join
bandwidth would be 99 GB/s if all threads accessed socket-         and aggregation algorithms have a step in which each thread
local memory. Hence, the measured interconnect bandwidth           in the system partitions data based on a global partitioning
corresponds to 60 % of the socket-local bandwidth. The             function. Then either each producer thread needs to push
Nehalem-EX architecture measured in this example imple-            portions of the data to its corresponding consumer thread,
ments a Source-based Snooping protocol [13, page 20] to            or each consumer needs to pull the data it needs on demand.
maintain coherency. These snooping messages (and replies)          Obviously, data shuffling is a fundamental operation in the
are exchanged on the interconnect even when all threads ac-        MapReduce framework, as well.
cess socket-local memory. Therefore, even socket-local mem-           In this paper, we focus on efficient data shuffling algo-
ory access can affect the traffic on the interconnect.             rithms for NUMA machines. In particular, the problem we
   This simple performance evaluation shows that just assign-      consider is the efficient transmission of data between threads
ing threads to cores and exploiting socket locality in mem-        in a machine composed of multiple multi-core sockets that
ory accesses is insufficient. If the data accesses are carefully   are statically connected by links having finite but possibly
orchestrated, performance can be improved by reducing con-         different bandwidths. Let S denote the number of sockets in
tention on the interconnect. These NUMA effects have to be         the machine and si denote the ith socket (0 ≤ i < S). Each
considered, particularly for latency-critical applications such    thread is bound to a core on a socket. Let si .pj denote the
as transaction processing, as shown in [24] . Systems such         j th thread running on socket si . Suppose that we have P
as DORA [22] avoid uncoordinated data accesses common              threads running on each socket, and thus we have a total of
in OLTP systems by binding threads (cores) to data.                N = SP threads. Thread si .pj produces some data, denoted
                                                                   as si .dj , that needs to be consumed by the corresponding
Operating systems are aware of NUMA architectures. Linux           thread. For every pair of thread si1 .pj1 and data si2 .dj2 , in
partitions memory into NUMA zones, one for each socket.            a shuffle operation involving N threads (N = SP ), a data
For each NUMA zone, the kernel maintains separate man-             transmission between si2 and si1 occurs when thread si1 .pj1
agement data structures. Unless explicitly bound to a spec-        accesses the data si2 .pj2 . Each transfer has a unique source
ified socket (NUMA zone) through the mbind() system call,          and destination buffers, which means that there is no need
the Linux kernel allocates memory on the socket of the core        for the threads to synchronize because they the do not mod-
that executes the current thread. More precisely, the allo-        ify any shared data structure. The goal is to minimize the
cation to the NUMA zone will only happen after the first           overall time of this procedure. Note that this data shuffling
write access to the memory, which then causes a page fault.        problem is very similar to the data consumption problem, in
The kernel tries to satisfy the memory allocation from the         which all the cores of a machine need to consume a buffer of
current zone. If there is not enough memory available in           data produced from all N cores of the machine. As a matter
that zone, the kernel allocates memory from the next zone          of fact, we use this formulation of the problem for the rest
with memory available.                                             of the discussion and experimentation.
   The operating system scheduler tries to minimize thread
migration to other cores. Under certain conditions, such as        3.1     Naive vs. coordinated shuffling
under-utilization of other cores, the scheduler may decide
to migrate a thread to a core in another zone. The kernel          3.1.1    Naive shuffling
does not automatically migrate the associated memory pages            The naive shuffling method does not consider the NUMA
along with the thread. Instead, Page Migration can be per-         characteristics of the machine. In the naive shuffling algo-
formed through the mbind() system call, either manually or         rithm, each thread pulls data generated by all other threads
automatically by a user-space daemon process.                      starting from partition s0 .d0 to sS−1 .dP −1 . This leads to an
                                                                   implementation using two nested loops, as shown below:
3. THE CASE OF SHUFFLING IN NUMA                                    1: for i ← 0 . . . S − 1 do
  Shuffling data among data partitions is a ubiquitous build-       2:   for j ← 0 . . . P − 1 do
ing block of most distributed data management operations.           3:      pull si .dj
s0.p0   s0.p1   s0.p2   s0.p3   s1.p0   s1.p1   s1.p2   s1.p3   s2.p0   s2.p1   s2.p2   s2.p3   s3.p0    s3.p1          s3.p2        s3.p3

  Step 0:
            s0.d0   s0.d1   s0.d2   s0.d3   s1.d0   s1.d1   s1.d2   s1.d3   s2.d0   s2.d1   s2.d2   s2.d3   s3.d0    s3.d1          s3.d2        s3.d3

                                                                                                                                                                         Memory

                                                                                                                                                                                                                                       Memory
                                                                                                                                                                                      Socket 0                            Socket 1
            s0.p0   s0.p1   s0.p2   s0.p3   s1.p0   s1.p1   s1.p2   s1.p3   s2.p0   s2.p1   s2.p2   s2.p3   s3.p0    s3.p1          s3.p2        s3.p3

  Step 1:
            s0.d0   s0.d1   s0.d2   s0.d3   s1.d0   s1.d1   s1.d2   s1.d3   s2.d0   s2.d1   s2.d2   s2.d3   s3.d0    s3.d1          s3.d2        s3.d3

            s0.p0   s0.p1   s0.p2   s0.p3   s1.p0   s1.p1   s1.p2   s1.p3   s2.p0   s2.p1   s2.p2   s2.p3   s3.p0    s3.p1          s3.p2        s3.p3

                                                                                                                                                                         Memory

                                                                                                                                                                                                                                       Memory
  Step 2:                                                                                                                                                                             Socket 2                            Socket 3

            s0.d0   s0.d1   s0.d2   s0.d3   s1.d0   s1.d1   s1.d2   s1.d3   s2.d0   s2.d1   s2.d2   s2.d3   s3.d0    s3.d1          s3.d2        s3.p3

                                                        (a) Access pattern                                                                                                                (b) QPI activity

                                Figure 3: The uncoordinated, naive shuffling method (first three steps).

            s0.p0   s0.p1   s0.p2   s0.p3   s1.p0   s1.p1   s1.p2   s1.p3   s2.p0   s2.p1   s2.p2   s2.p3   s3.p0   s3.p1           s3.p2        s3.p3

  Step 0:
            s0.d0   s0.d1   s0.d2   s0.d3   s1.d0   s1.d1   s1.d2   s1.d3   s2.d0   s2.d1   s2.d2   s2.d3   s3.d0   s3.d1           s3.d2        s3.d3

                                                                                                                                                                       Memory

                                                                                                                                                                                                                                     Memory
                                                                                                                                                                                      Socket 0                           Socket 1
            s0.p0   s0.p1   s0.p2   s0.p3   s1.p0   s1.p1   s1.p2   s1.p3   s2.p0   s2.p1   s2.p2   s2.p3   s3.p0   s3.p1           s3.p2        s3.p3

  Step 1:
            s0.d0   s0.d1   s0.d2   s0.d3   s1.d0   s1.d1   s1.d2   s1.d3   s2.d0   s2.d1   s2.d2   s2.d3   s3.d0   s3.d1           s3.d2        s3.d3

            s0.p0   s0.p1   s0.p2   s0.p3   s1.p0   s1.p1   s1.p2   s1.p3   s2.p0   s2.p1   s2.p2   s2.p3   s3.p0   s3.p1           s3.p2        s3.p3

                                                                                                                                                                       Memory

                                                                                                                                                                                                                                     Memory
  Step 2:                                                                                                                                                                             Socket 2                           Socket 3

            s0.d0   s0.d1   s0.d2   s0.d3   s1.d0   s1.d1   s1.d2   s1.d3   s2.d0   s2.d1   s2.d2   s2.d3   s3.d0   s3.d1           s3.d2        s3.d3

                                                        (a) Access pattern                                                                                                                (b) QPI activity

                      Figure 4: The NUMA-aware, coordinated, ring shuffling method (first three steps).

   However, this naive, uncoordinated method may cause
various bottlenecks. Either one of the QPI links or one of                                                                                                     s3.p3                s0.p0

                                                                                                                                                                                                                    Se
                                                                                                                                                                                                       s0
the local memory controllers may become over-subscribed,                                                                                         2

                                                                                                                                                                                                                     gm
                                                                                                                                            .p                                                             .p1
                                                                                                                                         s3

                                                                                                                                                                                                                           en
causing delays. It is difficult for a developer to predict such

                                                                                                                                                                                                                             t
                                                                                                                                                                 s3.d3            s0.d0
delays, because they are dependent on the hardware charac-
                                                                                                                            .p1

                                                                                                                                                                                                                    s0
                                                                                                                                                                                            s1
                                                                                                                                                         .d3

                                                                                                                                                                                                                    .p2
                                                                                                                                                                                              .d0
                                                                                                                            s3

teristics such as the interconnect topology.                                                                                                         s2
                                                                                                                                               .d3

                                                                                                                                                                                                      s2
   Figure 3 illustrates the access pattern and QPI activity

                                                                                                                                                                                                      .d0
                                                                                                                                            s1
                                                                                                                    s3.p0

of the naive shuffling method for a 4-socket, fully-connected

                                                                                                                                                                                                                           s0.p3
                                                                                                                                       s0.d3

                                                                                                                                                                                                            s3.d0
NUMA machine. In this example, suppose that we run 16
threads (4 threads / socket) for the naive shuffling method.
                                                                                                                                       s3.d2

                                                                                                                                                                                                            s0.d1
As it can be seen in Figure 3(a) , all threads pull data from
                                                                                                                    s2.p3

                                                                                                                                                                                                                            s1.p0
the same partition in each step. The QPI activity of the
                                                                                                                                            .d2

                                                                                                                                                                                                       s1

naive shuffling method is shown in Figure 3(b) . The data
                                                                                                                                                                                                      .d1
                                                                                                                                               s2

                                                                                                                                                                                                 s2
is pulled from the memory banks attached to Socket 0, and                                                                                            .d2
                                                                                                                              .p2

                                                                                                                                                                                            .d1                      s1
                                                                                                                                                          s1
                                                                                                                                                                                                                    .p1
                                                                                                                            s2

                                                                                                                                                                 s0.d2            s3.d1
then is transmitted to the other three sockets through three
QPI links. Consequently, the naive method underutilizes                                                                                  .p1                                                          .p2
                                                                                                                                                                                                            s1
                                                                                                                                               s2
the available bandwidth: it only uses 1/4 of the available                                                                                                                          s1.p3
                                                                                                                                                               s2.p0
memory channels and three of the twelve QPI links.

 3.1.2       Coordinated shuffling                                                                    Figure 5: Ring shuffling, an example of a coordi-
   The coordinated shuffling method coordinates the usage                                             nated data shuffling policy.
of the bandwidth of the QPI links across all threads in
the system. The basic idea behind the coordinated shuf-
fling method is illustrated in Figure 5 . There are two rings
in the figure: the ring members of the inner ring repre-                                                 During the coordinated shuffling, the inner ring remains
sent partitions s0 .d0 , . . . , sS−1 .dP −1 , while the ring members                                 fixed while the outer ring is rotating clockwise. At each step,
of the outer ring identify the threads s0 .p0 , . . . , sS−1 .pP −1 .                                 all threads access the partition they point to. For instance,
All data partitions are ordered in the inner ring first by                                            as shown in Figure 5 , thread s0 .p0 accesses the partition
thread number on a socket and then by socket number, i.e.,                                            s0 .d0 , thread s0 .p1 accesses the partition s1 .d0 , and so forth.
s0 .d0 , s1 .d0 , s2 .d0 , . . . , sS−1 .d0 , s0 .d1 , s1 .d1 , . . . In contrast, in                 Once the step is complete, all ring members of the outer ring
the outer ring, the threads are ordered first by socket num-                                          are rotated clockwise to the positions of their immediately
ber and then by thread number, i.e., s0 .p0 , s0 .p1 , s0 .p2 , . . . ,                               next ring members. The rotation continues until the outer
s0 .pP −1 , s1 .p0 , s1 .p1 , . . .                                                                   ring completes a full cycle.
We briefly explain why this coordinated method achieves                      4.    EVALUATION
high bandwidth utilization on the NUMA machine. First, it                          We use two IBM X Series x3850 servers for our experi-
is straightforward that a given thread will have accessed all                   ments. The first is a quad-socket with four X7560 Nehalem-
partitions once the outer ring completes a full cycle. Second,                  EX 8-core sockets and 512 GB of main memory (128 GB per
consider the segment in the outer ring that spans over all                      socket), in a fully-connected QPI topology. The second con-
threads in a given socket si , i.e., si .p0 , . . . , si .pP −1 . It is clear   nects two 4-socket x3850 chassis to form one single 8-socket
that these partitions pointed to by the segment are always                      system with eight 8-core X7560 Nehalem-EX sockets and
evenly located on all sockets, because partitions on various                    1 TB main memory (128 GB per socket). The 4-socket server
sockets are organized in an interleaved fashion. This means                     runs RedHat Enterprise Linux 6.1, whereas the 8-socket runs
that there are the same number of data transfers over all                       SuSE Enterprise Linux 11 SP1.
links from other sockets to the socket si , i.e., all links have                   In this section, we first study in detail the performance
up to ⌈P/S ⌉ data transfers.                                                    of various shuffling methods in Section 4.1 . Then we show
   Given that during a shuffling operation, for a given QPI                     the usefulness of data shuffling for implementing sort-merge
link between socket si and socket sj , there is a total of                      joins in Section 4.2 . Finally, in Section 4.3 , we analyze the
P × P = P 2 transmissions through the QPI link (each                            trade-offs of shuffling versus thread migration in a typical
of the P threads on sj transmits a data partition from                          group-by database operation with aggregation.
each of the P partitions on si ). Since the shuffling oper-
ation
 2 runs    in N = P S steps, each QPI link has at least                       4.1      Shuffling for NUMA
  P /P S = ⌈P/S ⌉ transmissions in each step. If the machine
                                                                                   We start with a simple experiment, using shuffling meth-
has a fully-connected topology, i.e., there is a QPI link for                   ods to shuffle 1.6 billion 64-bit integers. We have N threads,
each transmission path, the coordinated shuffling method                        each of which has 1/N th of the data partitioned into N pieces
matches the lower bound on the number of transmissions.                         (and stored in its local memory). The ith thread accesses the
   Figure 4 illustrates the access pattern and resulting QPI                    ith piece of data on all threads’ partitions and computes the
activity of the coordinated shuffling method. Unlike the                        mean of all integers on the piece of data.
naive method, each partition is accessed by only one thread
at each step. The coordinated method uses all QPI links                          4.1.1    Comparison of various policies
as well as all memory channels in each step, as shown in
                                                                                   Figure 6 compares the maximum bandwidth achieved for
Figure 4(b) . Consequently, the coordinated method increases
                                                                                the shuffling task using various shuffling methods, on both
the utilization of the QPI bandwidth and provides higher
                                                                                the four-socket and the eight-socket machines. We repeat
performance than the naive one. Note that for the coordi-
                                                                                each experiment 10 times and plot the mean. The vertical
nated shuffling policy to perform as expected, the developer
                                                                                error bars show the min/max of the 10 runs for each exper-
needs to control on which socket each thread is running and
                                                                                iment.
where the buffers are allocated. Furthermore, the developer
                                                                                   In this experiment, we evaluate the four methods described
must know the interconnect topology of the machine; to
                                                                                earlier; naive shuffling ( Section 3.1.1 ), ring shuffling ( Section 3.1.2 ),
achieve the higher transmission bandwidth the interconnect
                                                                                and thread migration ( Section 3.2 ), as well as coordinated
network should be fully connected or, at least, symmetric.
                                                                                random shuffling. In the latter method, each thread ran-
                                                                                domly chooses the partition to access next. In addition, we
                                                                                also analyze two different implementation variants for the
3.2 Shuffling vs. thread migration                                              coordinated shuffling and thread migration methods; one in
   An alternative method to data shuffling is thread migra-                     which threads are synchronized and operate in lock step from
tion, in which the process and its working set moves across                     one step in the shuffling process to the next. We call those
the sockets, rather than the data. In each step of shuffling                    implementations “tight”. We remove this synchronization
operation, we can rebind each thread to a different core on                     barrier between steps in the second implementation variant,
another socket to migrate the thread to other sockets. The                      such that threads independently move to the next step. We
working sets of the threads also need to be moved to the                        refer to this implementation as “loose”. The idea of synchro-
target socket through QPI links. Essentially, thread migra-                     nization is that it potentially avoids interference caused by
tion can be viewed as a dual approach to data shuffling: in                     diverging threads, i.e., threads that end up in different steps.
thread migration, data accesses go to local memory while                        The flip side of synchronization, however, is the extra cost
thread state accesses potentially go through the intercon-                      of the synchronization itself.
nect network, whereas in data shuffling, data is explicitly                        We first discuss the left side of Figure 6 , which shows the
moved through the interconnect network while the thread-                        performance on the four-socket machine. The naive method
state references remain local.                                                  performs poorly, achieving only a bandwidth of 22 GB/s. In
   When the data size is larger than the working set of the                     this method, all threads pull data from the memory of one
threads, thread migration is potentially more efficient than                    single socket (as shown in Figure 3(b) ), and therefore are
shuffling. The trade-off depends on the ratio between data                      significantly limited by the local memory bandwidth of that
size and working set, and it is discussed in Section 4.3 .                      single socket (24.7 GB/s).
   Note that the choice between shuffling and thread migra-                        The random shuffling method (labeled “Coord random
tion is orthogonal to the uncoordinated/coordinated meth-                       (tight)” in Figure 6 ) achieves better performance than the
ods described in Section 3.1 and, hence both, the naive (un-                    naive policy, but is still inferior to the other methods. In
coordinated) and coordinated methods can also be applied                        addition, the random shuffling experiences a higher variance
in thread migration for moving the working sets. With the                       in bandwidth compared to the other methods (as shown by
coordinated method, thread migration achieves higher band-                      the error bars). The relatively low performance of random
width utilization for transmitting the working sets.                            shuffling indicates that a simple random permutation of the
120                                                              120
                                               4 socket                                                                              8 socket
 Bandwidth (GB/s)

                    100                                                              100
                    80                                                                80
                    60                                                                60
                    40                                                                40
                    20                                                                20
                      0                                                                       0
                          Naive    Coord.      Ring     Ring    Thread Thread                                  Naive    Coord.      Ring          Ring    Thread Thread
                                  random     (loose)   (tight) migration migration                                     random     (loose)        (tight) migration migration
                                   (tight)                      (loose) (tight)                                         (tight)                           (loose) (tight)

Figure 6: Data shuffling bandwidth achieved by various shuffling policies. Policies that consider the underlying
QPI topology achieve up to 3× higher bandwidth than the naive approaches. Thread migration may achieve
even higher bandwidth.

producer-consumer pairs without taking the hardware topol-                                               100
ogy into consideration is not sufficient.                                                                               Thread migration
   The ring shuffling method tries to maximize the QPI uti-
                                                                                                         80
                                                                                      Bandwidth (GB/s)
lization and achieves 3× higher bandwidth than the naive                                                                Ring shuffling
method on the four-socket machine. The synchronized im-
                                                                                                         60             Naive
plementation (“Ring (tight)” in Figure 6 ) is slightly faster
than the unsynchronized version (“Ring (loose)”). The 70 GB/s
bandwidth achieved by the ring shuffling method is still                                                 40
lower than the aggregate bandwidth of all memory chan-
nels (4 × 24.7 GB/s = 98.8 GB/s) or all QPI links (12 ×                                                  20
10.9 GB/s = 130.8 GB/s). The gap is mainly due to the
transmissions of snooping messages (and replies) that are                                                 0
exchanged on the interconnects for the system’s coherency                                                         1    4     8     12       16     20    24    28   32
protocol.
                                                                                                                                     # Threads
   Thread migration achieves even higher bandwidth than
the ring shuffling. The synchronized implementation of the
                                                                                     Figure 7: Bandwidth achieved when shuffling uni-
migration method achieves a bandwidth of 94 GB/s, which
                                                                                     formly distributed data, as the number of threads
is close to the theoretical maximum bandwidth on that ma-
                                                                                     increases, in a four socket system.
chine (4 × 24.7 GB/s = 98.8 GB/s). Here, all threads access
data through the memory channel to their local memory, so
only snooping messages are exchanged on the interconnect,                            QPI links in the eight-socket system have a lower bandwidth
and the number and size of these messages are small enough                           (7.8 GB/s vs. 10.9 GB/s) and become the bottleneck on the
that they do not saturate the interconnect. The unsynchro-                           overall performance. Furthermore, more snooping messages
nized version of the migration method on a system with a                             (and replies) are exchanged in the interconnect in the case of
fully populated interconnection network exhibits lower per-                          the eight-socket configuration. Those messages not only con-
formance and has significantly higher variability. The reason                        sume more bandwidth of the interconnect but also increase
is that, at any point in time, threads may end up on the same                        the latencies of the memory requests.
core, forming convoys. Note that each thread in this exper-                             The results indicate that the source-based snooping pro-
iment has almost no state information. This represents the                           tocol used for cache coherency in our Nehalem-EX systems
best use case for the thread migration method. We further                            does not seem to scale well for eight sockets. In fact, Intel
evaluate the potential of thread migration in Section 4.3 .                          notes [18, p. 72] that “[the source-snoop protocol] works well
   The right side of Figure 6 shows the bandwidths of these                          for small systems with up to four or so processors that are
methods on our eight-socket machine. This system achieves                            likely to be fully connected by Intel QPI links. However,
similar bandwidths as the four-socket machine. Surprisingly,                         a better choice exists for systems with a large number of
the bandwidths of the ring shuffling and migration meth-                             processors . . . The home snoop behavior is better suited for
ods do not increase on the eight-socket machine with dou-                            these systems where link bandwidth is at a premium and the
ble the local memory bandwidth. Instead, thread migration                            goal is to carefully manage system traffic for the best overall
achieves similar performance on both machines, whereas the                           performance.” Unfortunately, we have not found this second
ring shuffling method performs even worse on eight sock-                             protocol implemented in any system available to us.
ets. The reasons for this are the non-uniform QPI links and
the coherency traffic. It turns out that four of the twelve
30                                                  In this experiment, we analyze the resulting performance
                                          Total (Ring)
                                                                              gain when using the ring shuffling primitive in the P-MPSM
  Scalability (vs. 1thead)
                             25           Total (Naive)
                                                                              join. We join 1.6B tuples with 16M tuples, where each
                                          Merge Phase (Ring)                  tuple contains a 64-bit key and a 64-bit value. Figure 8
                             20
                                          Merge Phase (Naive)                 shows the scalability of this approach as the speed-up over a
                             15                                               single-threaded implementation. At first, the disappointing
                                                                              observation is that the use of the primitive can only slightly
                             10                                               increase the overall scalability of the R-MPSM join from
                                                                              25.0× to 26.2×. The reason is that the execution time is
                             5                                                not dominated by the merge phase (4).
                                                                                 The time breakdown for the P-MPSM join in Figure 9
                             0
                                                                              shows that sorting the fact table chunks in phase (1) over-
                                     1          2     4         8   16   32   whelmingly dominates the execution time. Since the shuf-
                                                     # Threads                fling primitive is only used in the merge phase (4), we plot
                                                                              the scalability of this phase separately in Figure 8 . For the
                                                                              merge phase alone, our more efficient implementation of the
Figure 8: Scalability of the total time (solid lines)                         ring shuffling is able to increase the scalability from 9.4× to
and of the merge phase (dashed lines) of a state-                             16.3×, and hence, almost doubles the scalability.
of-the-art sort-merge-based join implementation [1] ,                            The left-hand side of Figure 9 depicts the time breakdown
when employing the naive and the NUMA-aware                                   for the naive shuffling implementation, and the right-hand
ring data shuffling methods. The merge phase,                                 side for the synchronized ring shuffling. It can be seen that
where the shuffling takes place, has superior scal-                           the already small fraction of the merge phase is further re-
ability with ring shuffling.                                                  duced by the ring shuffling. The difference in the merge
                                                                              phase between the two implementations becomes noticeable
                                                                              when using more than 8 threads, i.e., when the number of
4.1.2                             Scalability                                 threads in a single socket is exceeded, so inter-socket com-
   Figure 7 shows the bandwidth achieved (in GB/sec) as                       munication is required.
the number of threads increases when shuffling data among                        Although sorting the fact table is currently the dominat-
a four-socket system. As can be seen in Figure 7 , all the                    ing phase in this case, the merge phase is likely to dominate
three shuffling methods achieve nearly linear speed-ups be-                   as the number of threads increases. Furthermore, since the
fore their bandwidths approach the bandwidth bottlenecks.                     time spent for sorting is likely to be reduced by hardware
The three methods have similar bandwidths when using less                     accelerators or by exploiting more cache-efficient algorithms,
than 4 threads. However, the scaling stops at about 12                        and future systems are likely to increase their number of
threads and 22 GB/s, because the memory bandwidth of a                        cores significantly, it is quite probable that the performance
single socket becomes the bottleneck. The ring shuffling                      of the join will increasingly depend on the efficiency of the
method achieves a linear speed-up for up to 20 threads, at                    data shuffling.
which point the interconnect network becomes saturated by
both the data and snooping messages. Among the three
                                                                              4.3    Data shuffling vs. thread migration
methods, thread migration scales the best as the number                         Data shuffling is not a panacea and, as we saw earlier in
of threads increases. It shows linear speedup close to opti-                  Section 4.1.1 , there is potential for moving the computation
mal, except for 32 threads, where the capacity limit of the                   to the data rather than shuffling the data around. Migrat-
memory channels is reached.                                                   ing threads makes sense, especially when the accompanying
                                                                              working set of the migrating thread is relatively small. In
4.2 Shuffling as a primitive                                                  this section, we try to evaluate the trade-off between data
                                                                              shuffling (moving the data) and thread migration (moving
   An efficient, platform-optimized implementation of the
                                                                              the computation) further. In order to do that, we devised a
shuffling primitive can be used as a primitive or as a build-
                                                                              simple benchmark that models the following group-by oper-
ing block in the construction of database systems. As an
                                                                              ation traditionally found in database systems:
example, we integrate the synchronized (“tight”) implemen-
tation of the ring shuffling method in the sort-merge-based                   SELECT key, SUM(value) FROM table GROUP BY key
join algorithm described by Albutiu et al. [1] . Their Range-
partitioned Massively Parallel Sort-Merge (MPSM) join, called                 We measure the bandwidth at which a large number of key-
P-MPSM, is optimized for NUMA machines by trying to re-                       value pairs can be grouped by their key and the values with
duce the amount of data transferred among sockets.                            the same key summed up. The implementation uses parti-
   The P-MPSM join has four phases: (1) The fact table is                     tioning by the key. The size of both the key and the value
split into chunks and sorted locally. This implicitly parti-                  is 8 bytes.
tions each chunk into a set of fact runs. (2) The dimension                      Figure 10 shows the aggregation bandwidth, as the num-
table is similarly split into dimension chunks, which are then                ber of distinct keys per partition increases. The number of
range partitioned. (3) Each thread collects all associated                    partitions is equal to the number of processing cores each
partitions from all dimension chunks and sorts them locally                   machine has. That is, there are 32 partitions for the four-
into dimension runs. (4) Finally, each worker thread merges                   socket system and 64 partitions for the eight-socket sys-
a dimension run with all corresponding fact runs. We inte-                    tem. We compare the performance of four implementations.
grated the ring shuffling algorithm into the fourth phase to                  The first two are based on shuffling — one using the naive
read the dimension runs from remote sockets.                                  method, the other with ring shuffling. We also evaluate two
Naive shuffling                                                       Ring shuffling
                                         100%

                        Time breakdown
                                         80%                                                                                                                                                           Merge
                                         60%                                                                                                                                                           Sort Fact
                                         40%                                                                                                                                                           Sort Dim.
                                         20%                                                                                                                                                           Partition
                                          0%
                                                        1           2           4      8                   16         32    1         2          4       8                  16              32
                                                                                # Threads                                                        # Threads

Figure 9: Breakdown of the time spent in each phase of the sort-merge-based join implementation when data
shuffling uses the naive approach (left) and when it uses the coordinated ring shuffling (right). The merge
phase, in which the data shuffling take place, takes half the fraction of time when done with the NUMA-aware
method.
                       100                                              4 socket                                                                        8 socket                                       Migration & copy

                        80                                                                                                                                                                             Migration
    Bandwidth (GB/s)

                                                                                                                                                                                                       Ring
                        60
                                                                                                                                                                                                       Naive
                        40

                        20

                         0
                                         1K
                                              2K
                                                   4K
                                                        8K
                                                              16K
                                                                    32K
                                                                          64K

                                                                                                      1M
                                                                                                            2M
                                                                                                                 4M
                                                                                                                       8M
                                                                                 128K
                                                                                        256K
                                                                                               512K

                                                                                                                            1K
                                                                                                                                 2K
                                                                                                                                      4K
                                                                                                                                           8K
                                                                                                                                                16K
                                                                                                                                                      32K
                                                                                                                                                            64K

                                                                                                                                                                                       1M
                                                                                                                                                                                             2M
                                                                                                                                                                                                  4M
                                                                                                                                                                                                        8M
                                                                                                                                                                  128K
                                                                                                                                                                         256K
                                                                                                                                                                                512K
                                                             # distinct keys / partition                                                    # distinct keys / partition

Figure 10: Effective aggregation bandwidth of two shuffling-based and two migration-based implementations
of partitioned aggregation, as the number of distinct groups per partition increases.

thread migration-based implementations. In the first one,                                                                   shuffling implementation is more efficient. The crossover
the threads do not copy their state (aggregated data) along                                                                 point is at about 512k keys per partition. That means that
when they migrate to the socket containing the partition                                                                    when the total number of distinct groups in an aggregation
they need to aggregate next. This implies that the first mem-                                                               query is larger than 16M or 32M keys for the four-socket ma-
ory request to each thread’s aggregation state will cause a                                                                 chine — and 32M or 64M keys for the eight-socket machine
cache miss and a resulting remote memory access over the                                                                    — the shuffling-based implementation should be preferred
interconnect. We label this implementation as Migration                                                                     over the migration-based one. A query optimizer capable of
in Figure 10 . The last implementation avoids these remote                                                                  predicting the expected number of distinct keys may then
accesses during the processing of the aggregation by explic-                                                                choose the appropriate aggregation implementation.
itly copying over each thread’s aggregation state to its new                                                                  In order to analyze the impact of the individual approaches
socket before migrating the threads themselves, i.e., each                                                                  on the overall bandwidth, a time breakdown is given in
migrating thread copies its aggregation state from its local                                                                Figure 11 for the four-socket system. We classify the oper-
memory to the memory location of the soon-to-be consumed                                                                    ations that are performed during the aggregation into three
partition. We call this implementation Migration & copy in                                                                  categories and report the total time spent for each category.
Figure 10 .                                                                                                                 The first contains accesses to the unaggregated data (labeled
    Figure 10 shows that the performance of migration is sig-                                                               as Read partitions in Figure 11 ). The second category in-
nificantly higher than shuffling if the number of distinct                                                                  cludes operations to perform the actual aggregation, i.e., ac-
groups per partition (i.e., the state of each migrating thread)                                                             cesses and updates to the aggregation state with the newly
is less than 256k keys. In particular, if the partitions contain                                                            added data (labeled Upd. hash table). The third category is
less than 32k keys and so each thread’s aggregation state fits                                                              only applicable for Migration & copy. It represents the cost
into L2 cache, the difference between the performance of the                                                                to transfer the intermediate aggregation result from the local
shuffling and migrating methods corresponds to the band-                                                                    memory into the memory that contains the next partition
width difference we observed in Figure 6 . As the thread                                                                    to be aggregated (labeled Copy state in Figure 11 ).
state grows, i.e., the number of distinct keys per partition in-                                                              Several observations can be made from Figure 11 . First,
creases, we observe a crossover point at which the optimized                                                                the naive method is bottlenecked by the access to the un-
12
                            Naive                   Ring                      Migration              Migration & copy
                   10                                                                               Copy state
                   8                                                                                Upd. hash table
      Time (sec)

                                                                                                    Read partitions
                   6
                   4
                   2
                   0

                        # Distinct groups      # Distinct groups          # Distinct groups            # Distinct groups

Figure 11: Time breakdown for the four aggregation implementations, each with a different shuffling approach,
as the number of distinct groups per partition increases, in the four-socket machine. The time is broken into
the time to scan the table, update the grouping hash table, and move the state, in the case of migration &
copy implementation.

aggregated data (scans). It takes as much as 3 times more          tion is latency-bound. Thus, the NUMA effect is even more
to access the data, which is consistent with the results pre-      profound in OLTP. Porobic et al. [24] describe how perfor-
sented in Section 4.1.1 . Second, the thread migration poli-       mance of OLTP systems is sensitive to the latency variations
cies access the unaggregated data 25% faster than the ring         arising in NUMA systems, proposing that in many cases the
shuffling method, but as the size of the local aggregation re-     NUMA hardware should be treated as a cluster of indepen-
sult (and thus thread state) increases, the execution time is      dent nodes. [15] presents a NUMA-aware variation of an
dominated by the time it takes to access the local aggrega-        efficient log buffer insert protocol.
tion result. The crossover point is between 1M and 2M dis-            In this paper we present evidence for the potential of
tinct groups. Finally, the method that migrates the thread         thread migration instead of data shuffling. Moving the com-
and copies over the local aggregation result is much slower        putation to where the data resides seems to be especially
on large result sizes, because of the overhead of copying over     beneficial when the working set of the thread is small. Anal-
the local aggregation result.                                      ogous execution models have been proposed for both busi-
                                                                   ness intelligence (e.g., the DataPath system [2] ) as well as
                                                                   transaction processing (e.g., the data-oriented transaction
5. RELATED WORK                                                    execution model of [22, 23] ).
   Exchanging partitioned data among parallel processes, par-         Data shuffling has been studied extensively in the MapRe-
ticularly to do joins, has a long history in the database lit-     duce framework. Orchestra [6] was proposed as a global con-
erature (e.g., [5, 10, 11] ). Most of this work simply tried to    trol architecture to manage transfer activities, such as broad-
minimize the aggregate amount of data transmitted, rather          cast and shuffle, in cloud environments. Similar to our work,
than concern itself with differing limits on the capacity of       the scheduling in Orchestra is topology-aware. For a shuffle
the individual links between nodes, and did not try to orches-     operation, each node simultaneously transfers multiple trans-
trate the timing or routing of these transmissions between         missions to various destination nodes. Orchestra coordinates
nodes. More recent papers have focussed on multi-core ef-          the speeds of each transmission to achieve the global optimal
ficiency (e.g., [7, 12, 14] ), but assumed that any core could     performance. In contrast, in our method, each thread only
access the memory of another core in about the same time           accesses one target data at one time. The NUMA-aware
as their own. The architecture community has also studied          shuffling method coordinates the order of transmissions to
many aspects of DBMS performance from a perspective of             various destinations.
efficiency on parallel hardware (e.g., [16, 25] ). Much of this       Recently, the applicability of the MapReduce model was
work has focused on symmetric multiprocessors (SMPs) or            evaluated on NUMA machines [27] . Phoenix++ [26] is a
clusters; multi-socket multi-core systems are still fairly new,    revision of the Phoenix framework that applies the MapRe-
and relatively less studied.                                       duce model on NUMA machines. Our work provides an op-
   The memory performance of NUMA machines has been                portunity to optimize the shuffling operations in the MapRe-
evaluated e.g., in [19, 20] . These studies as complementary       duce model on NUMA machines.
to the NUMA memory bandwidths reported in Section 2 .                 The NUMA architecture also impacts on the design of op-
   Recently, Neumann et al. presented a NUMA-aware dis-            erating systems. A new OS architecture, called multi kernel,
tributed sort-merge join implementation [1] . This algorithm       was proposed to meet the networked nature of the NUMA
did not focus on the data shuffling step of the algorithm, but     machines [4] . The multikernel treats a NUMA machine as
our NUMA-aware data shuffling algorithm was able to speed          a small distributed system of processes that communicate
up their algorithm by up to 8% compared to a naive shuffle,        via message-passing. In contrast, our work proposes to pro-
as shown in the previous section.                                  vide an efficient shuffling primitive on NUMA macines that
   On the other hand, on-line transaction processing (OLTP)        is used for building data management systems.
is a primarily latency-bound application, since synchroniza-
6. CONCLUSIONS                                                    [11] G. Graefe. Encapsulation of parallelism in the Volcano
   The days when software performance automatically im-                query processing system. In SIGMOD, 1990.
proved with each new hardware version are long gone. Now,         [12] N. Hardavellas, I. Pandis, R. Johnson, N. Mancheril,
transistor counts continue to grow, but one has to carefully           A. Ailamaki, and B. Falsafi. Database servers on chip
design software to eke out every bit of performance from the           multiprocessors: Limitations and opportunities. In
hardware.                                                              CIDR, 2007.
   However, over-engineering software for each supported hard-
                                                                  [13] Intel.  Intel Xeon Processor 7500 Series Architec-
ware configuration further complicates already complex DBMS
                                                                       ture: Datasheet, Volume 2, 2010.     Available at
systems. So we argue instead that it is crucial to iden-
                                                                       http://www.intel.com/Assets/PDF/datasheet/323341.pdf.
tify core, performance-critical DBMS primitives that can be
tuned to specific hardware and provided as hardware-specific      [14] R. Johnson, I. Pandis, N. Hardavellas, A. Ailamaki, and
libraries (as done, for example in BLAS 3 ), while keeping the         B. Falsafi. Shore-MT: a scalable storage manager for the
remaining DBMS code fairly generic.                                    multicore era. In EDBT, 2009.
   In this paper, we have used the common data management         [15] R. Johnson, I. Pandis, R. Stoica, M. Athanassoulis, and
primitive of data shuffling to illustrate the significant ben-         A. Ailamaki. Scalability of write-ahead logging on mul-
efit of optimizing algorithms for NUMA hardware. Going                 ticore and multisocket hardware. The VLDB Journal,
forward, we expect many other such primitives to be sim-               20, 2011.
ilarly investigated, and to be hardware-optimized or even
co-designed with new hardware configurations and technolo-        [16] K. Keeton, D. A. Patterson, Y. Q. He, R. C. Raphael,
gies, to realize major performance gains.                              and W. E. Baker. Performance characterization of a
                                                                       quad Pentium Pro SMP using OLTP workloads. In
                                                                       ISCA, 1998.
Acknowledgments
                                                                  [17] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T.
We cordially thank Volker Fischer from IBM’s Systems Op-               Krogh. Basic linear algebra subprograms for fortran
timization Competence Center (SOCC) for setting up and                 usage. ACM Trans. Math. Softw., 5(3), 1979.
providing help with the 8-socket machine.
                                                                  [18] R. A. Maddox, G. Singh, and R. J. Safranek. Weaving
                                                                       High Performance Multiprocessor Fabric. Intel Press,
References                                                             2009.
    [1] M.-C. Albutiu, A. Kemper, and T. Neumann. Mas-
                                                                  [19] Z. Majo and T. R. Gross. Memory system performance
        sively parallel sort-merge joins in main memory multi-
                                                                       in a NUMA multicore multiprocessor. In SYSTOR,
        core database systems. PVLDB, 5(10), 2012.
                                                                       2011.
    [2] S. Arumugam, A. Dobra, C. M. Jermaine, N. Pansare,
                                                                  [20] C. McCurdy and J. S. Vetter. Memphis: Finding and
        and L. Perez. The DataPath system: a data-centric
                                                                       fixing NUMA-related performance problems on multi-
        analytic processing engine for large data warehouses.
                                                                       core platforms. In ISPASS, 2010.
        In SIGMOD, 2010.
                                                                  [21] A. Michael, H. Li, and P. Sart. Facebook open compute
    [3] C. Baru and G. Fecteau. An overview of DB2 parallel
                                                                       project. In HotChips, 2011.
        edition. In SIGMOD, 1995.
                                                                  [22] I. Pandis, R. Johnson, N. Hardavellas, and A. Aila-
    [4] A. Baumann, P. Barham, P.-E. Dagand, T. Harris,
                                                                       maki. Data-oriented transaction execution. PVLDB,
        R. Isaacs, S. Peter, T. Roscoe, A. Schüpbach, and
                                                                       3(1), 2010.
        A. Singhania. The multikernel: a new OS architecture
        for scalable multicore systems. In SOSP, 2009.            [23] I. Pandis, P. Tözün, R. Johnson, and A. Ailamaki.
                                                                       PLP: page latch-free shared-everything OLTP. PVLDB,
    [5] H. Boral, W. Alexander, L. Clay, G. P. Copeland,
                                                                       4(10), 2011.
        S. Danforth, M. J. Franklin, B. E. Hart, M. G. Smith,
        and P. Valduriez. Prototyping Bubba, a highly parallel    [24] D. Porobic, I. Pandis, M. Branco, P. Tözün, and A. Ail-
        database system. IEEE TKDE, 2, 1990.                           amaki. OLTP on hardware islands. PVLDB, 5(11),
                                                                       2012.
    [6] M. Chowdhury, M. Zaharia, J. Ma, M. I. Jordan, and
        I. Stoica. Managing data transfers in computer clusters   [25] P. Ranganathan, K. Gharachorloo, S. V. Adve, and
        with orchestra. In SIGCOMM, 2011.                              L. A. Barroso. Performance of database workloads on
                                                                       shared-memory systems with out-of-order processors.
    [7] J. Cieslewicz and K. A. Ross. Adaptive aggregation on
                                                                       In ASPLOS-VIII, 1998.
        chip multiprocessors. In VLDB, 2007.
                                                                  [26] J. Talbot, R. M. Yoo, and C. Kozyrakis. Phoenix++:
    [8] J. Dean and S. Ghemawat. MapReduce: simplified data
                                                                       modular MapReduce for shared-memory systems. In
        processing on large clusters. In OSDI, 2004.
                                                                       MapReduce, 2011.
    [9] D. J. Dewitt, S. Ghandeharizadeh, D. A. Schneider,
                                                                  [27] R. M. Yoo, A. Romano, and C. Kozyrakis. Phoenix
        A. Bricker, H.-I. Hsiao, and R. Rasmussen. The Gamma
                                                                       rebirth: Scalable MapReduce on a large-scale shared-
        database machine project. IEEE TKDE, 2(1), 1990.
                                                                       memory system. In IISWC, 2009.
[10] D. J. DeWitt and J. Gray. Parallel database sys-
     tems: the future of high performance database systems.
     CACM, 35, 1992.
3
     http://www.netlib.org/blas/
You can also read