Accelerating Spark with RDMA for Big Data Processing: Early Experiences
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Accelerating Spark with RDMA for
Big Data Processing: Early Experiences
Xiaoyi
Lu,
Md.
Wasi-‐ur-‐Rahman,
Nusrat
Islam,
Dip7
Shankar,
and
Dhabaleswar
K.
(DK)
Panda
Network-‐Based
Compu2ng
Laboratory
Department
of
Computer
Science
and
Engineering
The
Ohio
State
University,
Columbus,
OH,
USA
Outline
• Introduc7on
• Problem
Statement
• Proposed
Design
• Performance
Evalua7on
• Conclusion
&
Future
work
2
HotI 2014Digital Data Explosion in the Society
Figure Source (http://www.csc.com/insights/flxwd/78931-big_data_growth_just_beginning_to_explode) 3
HotI 2014Big Data Technology - Hadoop
• Apache Hadoop is one of the most popular Big Data technology
– Provides frameworks for large-scale, distributed data storage and
processing
– MapReduce, HDFS, YARN, RPC, etc.
Hadoop 1.x Hadoop 2.x
MapReduce Other Models
(Data Processing) (Data Processing)
MapReduce YARN
(Cluster Resource Management & Data (Cluster Resource Management & Job
Processing) Scheduling)
Hadoop Distributed File System (HDFS) Hadoop Distributed File System (HDFS)
Hadoop Common/Core (RPC, ..) Hadoop Common/Core (RPC, ..)
4
HotI 2014Big Data Technology - Spark
• An
emerging
in-‐memory
Worker
processing
framework
Zookeeper
– Itera7ve
machine
learning
Worker
– Interac7ve
data
analy7cs
– Scala
based
Implementa7on
Worker
HDFS
– Master-‐Slave;
HDFS,
Zookeeper
Driver
• Scalable
and
communica7on
SparkContext
intensive
Worker
Worker
Master
– Wide
dependencies
between
Resilient
Distributed
Datasets
(RDDs)
– MapReduce-‐like
shuffle
opera7ons
to
repar77on
RDDs
– Same as Hadoop, Sockets-
based communication
Narrow Dependencies Wide dependencies 5
HotI 2014Common
Protocols
using
Open
Fabrics
Applica-on
Applica7on
Interface
Sockets
Verbs
Protocol
Kernel Space
TCP/IP
RSockets
SDP iWARP
RDMA
RDMA
TCP/IP
Hardware
Offload
User User User User
Ethernet
RDMA
IPoIB
space space space space
Driver
Adapter
Ethernet
InfiniBand
Ethernet
InfiniBand
InfiniBand
iWARP
RoCE
InfiniBand
Adapter
Adapter
Adapter
Adapter
Adapter
Adapter
Adapter
Adapter
Switch
Ethernet
InfiniBand
Ethernet
InfiniBand
InfiniBand
Ethernet
Ethernet
InfiniBand
Switch
Switch
Switch
Switch
Switch
Switch
Switch
Switch
1/10/40
IPoIB
10/40
GigE-‐ RSockets
SDP
iWARP
RoCE
IB
Verbs
GigE
TOE
6
HotI 2014Previous Studies
• Very
good
performance
improvements
for
Hadoop
(HDFS
/MapReduce/RPC),
HBase,
Memcached
over
InfiniBand
– Hadoop Acceleration with RDMA
• N. S. Islam, et.al., SOR-HDFS: A SEDA-based Approach to Maximize
Overlapping in RDMA-Enhanced HDFS, HPDC’14
• N. S. Islam, et.al., High Performance RDMA-Based Design of HDFS over
InfiniBand, SC’12
• M. W. Rahman, et.al. HOMR: A Hybrid Approach to Exploit Maximum
Overlapping in MapReduce over High Performance Interconnects, ICS’14
• M. W. Rahman, et.al., High-Performance RDMA-based Design of Hadoop
MapReduce over InfiniBand, HPDIC’13
• X. Lu, et.al., High-Performance Design of Hadoop RPC with RDMA over
InfiniBand, ICPP’13
– HBase Acceleration with RDMA
• J. Huang, et.al., High-Performance Design of HBase with RDMA over InfiniBand,
IPDPS’12
– Memcached Acceleration with RDMA
• J. Jose, et.al., Memcached Design on High Performance RDMA Capable
Interconnects, ICPP’11
7
HotI 2014The High-Performance Big Data (HiBD) Project
• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)
• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
• RDMA for Memcached (RDMA-Memcached)
• OSU HiBD-Benchmarks (OHB)
• http://hibd.cse.ohio-state.edu
8
HotI 2014RDMA for Apache Hadoop
• High-Performance Design of Hadoop over RDMA-enabled Interconnects
– High performance design with native InfiniBand and RoCE support at the
verbs-level for HDFS, MapReduce, and RPC components
– Easily configurable for native InfiniBand, RoCE and the traditional sockets
-based support (Ethernet and InfiniBand with IPoIB)
• Current release: 0.9.9 (03/31/14)
– Based on Apache Hadoop 1.2.1
– Compliant with Apache Hadoop 1.2.1 APIs and applications
– Tested with
• Mellanox InfiniBand adapters (DDR, QDR and FDR)
• RoCE support with Mellanox adapters
• Various multi-core platforms
• Different file systems with disks and SSDs http://hibd.cse.ohio-state.edu
• RDMA for Apache Hadoop 2.x 0.9.1 is released in HiBD!
9
HotI 2014Performance Benefits – RandomWriter & Sort in SDSC-Gordon
IPoIB (QDR) OSU-IB (QDR) IPoIB (QDR) OSU-IB (QDR)
160 400
Reduced by 20% Reduced by 36%
Job Execution Time (sec)
Job Execution Time (sec)
140 350
120 300
100 250
80 200
60 150
Can RDMA benefit Apache Spark
40 100
20 50
0 0
on High-Performance Networks also?
50 GB
Cluster: 16
100 GB
Cluster: 32
200 GB
Cluster: 48
300 GB
Cluster: 64
50 GB
Cluster: 16
100 GB
Cluster: 32
200 GB
Cluster: 48
300 GB
Cluster: 64
• RandomWriter
in
a
cluster
of
single
SSD
per
• Sort
in
a
cluster
of
single
SSD
per
node
node
– 20%
improvement
over
IPoIB
for
– 16%
improvement
over
IPoIB
for
50GB
in
50GB
in
a
cluster
of
16
nodes
a
cluster
of
16
nodes
– 36%
improvement
over
IPoIB
for
– 20%
improvement
over
IPoIB
for
300GB
in
300GB
in
a
cluster
of
64
nodes
a
cluster
of
64
nodes
10
HotI 2014Outline
• Introduc7on
• Problem
Statement
• Proposed
Design
• Performance
Evalua7on
• Conclusion
&
Future
work
11
HotI 2014Problem Statement
• Is it worth it?
– Is the performance improvement potential high enough, if we can
successfully adapt RDMA to Spark?
– A few percentage points or orders of magnitude?
• How difficult is it to adapt RDMA to Spark?
– Can RDMA be adapted to suit the communication needs of
Spark?
– Is it viable to have to rewrite portions of the Spark code with
RDMA?
• Can Spark applications benefit from an RDMA-enhanced
design?
– What are the performance benefits that can be achieved by using
RDMA for Spark applications on modern HPC clusters?
– Can RDMA-based design benefit applications transparently?
12
HotI 2014Assessment of Performance Improvement Potential
• How much benefit RDMA can bring in Apache Spark
compared to other interconnects/protocols?
• Assessment Methodology
– Evaluation on primitive-level micro-benchmarks
• Latency, Bandwidth
• 1GigE, 10GigE, IPoIB (32Gbps), RDMA (32Gbps)
• Java/Scala-based environment
– Evaluation on typical workloads
• GroupByTest
• 1GigE, 10GigE, IPoIB (32Gbps)
• Spark 0.9.1
13
HotI 2014
Evaluation on Primitive-level Micro-benchmarks
120 40
1GigE 1GigE
10GigE 10GigE
35 IPoIB (32Gbps)
100 IPoIB (32Gbps)
RDMA (32Gbps) RDMA (32Gbps)
30
80
25
Time (ms)
Time (us)
60 20
15
40
10
Can these benefits of High-Performance
20
0
5
0
Networks be achieved in
1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M
Message Size (bytes) Message Size (bytes)
Small Message Large Message
3,000
Apache Spark?
• Obviously, IPoIB and 10GigE are
2,500
much better than 1GigE.
Bandwidth (MBps)
2,000
• Compared to other interconnects
1,500
/protocols, RDMA can significantly
1,000
– reduce the latencies for all the
500 message sizes
0
1GigE 10GigE IPoIB RDMA – improve the peak bandwidth
(32Gbps) (32Gbps) 14
HotI 2014Evaluation on Typical Workloads
30
25 1GigE 65% • For High-Performance Networks,
10GigE
– The execution time of GroupBy is
GroupBy Time (sec)
IPoIB (32Gbps)
20
significantly improved
15 – The network throughputs are much
10 22% higher for both send and receive
5 • Spark can benefit from high
0 Can RDMA further benefit Spark
4 6 8 10
-performance interconnects to a
greater extent!
Data Size (GB)
performance compared with IPoIB and
600
1GigE
10GigE
600
1GigE
10GigE
10GigE?
500 IPoIB (32Gbps) 500 IPoIB (32Gbps)
Network Throughput (MB/s)
Network Throughput (MB/s)
400 400
300 300
200 200
100 100
0 0
5 10 15 20 5 10 15 20
Sample Times (s) Sample Times (s)
Network Throughput in Recv Network Throughput in Send
15
HotI 2014Outline
• Introduc7on
• Problem
Statement
• Proposed
Design
• Performance
Evalua7on
• Conclusion
&
Future
work
16
HotI 2014Architecture Overview
Spark Applications
• Design Goals (Scala/Java/Python)
– High Performance Task Task Task Task
Spark
– Keeping the existing (Scala/Java)
Spark architecture Java NIO
BlockManager
Netty RDMA
BlockFetcherIterator
Java NIO Netty RDMA
and interface intact Shuffle
Server
Shuffle
Server
Shuffle
Server
Shuffle
Fetcher
Shuffle
Fetcher
Shuffle
Fetcher
(default) (optional) (plug-in) (default) (optional) (plug-in)
– Minimal code
changes Java Socket
RDMA-based Shuffle Engine
(Java/JNI)
1/10 Gig Ethernet/IPoIB (QDR/FDR) Native InfiniBand
• Approaches Network (QDR/FDR)
– Plug-in based approach to extend shuffle framework in Spark
• RDMA Shuffle Server + RDMA Shuffle Fetcher
• 100 lines of code changes inside Spark original files
– RDMA-based Shuffle Engine
17
HotI 2014SEDA-based Data Shuffle Plug-ins
• SEDA - Staged Event-Driven Connection
Pool
Request Response
Queue Queue
Architecture
• A set of stages connected by Request Request
queues Enqeue Dequeue
Response Response
• A dedicated thread pool will be in Enqueue Dequeue
charge of processing events on Listener Receiver Handler Responder
the corresponding queue
– Listener, Receiver, Handlers,
Responders
Accept Receive
• Performing admission controls Connection Request Send
Read Data
Response
on these event queues
• High throughput through RDMA-based Shuffle Engine Block
maximally overlapping different Data
processing stages as well as
maintain default task-level
parallelism in Spark
18
HotI 2014RDMA-based Shuffle Engine
• Connection Management
– Alternative designs
• Pre-connection
– Hide the overhead in the initialization stage
– Before the actual communication, pre-connect processes to each other
– Sacrifice more resources to keep all of these connections alive
• Dynamic Connection Establishment and Sharing
– A connection will be established if and only if an actual data exchange
is going to take place
– A naive dynamic connection design is not optimal, because we need to
allocate resources for every data block transfer
– Advanced dynamic connection scheme that reduces the number of
connection establishments
– Spark uses multi-threading approach to support multi-task execution in
a single JVM à Good chance to share connections!
– How long should the connection be kept alive for possible re-use?
– Time out mechanism for connection destroy
19
HotI 2014RDMA-based Shuffle Engine
• Data Transfer
– Each connection is used by multiple tasks (threads) to
transfer data concurrently
– Packets over the same communication lane will go to
different entities in both server and fetcher sides
– Alternative designs
• Perform sequential transfers of complete blocks over a
communication lane à Keep the order
– Cause long wait times for some tasks that are ready to transfer data
over the same connection
• Non-blocking and Out-of-order Data Transfer
– Chunking data blocks
– Non-blocking sending over shared connections
– Out-of-order packet communication
– Guarantee both performance and ordering
– Efficiently work with the dynamic connection management and sharing
mechanism
20
HotI 2014RDMA-based Shuffle Engine
• Buffer Management
– On-JVM-Heap vs. Off-JVM-Heap Buffer Management
– Off-JVM-Heap
• High-Performance through native IO
• Shadow buffers in Java/Scala
• Registered for RDMA communication
– Flexibility for upper layer design choices
• Support connection sharing mechanism à Request ID
• Support packet processing in order à Sequence number
• Support non-blocking send à Buffer flag + callback
21
HotI 2014Outline
• Introduc7on
• Problem
Statement
• Proposed
Design
• Performance
Evalua7on
• Conclusion
&
Future
work
22
HotI 2014Experimental Setup
• Hardware
– Intel
Westmere
Cluster
(A)
• Up
to
9
nodes
• Each
node
has
8
processor
cores
on
2
Intel
Xeon
2.67
GHz
quad-‐core
CPUs,
24
GB
main
memory
• Mellanox
QDR
HCAs
(32Gbps)
+
10GigE
– TACC
Stampede
Cluster
(B)
• Up
to
17
nodes
• Intel
Sandy
Bridge
(E5-‐2680)
dual
octa-‐core
processors,
running
at
2.70GHz,
32
GB
main
memory
• Mellanox
FDR
HCAs
(56Gbps)
• Soiware
– Spark
0.9.1,
Scala
2.10.4
and
JDK
1.7.0
– GroupBy
Test
23
HotI 2014Performance Evaluation on Cluster A
9 10
8 10GigE
IPoIB (32Gbps) 10GigE
RDMA (32Gbps) 8 IPoIB (32Gbps)
7 18%
GroupBy Time (sec)
GroupBy Time (sec)
RDMA (32Gbps) 20%
6
6
5
4
4
3
2 2
1
0 0
4 6 8 10 8 12 16 20
Data Size (GB) Data Size (GB)
4
Worker
Nodes,
32
Cores,
(32M
32R)
8
Worker
Nodes,
64
Cores,
(64M
64R)
• For 32 cores, up to 18% over IPoIB (32Gbps) and
up to 36% over 10GigE
• For 64 cores, up to 20% over IPoIB (32Gbps) and
34% over 10GigE
24
HotI 2014Performance Evaluation on Cluster B
35
50 IPoIB (56Gbps) 83% IPoIB (56Gbps) 79%
RDMA (56Gbps) 30 RDMA (56Gbps)
GroupBy Time (sec)
GroupBy Time (sec)
40 25
30 20
15
20
10
10
5
0 0
8 12 16 20 16 24 32 40
Data Size (GB) Data Size (GB)
8
Worker
Nodes,
128
Cores,
(128M
128R)
16
Worker
Nodes,
256
Cores,
(256M
256R)
• For 128 cores, up to 83% over IPoIB (56Gbps)
• For 256 cores, up to 79% over IPoIB (56Gbps)
25
HotI 2014Performance Analysis on Cluster B
• Java-based micro-benchmark comparison
IPoIB (56Gbps) RDMA (56Gbps)
Peak Bandwidth 1741.46MBps
5612.55MBps
• Benefit of RDMA Connection Sharing Design
20
RDMA−w/o−Conn−Sharing (56Gbps)
RDMA−w−Conn−Sharing (56Gbps)
GroupBy Time (sec)
15 55%
By enabling connection
10 sharing, achieve 55%
performance benefit for
5 40GB data size on 16
nodes
0
16 24 32 40
Data Size (GB)
26
HotI 2014Outline
• Introduc7on
• Problem
Statement
• Proposed
Design
• Performance
Evalua7on
• Conclusion
&
Future
work
27
HotI 2014Conclusion and Future Work
• Three major conclusions
– RDMA and high-performance interconnects can
benefit Spark.
– Plug-in based approach with SEDA-/RDMA-based
designs provides both performance and
productivity.
– Spark applications can benefit from an RDMA
-enhanced design.
• Future Work
– Continuously update this package with newer designs and carry
out evaluations with more Spark applications on systems with
high-performance networks
– Will make this design publicly available through the HiBD project
28
HotI 2014
Thank
You!
{luxi,
rahmanmd,
islamn,
shankard,
panda}
@cse.ohio-‐state.edu
Network-‐Based
Compu7ng
Laboratory
hrp://nowlab.cse.ohio-‐state.edu/
The
High-‐Performance
Big
Data
Project
hrp://hibd.cse.ohio-‐state.edu/
HotI 2014You can also read