Accelerating Spark with RDMA for Big Data Processing: Early Experiences
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Accelerating Spark with RDMA for Big Data Processing: Early Experiences Xiaoyi Lu, Md. Wasi-‐ur-‐Rahman, Nusrat Islam, Dip7 Shankar, and Dhabaleswar K. (DK) Panda Network-‐Based Compu2ng Laboratory Department of Computer Science and Engineering The Ohio State University, Columbus, OH, USA
Outline • Introduc7on • Problem Statement • Proposed Design • Performance Evalua7on • Conclusion & Future work 2 HotI 2014
Digital Data Explosion in the Society Figure Source (http://www.csc.com/insights/flxwd/78931-big_data_growth_just_beginning_to_explode) 3 HotI 2014
Big Data Technology - Hadoop • Apache Hadoop is one of the most popular Big Data technology – Provides frameworks for large-scale, distributed data storage and processing – MapReduce, HDFS, YARN, RPC, etc. Hadoop 1.x Hadoop 2.x MapReduce Other Models (Data Processing) (Data Processing) MapReduce YARN (Cluster Resource Management & Data (Cluster Resource Management & Job Processing) Scheduling) Hadoop Distributed File System (HDFS) Hadoop Distributed File System (HDFS) Hadoop Common/Core (RPC, ..) Hadoop Common/Core (RPC, ..) 4 HotI 2014
Big Data Technology - Spark • An emerging in-‐memory Worker processing framework Zookeeper – Itera7ve machine learning Worker – Interac7ve data analy7cs – Scala based Implementa7on Worker HDFS – Master-‐Slave; HDFS, Zookeeper Driver • Scalable and communica7on SparkContext intensive Worker Worker Master – Wide dependencies between Resilient Distributed Datasets (RDDs) – MapReduce-‐like shuffle opera7ons to repar77on RDDs – Same as Hadoop, Sockets- based communication Narrow Dependencies Wide dependencies 5 HotI 2014
Common Protocols using Open Fabrics Applica-on Applica7on Interface Sockets Verbs Protocol Kernel Space TCP/IP RSockets SDP iWARP RDMA RDMA TCP/IP Hardware Offload User User User User Ethernet RDMA IPoIB space space space space Driver Adapter Ethernet InfiniBand Ethernet InfiniBand InfiniBand iWARP RoCE InfiniBand Adapter Adapter Adapter Adapter Adapter Adapter Adapter Adapter Switch Ethernet InfiniBand Ethernet InfiniBand InfiniBand Ethernet Ethernet InfiniBand Switch Switch Switch Switch Switch Switch Switch Switch 1/10/40 IPoIB 10/40 GigE-‐ RSockets SDP iWARP RoCE IB Verbs GigE TOE 6 HotI 2014
Previous Studies • Very good performance improvements for Hadoop (HDFS /MapReduce/RPC), HBase, Memcached over InfiniBand – Hadoop Acceleration with RDMA • N. S. Islam, et.al., SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in RDMA-Enhanced HDFS, HPDC’14 • N. S. Islam, et.al., High Performance RDMA-Based Design of HDFS over InfiniBand, SC’12 • M. W. Rahman, et.al. HOMR: A Hybrid Approach to Exploit Maximum Overlapping in MapReduce over High Performance Interconnects, ICS’14 • M. W. Rahman, et.al., High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand, HPDIC’13 • X. Lu, et.al., High-Performance Design of Hadoop RPC with RDMA over InfiniBand, ICPP’13 – HBase Acceleration with RDMA • J. Huang, et.al., High-Performance Design of HBase with RDMA over InfiniBand, IPDPS’12 – Memcached Acceleration with RDMA • J. Jose, et.al., Memcached Design on High Performance RDMA Capable Interconnects, ICPP’11 7 HotI 2014
The High-Performance Big Data (HiBD) Project • RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x) • RDMA for Apache Hadoop 1.x (RDMA-Hadoop) • RDMA for Memcached (RDMA-Memcached) • OSU HiBD-Benchmarks (OHB) • http://hibd.cse.ohio-state.edu 8 HotI 2014
RDMA for Apache Hadoop • High-Performance Design of Hadoop over RDMA-enabled Interconnects – High performance design with native InfiniBand and RoCE support at the verbs-level for HDFS, MapReduce, and RPC components – Easily configurable for native InfiniBand, RoCE and the traditional sockets -based support (Ethernet and InfiniBand with IPoIB) • Current release: 0.9.9 (03/31/14) – Based on Apache Hadoop 1.2.1 – Compliant with Apache Hadoop 1.2.1 APIs and applications – Tested with • Mellanox InfiniBand adapters (DDR, QDR and FDR) • RoCE support with Mellanox adapters • Various multi-core platforms • Different file systems with disks and SSDs http://hibd.cse.ohio-state.edu • RDMA for Apache Hadoop 2.x 0.9.1 is released in HiBD! 9 HotI 2014
Performance Benefits – RandomWriter & Sort in SDSC-Gordon IPoIB (QDR) OSU-IB (QDR) IPoIB (QDR) OSU-IB (QDR) 160 400 Reduced by 20% Reduced by 36% Job Execution Time (sec) Job Execution Time (sec) 140 350 120 300 100 250 80 200 60 150 Can RDMA benefit Apache Spark 40 100 20 50 0 0 on High-Performance Networks also? 50 GB Cluster: 16 100 GB Cluster: 32 200 GB Cluster: 48 300 GB Cluster: 64 50 GB Cluster: 16 100 GB Cluster: 32 200 GB Cluster: 48 300 GB Cluster: 64 • RandomWriter in a cluster of single SSD per • Sort in a cluster of single SSD per node node – 20% improvement over IPoIB for – 16% improvement over IPoIB for 50GB in 50GB in a cluster of 16 nodes a cluster of 16 nodes – 36% improvement over IPoIB for – 20% improvement over IPoIB for 300GB in 300GB in a cluster of 64 nodes a cluster of 64 nodes 10 HotI 2014
Outline • Introduc7on • Problem Statement • Proposed Design • Performance Evalua7on • Conclusion & Future work 11 HotI 2014
Problem Statement • Is it worth it? – Is the performance improvement potential high enough, if we can successfully adapt RDMA to Spark? – A few percentage points or orders of magnitude? • How difficult is it to adapt RDMA to Spark? – Can RDMA be adapted to suit the communication needs of Spark? – Is it viable to have to rewrite portions of the Spark code with RDMA? • Can Spark applications benefit from an RDMA-enhanced design? – What are the performance benefits that can be achieved by using RDMA for Spark applications on modern HPC clusters? – Can RDMA-based design benefit applications transparently? 12 HotI 2014
Assessment of Performance Improvement Potential • How much benefit RDMA can bring in Apache Spark compared to other interconnects/protocols? • Assessment Methodology – Evaluation on primitive-level micro-benchmarks • Latency, Bandwidth • 1GigE, 10GigE, IPoIB (32Gbps), RDMA (32Gbps) • Java/Scala-based environment – Evaluation on typical workloads • GroupByTest • 1GigE, 10GigE, IPoIB (32Gbps) • Spark 0.9.1 13 HotI 2014 Evaluation on Primitive-level Micro-benchmarks 120 40 1GigE 1GigE 10GigE 10GigE 35 IPoIB (32Gbps) 100 IPoIB (32Gbps) RDMA (32Gbps) RDMA (32Gbps) 30 80 25 Time (ms) Time (us) 60 20 15 40 10 Can these benefits of High-Performance 20 0 5 0 Networks be achieved in 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size (bytes) Message Size (bytes) Small Message Large Message 3,000 Apache Spark? • Obviously, IPoIB and 10GigE are 2,500 much better than 1GigE. Bandwidth (MBps) 2,000 • Compared to other interconnects 1,500 /protocols, RDMA can significantly 1,000 – reduce the latencies for all the 500 message sizes 0 1GigE 10GigE IPoIB RDMA – improve the peak bandwidth (32Gbps) (32Gbps) 14 HotI 2014
Evaluation on Typical Workloads 30 25 1GigE 65% • For High-Performance Networks, 10GigE – The execution time of GroupBy is GroupBy Time (sec) IPoIB (32Gbps) 20 significantly improved 15 – The network throughputs are much 10 22% higher for both send and receive 5 • Spark can benefit from high 0 Can RDMA further benefit Spark 4 6 8 10 -performance interconnects to a greater extent! Data Size (GB) performance compared with IPoIB and 600 1GigE 10GigE 600 1GigE 10GigE 10GigE? 500 IPoIB (32Gbps) 500 IPoIB (32Gbps) Network Throughput (MB/s) Network Throughput (MB/s) 400 400 300 300 200 200 100 100 0 0 5 10 15 20 5 10 15 20 Sample Times (s) Sample Times (s) Network Throughput in Recv Network Throughput in Send 15 HotI 2014
Outline • Introduc7on • Problem Statement • Proposed Design • Performance Evalua7on • Conclusion & Future work 16 HotI 2014
Architecture Overview Spark Applications • Design Goals (Scala/Java/Python) – High Performance Task Task Task Task Spark – Keeping the existing (Scala/Java) Spark architecture Java NIO BlockManager Netty RDMA BlockFetcherIterator Java NIO Netty RDMA and interface intact Shuffle Server Shuffle Server Shuffle Server Shuffle Fetcher Shuffle Fetcher Shuffle Fetcher (default) (optional) (plug-in) (default) (optional) (plug-in) – Minimal code changes Java Socket RDMA-based Shuffle Engine (Java/JNI) 1/10 Gig Ethernet/IPoIB (QDR/FDR) Native InfiniBand • Approaches Network (QDR/FDR) – Plug-in based approach to extend shuffle framework in Spark • RDMA Shuffle Server + RDMA Shuffle Fetcher • 100 lines of code changes inside Spark original files – RDMA-based Shuffle Engine 17 HotI 2014
SEDA-based Data Shuffle Plug-ins • SEDA - Staged Event-Driven Connection Pool Request Response Queue Queue Architecture • A set of stages connected by Request Request queues Enqeue Dequeue Response Response • A dedicated thread pool will be in Enqueue Dequeue charge of processing events on Listener Receiver Handler Responder the corresponding queue – Listener, Receiver, Handlers, Responders Accept Receive • Performing admission controls Connection Request Send Read Data Response on these event queues • High throughput through RDMA-based Shuffle Engine Block maximally overlapping different Data processing stages as well as maintain default task-level parallelism in Spark 18 HotI 2014
RDMA-based Shuffle Engine • Connection Management – Alternative designs • Pre-connection – Hide the overhead in the initialization stage – Before the actual communication, pre-connect processes to each other – Sacrifice more resources to keep all of these connections alive • Dynamic Connection Establishment and Sharing – A connection will be established if and only if an actual data exchange is going to take place – A naive dynamic connection design is not optimal, because we need to allocate resources for every data block transfer – Advanced dynamic connection scheme that reduces the number of connection establishments – Spark uses multi-threading approach to support multi-task execution in a single JVM à Good chance to share connections! – How long should the connection be kept alive for possible re-use? – Time out mechanism for connection destroy 19 HotI 2014
RDMA-based Shuffle Engine • Data Transfer – Each connection is used by multiple tasks (threads) to transfer data concurrently – Packets over the same communication lane will go to different entities in both server and fetcher sides – Alternative designs • Perform sequential transfers of complete blocks over a communication lane à Keep the order – Cause long wait times for some tasks that are ready to transfer data over the same connection • Non-blocking and Out-of-order Data Transfer – Chunking data blocks – Non-blocking sending over shared connections – Out-of-order packet communication – Guarantee both performance and ordering – Efficiently work with the dynamic connection management and sharing mechanism 20 HotI 2014
RDMA-based Shuffle Engine • Buffer Management – On-JVM-Heap vs. Off-JVM-Heap Buffer Management – Off-JVM-Heap • High-Performance through native IO • Shadow buffers in Java/Scala • Registered for RDMA communication – Flexibility for upper layer design choices • Support connection sharing mechanism à Request ID • Support packet processing in order à Sequence number • Support non-blocking send à Buffer flag + callback 21 HotI 2014
Outline • Introduc7on • Problem Statement • Proposed Design • Performance Evalua7on • Conclusion & Future work 22 HotI 2014
Experimental Setup • Hardware – Intel Westmere Cluster (A) • Up to 9 nodes • Each node has 8 processor cores on 2 Intel Xeon 2.67 GHz quad-‐core CPUs, 24 GB main memory • Mellanox QDR HCAs (32Gbps) + 10GigE – TACC Stampede Cluster (B) • Up to 17 nodes • Intel Sandy Bridge (E5-‐2680) dual octa-‐core processors, running at 2.70GHz, 32 GB main memory • Mellanox FDR HCAs (56Gbps) • Soiware – Spark 0.9.1, Scala 2.10.4 and JDK 1.7.0 – GroupBy Test 23 HotI 2014
Performance Evaluation on Cluster A 9 10 8 10GigE IPoIB (32Gbps) 10GigE RDMA (32Gbps) 8 IPoIB (32Gbps) 7 18% GroupBy Time (sec) GroupBy Time (sec) RDMA (32Gbps) 20% 6 6 5 4 4 3 2 2 1 0 0 4 6 8 10 8 12 16 20 Data Size (GB) Data Size (GB) 4 Worker Nodes, 32 Cores, (32M 32R) 8 Worker Nodes, 64 Cores, (64M 64R) • For 32 cores, up to 18% over IPoIB (32Gbps) and up to 36% over 10GigE • For 64 cores, up to 20% over IPoIB (32Gbps) and 34% over 10GigE 24 HotI 2014
Performance Evaluation on Cluster B 35 50 IPoIB (56Gbps) 83% IPoIB (56Gbps) 79% RDMA (56Gbps) 30 RDMA (56Gbps) GroupBy Time (sec) GroupBy Time (sec) 40 25 30 20 15 20 10 10 5 0 0 8 12 16 20 16 24 32 40 Data Size (GB) Data Size (GB) 8 Worker Nodes, 128 Cores, (128M 128R) 16 Worker Nodes, 256 Cores, (256M 256R) • For 128 cores, up to 83% over IPoIB (56Gbps) • For 256 cores, up to 79% over IPoIB (56Gbps) 25 HotI 2014
Performance Analysis on Cluster B • Java-based micro-benchmark comparison IPoIB (56Gbps) RDMA (56Gbps) Peak Bandwidth 1741.46MBps 5612.55MBps • Benefit of RDMA Connection Sharing Design 20 RDMA−w/o−Conn−Sharing (56Gbps) RDMA−w−Conn−Sharing (56Gbps) GroupBy Time (sec) 15 55% By enabling connection 10 sharing, achieve 55% performance benefit for 5 40GB data size on 16 nodes 0 16 24 32 40 Data Size (GB) 26 HotI 2014
Outline • Introduc7on • Problem Statement • Proposed Design • Performance Evalua7on • Conclusion & Future work 27 HotI 2014
Conclusion and Future Work • Three major conclusions – RDMA and high-performance interconnects can benefit Spark. – Plug-in based approach with SEDA-/RDMA-based designs provides both performance and productivity. – Spark applications can benefit from an RDMA -enhanced design. • Future Work – Continuously update this package with newer designs and carry out evaluations with more Spark applications on systems with high-performance networks – Will make this design publicly available through the HiBD project 28 HotI 2014
Thank You! {luxi, rahmanmd, islamn, shankard, panda} @cse.ohio-‐state.edu Network-‐Based Compu7ng Laboratory hrp://nowlab.cse.ohio-‐state.edu/ The High-‐Performance Big Data Project hrp://hibd.cse.ohio-‐state.edu/ HotI 2014
You can also read