IPU exapod - Trusted IPU exaPOD by Graphcore and Atos.
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Trusted IPU exaPOD by Graphcore and Atos. | Why AI matters The data-ization and connection of all things alongside a massive rise in computing power is making AI a reality now; turning data into valuable insights and creating tremendous opportunities for business. Because data is so critical it must be protected, regulated and efficiently processed. As the European leader in Cloud, Big Data and High Performance Computing, Atos brings the necessary expertise and technology solutions to make AI a reality for your business. Why now? 3 key factors Structured and Machine Learning Everything unstructured data is affordable with is a connected device are unified as today computing producing Data Knowledge power Supercomputing powers the AI revolution
Trusted IPU exaPOD by Graphcore and Atos . Projected AI Spending by Industry, 2021 $ 12 bn $ 9.5 bn $ 9.3 bn $ 8.9 bn $ 5.3 bn Banking, Manufacturing Retail Public Sector Healthcare Financial Service & Insurance * Source : atos.net 40 % of new enterprise applications implemented 42 % CAGR The global artificial intelligence market by service providers will include smart size was valued at USD 39.9 billion in 2019 machine technologies by 2021 and is expected to grow at a compound * Source : Gartner annual growth rate (CAGR) of 42.2% from 2020 to 2027. * Source: Grand View Research
Architecture. Software Stack. Proposed Architecuture Legend Service Nodes Customer network x2 x2 AI Users Tenant 1 HighSpeed 100Gbe Tenant 1 Management FastML node Login Dev / Nodes TF Nodes FastML App Ethernet 100Gbe Service VM BMC 1Gb AI Users Tenant n Tenant n Ethernet 10Gb/s FastML App Service VM HighSpeed 100Gbe Ethernet network BMC network HA network 512 128 64 512 128 1024 AI Compute Compute/Poplar Parallel Storage 32 IPU-POD64 32 32 IPU-POD64 BullSequana X440-A5 8 AI400NVX 512 M2000 128x 512Dual socket Server M2000 192 NVMe SSD 2048x GC200 4096 AMDcore 2.5PB RAW Storage 2048x GC200 BullSequana X DDN AI400XTM
Trusted IPU exaPOD by Graphcore and Atos . Compute 100Gbe High Speed IPU-Fabric • 128x Poplar Atos Servers • Maximizes bandwidth across IPU-Links • 512x IPU Machines High BandWidth Parallel Storage Service & Management Nodes • 2.5 PB NVMe • Multi-Tenant service VM (Atos Codex AI suite) • 384GB/s read – 240GB/s write • Cluster Manager (Monitoring/provisioning) • Slurm Job Scheduler (Poplar Server & IPU Allocation) Poplar Software Stack
Technology. Management Software Strategy Design Benefits Smart Management Center xScale Keep the production running whatever happen! • Market leader components with Atos & Red Hat expertise • Native health checking design (including some multi-failure support) Aim for • Transparent update/upgrade/extension as also rollback (HW & SW) no downtime • Native load balancing • Modification tracking and rollback • Per key components disaster recovery procedure Modifiable & Extensible • API-based design for replaceable component or extension Open Environment • Market standard components with large communities Make administrator life easier for day to day operations Scale to • Unique management system regardless of cluster size, e.g. Exascale 10k-node or 100k-node system • With fully centralized management system, easy to upgrade to- wards Exascale eliminating complexity related to system extension
Trusted IPU exaPOD by Graphcore and Atos . Fast ML Engine HPC Compatible Graphcore Enhanced MLOps Deploy AI workloads on top of Leverage the best of Improve Security HPC scheduler AI frameworks Poplar® Software Fast ML Engine Features Training Model Framework Datasets Hyper Management Management Management Management Parameter Optimization Distributed Monitoring Notebooks Security Containers Training Management Management Support Graphcore Technology IPU ADVANTAGE Massive Performance Leap • up to 10~50x faster on training and inference • model held inside the processor • 100x memory bandwidth Much More Flexible • every network type supported efficiently • small batch size for fewer epochs • latency reduced by over 10x Easy to Use • ML framework support • Poplar® software stack AlexNet network visualization from POPLAR®
Reference Design. IPU-POD Switched Ethernet Approach IPU-POD Switched Implementation • POD64 supports 64K IPUs in a direct rack to rack architecture • Alternatively 16K GC200 IPUs through a switched architecture • 4.1 ExaFLOPs FP16.16 (256x 16PFLOPs) • 1.8 ExaBytes IPU-M2000 Memory (256x 7TB) • 16 Switch Planes non-redundant • Additional 16 planes for High Availability 32x 256p 100GbE Switches 32 256 256 256 G G W W G G W W G G W W IPU-POD64 #1 IPU-POD64 #256
Trusted IPU exaPOD by Graphcore and Atos. Graphcore IPU Scaleout Storage 32x IPU-POD64 Utility x4 x16 x1024 GC200 IPU-M2000 IPU-POD64 IPU-POD64k Processor Shelf Rack 1024 Racks 4 IPUs 64 IPUs 64k IPUs 1 PetaFlops 16 PetaFlops 16 ExaFlops
. Trusted IPU exaPOD by Graphcore and Atos Industries Data Center & Internet Research Higher Education Automotive Healthcare Finance
Performance. Overall Solution 512 AI Pflops FP16.16 System One Full AI Data Management solution • 32 IPU-POD64 each based on 16 Graphcore IPU-M2000 • 8x DDN AI400X • 2.5 PB NVMe • 384GB/s read, 240GB/s write One High Speed Interconnect Management Software stack • Arista Ethernet 100GbE • Atos Software management stack • Fat tree non blocking topology • Atos AI Software for Multi-tenant user service • One IPU Compute fabric & one storage/Management • Slurm Job Scheduler fabric BullSequana X Series Graphcore IPU-POD
Trusted IPU exaPOD by Graphcore and Atos . Graphcore Benchmarks State of the art performance with today’s large, complex models Natural Language Processing - BERT Large : Training The IPU delivers over 80% faster time-to-train with the BERT language model, training BERT-base in 14.5 hours in an IPU- POD64 system. Estimate of up to ~2x TCO vs leading competitive solution. BERT-LARGE Time-to-Train IPU-POD64 3x DGX A100 IPU-POD128 6x DGX A100 IPU-POD512 DGX A100 SuperPOD SU 0 5 10 15 20 25 30 Hours NOTES: BERT-Large using Wikipedia dataset + SQuAD 1.1 POD64 (16x IPU-M2000 Server) using PopART (SDK 1.3) | Mixed Precision Ph1 SL=128, Ph2 SL=384. DGX A100 (8x A100 GPU) Results using Pytorch | Mixed Precision published https://ngc.nvidia.com/catalog/resources/nvidia:bert_for_pytorch/performance Computer Vision - EfficientNet-B0 : Inference The Graphcore IPU-M2000 achieves 80x higher throughput and 20x lower latency compared to a leading alternative pro- cessor. High throughput at the lowest possible latency is key in many of the important use cases today such as visual search engines and medical imaging. EFFICIENTNET-B0: INFERENCE >80x higher throughput | >20x lower latency 40 40 Latency (ms) 20 10 IPU-M2000 0 50,000 100,000 Throughput (images/sec) NOTES: EfficientNet-B0 | Synthetic Data | headline comparisons using lowest latency 1x IPU-M2000 using TensorFlow | FP 16.16 | (SDK 1.3.0+272), Applications Bundle ‘m2000-applications-01’ | Batch size 4 through 144 | Assumes linear scaling GPU: 1x V100 (FP32) using TensorFlow & published Google reference. Batch Size 1-32 GPU results measured using public Google repo (https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet/) A100 results unavailable
AI Supercomputing. Sharing our knowledge with you AI AI AI Building the Speed-up Enabling AI ecosystem AI production new breakthroughs in AI Increase your Explore next gen Increase your PRODUCTIVITY COMPUTING KNOWLEDGE
T he Worl d’s Best A I S up e rc omp u t e r Trusted IPU exaPOD by and Scalable Servers Scalable AI Scalable Storage
아토스 인포메이션 테크놀로지(S) Pte, 한국지사 Atos Information Technology(S) Pte, Korea 서울특별시 종로구 삼봉로 94, 94빌딩 #904 #904, 94 Bldg., 94, Sambong-ro, Jongno-gu, Seoul Korea 03158 02. 2160. 8341. steve.park@atos.net www.atos.net 그래프코어 코리아 Graphcore Korea 서울특별시 영등포구 국제금융로 10, Two IFC 22층 Level 22, Two IFC, 10 Gukjegeumyung-ro, Yeongdeungpo-gu, Seoul Korea 07326 02. 6138. 3471. minwook@graphcore.ai www.graphcore.ai 메가존클라우드(주) Megazone Cloud Corp. 서울특별시 강남구 논현로85길 46 메가존빌딩 Megazone Bldg., 46, Nonhyeon-ro 85-gil, Gangnam-gu, Seoul Korea 06235 1644. 2243. matildalab@megazone.com www.megazone.com
You can also read