Super Computing 18, MC04 Building your own mini-CORAL : Power Accelerated Computing Platform - IBM LSF & HPC User Group @ SC18 - IBM ...

Page created by Ruby Morris
 
CONTINUE READING
Super Computing 18, MC04 Building your own mini-CORAL : Power Accelerated Computing Platform - IBM LSF & HPC User Group @ SC18 - IBM ...
IBM LSF & HPC User Group @ SC18

          Super Computing 18, MC04 Building your own mini-
          CORAL : Power Accelerated Computing Platform

          IBM Systems Lab Services/ SC18 / November, 2018 / © 2018 IBM Corporation
Super Computing 18, MC04 Building your own mini-CORAL : Power Accelerated Computing Platform - IBM LSF & HPC User Group @ SC18 - IBM ...
IBM LSF & HPC User Group @ SC18

         Agenda

                   • IBM Power Accelerated Computing Platform requirements
                   • Structure of Power Accelerated Computing Platform
                   • Lessons learned deploying large CORAL HPC Clusters
                   • How to get started with Power Accelerated Computing Platform
                   • Discussion

          IBM Systems Lab Services/ SC18 / November, 2018 / © 2018 IBM Corporation   2
Super Computing 18, MC04 Building your own mini-CORAL : Power Accelerated Computing Platform - IBM LSF & HPC User Group @ SC18 - IBM ...
IBM LSF & HPC User Group @ SC18

        IBM Power
        Accelerated Computing Platform
          IBM Power ACP gives clients their own AI
          installation based upon the world’s most
          powerful and smartest scientific
          supercomputer

          Supports
          • High Performance Computing (HPC)
          • Artificial Intelligence (AI)
          • Machine Learning / Deep Learning

          Based upon IBM CORAL!

          Natural markets: Research Labs, Universities,
                                                          3
          Government Labs, Military Research, Industry
Super Computing 18, MC04 Building your own mini-CORAL : Power Accelerated Computing Platform - IBM LSF & HPC User Group @ SC18 - IBM ...
IBM LSF & HPC User Group @ SC18

         Questions?
                                                                                       Complete Solutions for AI and Modern HPC

                                                                                     – CORAL Servers (POWER9 – IBM Power System AC922)
                                            How Do I Deploy AI
                                                                                     – Management Servers/Head Nodes
                                            at my Company?
                                                                                     – Networking : Ethernet and IB
                                                                                     – Elastic Storage Server

                                            I want to run Workloads and              – Linux and Software Development tools
                                            Experiments on Summit!                   – Pre-Sales/Install expert review by IBM Systems Lab
                                                                                       Services
                                                                                     – Hardware Configuration assembly in IBM facility
                                                                                     – Software Installation and Configuration by IBM before
                                            I want to explore                          delivery
                                            Quantum Computing
                                                                                     – Installation and connectivity support with IBM Systems Lab
                                                                                       Services
                                                                                     – Software Flexibility: HPC and/or PowerAI base or PowerAI
                                                                                       Enterprise, and/or H2O
               Power AI Reference Architecture:
               https://ibm.ent.box.com/s/8w75cdh6s4smgix7ckoh4yisn06h93iw
                                                                                                                                                    4
          IBM Systems Lab Services/ SC18 / November, 2018 / © 2018 IBM Corporation
Super Computing 18, MC04 Building your own mini-CORAL : Power Accelerated Computing Platform - IBM LSF & HPC User Group @ SC18 - IBM ...
IBM LSF & HPC User Group @ SC18

         CORAL and Summit & Sierra

         CORAL = Collaboration of Oak Ridge, Argonne & Lawrence Livermore
         National Labs

         Summit, Ascent and Peak are cluster names of Oak Ridge

         Sierra, Lassen, Ansel and Butte are cluster names at Lawrence Livermore

          Group Name / DOC ID / Month XX, 2018 / © 2018 IBM Corporation            5
Super Computing 18, MC04 Building your own mini-CORAL : Power Accelerated Computing Platform - IBM LSF & HPC User Group @ SC18 - IBM ...
IBM LSF & HPC User Group @ SC18

       Group Name / DOC ID / Month XX, 2017 / © 2017 IBM Corporation   6
Super Computing 18, MC04 Building your own mini-CORAL : Power Accelerated Computing Platform - IBM LSF & HPC User Group @ SC18 - IBM ...
IBM LSF & HPC User Group @ SC18

          Group Name / DOC ID / Month XX, 2017 / © 2017 IBM Corporation   7
Super Computing 18, MC04 Building your own mini-CORAL : Power Accelerated Computing Platform - IBM LSF & HPC User Group @ SC18 - IBM ...
IBM LSF & HPC User Group @ SC18
Super Computing 18, MC04 Building your own mini-CORAL : Power Accelerated Computing Platform - IBM LSF & HPC User Group @ SC18 - IBM ...
IBM LSF & HPC User Group @ SC18

          IBM POWER SYSTEM

         AC922
                   An Acceleration Superhighway
                   Unleash state of the art IO and
                   accelerated computing potential in
                   the post “CPU-only” era

                   Designed for the AI Era
                   Architected for the modern analytics
                   and AI workloads that fuel insights

                   Delivering Enterprise-
                               Enterprise-Class AI
                   Flatten the time to AI value curve
                   by accelerating the journey to build,
                   train, and infer deep neural networks
Super Computing 18, MC04 Building your own mini-CORAL : Power Accelerated Computing Platform - IBM LSF & HPC User Group @ SC18 - IBM ...
IBM LSF & HPC User Group @ SC18

         The POWER9 processor

                          1stchip                            2x     Core performance
                                                                         vs x86

             ~1TB/s       with PCIe4
                                       4GHZ
               BW into                   PEAK                         performance
                chip       7TB/s
                           On chip
                                       FREQUENCY             1.5x     vs POWER8

                            BW            8
                           >15
                                        BILLION
                                       TRANSISTORS
                                                     >24B    2x       more memory
                                                                       vs POWER8
                          MILES OF                    VIAS
                            WIRE         17                            More memory
                                        LEVELS
                                       OF METAL
                                                             1.4x    bandwidth vs x86
IBM LSF & HPC User Group @ SC18

         Watching Processors
         Evolve!

           HPC analyst Addison Snell (CEO of Intersect360
           Research) ….commented by email.

           “One, Power9 has excellent memory bandwidth
           and performance.

           Two, it is a great platform for attaching accelerators
           or co-processors. It’s an odd statement of
           direction, but maybe a visionary one,
           essentially saying a processor isn’t
           about computation per se, but rather
           it’s about feeding data to other
           computational elements.”

             IBM and Business Partner Use Only
IBM LSF & HPC User Group @ SC18

            IBM Power System AC922 - POWER9 with increased GPU and IO bandwidth for differentiation

                Realize unprecedented performance and application gains with POWER9 and NVLink 2.0
              • 2 POWER9 CPUs and up to 4 “Volta” NVLink 2.0 GPUs in a versatile 2U Linux server
              • PCIe Gen4 bus has double I/O Bandwidth vs. PCIe Gen3
              • CPU (Turbo)/GPU (Boost) enabled for improved data center efficiency and performance to be
                maintained at high levels (3.3 / 3.45ghz, air/water).

                 High level System Overview
                  2-Socket, 2U Packaging
                  32, 40 (air) or 36,44 (water) P9 Processor cores
                  4 NVIDIA Volta V100 NVLink2 GPUs
                  2 TB Memory (16x - 128GB DIMMs)
                  4 PCIe Gen4 Slots
                  2x SFF (HDD/SSD), SATA, Up to 7.7 TB storage
                  Supports 1.6, 3.2 and 6.4TB NVMe Adapters
                  Redundant Hot Swap Power Supplies and Fans
                  Default 3 year 9x5 warranty, 100% CRU
IBM LSF & HPC User Group @ SC18

         IBM Spectrum LSF Suites
         Powerful Workload Management

         The suite delivers:
         •      Enhanced Utilization of assets through effective
                scheduling and sharing policies
         •      Enhancing User Productivity through ease of
                use, accessibility and simplification
         •      Operational Efficiency through insight of how the
                HPC environment is being used
         Comprehensive GPU, Container and Hybrid Cloud
         Support
         The LSF Suite for HPC is available at no charge via
         the IBM Academic Initiative

          IBM Systems Lab Services/ SC18 / November , 2018 / © 2018 IBM Corporation   13
IBM LSF & HPC User Group @ SC18

          AI Changes Everything for Data

        Diversity of Data
        – Local, HDFS, NFS, Posix, Cloud
        Amount of Data
        – A Petabyte is just a starting point
        Delivery of Data
        – Gigabytes/Sec/Server to feed GPU

                                                14
IBM LSF & HPC User Group @ SC18

            IBM Spectrum Scale with Elastic Storage Server Family

        The IBM ESS Family
        The Storage Built for AI!

                • Over 1000 ESS Installed
                • Over 300 ESS customers
                • Over 5,000 Spectrum Scale clients
                                                        Five 9’s Reliability!

                                                      IBM is the World Leader in Software Defined Storage
                                                                         Environments
IBM LSF & HPC User Group @ SC18

               ESS Installation at ORNL
           77 ESS Systems delivering:
           •    Single Namespace up to 250 Petabytes
           •    2.5 TB/s large block sequential IO performance
           •    2.6M file creates/sec for 32KB files in unique
                directories
           •    50K file creates/sec to single shared directory
           •    Spectrum Scale RAID with declustered erasure
                coding
           •    16 GB/Second of Data I/O to a Single Server
IBM LSF & HPC User Group @ SC18

          IBM Elastic Storage Server (ESS) Family
                 Speed                                       Hybrid                                                 Capacity
                      Model GS1S                                                               Model GL1S:
                                                                                             1 Enclosures, 9U                            Model GL6S:
                        24 SSD
                                                                                            82 NL-SAS, 2 SSD                           6 Enclosures, 28U
                                                                                                                                      502 NL-SAS, 2 SSD

                                                                                                   ESS 5U84
                                                                                                    Storage
                                                                                                                                            ESS 5U84
                                                                     Model GH24S:                                                            Storage

            14 GB/s                    Model GH14S:
                                                                 2 2U24 Enclosure SSD
                                                                 4 5U84 Enclosure HDD                               Model GL4S:
                                   1 2U24 Enclosure SSD           334 NL-SAS, 48 SSD                              4 Enclosures, 20U
                                   4 5U84 Enclosure HDD                                                          334 NL-SAS, 2 SSD           ESS 5U84
                                                                                                                                              Storage
                      Model GS2S    334 NL-SAS, 24 SSD
                        48 SSD
                                                                                             6 GB/s                     ESS 5U84
                                                                                                                         Storage             ESS 5U84
                                                                                                                                              Storage

                                          ESS 5U84 Storage               ESS 5U84 Storage
                                                                                               Model GL2S:
                                                                                             2 Enclosures, 12U
                                                                                                                        ESS 5U84
                                                                                            166 NL-SAS, 2 SSD            Storage

           26 GB/s                        ESS 5U84 Storage               ESS 5U84 Storage

                                                                                                                                             ESS 5U84
                                                                                                                                              Storage
                                                                                                    ESS 5U84
                                                                                                     Storage
                      Model GS4S
                        96 SSD
                                                                                                                        ESS 5U84
                                                                                                                         Storage             ESS 5U84
                                                                                                                                              Storage
                                          ESS 5U84 Storage               ESS 5U84 Storage

                                                                                                    ESS 5U84            ESS 5U84
                                                                                                                                            ESS 5U84
                                                                                                     Storage             Storage
                                          ESS 5U84 Storage               ESS 5U84 Storage                                                    Storage

            40 GB/s                38 GB/s                        40 GB/s                   12 GB/s              24 GB/s              36 GB/s

                                                                                                                                            IBM Systems    |   17
IBM LSF & HPC User Group @ SC18

           New ESS C-Series
           Maximum Density with Room to Upgrade and Grow!

                                                                                               New! Model GL6C:
                                                                                                6 Enclosures, 28U
                                                                                               634 NL-SAS, 2 SSD

                                                                          New! Model GL4C
                                                                           4 Enclosures, 16U
                                                                          432 NL-SAS, 2 SSD           4U106
                                                                                                      Storage

                                                     New! Model GL2C:                                 4U106
                                                                                                      Storage
                                                      2 Enclosures, 12U
                                                     210 NL-SAS, 2 SSD            4U106
                                                                                  Storage             4U106
                                                                                                      Storage

                                 New! Model GL2C:
                                                                                 4U106
                                   1 Enclosure, 8U                               Storage

                                 104 NL-SAS, 2 SSD
                                                             4U106
                                                             Storage
                                                                                                      4U106
                                                                                                      Storage

                                        4U106
                                        Storage                                   4U106
                                                                                                      4U106
                                                                                  Storage
                                                                                                      Storage

                                                             4U106                4U106               4U106
                                                             Storage              Storage             Storage

                                    1.0 PB Disk           2.0 PB Disk         4.2 PB Disk           6.3 PB Disk

        © IBM Corporation 2018                                                                                      18
IBM LSF & HPC User Group @ SC18

          Power Accelerated Computing Platform – Sample Building Block View
                                   1-4 S42 Racks

                    Mellanox                       AC992: The World’s Premier AI Servers
                    Switches
                                                   • Featured in ORNL and LLNL CORAL Installs
                                                   • ExaOps of demonstrated AI Performance
                 Compute:
                                                   • Able to Process more than 20 GB/S of Data
                     AC922                         • Add Servers as Workloads Grow!
                  2 or 4 GPUs

                Management
                L922 or AC922
                                                   IBM Elastic Storage Server for AI Workloads
                                                   • Density meets Performance
                                                   • High Density Petabytes in Minimum Space
                 Elastic Storage                   • Featured in ORNL and LLNL Installs
                     Server
                                                   • Grow Performance by Scaling Up or Out!
                 (5147 & 5148)
                                                   • Supports IB and Ethernet!

                                                                                           IBM Systems
IBM LSF & HPC User Group @ SC18

                                  PowerAI                   Integrated & Supported AI Platform
                          Open-Source Based                 3-4x Speedup for AI Training
                         Enterprise AI Platform             Ease of Use Tools for Data Scientists

                                        Developer Ease-of-Use Tools

                                          Open Source Frameworks:
                                           Supported Distribution
                                                                         Caffe
                                                                SnapML

                                          Faster Training Times via
                                     HW & SW Performance Optimizations

                                        GPU-Accelerated20          Storage
                                         Power Servers
IBM LSF & HPC User Group @ SC18

                          5x Faster Data Communication with Unique
                           CPU-GPU NVLink High-Speed Connection

              Store Large Models in                   1 TB                            1 TB
                 System Memory                       Memory                          Memory

                                         170GB/s                                               170GB/s

          Fast Transfer                            POWER9                        POWER9
           via NVIDIA                                CPU                           CPU
             NVLink                     NVLink                                                    NVLink
                                      150 GB/s                                                    150 GB/s

                 Operate on One           V100 GPU            V100 GPU    V100 GPU            V100 GPU
                 Layer at a Time

                                                      IBM Power System AC922
                                                     Deep Learning Server (4-GPU Config)

                                                       21
IBM LSF & HPC User Group @ SC18

                                                       Auto-ML for Images & Video
           PowerAI
            Vision

                                     PowerAI: Open Source ML Frameworks
           PowerAI                                                       SnapML

                                             Large Model Support (LMS)              Deep Learning Impact
                                                                                       (DLI) Module

                                  Distributed Deep Learning         Auto-Hyper         Data & Model
                                            (DDL)                Parameter Tuning    Management, ETL,
           PowerAI                                                                    Visualize, Advise

          Enterprise

            Accelerated
           Infrastructure
                                           Accelerated Servers            Storage
                                                                                         22
IBM LSF & HPC User Group @ SC18

                                  Increased
              Faster Time                        Simplified   Enterprise
                                  Resource
               to Results                       Management     Solution
                                  Utilization
IBM LSF & HPC User Group @ SC18

         Power AI Enterprise Project Examples
                          Industry     Scenario                              Industry         Scenario
                                       Credit Scoring                                         Network cabling detection
                                                                             Telcom
                          Banking      Face Masking Detection                                 Service halt handling
                                       Stock Index Futures Prediction                         LED Panel defect inspection
                                                                             Manufacturing
                                       Research Exploration                                   Steel quality classification
                                       OCR recognition correction                             Wafer Flaw detection
                                       Company Logo and name auto matching   Energy           Power transmission line safety detection
                          Securities
                                       AI on cloud                           Healthcare       Pathologic analysis
                                       Hand writing recognition              Retail           Retail market analysis via image recognition
                           Insurance   Work order auto clustering/handling
                                                                             Public           Satellite photo fault reorganization
                                                                             Transportation   Train & subway defect inspection

          © 2017 IBM Corporation                                                                                                             24
IBM LSF & HPC User Group @ SC18

               PowerAI Vision: “Point-and-Click” AI for Images & Video
              Label Image or        Auto-Train AI Model   Package & Deploy
                Video Data                                    AI Model
IBM LSF & HPC User Group @ SC18

          PowerAI Vision Project
          Examples
          Defect Identification
          •Wafer Fab Inspection – Electronics
          •Cam Shaft Inspection – Automotive
          •Seat Inspection – Automotive
          •PCBA Inspection – Electronics
          •Utility disk Inspection – Energy/Utilities
          •Mainframe assemble inspection – Electronics
          •Ceramic capacitor - Electronics
          •Defective Components – Oil/Gas

          Facial / Object Recognition
          •Safety/Security - Transit, Banking, Gaming
          •Building Infrastructure – Building/Construction
          •Service – Retail, Food
          •Traffic – Municipal

          IBM Supply Chain Engineering / DOC ID / / © 2017 IBM Corporation   26
IBM LSF & HPC User Group @ SC18

         Power Accelerated Computing Platform – Building Blocks
                                                                                                                         1-4 S42 Racks
                                                          Hardware Building Blocks
                                                                                                                                         IB TOR switch
                                                                                       IB and Ethernet                                   Enet TOR switch
                                                           Mellanox                    Switches (Mellanox)
                                                           Switches                    (Shared w/ESS)

                                                           AC922                                                                         Compute Nodes
                                                         8335-GTG                    4 – 15 Compute Servers*
                                                        2 or 4 GPUs

                                                                                                                                          xCAT / Manager /
                                                                                                                                          Login node

                                                                                     1–3
                                                      9008-22L or                    Management / Login Servers                           ESS mgmt. node or
                                                       8335-GTG                      (1st rack)                                           protocol nodes

                                                                                                                                         ESS
                                                      Elastic Storage                 0-1 ESS per cluster
                                                          Server                      (optional, 1st rack)
                                                      (5147 & 5148)
                                                                                                   * 7 max in 1st rack
                                                                                                   15 max in 2nd - 4th

          IBM Systems Lab Services/ SC18 / November, 2018 / © 2018 IBM Corporation                                                                            |   27
                                                                                                                                                                    27
IBM LSF & HPC User Group @ SC18

         Power Accelerated Computing Platform
               Configurable HW to simplify creation of “CORAL Like” scale out clusters
                             Storage                                  Compute         Management          Switches          Rack
                          Elastic Storage                        AC922 (2 or 4 GPU)     L922+ and/or        Mellanox      One to four
                              Server                                8335-GTG          AC922 (0,2,4 GPU)                   42U Racks
                                                                                         8335-GTG                           (S42)
                                Optional                         Air Cooled Only                              100Gb       If you really
                                                              Same Processors as in                         InfiniBand     need more,
                                                                 CORAL Servers                            40Gb Ethernet   let us know!
                                                                                                          10Gb Ethernet
                                                                                                           1Gb Ethernet
               -
               Configurable to support HPC, Power AI, and in the future, Quantum Simulator stacks
               -     Simplifies ability to configure complex configs for scale out infrastructure
               -     Software customization & fully rack integrated in IBM manufacturing
                        - Determined in IBM System Lab Services Implementation Design Workshop
                        - Optional On-Site network Integration and knowledge transfer available
                    - Option to assemble in Rochester, MN Pre-build lab if customer wants to use their own switches, racks or desire Water
                       Cooled AC922 Compute processors
          IBM Systems Lab Services/ SC18 / November, 2018 / © 2018 IBM Corporation                                                           28
IBM LSF & HPC User Group @ SC18

         Software that can be customized at IBM Manufacturing *
                                    Red Hat OS 7.5 (5639-RLE)
                                    IBM Spectrum Scale Client                               Optional frameworks/levels as identified in the
           Base                     Mellanox OFED driver (Mellanox)                         Implementation Design Workshop :
                                    NVIDIA CUDA Software (Nvidia)                            Anaconda
                                                                                             Caffe
                                                                                             IBM Advanced Toolchain
                                    PowerAI Base (5765-PAI)                                  Jupyter Notebook
                                    PowerAI Enterprise (5765-AIE)                            Keras
             AI                         Spectrum Conductor                                   Python
                                        DL Impact                                                                                             Optional Open
                                                                                             PyTorch                                          Source for P9
                                        PowerAI                                              TensorFlow
                                    PowerAI Vision (5737-H10)                                 xCAT
                                    H2O Driverless AI (5639-AIH)                             XGBOOST (latest git code)

                                    IBM Spectrum LSF Suite (5737-F30)
                                    IBM Compilers – XLC/C++/Fortran, gcc
            HPC                     ESSL (5765-L61)
                                    IBM Spectrum MPI (5725-G83)
                                    Performance Toolkit (5765-PD2)
                                    xCAT support (5771-CAT)
                                                                                     * Assuming customer has required licenses (design workshop)

          IBM Systems Lab Services/ SC18 / November, 2018 / © 2018 IBM Corporation                                                                      29
IBM LSF & HPC User Group @ SC18

                How do I get started?

                                   What use cases in my company will have payback?
                                   Who can help my company customize the software?
                                   Who can provide knowledge transfer to my personnel?

          IBM Systems Lab Services/ SC18 / November , 2018 / © 2018 IBM Corporation      30
IBM LSF & HPC User Group @ SC18

          Cognitive Discovery Workshop:                                               Helping you identify the right cognitive use cases

           Objective: To provide an overview of Cognitive technologies, explore potential uses cases and how they can
                      be deployed to provide business value. The key focus is to identify potential use cases for Proof
                      of Concept project.
            How’s it Delivered ? A 4-6 hour Face to Face workshop at customer location delivered by a IBM
                      Cognitive Workshop team

          What’s the output ?                          Potential use cases and an action plan to help team select an appropriate Cognitive
          project.
          Who should attend ? Key IT resources, Data Scientist/Customer Data Architect, LOB(Business Sponsor),
                   any others the customer team feels are important to the discussion.
                            Detailed abstract: This session typically includes discussions on:
                                      Overview of industry and cross industry use cases
                                  •   Discussion of Open Source Cognitive technologies such as Tensorflow, Caffe, Theano, Torch,
                                  •   Discussion on data layer technologies such as Hadoop, NoSQL, NewSQL and relational DB technologies and
                                      the Importance of End to End process (Governance and Data management)
                                  •   Discussion of Customer Specific use cases including feasibility assessment.
                                  •   Develop action plan to assist the customer to Identify and justify Cognitive use cases (ROI or or ROI factors)
                                        • ID infrastructure actions necessary to support Cognitive project

                            Email: cssc@us.ibm.com
                            Submit Online Request: https://ibm.biz/BdFfcV
          IBM Systems Lab Services/ SC18 / November , 2018 / © 2018 IBM Corporation                                                                    31
IBM LSF & HPC User Group @ SC18

                     Discovery Workshop
                               Time                                      Topic                                    Speaker        Audience
                         9:00-9:15 am       Introductions and Review Workshop Objectives                             All        Execs, LOB, IT
         Executive
                                                                                                                                   Liaisons
         Session
                         9:15-10:45 am      Executive Session                                                        IBM        Execs, LOB, IT
                                            -What is AI                                                                            Liaisons
                                            -Art of the Possible
                                            -Short Demo – H2O
                         10:45 – 11:00am    Break
                         11:00 – 11:45 pm   Introduce Use Case Workshop                                                        LOB, IT Liaisons
                                             -Answering lingering Q&A
         Use Case                            -Each LOB department mission overview & focus areas
         Discovery       11:45 - 12:30 pm   Industry Examples of Applied AI                                       IBM/Client   LOB, IT Liaisons
                                             -Group Discussion on applicability to Customer
                         12:30 – 1:00 pm    Lunch
                         1:00 – 2:30 pm     Discussion and Identification of Use cases by LOB.                    IBM/Client   LOB, IT Liaisons
                                            -Feasibility and Impact of Use Cases
                                            -Identify High Interest and Highest Value Use Cases for Customer
                         2:30 – 2:45 pm     Break
         Business Case                                                                          32
                         2:45 – 4:00 pm     Develop Action Plan for Creation of Exec Proposal for High Value      IBM/Client    LOB, IT Liaison
         Development
                                            Use Cases                                                                          for identified use
                                            -Use Case Pay Back, Cognitive Work Flow, Timeline, Data Strategy                         cases
                                            -Cognitive Skill Set, Data Strategy, POC/Trial Implementation steps
IBM LSF & HPC User Group @ SC18

         Power ACP – IBM Systems Lab Services
                                                                                            Manufacturing                                 Install

                    Implementation Design                                                   Hardware Racking,                       Network Integration &
                         Workshop                                                               Software                            Knowledge Transfer on
                                                                                            Customization in                                site
                                                                                             Manufacturing
          -       Develops information to enable                                     -   Install, Configure & Verify software   -   Optional network integration
                  majority of system implementation                                                                             -   Done on customer site
                  and tailoring to occur in IBM                                                                                 -   Billable to customer
                  Manufacturing                                                                                                 -   Knowledge Transfer on solution
          -       Done on customer site                                                                                             configuration

          Note: This step mandatory for enabling
          manufacturing SW preload
                                                                                     Contact us today fdrobin@us.ibm.com
                                                                                     On the Web: www.ibm.com/it-infrastructure/services/lab-services PartnerWorld:
                                                                                     www.ibm.com/partnerworld/systems/services/lab-services Email us:
                                                                                     ibmsls@us.ibm.com
                                                                                                                                                                     33
         IBM Systems Lab Services/ SC18 / November , 2018 / © 2018 IBM Corporation
IBM LSF & HPC User Group @ SC18

         IBM Systems Lab Services Implementation Design Workshop
         Onsite customer workshop to enable a fast time-
                                                    time-to-
                                                         to-benefit implementation
         - Develops information to enable majority of system implementation and tailoring to occur in
            IBM Manufacturing
             - Documents software and infrastructure required to enable customer use cases
             - Includes:
                  - Data Center personnel to ensure client data center is ready for the Power
                     Accelerated Computing Platform implementation
                  - Customer personnel to determine customization of software like PowerAI
                     Enterprise or PowerAI Vision or H20
                  - Client networking team to document customization needed for networking (IPs,
                     VLANS, Uplinks, etc)
             - Creation of the implementation documentation that will be used for customization at
                 IBM Manufacturing and for solution knowledge transfer

          IBM Systems Lab Services/ SC18 / November, 2018 / © 2018 IBM Corporation                      34
IBM LSF & HPC User Group @ SC18

         End Result at the Data Center
                                                                                     This   Not This

          IBM Systems Lab Services/ SC18 / November, 2018 / © 2018 IBM Corporation                     35
IBM LSF & HPC User Group @ SC18

               Lessons learned with Summit on deploying large
               HPC Clusters

          IBM Systems Lab Services/ SC18 / November, 2018 / © 2018 IBM Corporation
                                                                                     36
IBM LSF & HPC User Group @ SC18

          Group Name / DOC ID / Month XX, 2017 / © 2017 IBM Corporation   37
IBM LSF & HPC User Group @ SC18

         Deployment of Large HPC Clusters Lessons Learned

          Architecture for scale is important. In our case, the network architecture was quite successful, and service
          nodes were used to distribute provisioning workload across many nodes.

          Most of the effort in deploying a large cluster is in the infrastructure racks

          Switch-level discovery becomes critical for large-scale rapid deployment of racks. Cabling verification and
          double-checking node positions became important.

          It's important to establish a good, complete set of node-level diagnostics to run on every node in the cluster,
          and to run this set of diagnostics on a continuous basis

          Establish a process and mechanism to deploy updates continuously to the cluster, for both software and
          firmware. This includes both stateful and stateless nodes.

          Expect issues at scale with most tools

          IBM Systems Lab Services/ SC18 / November , 2018 / © 2018 IBM Corporation                                         38
IBM LSF & HPC User Group @ SC18

         Performance Testing as you go

          One of the final objectives for the cluster deployment was a submission to the Top 500

          For Sierra, HPL (Linpack) became an extraordinarily valuable tool for exercising a cluster, and finding and
          diagnosing performance issues

          We started small at the node level, and worked up to the rack level, row level and cluster level. In this way, we
          could identify performance issues at the micro level, rather than the macro level. When tuned well, node level
          and rack level performance was remarkably similar.

          Node level HPL identifies CPU, GPU and memory performance issues

          Rack-level HPL identifies Infiniband performance issues both at individual nodes and at the rack-level IB
          switches

          Row-level HPL identifies performance issues in some core IB switches. For example, we saw performance
          issues in the eastern end of one row in Sierra

          Cluster-level HPL identifies issues at very large scale, and provides opportunities for novel approaches to HPL
          IBM Systems Lab Services/ SC18 / November , 2018 / © 2018 IBM Corporation                                           39
IBM LSF & HPC User Group @ SC18

         Power Accelerated Computing Platform

              Getting Started
              • IBM Cognitive Systems Solution Center (CSSC)
                Optional Discovery Workshop to identify use cases
                       •     Email: cssc@us.ibm.com
                       •     Submit Online Request: https://ibm.biz/BdFfcV

              • IBM Systems Lab Services three Stage Approach
                              i. Implementation Design Workshop
                              ii. Manufacturing Customization
                              iii. Data Center Integration
                       •     Email: ibmsls@us.ibm.com or
                       •     Fred Robinson fdrobin@us.ibm.com

              • Configurator: eConfig -> Power -> Solutions -> Power
                ACP

          IBM Systems Lab Services/ SC18 / November , 2018 / © 2018 IBM Corporation   |   40
IBM LSF & HPC User Group @ SC18

             IBM Systems Lab Services
             Proven expertise to help leaders plan, design, and implement the essential IT infrastructure for what comes next

             Our team of 1,000+ consultants, engage
             worldwide in pre and post sales
             opportunities in:

              Power Systems
              Storage and Software Defined
               Infrastructure
              IBM Z and LinuxONE
              HPC & Deep Learning
              Systems Consulting
              Migration Factory                                         ibmsls@us.ibm.com
                                                                         www.ibm.com/it-infrastructure/services/lab-services
              Technical Training and Events                             Fred Robinson fdrobin@us.ibm.com
IBM LSF & HPC User Group @ SC18

        IBM Power
        Accelerated Computing Platform
          IBM Power ACP gives clients their own AI
          installation based upon the world’s most
          powerful and smartest scientific
          supercomputer

          Includes everything required for success!
              • Networking
              • Servers
              • Storage
              • Software
              • Services
              • Support

          Leverage CORAL success TODAY!               42
IBM LSF & HPC User Group @ SC18

         Notices and disclaimers
              • © 2018 International Business Machines Corporation. No part of          • Performance data contained herein was generally obtained in a
                this document may be reproduced or transmitted in any form without        controlled, isolated environments. Customer examples are presented
                written permission from IBM.                                              as illustrations of how those
              • U.S. Government Users Restricted Rights — use, duplication or           • customers have used IBM products and the results they may have
                disclosure restricted by GSA ADP Schedule Contract with IBM.              achieved. Actual performance, cost, savings or other results in other
                                                                                          operating environments may vary.
              • Information in these presentations (including information relating to
                products that have not yet been announced by IBM) has been              • References in this document to IBM products, programs, or services
                reviewed for accuracy as of the date of initial publication and could     does not imply that IBM intends to make such products, programs or
                include unintentional technical or typographical errors. IBM shall        services available in all countries in which IBM operates or does
                have no responsibility to update this information. This document is       business.
                distributed “as is” without any warranty, either express or             • Workshops, sessions and associated materials may have been
                implied. In no event, shall IBM be liable for any damage arising          prepared by independent session speakers, and do not necessarily
                from the use of this information, including but not limited to,           reflect the views of IBM. All materials and discussions are provided for
                loss of data, business interruption, loss of profit or loss of            informational purposes only, and are neither intended to, nor shall
                opportunity. IBM products and services are warranted per the              constitute legal or other guidance or advice to any individual
                terms and conditions of the agreements under which they are               participant or their specific situation.
                provided.
                                                                                        • It is the customer’s responsibility to insure its own compliance
              • IBM products are manufactured from new parts or new and used
                                                                                          with legal requirements and to obtain advice of competent legal
                parts.
                                                                                          counsel as to the identification and interpretation of any relevant laws
                In some cases, a product may not be new and may have been
                                                                                          and regulatory requirements that may affect the customer’s business
                previously installed. Regardless, our warranty terms apply.”
                                                                                          and any actions the customer may need to take to comply with such
              • Any statements regarding IBM's future direction, intent or                laws. IBM does not provide legal advice or represent or warrant that its
                product plans are subject to change or withdrawal without                 services or products will ensure that the customer follows any law.
                notice.

          © Copyright IBM Corporation 2018                                                                                                                           43
IBM LSF & HPC User Group @ SC18

         Notices and disclaimers
         continued
          • Information concerning non-IBM products was obtained from the suppliers of             • IBM, the IBM logo, ibm.com and [names of other referenced IBM
            those products, their published announcements or other publicly available                products and services used in the presentation] are trademarks
            sources. IBM has not tested those products about this publication and cannot             of International Business Machines Corporation, registered in
            confirm the accuracy of performance, compatibility or any other claims related           many jurisdictions worldwide. Other product and service names
            to non-IBM products. Questions on the capabilities of non-IBM products should            might be trademarks of IBM or other companies. A current list of
            be addressed to the suppliers of those products. IBM does not warrant the                IBM trademarks is available on the Web at "Copyright and
            quality of any third-party products, or the ability of any such third-party products     trademark information" at: www.ibm.com/legal/copytrade.shtml.
            to interoperate with IBM’s products. IBM expressly disclaims all warranties,           • .
            expressed or implied, including but not limited to, the implied warranties
            of merchantability and fitness for a purpose.
          • The provision of the information contained herein is not intended to, and does
            not, grant any right or license under any IBM patents, copyrights, trademarks or
            other intellectual property right.

          © Copyright IBM Corporation 2018                                                                                                                              44
You can also read