Enable AI at Scale with NVIDIA and Equinix
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Enterprise Strategy Group | Getting to the bigger truth.™ ESG WHITE PAPER Enable AI at Scale with NVIDIA and Equinix Embracing a Hybrid Cloud AI Future By Mike Leone, ESG Senior Analyst January 2022 This ESG White Paper was commissioned by NVIDIA and Equinix and is distributed under license from ESG. © 2022 by The Enterprise Strategy Group, Inc. All Rights Reserved.
White Paper: Enable AI at Scale with NVIDIA and Equinix 2 Contents Introduction .............................................................................................................................................................................. 3 AI Infrastructure Challenges ..................................................................................................................................................... 3 Rethinking a Cloud-first AI Approach ....................................................................................................................................... 4 The Impact of Data Gravity on Training and Inference ...................................................................................................................... 4 Hybrid Cloud AI to the Rescue .................................................................................................................................................. 5 Rise of Colocation AI Services................................................................................................................................................................... 6 Follow the AI Leaders ............................................................................................................................................................... 6 NVIDIA LaunchPad with Equinix ............................................................................................................................................... 6 The Bigger Truth ....................................................................................................................................................................... 8 Understand AI infrastructure requirements across all business units involved in AI development........................................ 8 Understand AI infrastructure requirements based on what is needed to support the entire lifecycle of AI—from prototyping and experimentation to production training at scale and inference at the edge................................................. 8 Understand the potential benefits of embracing data gravity by moving compute to where data resides. ........................ 8 Understand the role the public cloud plays in the long-term success of AI. ................................................................................. 8 © 2022 by The Enterprise Strategy Group, Inc. All Rights Reserved.
White Paper: Enable AI at Scale with NVIDIA and Equinix 3 Introduction The impact AI is having on organizations is profound. Whether they are turning to AI to provide more predictive insights into future scenarios/outcomes or developing AI-based products and services to capture new revenue opportunities, businesses are continuing to emphasize the importance of AI adoption as a game changer for the modern business. ESG research shows that 45% of organizations currently have AI projects in production using specialized infrastructure, with another 39% at the piloting/POC stage, as organizations look for smarter and faster ways to gain value from data.1 While the usage of AI offers potentially eye-opening benefits, challenges continue to arise that cause roadblocks, delays, and outright failures in achieving AI success. Between infrastructure shortcomings throughout the AI lifecycle, an inability to cost-effectively scale AI, and the increasing force of data gravity, even organizations that have seen early success by starting their AI journeys in the public cloud are rethinking their cloud-first/cloud-only strategies. As organizations plan for the increased pervasiveness of AI throughout the business, they come to the realization that a hybrid cloud approach to AI will be a requirement to ensure that they can achieve true AI success. AI Infrastructure Challenges One of the greatest challenges experienced by organizations today in their adoption of AI comes at the hands of the infrastructure stack. Simply put, many CIOs don’t have the right IT platform/infrastructure in place to satisfy AI workload requirements. Between inadequate processing power, storage capacity, networking capabilities, and an inability to properly manage resource allocation, infrastructure readiness is proving to be a significant issue in keeping up with the performance and concurrency demands of diverse AI workloads. Those workloads include data analysis and experimentation, feature engineering, model training, model serving, and inference within a deployed application. Each of these workloads has different infrastructure requirements. Organizations are spending millions of dollars to stitch together AI components in a DIY fashion or turning to public cloud-based as-a-service offerings, but in both cases, the foundation infrastructure is rooted in general purpose components. This is a major reason that 98% of recently surveyed AI adopters identified or anticipated a weak component somewhere in their AI infrastructure stack (see Figure 1). More specifically, 86% identified at least one of the following areas as a weak link: GPU processing, CPU processing, data storage, networking, resource sharing, or integrated development environments. Figure 1. Top 8 Weakest Links in the AI Infrastructure Stack Which parts of the infrastructure stack do you believe are or will be the weakest links in your organization’s ability to deliver an effective AI environment? (Percent of respondents, N=325, three responses accepted) Resource sharing 26% Integrated development environment (IDE) 25% GPU processing 25% CPU processing 25% Data storage 22% Databases 22% Multi-tenancy 21% Networking 20% Source: Enterprise Strategy Group 1 Source: ESG Master Survey Results, Supporting AI/ML Initiatives with a Modern Infrastructure Stack, May 2021. All ESG research references and charts in this white paper have been taken from this master survey results set unless otherwise noted. © 2022 by The Enterprise Strategy Group, Inc. All Rights Reserved.
White Paper: Enable AI at Scale with NVIDIA and Equinix 4 Another interesting component to the infrastructure stack challenge is the diversity of personas that may require access to the system, from a data-centric persona like a data scientist or data engineer, an application developer, or someone in IT responsible for resource allocation or maintenance. Availability of not only the system, but the tools, technologies, and underlying data, create several bottlenecks, all of which impact the time to value. Organizations will need to embrace new infrastructure purpose-built for diverse AI workloads and will need to carefully consider how to onboard such platforms, especially if their data center is not optimized for an accelerated computing infrastructure or if they’ve moved away from data centers altogether. One example of overlooked requirements is power and cooling. AI infrastructure consists of dense HPC hardware that requires reliable power and proper cooling that simply cannot be provided in every data center. Rethinking a Cloud-first AI Approach The public cloud creates a low barrier to entry for AI by providing an as-a-service model to satisfy temporal AI needs. End- users gain access to the right tools, technologies, and resources to get started with AI faster and more cost-effectively than anywhere else. And while it’s appealing to have a controlled environment to effectively experiment and learn about the best ways to leverage data to support an AI use case, challenges remain. As organizations experiment on their AI models in the cloud, model complexity, increased compute and storage requirements, and exponential data growth introduce rapidly escalating costs for tight-budgeted organizations. So, while it may be easy to scale cloud AI deployments quickly, cost becomes a deterrent, forcing organizations to make tradeoffs in the way they deploy AI and leverage AI-specific resources to a growing number of stakeholders. According to ESG research, this is driving the repatriation of AI workloads, with organizations citing areas such as an inability to meet scalability/elasticity expectations, poor or unpredictable performance, and high cost as drivers for repatriation. In fact, 57% of IT organizations have repatriated workloads (inclusive of AI workloads) from the public cloud back to on-premises environments. 2 The Impact of Data Gravity on Training and Inference Driving the repatriation of AI workloads is the idea of data gravity. Data gravity is the ability of a large data set to attract applications, processing power, services, and other data. The force of gravity, in this context, can be thought of as the way these other entities are drawn to data relative to its mass. Data gravity is particularly challenging in AI training efforts, which often require the use of high-performant and therefore higher cost compute, storage, and networking. Knowing that AI training is fueled by massive volumes of data, in workflows where data must be moved from one location to another, like from a private deployment to a public cloud environment, a significant amount of time and effort is required. And this is before factoring in the tangible cost of processing the data using general purpose compute. As organizations look to offset the inefficiencies, time delays, and increased costs of moving TBs of data into a public cloud environment, embracing the idea of moving compute to where data is generated or stored is on the rise. This is also known as training where the data lands. In fact, when it comes to training, 67% of organizations have embraced training models on-premises, whether in a data center or at an edge location. 2 Source: ESG Master Survey Results, 2021 Data Infrastructure Trends, September 2021. © 2022 by The Enterprise Strategy Group, Inc. All Rights Reserved.
White Paper: Enable AI at Scale with NVIDIA and Equinix 5 While training is more focused on the development of algorithms and processing large volumes of data to ensure high accuracy, enterprises must then deploy algorithms to production environments for inferencing. Inferencing has a different set of requirements than training in that it does not require heavy processing resources, but for many use cases, it does require fast execution of data analysis, returning a result in as close to real time as possible. This is a major reason why organizations continue to search for ways to implement inferencing in line with the incoming data flow. As organizations struggle to deploy developed algorithms to production environments, the fact that these environments are increasingly falling outside of the public cloud is adding to complexity. This means that organizations will be looking to deploy models directly on an edge device or at an edge aggregation point. And while aggregating data to a central location may introduce unnecessary latency, several use cases require multiple data sources for inferencing, forcing organizations to architect a well-networked and high-performance edge aggregation point. Hybrid Cloud AI to the Rescue The public cloud will continue to enable organizations to ramp up AI initiatives quickly, especially for early adopters that do not have immediate access to critical AI infrastructure components. However, when the AI cloud tipping point is reached because model complexity, scale, or data gravity has dictated a need for a private AI footprint, the availability of private AI infrastructure resources will be essential. Organizations will require a fixed-cost infrastructure with right-sized resources that supports the diversity of AI workloads, from experimentation and rapid model training at scale to multi- environment deployment and management. This is driving the need for a hybrid AI architecture to set organizations up for AI success. In fact, when ESG asked organizations about the most important considerations of infrastructure solutions used to support their AI initiatives throughout the AI lifecycle, the top response was hybrid cloud capabilities (see Figure 2). Figure 2. Most Important AI Lifecycle Infrastructure Considerations For all aspects of the AI lifecycle, which of the following are–or likely will be–the most important in your organization’s consideration of infrastructure solution(s) used to support its AI initiatives? (Percent of respondents, N=325, multiple responses Hybrid/multi-cloud capability 18% Data security/governance 17% Maximizing hardware/infrastructure utilization 16% Integrated development environment (IDE) 15% Model management and monitoring 15% Integration with GPU 14% Data durability/high Availability 14% Speed of deployment/provisioning 14% Management simplicity 14% Lowest possible latency 13% Data movement 13% Data traceability 13% Source: Enterprise Strategy Group © 2022 by The Enterprise Strategy Group, Inc. All Rights Reserved.
White Paper: Enable AI at Scale with NVIDIA and Equinix 6 Organizations are recognizing that hybrid cloud AI is enabling them to overcome potential failures in accelerating AI development and effectively operationalizing AI. With the knowledge that organizations do not want to be mired in a proof- of-concept stage across a mix of siloed projects all with escalating cloud costs, hybrid AI is helping address the amount of “model debt” companies incur by offsetting a widening gap between developed models and deployed models. Hybrid cloud AI is enabling organizations to embrace an end-to-end platform on which enterprises can efficiently traverse the AI lifecycle, from development to deployment of AI applications, ensuring reasonable ROI as the pervasiveness of AI scales throughout the business. Rise of Colocation AI Services As organizations look for new solutions to enable them to embrace data gravity, bring the compute and software stack to where the massive data sets reside, and eliminate the barrier to getting deterministic performance of on-premises AI systems, interest in colocation services is on the rise with a goal of enabling organizations to achieve hybrid cloud AI success. Colocation service offerings can help businesses overcome a lack of AI-optimized facilities and infrastructure, by providing the right AI infrastructure resources to support all the workloads throughout the AI lifecycle. Low latency connectivity to all major cloud providers, improved data locality that allows organizations to know that data is either inside or near colo facilities, and the availability of right-sized resources based on AI workload demand enable colo service offerings to improve time to insight while enabling effortless mobility of AI development workloads. For use cases where latency is critical, deploying inference infrastructure at macro-edge locations can also serve as an elegant solution to ensure AI SLAs are consistently met. For those AI use cases where data aggregation is required, an interconnected colocation facility can offer faster access to multiple data sources and therefore enrich AI development. Follow the AI Leaders Forward-leaning enterprises are achieving AI success by overcoming the impact of data gravity and eliminating escalating I/O costs by moving compute to where the data lives, enabling an affordable compute cost-model that lowers the barrier to AI entry. AI leaders are looking to colocation facilities that provide access to modern AI infrastructure and a fully optimized platform that supports the end-to-end AI lifecycle, from development to deployment, and all AI workloads, including analytics, training, and inference. Powerful compute nodes, scalable compute clusters, high-performance storage, and right-sized resource availability are being leveraged to deliver deterministic performance. To offset AI operational burdens and shadow AI silos commonly experienced by IT staff tasked with AI resource delivery, leaders are looking to new operating models that will empower IT to consolidate operational AI silos, simplify capacity planning, and ensure resources are optimally delivered based on AI workload requirements. Leaders are embracing hybrid AI architectures optimized for cost-effective AI development, yielding higher levels of efficiency, as well as faster experimentation and traversal of the iterative AI lifecycle. NVIDIA LaunchPad with Equinix NVIDIA and Equinix recognize the power of leveraging the best AI hardware and software infrastructure in a seamless, easy, and cost-effective way. Together they are building an AI ecosystem of technology providers, ISVs, tool developers, data brokers, and network providers all with a goal of democratizing AI. To deliver a complete development-to-deployment AI infrastructure solution, NVIDIA and Equinix have partnered to deliver the NVIDIA LaunchPad solution on Platform Equinix. NVIDIA LaunchPad is a free service for enterprise customers to try NVIDIA AI. With NVIDIA LaunchPad, enterprises can get immediate, short-term access to NVIDIA AI running on private accelerated compute infrastructure to power critical AI initiatives. As organizations gain experience and see success, enterprises can move to a consumption-based model on a subscription basis by deploying and scaling their AI infrastructure within an Equinix data center. © 2022 by The Enterprise Strategy Group, Inc. All Rights Reserved.
White Paper: Enable AI at Scale with NVIDIA and Equinix 7 With NVIDIA and Equinix, organizations gain access to an end- to-end solution that provides both core AI infrastructure for model training with NVIDIA Base CommandTM, as well as inference and edge AI infrastructure with mainstream NVIDIA- Certified Systems enabled by Equinix Metal service. Foundational to the Base Command offering is NVIDIA DGX Foundry - a high-performance AI training infrastructure based on NVIDIA DGX SuperPODTM, comprising NVIDIA DGXTM A100 systems, NetApp storage and NVIDIA networking. Each DGX A100 integrates eight NVIDIA A100 Tensor Core GPUs and two 2nd Gen AMD EPYCTM processors, powered by a full-stack AI-optimized architecture purpose-built for the unique demands of AI workloads, from analytics and experimentation, to training and inference. NVIDIA DGX systems are optimized at every layer for delivering the fastest time-to-solution on the most complex AI workloads. AI researchers and innovators don’t have to waste time integrating, troubleshooting, and supporting hardware and software. Data scientists can confidently utilize resources across their end-to-end workflows, from development to training at scale. The Equinix Fabric provides high-speed and secure connectivity between these distributed training and inference locations. The software-defined interconnection services provide fast and secure data transfer from distributed data sources to the NVIDIA AI model training stack. The same private interconnection solution also enables the transfer of the newly developed AI models to the NVIDIA AI edge infrastructure at Equinix. Enterprises can deploy their AI training and edge infrastructure on Platform Equinix in more than 64 metro markets across more than 26 countries on five continents. All these distributed Equinix sites are interconnected via Equinix Fabric high-speed, low-latency, and secure virtual connections. And most metros have been verified by NVIDIA to meet the power and cooling requirements of next-generation AI hardware. In addition to providing the AI compute, network, and storage infrastructure, NVIDIA LaunchPad provides the necessary software-based orchestration services to move data and AI models between the distributed sites in a seamless manner using cloud technologies. Customers can manage their AI development workflow with NVIDIA Base CommandTM, NVIDIA Fleet CommandTM, and NVIDIA AI Enterprise suite which provides easy, secure management and deployment of AI at the edge. The Equinix infrastructure deploys in minutes, providing enterprises with immediate access to an entire spectrum of NVIDIA resources that support virtually every aspect of AI, from data center training and inference to full-scale deployment at the edge. © 2022 by The Enterprise Strategy Group, Inc. All Rights Reserved.
White Paper: Enable AI at Scale with NVIDIA and Equinix 8 The Bigger Truth As organizations look to embrace AI throughout the business, they are becoming increasingly aware of the challenges preventing greater success. Distributed data sets, data gravity, operational silos, and the need for right-sized access to powerful, yet cost-effective infrastructure are forcing organizations to make tradeoffs in how they best leverage AI infrastructure solutions and capabilities across hybrid environments. Together, NVIDIA and Equinix are looking to help organizations embrace AI at scale using a colocation model that removes ongoing data center capital burden, eliminates data center redesign, minimizes AI operational silos, and enables teams to benefit with the best of both worlds: the simplicity of the cloud with the deterministic performance needed to support production AI workloads at scale. Organizations looking to transform the business by scaling the use of AI can set themselves up for success by considering the following areas and questions that should be asked of key stakeholders. Understand AI infrastructure requirements across all business units involved in AI development. • What initiatives are driving the deployment of AI development environments? • Where are AI development deployments spawning? • Who is running them? Understand AI infrastructure requirements based on what is needed to support the entire lifecycle of AI— from prototyping and experimentation to production training at scale and inference at the edge. • Where does underlying data that is leveraged to support AI initiatives reside? • How is the AI data pipeline constructed? • How much of the data workflow is dependent on data movement? • Is inline inference required, for latency or data volume reasons? • What role does/should the edge play in the AI lifecycle? Understand the potential benefits of embracing data gravity by moving compute to where data resides. • Where is training done today? • How rapidly are datasets supporting model development growing? • How does data movement impact the timeliness of training results? • How much cost goes into the support of dataset I/O, data movement, data hosting, and compute/storage resources? • Is data aggregation optimized for inference and training? Understand the role the public cloud plays in the long-term success of AI. • Where are you leveraging the public cloud today in support of AI initiatives? • What are your current cloud costs in association with AI workloads? • Has the public cloud become the single hammer for every nail in your enterprise? © 2022 by The Enterprise Strategy Group, Inc. All Rights Reserved.
White Paper: Enable AI at Scale with NVIDIA and Equinix 9 All trademark names are property of their respective companies. Information contained in this publication has been obtained by sources The Enterprise Strategy Group (ESG) considers to be reliable but is not warranted by ESG. This publication may contain opinions of ESG, which are subject to change. This publication is copyrighted by The Enterprise Strategy Group, Inc. Any reproduction or redistribution of this publication, in whole or in part, whether in hard-copy format, electronically, or otherwise to persons not authorized to receive it, without the express consent of The Enterprise Strategy Group, Inc., is in violation of U.S. copyright law and will be subject to an action for civil damages and, if applicable, criminal prosecution. Should you have any questions, please contact ESG Client Relations at 508.482.0188. Enterprise Strategy Group is an IT analyst, research, validation, and strategy firm that provides market intelligence and actionable insight to the global IT community. www.esg-global.com contact@esg-global.com 508.482.0188 © 2022 by The Enterprise Strategy Group, Inc. All Rights Reserved.
You can also read