Ready Solutions for Data Analytics - Big Data as a Service (Ready Solutions for Big Data) Architecture Guide - Dell EMC
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Ready Solutions for Data Analytics Big Data as a Service (Ready Solutions for Big Data) Architecture Guide February 2019 H17286.1
ii | Contents Contents List of figures..................................................................................................................... iv List of tables....................................................................................................................... v Trademarks........................................................................................................................ vi Notes, cautions, and warnings......................................................................................... vii Chapter 1: Solution overview..............................................................................................8 Overview...............................................................................................................................................9 Chapter 2: Solution architecture....................................................................................... 11 Architecture overview.........................................................................................................................12 Solution components..........................................................................................................................12 Deployment architecture.................................................................................................................... 13 Chapter 3: Software architecture......................................................................................15 Software overview..............................................................................................................................16 Elastic Plane cluster management.................................................................................................... 16 App Store................................................................................................................................ 16 App Workbench.......................................................................................................................16 Multi-tenancy and role-based security...............................................................................................16 Tenants.................................................................................................................................... 17 Role-based security.................................................................................................................18 Resource management......................................................................................................................18 Node flavors............................................................................................................................ 18 Resource allocation.................................................................................................................19 Quotas..................................................................................................................................... 20 Storage access and management.....................................................................................................20 DataTaps..................................................................................................................................20 Tenant storage........................................................................................................................ 21 Node storage...........................................................................................................................21 Chapter 4: Cluster architecture.........................................................................................22 Cluster architecture............................................................................................................................ 23 Node roles definitions........................................................................................................................ 24 Sizing summary..................................................................................................................................24 Rack layout........................................................................................................................................ 25 Chapter 5: Hardware architecture.................................................................................... 27 Dell EMC PowerEdge rack servers...................................................................................................28 Dell EMC PowerEdge R640 server........................................................................................ 28 Dell EMC PowerEdge R740xd server.................................................................................... 28 Server hardware configurations.........................................................................................................28 Administration Node................................................................................................................ 29 Gateway Nodes.......................................................................................................................29 Worker Nodes - high density.................................................................................................. 30 Ready Solutions for Data Analytics | Big Data as a Service (Ready Solutions for Big Data) | February 2019
Contents | iii Worker Nodes - GPU accelerated..........................................................................................30 Chapter 6: Network architecture.......................................................................................32 Physical network architecture............................................................................................................ 33 Physical network definitions...............................................................................................................33 Physical network components........................................................................................................... 33 Server node connections........................................................................................................ 34 25 GbE pod switches..............................................................................................................35 25 GbE Layer 2 cluster aggregation...................................................................................... 36 iDRAC management network................................................................................................. 37 Network equipment summary - 25 GbE................................................................................. 37 Logical network architecture.............................................................................................................. 38 Logical network definitions.................................................................................................................39 Core network integration....................................................................................................................39 Chapter 7: Solution monitoring......................................................................................... 40 Cluster monitoring.............................................................................................................................. 41 Hardware monitoring..........................................................................................................................41 Appendix A: References................................................................................................... 42 About BlueData.................................................................................................................................. 43 About Cloudera.................................................................................................................................. 43 About Red Hat................................................................................................................................... 43 About Dell EMC Customer Solution Centers.................................................................................... 43 To learn more.....................................................................................................................................44 Glossary............................................................................................................................45 Index................................................................................................................................. 54 Ready Solutions for Data Analytics | Big Data as a Service (Ready Solutions for Big Data) | February 2019
iv | List of figures List of figures Figure 1: Solution components........................................................................................ 12 Figure 2: Solution deployment architecture..................................................................... 14 Figure 3: Solution Cluster architecture............................................................................ 23 Figure 4: Solution rack layout.......................................................................................... 26 Figure 5: Dell EMC PowerEdge R640 server 10 x 2.5" chassis......................................28 Figure 6: Dell EMC PowerEdge R740xd server 3.5” chassis.......................................... 28 Figure 7: Physical network architecture...........................................................................33 Figure 8: Dell EMC PowerEdge R640 network ports...................................................... 34 Figure 9: Dell EMC PowerEdge R740xd network ports.................................................. 34 Figure 10: 25 GbE single pod networking equipment..................................................... 36 Figure 11: Dell EMC Networking S5048F-ON multiple pod networking equipment......... 37 Figure 12: Network fabric architecture.............................................................................38 Figure 13: OME health monitoring...................................................................................41 Ready Solutions for Data Analytics | Big Data as a Service (Ready Solutions for Big Data) | February 2019
List of tables | v List of tables Table 1: Cluster node roles..............................................................................................23 Table 2: Recommended cluster size - 25 GbE................................................................24 Table 3: Alternative cluster sizes - 25 GbE..................................................................... 25 Table 4: Rack and pod density scenarios........................................................................25 Table 5: Hardware configurations – Dell EMC PowerEdge R640 Administration Node............................................................................................................................. 29 Table 6: Hardware configurations – Dell EMC PowerEdge R640 Gateway Node............29 Table 7: Hardware configurations – Dell EMC PowerEdge R740xd Worker Nodes - high density.................................................................................................................. 30 Table 8: Hardware configurations – Dell EMC PowerEdge R740xd Worker Nodes - GPU accelerated.......................................................................................................... 30 Table 9: Solution network definitions............................................................................... 33 Table 10: Network / Interface Cross Reference...............................................................34 Table 11: Per rack network equipment - 25 GbE............................................................ 37 Table 12: Per pod network equipment - 25 GbE............................................................. 37 Table 13: Per cluster aggregation network switches for multiple pods - 25 GbE............. 38 Table 14: Per node network cables required – 25 GbE configurations............................38 Table 15: Solution logical network definitions.................................................................. 39 Ready Solutions for Data Analytics | Big Data as a Service (Ready Solutions for Big Data) | February 2019
vi | Trademarks Trademarks The information in this publication is provided “as is.” Dell Inc. makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any software described in this publication requires an applicable software license. Copyright © 2018-2019 Dell Inc. or its subsidiaries. All rights reserved. Dell, EMC, Dell EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries. Other trademarks may be trademarks of their respective owners. Dell believes the information in this document is accurate as of its publication date. The information is subject to change without notice. Ready Solutions for Data Analytics | Big Data as a Service (Ready Solutions for Big Data) | February 2019
Notes, cautions, and warnings | vii Notes, cautions, and warnings Note: A Note indicates important information that helps you make better use of your system. CAUTION: A Caution indicates potential damage to hardware or loss of data if instructions are not followed. Warning: A Warning indicates a potential for property damage, personal injury, or death. This document is for informational purposes only and may contain typographical errors and technical inaccuracies. The content is provided as is, without express or implied warranties of any kind. Ready Solutions for Data Analytics | Big Data as a Service (Ready Solutions for Big Data) | February 2019
8 | Solution overview Chapter 1 Solution overview Topics: This guide describes the Big Data as a Service solution, a Dell EMC Ready Solution for Data Analytics. It covers the solution architecture • Overview overall, the software architecture, the design of the nodes and clusters, the hardware components and architecture, the network design, and the operational monitoring of the solution. Ready Solutions for Data Analytics | Big Data as a Service (Ready Solutions for Big Data) | February 2019
Solution overview | 9 Overview In today’s highly competitive business climate, organizations require insight into business operations as they happen, so they can respond to quickly changing market conditions. So naturally, data analytics, or Big Data, is reshaping industries by enabling rapid data-based decision making. Big Data has become an essential component of digital transformation across marketing, operations, finance — really all aspects of the modern business enterprise. Yet, deploying Big Data environments can be very complex and time-consuming. The numerous tasks may include: • Acquiring and deploying the compute nodes with storage • Performing network configurations • Installing operating systems • Deploying Hadoop clusters • Installing other analytic applications • Testing and validating • Administering the users • Aecuring all of the elements • Separately monitoring and managing all of the components The complexity can also introduce risk, as well as time, particularly when there are multiple requests and varying needs coming from different functions and departments within the organization. This solution is designed to simplify and accelerate Big Data deployments. Multi-tenant Big Data deployments that may have taken months can now be completed within a couple of days. Once the platform is deployed, data scientists and analysts can create their own virtual data analytic clusters on- demand within minutes — while accessing centralized data and reducing duplication. This solution is part of Dell EMC's Ready Solutions for Data Analytics portfolio and includes the following elements: • A complete enterprise-grade hardware infrastructure stack from Dell EMC, including scalable and high-performance compute, storage, and networking elements. • The BlueData Elastic Private Instant Clusters (EPIC) software, a platform that enables Big Data as a Service by deploying a wide range of pre-packaged containerized data analytic applications. • Automated lifecycle management operations and end-to-end infrastructure monitoring with Dell EMC OpenManage Enterprise. • An extensive and validated ecosystem of containerized data analytic services, accessible via the BlueData App Store. • An available jumpstart services package, including deployment, on-site integration, and initial consulting services. • Plus, along with the jumpstart services, the Big Data Automated Deployment Tool Kit (ADTK) from Dell EMC is included to ensure rapid, reliable, and risk-free deployments. The wide range of capabilities in this solution make this a complete turn-key solution for Big Data as a Service that can be deployed quickly and efficiently as a platform, and then in turn offer rapid on-demand analytic services to end users with efficient utilization of resources for the organization as a whole. The benefits of such a complete Big Data as a Service solution are numerous and allow the organization to: • Simplify on-premises deployments with a turnkey BDaaS solution. • Increase business agility by empowering data scientists and analysts to create Big Data clusters in a matter of minutes, with just a few mouse clicks. • Minimize the need to move data by independently managing and scaling compute and storage. • Maintain security and control in a multi-tenant environment, integrated with your enterprise security model (e.g. LDAP, AD, or Kerberos). Ready Solutions for Data Analytics | Big Data as a Service (Ready Solutions for Big Data) | February 2019
10 | Solution overview • Achieve cost savings of up to 75% compared to traditional deployments by improving utilization, controlling usage, eliminating cluster sprawl, and minimizing data duplication. • Deliver faster time-to-insights with pre-integrated images for common data science, analytics, visualization, and business intelligence tools – including Cloudera Hadoop, Hortonworks Hadoop, Spark, TensorFlow, Cassandra, Kafka, and others. Ready Solutions for Data Analytics | Big Data as a Service (Ready Solutions for Big Data) | February 2019
Solution architecture | 11 Chapter 2 Solution architecture Topics: The overall architecture of the solution addresses all aspects of implementing this solution in production, including the software • Architecture overview layers, the physical server hardware, the network fabric, scalability, • Solution components performance, and ongoing management. • Deployment architecture This chapter summarizes the main aspects of the solution architecture. Ready Solutions for Data Analytics | Big Data as a Service (Ready Solutions for Big Data) | February 2019
12 | Solution architecture Architecture overview As Big Data deployments expand to meet the needs of multiple organizations and applications, supporting diverse data analytics workloads and user groups requires increased agility and streamlined operations. Implementing a Big Data as a Service environment can provide a solution for these needs. A Big Data as a Service environment has the following key requirements: • Streamlined operations — Big Data as a Service must provide streamlined operations through self service with secure multi-tenancy, while simplifying resource management and providing high availability and performance. • Compute abstraction layer — Applications and clusters on demand must be supported without concern for physical compute infrastructure allocation. Resource management must provide capacity management and scalability. Applications should be templated to hide the details of physical compute requirements. • Storage abstraction layer — Local, remote, and shared storage must be supported, including security and multi-tenant isolation. • Hardware infrastructure Layer — The hardware infrastructure must provide high performance compute, network, and storage, with management capabilities. The infrastructure must be scalable and support independent allocation of compute, network, and storage resources. The architecture of this solution embodies all the hardware, software, resources, and services needed to meet these requirements in a production environment. Based on BlueData EPIC, this integrated solution means that you can be in production within a shorter time than is typically possible with homegrown solutions. Solution components This solution addresses the requirements of Big Data as a Service by integrating multiple hardware and software components that provide the necessary functions. Figure 1: Solution components on page 12 illustrates the primary functional components in this solution. Figure 1: Solution components Ready Solutions for Data Analytics | Big Data as a Service (Ready Solutions for Big Data) | February 2019
Solution architecture | 13 • Containers provide the core runtime abstraction for the user applications. These containers provide isolation between user applications and the rest of the infrastructure. The containers are based on Docker. • The Resource management and orchestration Layer is the core operational component in the system, and is provided by EPIC. This layer is responsible for allocating resources to applications, and creating and monitoring container instances to execute those applications. In EPIC, container instances are referred to as virtual nodes. Elastic Plane provides the operational interface to this layer. • Tenants are an abstraction that provide multi-tenancy capabilities by grouping container instances. Containers associated with a tenant are isolated from other tenants at the network, compute, and storage levels. • The App Store is a repository of application images, allowing fully automated self service deployment. Images in the App Store are preconfigured and ready to run, including complete cluster support. Images for Hadoop and other Big Data platforms are provided with the base installation. The application workbench enables users to quickly add images for any other Big Data application or data processing platform. • The Compute infrastructure provides the memory, processor, hardware accelerator and I/O resources to support container execution. This infrastructure is provided by Dell EMC PowerEdge servers. • IOBoost is an EPIC component that ensures performance comparable to bare metal in the containerized environment. • The Virtual network layer is responsible for dynamically assigning network addresses to container instances, supporting tenant isolation at the network level, and managing connectivity between container instances and external networks. This layer is provided as part of EPIC. • Node storage provides local storage for a container instance while it is running. This storage is ephemeral, and is removed when a container instance completes. • DataTaps provide access to remote storage for containers. DataTaps are associated with a tenant, so multiple applications and containers can share a DataTap while the DataTap is isolated from other tenants. • Tenant storage is a DataTap that provides persistent shared storage accessible by all nodes within a given tenant. The underlying filesystem is HDFS, and the physical storage is allocated from the Storage Infrastructure. • NFS access to remote storage is available through NFS DataTaps. • Isilon HDFS access to remote storage is available through HDFS DataTaps. • Storage infrastructure is provided by Dell EMC PowerEdge servers. • Network infrastructure is provided by Dell EMC Networking switches. • Operations and security capabilities are integrated through the entire stack by EPIC and OpenManage Enterprise. Deployment architecture Cluster deployment and hardware infrastructure management capabilities are provided through a dedicated Administration Node. Figure 2: Solution deployment architecture on page 14 illustrates the functional components of the deployment architecture. Ready Solutions for Data Analytics | Big Data as a Service (Ready Solutions for Big Data) | February 2019
14 | Solution architecture Figure 2: Solution deployment architecture The deployment process for nodes in the cluster is driven from a web interface to the Big Data Automated Deployment Tool Kit. Deployment of a node includes all the configuration required for the node to function, including: • Configure appropriate BIOS settings • Configure RAID sets • Install the target OS • Configure file system layouts • Install appropriate OS packages • Configure network interfaces • Configure host names • Configure SSH keys The primary components of the deployment architecture are: • Big Data Automated Deployment Tool Kit — provides the core deployment capabilities for the cluster, including discovering, configuring, and deploying nodes in the cluster. Operators drive the cluster deployment from the Big Data Automated Deployment Tool Kit web interface. • RackHD — provides a platform agnostic management and workflow orchestration engine. A web interface to RackHD is available but is not required for cluster deployment. • Ansible — is used to to automate the installation and configuration of software on the destination nodes. • Docker — is used to containerize the functionality of the Big Data Automated Deployment Tool Kit • OpenManage Enterprise — is used to monitor the hardware in the cluster. It runs as a virtual machine under KVM. • Software images — provides master copies of software necessary for installation, including RHEL, CentOS, RancherOS, and firmware. • Configuration data — is stored on the Admin Node, including system configuration settings, kickstart files, and playbooks used by Ansible. Ready Solutions for Data Analytics | Big Data as a Service (Ready Solutions for Big Data) | February 2019
Software architecture | 15 Chapter 3 Software architecture Topics: This solution is based upon BlueData EPIC. • Software overview EPIC is an enterprise-grade software platform that forms a layer between the underlying infrastructure and Big Data applications, • Elastic Plane cluster transforming that infrastructure into an agile and flexible platform for management virtual clusters running on Docker containers. • Multi-tenancy and role-based security • Resource management • Storage access and management Ready Solutions for Data Analytics | Big Data as a Service (Ready Solutions for Big Data) | February 2019
16 | Software architecture Software overview The EPIC platform provides a simple, on-premises platform for delivering Big Data as a Service to an enterprise. EPIC seamlessly delivers a single shared platform for multiple distributions and versions of Hadoop, Spark, and other BI or analytics tools. Whether it is the need to support separate business unit's disparate Hadoop distribution requirements (e.g., Cloudera versus Hortonworks) or to support multiple versions of Hadoop for multiple BI toolchains, the BlueData EPIC software platform can pool all these resources on the same bare-metal hardware stack. The EPIC platform consists of the EPIC services that are installed on each host in the cluster. EPIC handles all of the back-end virtual cluster management for you, thereby eliminating the need for complex, time-consuming IT support. Platform and Tenant Administrator users can perform all of these tasks in moments using the EPIC web portal. EPIC consists of three key capabilities: • ElasticPlane — A self-service web portal interface that spins up virtual Hadoop or Spark clusters on demand in a secure, multi-tenant environment. • IOBoost — Provides application-aware data caching and tiering to ensure high performance for virtual clusters running Big Data workloads. • DataTap — Accelerates time-to-value for Big Data by allowing in-place access to any storage environment, thereby eliminating time-consuming data movement. Elastic Plane cluster management Clusters spun up by Elastic Plane can be created to run a wide variety of Big Data applications, services, and jobs. Elastic Plane also provides a RESTful API for integration. EPIC abstracts common platform infrastructure resources by creating clusters using virtual nodes implemented as Docker containers. EPIC provides multi-tenancy, security, resource management, and storage access to the virtual clusters. App Store The EPIC software platform includes an App Store for common distributed computing frameworks, machine learning applications, and data science tools. Open source distributions for Hadoop, Spark, Kafka, and other frameworks – as well as representative machine learning and analytics applications – are provided as preconfigured Docker images in the App Store, and available via one-click deployment. App Workbench Every organization’s Big Data and/or AI deployment is likely to have its own unique use cases and requirements as well as its own preferred frameworks, applications, and tools. Both open source and commercial applications in this space are continually evolving, with a constant stream of updates, upgrades, new versions, and new products. To accommodate these needs, EPIC allows customers to modify and/or augment their App Store to meet the specific (and highly dynamic) requirements of their data scientists and data analyst teams. The EPIC platform provides App Workbench functionality that enables this “bring your own app” model. We also provide training and consulting services to assist customers with creating their own Docker images, and in becoming self-sufficient as they expand and update their own App Workbench. Multi-tenancy and role-based security EPIC implements a multi-tenancy platform, with role-based security. Tenants allow you to restrict access as needed, such as by department. Each tenant has its own unique sets of authorized users, DataTaps, applications, and virtual clusters that are never shared with any other tenant. User accounts must be assigned a Tenant Administrator or Member role in a tenant to access that tenant. Ready Solutions for Data Analytics | Big Data as a Service (Ready Solutions for Big Data) | February 2019
Software architecture | 17 Tenants Tenants are created by the Platform Administrator. The infrastructure resources (e.g., CPU, RAM, GPU, storage) available on the EPIC platform are allocated among the tenants on the platform. Each tenant is allocated a set of resources, and only users who are members of that tenant can access those resources. A Tenant Administrator manages the resources assigned to that tenant. Each tenant must have at least one user with the Tenant Administrator role. Users with access to one tenant cannot access or modify any aspect of another tenant unless they have been assigned a Tenant Administrator or Member role on that tenant. Tenants can be created to best suit your organizational needs, such as by: • Office location — If your organization has multiple office locations, you could choose to create one or more tenants per location. For example, you could create a tenant for the San Francisco office and one for the New York office. EPIC does not take location into account; this is just an example of how you could use a tenant. • Department — You could choose to create one or more tenants for each department. For example, you could create one tenant each for the Manufacturing, Marketing, Research & Development, and Sales departments. • Use cases, application lifecycle, or tools — Different use cases for Big Data analytics and data science may have different image/resource requirements. • Combination — You could choose to create one tenant by department for each location. For example, you could create a tenant for the Marketing department in San Francisco and another tenant for the Marketing department in New York. Some of the factors to consider when planning how to create tenants may include: • Structure of your organization —This may include such considerations as the department(s), team(s), and/or function(s) that need to be able to run jobs. • Location of data — If the data to be accessed by the tenant resides in Amazon S3 storage on AWS, then the tenant should be configured to use Amazon EC2 compute resources. If the data to be accessed by the tenant resides on-premises, then the tenant can be configured to use either on- premises or Amazon EC2 compute resources. • Use cases/tool requirements — Different use cases for Big Data analytics and data science may have different image/resource requirements. • Seasonal needs — Some parts of your organization may have varying needs depending on the time of year. For example, your Accounting department may need to run jobs between January 1 and April 15 each year but have few to no needs at other times of the year. • Amount and location(s) of hosts — The number and location(s) of the hosts that you will use to deploy an EPIC platform may also be a factor. If your hosts are physically distant from the users who need to run jobs, then network bandwidth may become an important factor as well. • Personnel who need EPIC access — The locations, titles, and job functions of the people who will need to be able to access EPIC at any level (Platform Administrator, Tenant Administrator, or Member) may influence how you plan and create tenants. • IT policies — Your organization’s IT policies may play a role in determining how you create tenants, and who may access them. • Regulatory needs — If your organization deals with regulated products or services (such as pharmaceuticals or financial products), then you may need to create additional tenants to safeguard regulated data, and keep it separate from non-regulated data. These are just a few of the possible criteria you must evaluate when planning how to create tenants. EPIC has the power and flexibility to support the tenants you create regardless of the schema you use. You may create, edit, and delete tenants at any time. However, careful planning for how you will use your EPIC platform that includes the specific tenant(s) your organization will need now, and in the future, will help you better plan your entire EPIC installation, from the number and type of hosts, to the tenants you create once EPIC is installed on those nodes. Ready Solutions for Data Analytics | Big Data as a Service (Ready Solutions for Big Data) | February 2019
18 | Software architecture Role-based security EPIC implements a user level role-based security model. Each user has a unique username and password that they must provide in order to login to EPIC. Authentication is the process by which EPIC matches the user-supplied username and password against the list of authorized users and determines: • Whether to grant access • What exact access to allow, in terms of the specific role(s) granted to that user EPIC can authenticate users using any of the following methods: • Internal user database • An existing LDAP or AD server Role assignments are stored on the EPIC Controller Node. EPIC includes three roles that allow you to control who can see certain data, and perform specific functions. The roles are: • Platform Administrator • Tenant Administrator • Member Roles are granted on a per-tenant basis, so users can be restricted to a single tenant or granted access to multiple tenants. Each user can have a maximum of one role per tenant. A user with more than one role may be a Member of some tenants, and a Tenant Administrator of other tenants. Some of the user-related items you must consider when planning and maintaining your EPIC installation include: • Tenants — The number of tenants and the function(s) each tenant performs will determine how many Tenant Administrator users you will need and, by extension, the number of Member users you will need for each tenant. The reverse is also true, because the number and functions of users needing to run jobs can influence how you create tenants. For example, different levels of confidentiality might require separate tenants. • Job functions — The specific work performed by each user will directly impact the EPIC role they receive. For example, a small organization may designate a single user as the Tenant Administrator for multiple tenants, while a large organization may designate multiple Tenant Administrators per tenant. • Security clearances — You may need to restrict access to information based upon each user’s security clearance. This can impact both the tenant(s) a user has access to, and the role that user has within the tenant(s). Resource management EPIC manages the pool of physical resources available in the cluster, and allocates those resources to virtual nodes on a first-come, first-served basis. Each tenant may be assigned a quota that limits the total resources available for use by the nodes within that tenant. A tenant's ability to utilize its entire quota of resources is limited by the availability of physical resources. QoS can be controlled at the tenant level. Each cluster requires CPU, RAM, and storage resources in order to run, based upon the number and flavor of its component nodes, and any quotas assigned to the tenant. If available, GPU resources can also be allocated. Cluster creation can only proceed if the total resources assigned to that cluster will not cause the total sum of all resources, by all of the clusters in that tenant, to exceed the tenant quota, and if the needed number of resources are currently available. Node flavors EPIC uses virtual node flavors to define the processor, RAM, and root disk storage, used by each virtual node. For example, if the flavor small specifies a single vCPU core, 3 GB of RAM, 30 GB disk, and two Ready Solutions for Data Analytics | Big Data as a Service (Ready Solutions for Big Data) | February 2019
Software architecture | 19 GPUs, then all virtual nodes created with the small flavor will have those specifications. EPIC creates a default set of flavors (such as Small, Medium, and Large) during installation. The Tenant Administrator should create flavors with virtual hardware specifications appropriate to the clusters that tenant members will create. Application characteristics will guide these choices, particularly the minimum virtual hardware requirements per node. Using nodes with excessively large specifications will waste resources and count toward a tenant's quota. It is therefore important to define a range of flavor choices that closely match user requirements. The Tenant Administrator may freely edit or delete these flavors. When editing or deleting a flavor: • If you edit or delete an existing flavor, then all virtual nodes using that flavor will continue using the flavor as specified before the change or deletion. EPIC displays the flavor definition being used by clusters. • You may delete all of the flavors defined within your EPIC installation; however, if you do this, then you will be unable to create any clusters until you create at least one new flavor. • You may specify an alternative root disk size when creating or editing a flavor. This size overrides the default size specified by the image in the App Store. Specifying a root disk size that is smaller than the minimum size indicated by a given image will prevent you from being able to instantiate that image on a cluster that uses that flavor. Creating a larger root disk size will slow down cluster creation, but may be necessary in situations where you are using the cluster to run an application that uses a local file system. Resource allocation EPIC models vCPU cores as follows: • The number of available vCPU cores is the number of physical CPU cores multiplied by the CPU allocation ratio specified by the Platform Administrator. For example, if the hosts have 40 physical CPU cores and the Platform Administrator specifies a CPU allocation ratio of 3, then EPIC will display a total of 120 available cores. EPIC allows an unlimited number of vCPU cores to be allocated to each tenant. The collective core usage for all nodes within a tenant will be constrained by either the tenant's assigned quota or the available cores in the system, whichever limit is reached first. The tenant quotas and the CPU allocation ratio act together to prevent tenant members from overloading the system's CPU resources. • When two nodes are assigned to the same host and contend for the same physical CPU cores, EPIC allocates resources to those nodes in a ratio determined by their vCPU core count. For example, a node with 8 cores will receive twice as much CPU time as a node with 4 cores. • The Platform Administrator can also specify a QoS multiplier for each tenant. In the case of CPU resource contention, the node core count is multiplied by the tenant QOS multiplier when determining the CPU time it will be granted. For example, a node with 8 cores in a tenant with a QOS multiplier of 1 will receive the same CPU time as a node with 4 cores in a tenant with a QOS multiplier of 2. The QOS multiplier is used to describe relative tenant priorities when CPU resource contention occurs; it does not affect the overall cap on CPU load established by the CPU allocation ratio and tenant quotas. EPIC models RAM as follows: • The total amount of available RAM is equal to the amount of unreserved RAM in the EPIC platform. Unreserved RAM is the amount of RAM remaining after reserving some memory in each host for EPIC services. For example, if your EPIC platform consists of four hosts that each have 128 GB of physical RAM with 110 GB of unreserved RAM, the total amount of RAM available to share among EPIC tenants will be 440 GB. • EPIC allows an unlimited amount of RAM to be allocated to each tenant. The collective RAM usage for all nodes within a tenant will be constrained by either the tenant's assigned quota or the available RAM in the system, whichever limit is reached first. Root disk storage space is allocated from the disk(s) on each Worker Node that are assigned as Node Storage disks. Each virtual node consumes node storage space equivalent to its root disk size on the Worker Node where that virtual node is placed. Ready Solutions for Data Analytics | Big Data as a Service (Ready Solutions for Big Data) | February 2019
20 | Software architecture If the EPIC platform includes compatible GPU devices, then EPIC models those GPU devices as follows: • The total number of available GPU resources is equal to the number of physical GPU devices in the EPIC platform. For example, if your EPIC platform consists of four hosts that each have 8 physical GPU devices, then the EPIC platform will have a total of 32 GPU devices available to share among EPIC tenants. • EPIC allows an unlimited amount of GPU resources to be allocated to each tenant. The collective GPU resource usage for all virtual nodes within a tenant will be constrained by either the tenant's assigned quota or the available GPU devices in the system, whichever limit is reached first. • GPU devices are expensive resources. EPIC therefore handles virtual node/container placement as follows: • If a virtual node does not require GPU devices, then EPIC attempts to place that node on a host that does not have any GPU devices installed. • If a virtual node does require GPU resources, then EPIC attempts to place that container in such a way as to maximize GPU resource utilization on each host, to reduce/eliminate wasted resources. • In either case, EPIC attempts to place a virtual node on a host with available resources and will fail if resources are unavailable. Quotas Assigning a quota of resources to a tenant does not reserve those resources for that tenant when that tenant is idle (not running one or more clusters). This means that a tenant may not actually be able to acquire system resources up to the limit of its configured quota. You may assign a quota for any amount of resources to any tenant(s) regardless of the actual number of available system resources. A configuration where total allowed tenant resources exceed the current amount of system resources is called over-provisioning. Over-provisioning occurs when one or more of the following conditions are met: • You only have one tenant which has quotas that either exceed the system resources or are undefined quotas. This tenant will only be able to use the resources that are actually available to the EPIC platform. This arrangement is just a convenience to make sure that the one tenant is always able to fully utilize the platform, even if you add more hosts in the future. • You have multiple tenants where none have overly large or undefined quotas, but where the sum of their quotas exceeds the resources available to the EPIC platform. In this case, you are not expecting all tenants to attempt to use all their allocated resources simultaneously. Still, you have given each tenant the ability to claim more than its “fair share” of the EPIC platform's resources when these extra resources are available. In this case, you must balance the need for occasional bursts of usage against the need to restrict how much a “greedy” tenant can consume. A larger quota gives more freedom for burst consumption of unused resources while also expanding the potential for one tenant to prevent other tenants from fully utilizing their quotas. • You have multiple tenants where one or more has overly large and/or undefined quotas. Such tenants are trusted or prioritized to be able to claim any free resources. However, they cannot consume resources being used by other tenants. Storage access and management EPIC supports multiple forms of storage management and access for local and remote data. Data sources include DataTaps for remote storage, per-tenant shared storage, and per node storage. DataTaps DataTaps expand access to shared data by specifying a named path to a specified storage resource. Big Data jobs within EPIC virtual clusters can then access paths within that resource using that name. This allows you to run jobs using your existing data systems without the need to make copies of your data. Ready Solutions for Data Analytics | Big Data as a Service (Ready Solutions for Big Data) | February 2019
Software architecture | 21 Tenant Administrator users can quickly and easily build, edit, and remove DataTaps. Tenant Member users can use DataTaps by name. DataTaps can be used to access remote NFS servers, HDFS, or HDFS with Kerberos. The type of remote storage is completely transparent to the user job or process using the DataTap. Each DataTap includes the following properties: • Name — Unique name for each DataTap. • Description — Brief description of the DataTap, such as the type of data or the purpose of the DataTap. • Type — Type of file system used by the shared storage resource associated with the DataTap (HDFS, or NFS). • Connection details — Hostname and other protocol specific connection details, including authentication. The storage pointed to by a BlueData DataTap can be accessed by a MapReduce job (or by any other Hadoop- or Spark-based activity in an EPIC virtual node) by using a URI that includes the name of the DataTap. DataTaps can be used to access Dell EMC Isilon clusters. Most Big Data applications will probably use the HDFS interface to Isilon, but NFS is also available. DataTaps exist on a per-tenant basis. This means that a DataTap created for Tenant A cannot be used by Tenant B. You may, however, create a DataTap for Tenant B with the exact same properties as its counterpart for Tenant A, thus allowing both tenants to use the same shared network resource. This allows jobs in different tenants to access the same storage simultaneously. Further, multiple jobs within a tenant may use a given DataTap simultaneously. While such sharing can be useful, be aware that the same cautions and restrictions apply to these use cases as for other types of shared storage: multiple jobs modifying files at the same location may lead to file access errors and/or unexpected job results. Users who have a Tenant Administrator role may view and modify detailed DataTap information. Members may only view general DataTap information and are unable to create, edit, or remove a DataTap. Tenant storage EPIC supports an optional storage location that is shared by all nodes within a given tenant, called Tenant Storage. The Platform Administrator configures tenant storage while installing EPIC and can change it at any time thereafter. Tenant storage can be configured to use either a local HDFS installation or a remote HDFS or NFS system. Alternatively, you can create a tenant without dedicated storage. When a new tenant is created, that tenant automatically receives a DataTap called TenantStorage that points at a unique directory within the Tenant Storage space. This DataTap can be used in the same manner as other DataTaps, but it cannot be edited or deleted. The TenantStorage DataTap points at the top-level directory that a tenant can access within the tenant storage service. The Tenant Administrator can create or edit additional DataTaps that point at or below that directory; however, one cannot create or edit a DataTap that points outside the tenant storage on that particular storage service. If the tenant storage is based on a local HDFS, then the Platform Administrator can specify a storage quota for each tenant. EPIC uses the HDFS back-end to enforce this quota, meaning that the quota applies to storage operations that originate from both the EPIC DataTap browser or the nodes within that tenant. Node storage EPIC supports node storage that can be used for applications that require local disk storage. Node storage is allocated from each host in the EPIC platform and is used for the volumes that back the local storage for each virtual node. A tenant can optionally be assigned a quota for how much storage the nodes in that tenant can consume. Ready Solutions for Data Analytics | Big Data as a Service (Ready Solutions for Big Data) | February 2019
22 | Cluster architecture Chapter 4 Cluster architecture Topics: Several node types, each with specific functions, are included in this solution. This chapter provides detailed definitions of those node • Cluster architecture types. • Node roles definitions • Sizing summary • Rack layout Ready Solutions for Data Analytics | Big Data as a Service (Ready Solutions for Big Data) | February 2019
Cluster architecture | 23 Cluster architecture Figure 3: Solution Cluster architecture on page 23 illustrates the roles for the nodes in a basic cluster. Figure 3: Solution Cluster architecture The cluster environment consists of multiple software services running on multiple physical server nodes. The implementation divides the server nodes into several roles, and each node has a configuration optimized for its role in the cluster. The physical server configurations are divided into three broad classes: • Worker Nodes handle the execution of the tenant containers and provide storage. • Controller Nodes support services needed for the cluster operation. • Gateway Nodes provide an interface between the cluster and the existing network. A high-performance network fabric connects the cluster nodes together, and isolates the core cluster network from external and management functions. The minimum configuration supported is thirteen cluster nodes. The nodes have the following roles: Table 1: Cluster node roles Physical node Hardware configuration Administration Node Administration Gateway Node 1 Gateway Gateway Node 2 Gateway Controller Node High density worker Controller Node High density worker Controller Node worker Worker Node 1 Worker - High density or GPU accelerated Worker Node 2 Worker - High density or GPU accelerated Ready Solutions for Data Analytics | Big Data as a Service (Ready Solutions for Big Data) | February 2019
24 | Cluster architecture Physical node Hardware configuration Worker Node 3 Worker - High density or GPU accelerated Worker Node 4 Worker - High density or GPU accelerated Worker Node 5 Worker - High density or GPU accelerated Worker Node 6 Worker - High density or GPU accelerated Worker Node 7 Worker - High density or GPU accelerated Node roles definitions • Administration Node — Provides cluster deployment and management capabilities. This node hosts the deployment software and an instance of OpenManage Enterprise. • Gateway Node 1, Gateway Node 2 — Provide an interface for control traffic between existing network infrastructure and service end points on virtual clusters. These nodes are exposed on the main network, and proxy IP incoming network traffic between the primary LAN IP addresses and the private cluster network addresses. The Gateway Nodes act as a high availability pair with round-robin DNS entries for their network IP addresses. • Controller Node 1 — Provides management and control of all the hosts in the cluster, through the EPIC Controller service. The EPIC web interface runs on this host. • Controller Node 2 — Provides a backup instance of the Controller service, called the Shadow Controller, for High Availability. If Controller Node 1 fails, then EPIC will failover to this node. • Controller Node 3 — Provides an arbiter service to facilitate controller High Availability. • Worker Nodes — Provide the primary compute and storage resources for the cluster environment. Note: Controller Nodes 1, 2, and 3 also act as Worker Nodes and their resources are also available for use by EPIC. In larger deployments, Controller Nodes 1 and 2 can be dedicated to the controller function. Sizing summary The minimum configuration supported is thirteen nodes: • One (1) Administration Node • Three (3) Controller Nodes • Seven (7) Worker Nodes • Two (2) Gateway Nodes Table 2: Recommended cluster size - 25 GbE on page 24 shows the recommended number of Worker Node or Controller Nodes per pod and pods per cluster for 25 GbE clusters using the S5048F-ON switch model. Table 3: Alternative cluster sizes - 25 GbE on page 25 shows some alternatives for cluster sizing with different bandwidth oversubscription ratios. When determining actual rack space requirements, the Administration Node and Gateway Nodes should also be included. Table 2: Recommended cluster size - 25 GbE Nodes per rack Nodes per pod Pods per cluster Nodes per Bandwidth cluster oversubscription 12 36 8 288 2.25 : 1 Ready Solutions for Data Analytics | Big Data as a Service (Ready Solutions for Big Data) | February 2019
Cluster architecture | 25 Table 3: Alternative cluster sizes - 25 GbE Nodes per rack Nodes per pod Pods per cluster Nodes per Bandwidth cluster oversubscription 12 48 8 384 3:1 12 36 10 360 3:1 12 24 16 384 3:1 Power and cooling will typically be the primary constraints on rack density. However, a rack is a potential fault zone, and rack density will affect overall cluster reliability, especially for smaller clusters. Table 4: Rack and pod density scenarios on page 25 shows some possible scenarios based on typical data center constraints. Table 4: Rack and pod density scenarios Server platform Nodes racks Comments per per rack pod Dell EMC PowerEdge 12 3 Typical configuration, requiring less than 10kW R740xd power per rack. Good rack level fault zone isolation. Dell EMC PowerEdge 10 2 Smaller rack and pod fault zones, with slightly higher R740xd bandwidth oversubscription of 2.5 : 1. Rack layout Figure 4: Solution rack layout on page 26 illustrates a typical single rack installation. Ready Solutions for Data Analytics | Big Data as a Service (Ready Solutions for Big Data) | February 2019
You can also read