Usher: An Extensible Framework for Managing Clusters of Virtual Machines
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Usher: An Extensible Framework for Managing Clusters of Virtual Machines Marvin McNett, Diwaker Gupta, Amin Vahdat, and Geoffrey M. Voelker – University of California, San Diego ABSTRACT Usher is a virtual machine management system designed to impose few constraints upon the computing environment under its management. Usher enables administrators to choose how their virtual machine environment will be configured and the policies under which they will be managed. The modular design of Usher allows for alternate implementations for authentication, authorization, infrastructure handling, logging, and virtual machine scheduling. The design philosophy of Usher is to provide an interface whereby users and administrators can request virtual machine operations while delegating administrative tasks for these operations to modular plugins. Usher’s implementa- tion allows for arbitrary action to be taken for nearly any event in the system. Since July 2006, Usher has been used to manage virtual clusters at two locations under very different settings, demonstrating the flexibility of Usher to meet different virtual machine management requirements. Introduction requirements, or service infrastructures. To facilitate its use in a wide range of environments, Usher com- Usher is a cluster management system designed bines a core set of interfaces that implement basic to substantially reduce the administrative burden of mechanisms, clients for using these mechanisms, and managing cluster resources while simultaneously im- a framework for expressing and customizing adminis- proving the ability of users to request, control, and trative policies in extensible modules, or plugins. customize their resources and computing environment. System administrators of cluster computing environ- The Usher core implements basic virtual cluster ments face a number of imposing challenges. Different and machine management mechanisms, such as creat- users within a cluster can have a wide range of com- ing, destroying, and migrating VMs. Usher clients use puting demands, spanning general best-effort comput- this core to manipulate virtual clusters. These clients ing needs, batch scheduling systems, and complete serve as interfaces to the system for users as well as control of dedicated resources. These resource de- for use by higher-level cluster software. For example, mands vary substantially over time in response to an Usher client called ush provides an interactive changes in workload, user base, and failures. Further- command shell for users to interact with the system. more, users often need to customize their operating We have also implemented an adapter for a high-level system and application environments, substantially in- execution management system [6], which operates as creasing configuration and maintenance tasks. Finally, an Usher client, that creates and manipulates virtual clusters rarely operate in isolated administrative envi- clusters on its own behalf. ronments, and must be integrated into existing authen- Usher supports customizable modules for two tication, storage, network, and host address and name important purposes. First, these modules enable Usher service infrastructure. to interact with broader site infrastructure, such as au- Usher balances these imposing requirements us- thentication, storage, and host address and naming ser- ing a combination of abstraction and architecture. vices. Usher implements default behavior for common Usher provides a simple abstraction of a logical clus- situations, e.g., newly created VMs in Usher can use a ter of virtual machines, or virtual cluster. Usher users site’s DHCP service to obtain addresses and domain can create any number of virtual clusters (VCs) of ar- names. Additionally, sites can customize Usher to im- bitrary size, while Usher multiplexes individual virtual plement more specialized policies; at UCSD, an Usher machines (VMs) on available physical machine hard- VM identity module allocates IP address ranges to ware. By decoupling logical machine resources from VMs within the same virtual cluster. physical machines, users can create and use machines Second, pluggable modules enable system admin- according to their needs rather than according to as- istrators to express site-specific policies for the place- signed physical resources. ment, scheduling, and use of VMs. As a result, Usher al- Architecturally, Usher is designed to impose few lows administrators to decide how to configure their vir- constraints upon the computing environment under its tual machine environments and determine the appropri- management. No two sites have identical hardware ate management policies. For instance, to support a gen- and software configurations, user and application eral-purpose computing environment, administrators can 21st Large Installation System Administration Conference (LISA ’07) 167
Usher: An Extensible Framework for Managing Clusters . . . McNett, et al. install an available Usher scheduling and placement among services while satisfying the constraints of the plugin that performs round-robin placement of VMs SLA requirements. across physical machines and simple rebalancing in re- For example, Océano provides a computing utility sponse to the addition or removal of virtual and physical for e-commerce [9]. Services formally state their work- machines. With this plugin, users can dynamically add load performance requirements (e.g., response time), or remove VMs from VCs at any time without having to and Océano dynamically allocates physical servers in specify service level agreements (SLAs) [9, 17, 22], response to changing workload conditions to satisfy write configuration files [10], or obtain leases on re- such requirements. Rocks and Rolls provide scalable sources [12, 14]. With live migration of VMs, Usher can and customizable configuration for computational grids dynamically and transparently adjust the mapping of [11, 18], and Cluster-on-Demand (COD) performs re- virtual to physical machines to adapt to changes in load source allocation for computing utilities and computa- among active VMs or the working set of active VMs, tional grid services [12]. COD implements a virtual exploit affinities among VMs (e.g., to enhance physical cluster abstraction, where a virtual cluster is a disjoint page sharing [20]), or add and remove hardware with set of physical servers specifically configured to the re- little or no interruption. quirements of a particular service, such as a local site Usher enables other powerful policies to be ex- component of a larger wide-area computational grid. pressed, such as power management (reduce the num- Services specify and request resources to a site manager ber of active physical machines hosting virtual clus- and COD leases those resources to them. Finally, Emu- ters), distribution (constrain virtual machines within a lab provides a shared network testbed in which users virtual cluster to run on separate nodes), and resource specify experiments [21]. An experiment specifies net- guarantees. Another installation of Usher uses its clus- work topologies and characteristics as well as node ter to support scientific batch jobs running within vir- software configurations, and Emulab dedicates, iso- tual clusters, guarantees resources to those jobs when lates, and configures testbed resources for the duration they run, and implements a load-balancing policy that of the experiment. migrates VMs in response to load spikes [13]. The recent rise in virtual machine monitor (VMM) Usher is a fully functional system. It has been in- popularity has naturally led to systems for configuring stalled in cluster computing environments at UCSD and managing virtual machines. For computational and the Russian Research Center in Kurchatov, Rus- grid systems, for example, Shirako extends Cluster- sia. At UCSD, Usher has been in production use since on-Demand by incorporating virtual machines to fur- January 2007. It has managed up to 142 virtual ma- ther improve system resource multiplexing while sat- chines in 26 virtual clusters across 25 physical ma- isfying explicit service requirements [14], and VIO- chines. The Usher implementation is sufficiently reli- LIN supports both intra-and inter-domain migration to able that we are now migrating the remainder of our satisfy specified resource utilization limits [17]. Sand- user base from dedicated physical machines to virtual piper develops policies for detecting and reacting to clusters, and Usher will soon manage all 130 physical hotspots in virtual cluster systems while satisfying ap- nodes in our cluster. In the rest of this paper we de- plication SLAs [22], including determining when and scribe the design and implementation of Usher, as well where to migrate virtual machines, although again un- as our experiences using it. der the constraints of meeting the stringent SLA re- quirements of a data center. Related Work On the other hand, Usher provides a framework Since the emergence of widespread cluster com- that allows system administrators to express site-spe- puting over a decade ago [8, 16], many cluster config- cific policies depending upon their needs and goals. uration and management systems have been developed By default, the Usher core provides, in essence, a gen- to achieve a range of goals. These goals naturally in- eral-purpose, best-effort computing environment. It fluence individual approaches to cluster management. imposes no restrictions on the number and kind of vir- Early configuration and management systems, such as tual clusters and machines, and performs simple load Galaxy [19], focus on expressive and scalable mecha- balancing across physical machines. We believe this nisms for defining clusters for specific types of ser- usage model is important because it is widely applica- vice, and physically partition cluster nodes among ble and natural to use. Requiring users to explicitly those types. specify their resource requirements for their needs, for More recent systems target specific domains, example, can be awkward and challenging since users such as Internet services, computational grids, and ex- often do not know when or for how long they will perimental testbeds, that have strict workload or re- need resources. Further, allocating and reserving re- source allocation requirements. These systems support sources can limit resource utilization; guaranteed re- services that express explicit resource requirements, sources that go idle cannot be used for other purposes. typically in some form of service level agreement However, sites can specify more elaborate policies in (SLA). Services provide their requirements as input to Usher for controlling the placement, scheduling, and the system, and the system allocates its resources migration of VMs if desired. Such policies can range 168 21st Large Installation System Administration Conference (LISA ’07)
McNett, et al. Usher: An Extensible Framework for Managing Clusters . . . from batch schedulers to allocation of dedicated physi- we designed Usher as a flexible platform for con- cal resources. structing virtual machine management installations In terms of configuration, Usher shares many of customized to the needs of a particular site. the motivations that inspired the Manage Large Net- To accomplish this goal, we had two design objec- works (MLN) tool [10]. The goal of MLN is to enable tives for Usher. First, Usher maintains a clean separation administrators and users to take advantage of virtual- between policy and mechanism. The Usher core pro- ization while easing administrator burden. Administra- vides a minimal set of mechanisms essential for virtual tors can use MLN to configure and manage virtual machine management. For instance, the Usher core has machines and clusters (distributed projects), and it mechanisms for placing and migrating virtual machines, supports multiple virtualization platforms (Xen and while administrators can install site-specific policy mod- User-Mode Linux). MLN, however, requires adminis- ules that govern where and when VMs are placed. trators to express a number of static configuration de- Second, Usher is designed for extensibility. The cisions through configuration files (e.g., physical host Usher core provides three ways to extend functionali- binding, number of virtual hosts), and supports only ty, as illustrated in Figure 1. First, Usher provides a set coarse granularity dynamic reallocation (manually by of hooks to integrate with existing infrastructure. For the administrator). Usher configuration is interactive instance, while Usher provides a reference implemen- and dynamic, enables users to create and manage their tation for use with the Xen VMM, it is straightforward virtual clusters without administrative intervention, to write stubs for other virtualization platforms. Sec- and enables a site to globally manage all VMs accord- ond, developers can use a Plugin API to enhance Ush- ing to cluster-wide policies. er functionality. For example, plugins can provide XenEnterprise [5] from XenSource and Virtual- database functionality for persistently storing system Center [4] from VMware are commercial products for state using a file-backed database, or provide authenti- managing virtual machines on cluster hardware from cation backed by local UNIX passwords. Third, Usher the respective companies. XenEnterprise provides a provides a Client API for integrating with user inter- graphical administration console, Virtual Data Center, faces and third-party tools, such as the Usher com- for creating, managing, and monitoring Xen virtual ma- mand-line shell and the Plush execution management chines. VirtualCenter monitors and manages VMware system (discussed in the Applications subsection of virtual machines on a cluster as a data center, support- the Implementation section). ing VM restart when nodes fail and dynamic load bal- ancing through live VM migration. Both list interfaces for external control, although it is not clear whether ad- ministrators can implement arbitrary plugins and poli- cies for integrating the systems into existing infrastruc- ture, or controlling VMs in response to arbitrary events in the system. In this regard, VMWare’s Infrastructure Management SDK provides functionality similar to that provided by the Usher client API. However, this SDK does not provide the tight integration with VMWare’s Figure 1: Usher interfaces. centralized management system that plugins do for the Usher Overview Usher system. Also, of course, these are all tied to managing a single VM product, whereas Usher is de- A running Usher system consists of three main signed to interface with any virtualization platform components: local node managers (LNMs), a central- that exports a standard administrative interface. ized controller, and clients. A client consists of an ap- plication that utilizes the Usher client library to send System Architecture virtual machine management requests to the controller. We have written a few applications that import the This section describes the architecture of Usher. Usher client library for managing virtual machines (a We start by briefly summarizing the goals guiding our shell and XML-RPC server) with more under develop- design, and then present a high-level overview of the ment (a web frontend and command line suite). system. We then describe the purpose and operation of each of the various system components, and how they Figure 2 depicts the core components of an Ush- er installation. One LNM runs on each physical node interact with each other to accomplish their tasks. We and interacts directly with the VMM to perform man- end with a discussion of how the Usher system accom- agement operations such as creating, deleting, and mi- modates software and hardware failures. grating VMs on behalf of the controller. The local Design Goals node managers also collect resource usage data from As mentioned, no two sites have identical hard- the VMMs and monitor local events. LNMs report re- ware and software configurations, user and application source usage updates and events back to the controller requirements, or service infrastructures. As a result, for use by plugins and clients. 21st Large Installation System Administration Conference (LISA ’07) 169
Usher: An Extensible Framework for Managing Clusters . . . McNett, et al. The controller is the central component of the operation to complete. We emphasize that this architec- Usher system. It receives authenticated requests from ture abstracts VMM-specific implementations – the clients and issues authorized commands to the LNMs. controller is oblivious to the specific VMMs running It also communicates with the LNMs to collect usage on the physical nodes as long as the LNM provides the data and manage virtual machines running on each remote API implementation. As a result, although our physical node. The controller provides event notifica- implementation currently uses the Xen VMM, Usher tion to clients and plugins registered to receive notifi- can target other virtualization platforms. Further, Usher cation for a particular event (e.g., a VM has started, is capable of managing VMs running any operating been destroyed, or changed state). Plugin modules can system supported by the VMMs under its management. perform a wide range of tasks, such as maintaining As the Usher system runs, VM and VMM re- persistent system-wide state information, performing source usage fluctuates considerably. The local node DDNS updates, or doing external environment prepa- manager on each node monitors these fluctuations and ration and cleanup. reports them back to the controller. It reports resource The client library provides an API for applications usage of CPU utilization, network receive and trans- to communicate with the Usher controller. Essentially, mit loads, disk I/O activity, and memory usage in 1, 5, clients submit requests to the controller when they need and 15-minute averages. to manipulate their VMs or request additional VMs. In addition to changes in resource usage, VM The controller can grant or deny these requests as its state changes sometimes occur unexpectedly. VMs can operational policy dictates. One purpose of clients is to crash or even unexpectedly appear or disappear from serve as the user interface to the system, and users use the system. Detecting these and other related events clients to manage their VMs and monitor system state. requires both careful monitoring by the local node More generally, arbitrary applications can use the client managers as well as VMM support for internal event library to register callbacks for events of interest in the notification. Administrators can set a tunable parame- Usher system. ter for how often the LNM scans for missing VMs or Typically, a few services also support a running unexpected VMs. The LNM will register callbacks Usher system. Depending upon the functionality de- with the VMM platform for other events, such as VM sired and the infrastructure provided by a particular crashes; if the VMM does not support such callbacks, site, these services might include a combination of the LNM will periodically scan to detect these events. following: a database server for maintaining state in- formation or logging, a NAS server to serve VM filesystems, an authentication server to provide au- thentication for Usher and VMs created by Usher, a DHCP server to manage IP addresses, and a DNS server for name resolution of all Usher created VMs. Note that an administrator may configure Usher to use any set of support services desired, not necessarily the preceding list. Usher Components As noted earlier, Usher consists of three main components, local node managers on each node, a central controller, and Usher clients. Figure 2: Usher components. Local Node Managers The local node managers (LNMs) operate closest Usher Controller to the hardware. As shown in Figure 2, LNMs run as The controller is the center of the Usher system. servers on each physical node in the Usher system. It can either be bootstapped into a VM running in the The LNMs have three major duties: i) to provide a re- system, or run on a separate server. The controller pro- mote API to the controller for managing local VMs, ii) vides the following: to collect and periodically upload local resource usage • User authentication data to the controller, and iii) to report local events to • VM operation request API the controller. • Global state maintenance Each LNM presents a remote API to the con- • Consolidation of LNM monitoring data troller for manipulating VMs on its node. Upon invok- • Event notification ing an API method, the LNM translates the operation User authentication: Usher uses SSL-encrypted into the equivalent operation of the VM management user authentication. All users of the Usher system must API exposed by the VMM running on the node. Note authenticate before making requests of the system. Ad- that all LNM API methods are asynchronous so that ministrators are free to use any of the included authenti- the controller does not block waiting for the VMM cation modules for use with various authentication 170 21st Large Installation System Administration Conference (LISA ’07)
McNett, et al. Usher: An Extensible Framework for Managing Clusters . . . backends (e.g., LDAP), or implement their own. An ad- In addition to the above lists, the controller main- ministrator can register multiple authentication mod- tains three other lists of VMs: lost, missing, and un- ules, and Usher will query each in turn. This support is managed VMs. The subtle distinction between lost useful, for instance, to provide local password authen- and missing is that lost VMs are a result of an LNM or tication if LDAP or NIS authentication fails. After re- VMM failure (the controller is unable to make this ceiving a user’s credentials, the controller checks them distinction), whereas a missing VM is a result of an against the active authentication module chain. If one unexpected VM disappearance as reported by the succeeds before reaching the end of the chain, the user LNM where the VM was last seen running. A missing is authenticated. Otherwise, authentication fails and VM can be the result of an unexpected VMM error the user must retry. (e.g., we have encountered this case upon a VMM er- VM operation request API: A key component of ror on migration). Unmanaged VMs are typically a re- the controller is the remote API for Usher clients. This sult of an administrator manually starting a VM on a API is the gateway into the system for VM manage- VMM being managed by Usher; Usher is aware of the ment requests (via RPC) from connecting clients. Typi- VM, but is not itself managing it. The list of unman- cally, the controller invokes an authorization plugin to aged VMs aids resource usage reporting so that Usher verify that the authenticated user can perform the oper- has a complete picture of all VMs running on its ation before proceeding. The controller may also in- nodes. voke other plugins to do preprocessing such as check- Having the controller maintain system state re- ing resource availability and making placement deci- moves the need for it to query all LNMs in the system sions at this point. Usher calls any plugin modules reg- istered to receive notifications for a particular request for every VM management operation and state query. once the controller receives such a request. However, the controller does have to synchronize with the rest of system and we discuss synchronization fur- Usher delegates authorization to plugin modules ther in the Component Interaction subsection below. so that administrators are free to implement any policy or policies they wish and stack and swap modules as Consolidation of LNM monitoring data: Prop- the system runs. In addition, an administrator can con- er state maintenance relies upon system monitoring. figure the monitoring infrastructure to automatically The controller is responsible for consolidating moni- swap or add policies as the system runs based upon toring data sent by the local node managers into a for- current system load, time of day, etc. In its simplest mat accessible by the rest of the system. Clients use form, an authorization policy module openly allows this data to describe the state of the system to users, users to create and manipulate their VMs as they de- and plugins use this data to make policy decisions. For sire or view the global state of the system. More re- example, plugin modules may use this data to restrict strictive policies may limit the number of VMs a user user resource requests based on the current system can start, prohibit or limit migration, or restrict what load, or make VM scheduling decisions to determine information the system returns upon user query. where VMs should run. Once a request has successfully traversed the au- Event notification: Usher often needs to alert thorization and preprocessing steps, the controller exe- clients and plugin modules when various events in the cutes it by invoking asynchronous RPCs to each LNM system occur. Events typically fall into one of three involved. As described above, it is up to any plugin pol- categories: icy modules to authorize and check resource availabili- • VM operation requests ty prior to this point. Depending upon the running poli- • VM state changes cy, the authorization and preprocessing steps may alter • Errors and unexpected events a user request before the controller executes it. For ex- Clients automatically receive notices of state ample, the policy may be to simply ‘‘do the best I can’’ changes of their virtual machines. Clients are free to to honor a request when it arrives. If a user requests take any action desired upon notification, and can more VMs than allowed, this policy will simply start as safely ignore them. Plugin modules, however, must many VMs as are allowed for this user, and report back to the client what action was taken. Finally, if insuffi- explicitly register with the controller to receive event cient resources are available to satisfy an authorized notifications. Plugins can register for any type of event and preprocessed request, the controller will attempt to in the system. For example, a plugin may wish to re- fulfill the request until resources are exhausted. ceive notice of VM operation requests for preprocess- ing, or error and VM state change events for reporting Global state maintenance: The controller main- and cleanup. tains a few lists which constitute the global state of the system. These lists link objects encapsulating state in- Clients and the Client API formation for running VMs, running VMMs, and in- Applications use the Usher client API to interact stantiated virtual clusters (VCs). A virtual cluster in with the Usher controller. This API provides methods Usher can contain an arbitrary set of VMs, and admin- for requesting or manipulating VMs and performing istrators are free to define VCs in any way suitable to state queries. We refer to any application importing their computing environment. this API as a client. 21st Large Installation System Administration Conference (LISA ’07) 171
Usher: An Extensible Framework for Managing Clusters . . . McNett, et al. The client API provides the mechanism for clients In this way, the update serves as an acknowledgment to securely authenticate and connect to the Usher con- and the LNM knows that the controller received notice troller. Once connected, an application may call any of of the event. the methods provided by the API. All methods are asyn- Similarly, the controller sends VM object in- chronous, event-based calls to the controller (see the stances for newly created VMs to an LNM before the Implementation section below). As mentioned above, VM is actually started there. Upon successful return connected clients also receive notifications from the from a start command, the controller updates the VMs controller for state changes to any of their VMs. Client cached object state on the LNM. Subsequently, the applications can have their own callbacks invoked for LNM assumes the responsibility for monitoring and these notifications. reporting any unexpected state changes back to the Component Interaction controller. Having described each of the Usher components Controller and Client Interaction individually, we now describe how they interact in Clients to the Usher system communicate with more detail. We first discuss how the controller and the controller. Before a client can make any requests, LNMs interact, and then describe how the controller it must authenticate with the controller. If authentica- and clients interact. Note that clients never directly tion succeeds, the controller returns a capability to the communicate with LNMs; in effect, the controller client for invoking its remote API methods. Clients ‘‘proxies’’ all interactions between clients and LNMs. use this API to manipulate VMs. Controller and LNM Interaction Similar to the local node managers, clients re- When an LNM starts or receives a controller re- ceive cached object instances corresponding to their covery notice, it connects to the controller specified in VMs from the controller upon connection. If desired, its configuration file. The controller authenticates all clients can filter this list of VMs based upon virtual connections from LNMs, and encrypts the connection cluster grouping to limit network traffic. The purpose for privacy. Upon connection to the controller, the of the cached objects at the client is twofold. First, LNM passes a capability to the controller for access to they provide a convenient mechanism by which clients its VM management API. can receive notification of events affecting their VMs, Using the capability returned by the LNM, the since the controller sends updates to each cached VM controller first requests information about the hard- object when the actual VM is modified. Second, the ware configuration and a list of currently running vir- cached VM objects provide state information to the tual machines on the new node. The controller adds clients when they request VM operations. With this or- this information to its lists of running VMs and ganization, clients do not have to query the controller VMMs in the system. It then uses the capability to as- about the global state of the system before actually sume management of the VMs running on the LNM’s submitting a valid request. For example, a client node. should not request migration of a non-existent VM, or try to destroy a VM which it does not own. The client The controller also returns a capability back to library is designed to check for these kinds of condi- the LNM. The LNM uses this capability for both event tions before submitting a request. Note that the con- notification and periodic reporting of resource usage troller is capable of handling errant requests; this back to the controller. scheme simply offloads request filtering to the client. When the controller discovers that a new node The controller is the authority on the global state already has running VMs (e.g., because the node’s of the system. When the controller performs an action, LNM failed and restarted), it first determines if it it does so based on what it believes is the current glob- should assume management of any of these newly dis- al state. The cached state at the client reflects the con- covered VMs. The controller makes this determination troller ’s global view. For this reason, even if the con- based solely upon the name of the VM. If the VM troller is in error, its state is typically used by clients name ends with the domain name specified in the con- for making resource requests. The controller must be troller ’s configuration file, then the controller assumes capable of recovering from errors due to inconsisten- it should manage this VM. Any VMs which it should cies between its own view of the global state of the not manage are placed on the unmanaged list dis- system and the actual global state. These inconsisten- cussed above. For any VMs which the controller cies are typically transient (e.g., a late event notifica- should manage, the controller creates a VM object in- tion from an LNM), in which case the controller may stance and places this object on its running VMs list. log an error and return an error message to the client. These instances are sent to the LNMs where the VMs are running and cached there. Whenever an LNM sees Failures that a cached VM object is inconsistent with the corre- As the Usher system runs, it is possible for the sponding VM running there (e.g., the state of the VM controller or any of the local node managers to become changed), it alerts the controller of this event. The unavailable. This situation could be the result of hard- controller then updates the cached object on the LNM. ware failure, operating system failure, or the server itself 172 21st Large Installation System Administration Conference (LISA ’07)
McNett, et al. Usher: An Extensible Framework for Managing Clusters . . . failing. Usher has been designed to handle these failures total of 3993 lines of code. Also included is the line gracefully. count for the ush application (over 400 of which is In the event of a controller failure, the LNMs simply online documentation). will start a listening server for a recovery announce- Local Node Managers ment sent by the controller. When the controller Local Node Managers export the remote API restarts, it sends a recovery message to all previously shown in Table 2 to the controller. This API is made known LNMs. When the LNMs receive this an- available to the controller via a capability passed to nouncement, they reconnect to the controller. As men- the controller when an LNM connects to it. tioned in the Controller and LNM Interaction section above, when an LNM connects to the controller, it Method Name Description passes information about its physical parameters and get_details(vm name) get VM state infor- locally running VMs. With this information from all mation connecting LNMs, the controller recreates the global get_status(vm name) get VM resource us- state of the system. With this design, Usher only re- age statistics quires persistent storage of the list of previously known LNMs rather than the entire state of the system receive(vm instance) receive new cached to restore system state upon controller crash or failure. VM object Since the controller does not keep persistent in- start(vm name) start cached VM formation about which clients were known to be con- op_on(operation, operate on existing nected before a failure, it cannot notify clients when it vm name) VM restarts. Instead, clients connected to a controller migrate(vm name, migrate VM to LNM which fails will attempt to reconnect with timeouts lnm name) following an exponential backoff. Once reconnected, get_node_info() get node physical clients flush their list of cached VMs and receive a characteristics new list from the controller. get_node_status() get node dynamic The controller detects local node manager fail- and resource usage ures upon disconnect or TCP timeout. When this situa- info tion occurs, the controller changes the state of all VMs known to be running on the node with the failed LNM Table 2: Local node manager remote API. to lost. It makes no out of band attempts to determine if lost VMs are still running or if VMMs on which This API includes methods to query for VM state information and VM resource usage details using the LNMs have failed are still running. The controller get_details and get_status methods, respectively. State simply logs an error, and relies upon the Usher admin- information includes run state, memory allocation, IP istrator or a recovery plugin to investigate the cause of and MAC addresses, the node on which VM is run- the error. ning, VM owner, etc. Resource usage includes 1, 5, Implementation and 15-minute utilizations of the various hardware re- sources. In this section we describe the implementation of The receive method creates a cached copy of a Usher, including the interfaces that each component VM object on an LNM. An LNM receives the cached supports and the plugins and applications currently copy when it connects to the controller. It compares implemented for use with the system. the state of the VM object with the actual state of the virtual machine. If the states differ the LNM notifies Component LoC the controller, which updates the LNM’s cached copy LNM (w/ Xen hooks) 907 of the VM as an acknowledgment that it received the Controller 1703 state change notice. Client API 750 In addition, the cached copy of a VM at its LNM Utilities 633 contains methods for manipulating the VM it repre- Ush 1099 sents. When a VM manipulation method exposed by Table 1: Code size of individual components. the LNM’s API is invoked (one of start, op_on, or mi- grate), the method calls the corresponding method of The main Usher components are written in Python the cached VM object to perform the operation. This [2]. In addition, Usher makes use of the Twisted net- structure provides a convenient way to organize VM work programming framework [3]. Twisted provides operations. To manipulate a VM, a developer simply convenient mechanisms for implementing event based calls the appropriate method of the cached VM object. servers, asynchronous remote procedure calls, and re- Note that the controller must still update the state of mote object synchronization. Table 1 shows source its VM object as an acknowledgment that the con- code line counts for the main Usher components, for troller knows the operation was successful. 21st Large Installation System Administration Conference (LISA ’07) 173
Usher: An Extensible Framework for Managing Clusters . . . McNett, et al. Most operations on an existing VM are encapsu- suggestion to the controller as to where to migrate the lated in the op_on function, and have similar signa- VMs. The controller can choose to ignore this sugges- tures. Table 3 shows the list of valid operations to the tion or ignore the migrate request altogether based up- op_on method. on the policies installed. The operations supported by the op_on method in Operation Description the controller API are the same as those to the op_on pause pause VM execution, keeping method of the remote LNM API (Table 3). memory image resident resume resume execution of a paused VM Method Name Description shutdown nicely halt a VM list(vm list, status) list state and resource usage information for reboot shutdown and restart VM VMs hibernate save VM’s memory image to per- list_lnms(lnm list, status) list LNMs and resource sistent storage usage information for restore restore hibernated VM to run state VMMs destroy hard shutdown a VM start(vm list, lnm list) start list of VMs on cycle destroy and restart a VM LNMs Table 3: Operations supported by the op_on method. op_on(operation, vm list) operate on existing VMs migrate(vm list, lnm list) migrate VMs to LNMs All VM operations invoke a corresponding oper- ation in the VMM’s administration API. Though Ush- Table 4: Controller remote API for use by Usher er currently only manages Xen VMs, it is designed to clients. be VMM-agnostic. An installation must provide an Client API implementation of Usher’s VMM interface to support The client API closely mirrors that of the con- new virtual machine managers. troller. An important difference between these two The LNM’s remote API exposes a few methods APIs, though, is that the client API signatures contain that do not operate on VMs. The get_node_info method many additional parameters to aid in working with returns hardware characteristics of the physical ma- large sets of VMs. These additional parameters allow chine. The controller calls this method when an LNM users to operate on arbitrary sets of VMs and virtual connects. The get_node_status method is similar to the clusters in a single method call. The API supports spec- get_status method. Additionally, it reports the number ifying VM sets as regular expressions, explicit lists, or of VMs running on the VMM and the amount of free ranges (when VM names contain numbers). The client memory on the node. API also allows users to specify source and destination Usher Controller LNMs using regular expressions or explicit lists. The remote API exported by the controller to Another difference between the client and con- connecting clients closely resembles the interface ex- troller APIs is that the client API expands the op_on ported by LNMs to the controller. Table 4 lists the method into methods for each type of operation. Ex- methods exported by the controller to Usher clients. plicitly enumerating the operations as individual meth- This API is made available to clients via a capability ods avoids confusing application writers unfamiliar passed upon successful authentication with the con- with the op_on method. These methods simply wrap troller. the call to the op_on method, which is still available Note that most of these methods operate on lists for those wishing to call it directly. of VMs, rather than single VMs expected by the LNM Finally, the client API contains connect and recon- API methods. Since Usher was designed to manage nect methods. These methods contact and authenticate clusters, the common case is to invoke these methods with the controller via SSL. They also start the client’s on lists of VMs rather than on a single VM at a time. event loop to handle cached object updates and results This convention saves significant call overhead when from asynchronous remote method calls. The reconnect dealing with large lists of VMs. method is merely a convenience method to avoid hav- The start and migrate methods both take a list of ing to pass credentials to the API if a reconnect is re- quired after having been successfully connected. This LNMs. For start, the list specifies the LNMs on which method can be used by a reconnecting application up- the VMs should be started. An empty list indicates on an unexpected disconnect. that the VMs can be started anywhere. Recall that this parameter is simply a suggestion to the controller. Configuration Files Policies installed in the controller dictate whether or All Usher components are capable of reading not the controller will honor the suggestion. Likewise, configuration data from text files. All valid configura- the LNM list passed to the migrate method is simply a tion parameters, their type, and default values are 174 21st Large Installation System Administration Conference (LISA ’07)
McNett, et al. Usher: An Extensible Framework for Managing Clusters . . . specified in the code for each component. When each Besides authorization policies, one can imagine component starts, it first parses its configuration files policies for VM operation and placement. For exam- (if found). The configuration system tries to read in ple, initial VM placement, VM scheduling (i.e., dy- values from the following locations (in order): a de- namic migration based on load or optimizing a utility fault location in the host filesystem, a default location function), or reservations. A policy plugin for initial in the user’s home directory, and finally a file indicat- placement would be registered for the start_request ed by an environment variable. This search ordering event (probably with a higher order attribute than the enables users to override default values easily. Values startup authorization policy discussed above so that it read in later configuration files replace values speci- is called later in the plugin callback chain). Some sim- fied in a previously read file. ple policies such a plugin might support are round- Plugins robin and least-loaded. Scheduling and reservation Plugins are separate add-on modules which can plugins could be registered with a timer to be fired pe- be registered to receive notification of nearly any riodically to evaluate the state of the system and make event in the system. Plugins live in a special directory decisions about where VMs should be migrated and (aptly named ‘‘plugins’’) of the Usher source tree. which VMs might have an expired reservation, respec- Usher also looks in a configurable location for third- tively. party/user plugins. Any plugins found are automatical- As a concrete example of plugin usage in Usher, ly sourced and added to a list of available plugins. To we now discuss plugins implemented for use by the register a plugin, the controller provides an additional UCSD Usher installation, and outline the sequence of API call register_plugin(plugin name, event, configuration events for a scenario of starting a set of VMs. Detailed file). Each plugin is required to provide a method discussion about these plugins is deferred to the named entry_point to be called when an event fires for UCSD SysNet subsection of the Usher Deployments which it is registered. It is possible to add a single section. plugin to multiple event handler chains. Note that the The UCSD installation uses the following plug- register_plugin method can be called from anywhere in ins: an SQL database plugin for logging, mirroring the Usher code. global system state, and IP address management; an By default, plugins for each event are simply LDAP plugin for user authentication for both Usher called in the order in which they are registered. There- and VMs created by Usher; a filesystem plugin for fore careful consideration must be given to ordering preparing writable VM filesystems; a DNS plugin for while registering plugins. A plugin’s configuration ob- modifying DNS entries for VMs managed by Usher; ject can optionally take an order parameter that gov- and a default placement plugin to determine where erns the order in which plugins are called on the VMs should be started. We are developing additional event’s callback list. The plugin API also provides a modules for VM scheduling as part of ongoing re- converse unregister_plugin call to change event han- search. dling at runtime. All plugins for the UCSD installation are written Plugins can be as simple or complex as neces- in Python. Table 5 contains line counts for these plug- sary. Since the controller invokes plugin callback ins. Overall, the UCSD plugins total 1406 lines of chains asynchronously, complex plugins should not in- terfere with the responsiveness of the Usher system code. (i.e., the main controller event loop will not block waiting for a plugin to finish its task). Plugin LoC Policies in an Usher installation are implemented Database 260 as plugins. As an example, an administrator may have LDAP 870 strict policies regarding startup and migration of virtu- Filesystem 54 al machines. To enforce these policies, a plugin (or DNS 90 plugins) is written to authorize start and migrate re- Placement 132 quests. This plugin gets registered for the start_request Table 5: Code size of UCSD plugins. and migrate_request events, either manually using the controllers register_plugin command, or by specifying When a request to start a list of VMs arrives, the these registrations in the controller’s configuration controller calls the modules registered for the ‘‘start file. Once registered, subsequent start and migrate re- request’’ event. The placement and database modules quests are passed to the plugin (in the form of a Re- are in the callback list for this event. The placement quest object) for authorization. At this point, the module first determines which VMs to start based on plugin can approve, approve with modification, or available resources and user limits, then determines simply reject the request. Once this is done, the re- where each of the allowed VMs will start. The data- quest is passed on to any other plugins registered on base module receives the modified request list, logs the start_request or migrate_request event lists with a the request, then reserves IP addresses for each of the higher order attribute. new VMs. 21st Large Installation System Administration Conference (LISA ’07) 175
Usher: An Extensible Framework for Managing Clusters . . . McNett, et al. The controller generates a separate VM start succeeds, the controller returns a capability for its re- command for each VM in the start list. Prior to invok- mote API and all of user mmcnett’s VMs. The some- ing the start command, the controller triggers a ‘‘regis- what unusual output ‘‘’’ ter VM’’ event. The database, DNS, and file system reflects the fact that all client calls to the controller are plugin modules are registered for this event. The data- asynchronous. When ‘‘connect’’ returns, Ush responds base module adds the VM to a ‘‘VMs’’ table to mirror with the ‘‘Command 0 result:’’ message followed by the the fact that this is now a VM included in the con- actual result ‘‘Connected’’. troller ’s view of the global state. The DNS plugin sim- Upon connecting Ush saves the capability and ply sends a DDNS update to add an entry for this VM cached VM instances sent by the controller. Once con- in our DNS server. The filesystem module prepares nected, the user runs the ‘‘list’’ command to view his mount points on an NFS server for use by each VM. currently running VMs. Since the client already has Finally, upon return from each start command, a cached instances of user mmcnett’s VMs, the list com- ‘‘start complete’’ event fires. The database module mand does not invoke any remote procedures. Conse- registers to receive this event. The database module quently, Ush responds immediately indicating that us- checks the result of the command, logs this to the er mmcnett already has two VMs running. database, then either marks the corresponding IP ad- The user then requests the start of ten VMs in the dress as used (upon success) or available (upon fail- ‘‘sneetch’’ cluster. In this case, the -n argument speci- ure). Note that the database module does not change fies the name of a cluster, and the -c argument speci- the state of the VM in the VMs table until receiving a fies how many VMs to start in this cluster. When the ‘‘state changed’’ event for this VM (which originates controller receives this request, it first calls on the au- at the LNM). thorization and database modules to authorize the re- Sites can install or customize plugins as neces- quest and reserve IP addresses for the VMs to be start- sary. The Usher system supports taking arbitrary input ed. Next, the controller calls the initial placement and passing it to plugins on request events. For exam- plugin to map where the authorized VMs should be ple, a request to start a VM may include information started. The controller calls the start method of the re- about which OS to boot, which filesystem to mount, mote LNM API at each new VM’s LNM. The LNMs etc. Plugin authors can use this mechanism to com- call the corresponding method of the VMM admini- pletely customize VMs at startup. stration API to start each VM. Upon successful return Applications of all of these remote method calls, the controller re- We have written two applications using the client sponds to the client that the ten VMs were started in API, a shell named Ush and an XML-RPC server two seconds and provides information about where named plusher, and are developing two other applica- each VM was started. After completing their boot se- tions, a Web interface and a command-line suite. The quence, user mmcnett can ssh into any of his new Web interface will provide a convenient point and VMs by name. click interface accessible from any computer with a web browser. The command line suite will facilitate writing shell scripts to manage virtual machines. Ush Client The Usher shell Ush provides an interactive command-line interface to the API exported by the Usher controller. Ush provides persistent command line history and comes with extensive online help for each command. If provided, Ush can read connection details and other startup parameters from a configura- tion file. Ush is currently the most mature and pre- ferred interface for interacting with the Usher system. As an example of using the Usher system, we de- scribe a sample Ush session from the UCSD Usher in- stallation, along with a step-by-step description of ac- tions taken by the core components to perform each Figure 3: Ush request. In this example, user ‘‘mmcnett’’ requests ten VMs. Figure 3 contains a snapshot of Ush upon com- Plusher pletion of the start command. Plush [7] is an extensible execution management First a user connects to the Usher controller by system for large-scale distributed systems, and plusher running the ‘‘connect’’ command. In connecting, the is an XML-RPC server that integrates Plush with Ush- controller receives the user’s credentials and checks er. Plush users describe batch experiments or compu- them against the LDAP database. Once authentication tations in a domain-specific language. Plush uses this 176 21st Large Installation System Administration Conference (LISA ’07)
McNett, et al. Usher: An Extensible Framework for Managing Clusters . . . input to map resource requirements to physical re- machines far exceeds the supply of physical machines, sources, bind a set of matching physical resources to and juggling physical machine allocations never satis- the experiment, set up the execution environment, and fies all parties. However, for most of their lifetimes, vir- finally execute, monitor and control the experiment. tual machines can satisfy the needs of nearly all Since Usher is essentially a service provider for projects: resource utilization is bursty with very low the virtual machine ‘‘resource’’, it was natural to inte- averages (1% or less), an ideal situation for multiplex- grate it with Plush. This integration enables users to ing; virtualization overhead is an acceptable trade-off request virtual machines (instead of physical ma- to the benefits Usher provides; and users have com- chines) for running their experiments using a familiar plete control over their clusters of virtual machines, interface. and can fully customize their machine environments. Usher can also isolate machines, or even remove them Developing plusher was straightforward. Plush al- from virtualization use, for particular circumstances ready exports a simple control interface through XML- (e.g., obtaining final experimental results for a paper RPC to integrate with resource providers. Plush re- deadline) and simply place them back under Usher quires providers to implement a small number of up- management when the deadline passes. calls and down-calls. Up-calls allow resource providers to notify Plush of asynchronous events. For example, At the time of this writing, we have staged 25 using down-calls Plush requests resources asynchro- physical machines from our hardware cluster into nously so that it does not have to wait for resource allo- Usher. On those machines, Usher has multiplexed up cation to complete before continuing. When the pro- to 142 virtual machines in 26 virtual clusters, with an vider finishes allocating resources, it notifies Plush us- average of 63 VMs active at any given time. Our Ush- ing an up-call. er controller runs on a Dell PowerEdge 1750 with a 2.8 GHz processor and 2 GB of physical memory. To integrate Plush and Usher in plusher, we only This system easily handles our workload. Although needed to implement stubs for this XML-RPC inter- load is mostly dictated by plugin complexity, using the face in Usher. The XML-RPC stub uses the Client API plugins discussed below, the Usher controller con- to talk to the Usher controller. The XML-RPC stub sumes less than 1 percent CPU on average (managing acts as a proxy for authentication – it relays the au- 75 virtual machines) with a memory footprint of ap- thentication information (provided by users to Plush) proximately 20 MB. The Usher implementation is suf- to the controller before proceeding. When the request- ficiently reliable that we are now migrating the re- ed virtual machines have been created, plusher returns mainder of our user base from dedicated physical ma- a list of IP addresses to Plush. If the request fails, it re- chines to virtual clusters, and Usher will soon manage turns an appropriate error message. all 130 physical nodes in our cluster. Usher Deployments Usher Usage The straightforward ability to both easily create Next we describe two deployments of Usher that arbitrary numbers of virtual machines as well as de- are in production use at different sites. The first de- stroy them has proved to be very useful, and the Sys- ployment is for the UCSD CSE Systems and Network- Net group has used this capability in a variety of ways. ing research group, and the second deployment is at As expected, this ability has greatly eased demand for the Russian Research Center, Kurchatov Institute physical machines within the research group. Projects (RRC-KI). The two sites have very different usage simply create VMs as necessary. Usher has also been models and computing environments. In describing used to create clusters of virtual machines for students these deployments, our goal is to illustrate the flexibil- in a networking course; each student can create a clus- ity of Usher to meet different virtual machine manage- ter on demand to experiment with a distributed proto- ment requirements and to concretely demonstrate how col implementation. The group also previously re- sites can extend Usher to achieve complex manage- served a set of physical machines for general login ac- ment goals. Usher does not force one to setup or man- cess (as opposed to reserved use by a specific research age their infrastructure as done by either of these two project). With Usher, a virtual cluster of convenience installations. VMs now serves this purpose, and an alias with UCSD SysNet round-robin DNS provides a logical machine name for The UCSD CSE Systems and Networking (Sys- reference while distributing users among the VMs up- Net) research group has been using Usher experimen- on login. Even mundane tasks, such as experimenting tally since June 2006 and for production since January with software installations or configurations, can ben- 2007. The group consists of nine faculty, 50 graduate efit as well because the cost of creating a new machine students, and a handful of research staff and undergrad- is negligible. Rather than having to undo mistakes, a uate student researchers. The group has a strong focus user can simply destroy a VM with an aborted config- on experimental networking and distributed systems re- uration and start from scratch with a new one. search, and most projects require large numbers of ma- The SysNet group currently uses a simple policy chines in their research. As a result, the demand for module in Usher to determine the scheduling and 21st Large Installation System Administration Conference (LISA ’07) 177
You can also read