RUMOR: MONITORING AND RECOVERY OF BPEL APPLICATIONS
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
RuMoR: Monitoring and Recovery of BPEL Applications Jocelyn Simmonds, Shoham Ben-David, Marsha Chechik Department of Computer Science University of Toronto Toronto, ON M5S 3G4, Canada {jsimmond, shoham, chechik}@cs.toronto.edu ABSTRACT cesses, its correctness depends on the correctness of its part- Web service applications are distributed processes that are ners and their interactions. composed of dynamically bounded services. Since the over- Web service applications are distributed systems, where all system may only be available at runtime, static analysis partners are dynamically discovered and are going on- and is difficult to perform in this setting. Instead, these sys- off-line as the application runs. Their failures can be caused tems are many times checked dynamically, by monitoring by bugs in the service orchestration, e.g., due to faulty logic their behavior during runtime. Our tool performs monitor- and bad data manipulation, or by problems with hardware, ing of web service applications, and, when violations are dis- network or system software, or by incorrect invocations of covered, we automatically propose and rank recovery plans services. With runtime failures of web services inevitable, in- which users can then select for execution. Properties, spec- frastructures for running them typically include the ability ified using property patterns, are transformed into finite- to define faults and compensatory actions for dealing with state automata. Finite execution traces of web services de- exceptional situations. Specifically, the compensation mech- scribed in BPEL are checked for conformance at runtime. anism is the application-specific way of reversing completed For some property violations, recovery plans essentially in- activities. For example, the compensation for booking a car volve “going back” – compensating the executed actions until would be to cancel the booking. an alternative behaviour of the application is possible. For Existing infrustructures for web services, e.g., the BPEL other violations, recovery plans include both “going back” engine, include mechanisms for fault definition, for specifica- and “re-planning” – guiding the application towards a de- tion of compensation actions, and for dealing with termina- sired behaviour. These plans are generated using techniques tion. When an error is detected at runtime, they typically adapted from AI planning. MC: Do not like the last try to compensate all completed activities for which com- sentence. Maybe: just remove “AI”? pensations are defined, with the default compensation being the reversal of the most recently completed action. This approach presents several major problems: (1) The applica- Keywords tion is often allowed to continue running until the fault is Web services, behavioural properties, runtime monitoring, discovered, thus executing and then compensating for a lot recovery, planning, SAT solving. of unnecessary and potentially expensive activities. (2) It is THIS IS AN OLD VERSION – paper2.tex is the hard to determine, a priori, the state of the application after correct version executing compensation mechanisms. (3) There might be multiple compensations available, based on global informa- tion, but the automatic application of compensations does 1. INTRODUCTION not allow the user to choose between them. The Service-Oriented Architecture (SOA) framework is a This paper describes RuMoR, a user-guided recovery tool popular guideline for building web-based applications. A for web services. This tool takes as input the target BPEL SOA-based application is an orchestration of services offered application, a set of safety and liveness properties of the by (possibly third-party) components. These components – application (specified using the Specification Pattern Sys- which can be written in a traditional compiled language such tem [1]), and the maximum length and number of recovery as Java, or in an XML-centric language such as BPEL [5] – plans. Our approach has three phases: Preprocessing, Mon- communicate through published interfaces. Since an appli- itoring and Recovery. Properties are translated into moni- cation is a composition of several distributed business pro- tors during the Preprocessing phase. During the Monitoring phase, RuMoR runs these monitors in parallel with the ap- plication, stopping when one of the monitors is about to be violated. This triggers the Recovery phase, where various techniques are used to generate recovery plans. Permission to make digital or hard copies of all or part of this work for For violations of safety properties, recovery plans use com- personal or classroom use is granted without fee provided that copies are pensation actions to allow the application to “go back” to an not made or distributed for profit or commercial advantage and that copies earlier state at which an alternative path that potentially bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific avoids the fault is available. We call such states “change permission and/or a fee. states”; these include user choices and certain partner calls. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.
been found (which activates the recovery phase) or all part- ners terminate. MM also stores the intercepted events. During the recovery phase, artifacts from both the pre- processing and the runtime monitoring phases are used to generate recovery plans. The Safety Plan Generator gen- erates recovery plans that can only compensate executed activities. For liveness properties, plans can compensate ex- ecuted activities and execute new activities. In this case, the Liveness Plan Generator first generates the correspond- ing planning problem using the model of the application, error trace and violated monitor. The planning problem is expressed in STRIPS [2] – an input language to the plan- ner Blackbox [4], which we use to convert it into a SAT (a) instance. The maximum plan length is used to limit the size of the planning graph generated by Blackbox, effectively lim- iting the size of the plans that can be produced. As shown in Fig. 1b, we modify the initial SAT instance in order to produce alternative plans. Plans are extracted from the sat- isfying assignments produced by the SAT solver SAT4J and converted into BPEL for displaying and execution. SAT4J (b) is an incremental SAT solver, i.e., it saves results from one Figure 1: (a) Tool architecture; (b) Recovery plan search and uses them for the next. For our method of gen- generation for liveness properties. erating multiple plans, where each SAT instance is more restricted than the previous one, this is particularly useful, Failure of a liveness monitor during execution means that leading to efficient analysis. some required actions have not been seen before the ap- All computed plans are presented to the application user plication tried to terminate, and the recovery plan should through the Violation Reporter (VR), and the chosen plan attempt to perform these actions. is executed by the Plan Executor (PE). VR generates a web We generate such plans by adapting techniques from the page snippet with violation information, as well as a form field of AI planning. The application itself defines a plan- for selecting a recovery plan. Developers must include this ning domain, i.e., which events can be executed and when. snippet in the default error page, so that the computed re- A recovery plan is a sequence of events that, starting at the covery plans can be shown when an error is detected. If current error state, leads to an application state from which no monitor is violated during the execution of the chosen the current liveness property can be satisfied. We call these plan (MM updates the states of the active monitors during states “goal states”, and compute them through static analy- the plan execution), the framework switches back to runtime sis during the Preprocessing phase. When there are multiple monitoring. recovery plans available, RuMoR automatically ranks them based on user preferences (e.g., the shortest, the cheapest, the one that involves the minimal compensation, etc.) and 3. CONCLUSION enable the application user to choose among them. We presented RuMoR, a recovery tool based on monitor- ing and planning to detect and fix runtime errors detected in 2. DESIGN AND IMPLEMENTATION BPEL applications. Recovery plans are generated by doing a combined static and dynamic analysis of the target appli- We have implemented RuMoR using a series of publicly cation and user-specified properties of the application. Our available tools and several short (200-300 lines) new Python experience [6] shows that we can effectively generate user- or Java scripts. The Preprocessing and Monitoring phases of expected plans, and experiments so far suggest scalability our tool are the same for both safety and liveness properties, with respect to runtime complexity. but different components are required for generating plans from the two types of properties. We show the architecture of our tool in Fig. 1a (rectangles and ovals are components 4. REFERENCES and artifacts, respectively). Developers create properties for their web services using [1] M. Dwyer, G. Avrunin, and J. Corbett. Patterns in Property property patterns and system events. During the prepro- Specifications for Finite-State Verification. In ICSE’99, cessing phase, the Property Translator (PT) component re- pages 411–420, May 1999. [2] R. Fikes and N. J. Nilsson. STRIPS: A New Approach to the ceives the specified properties and turns them into monitors. Application of Theorem Proving to Problem Solving. Artif. The LTS Extractor (LE) component extracts an LTS model Intell., 2(3/4):189–208, 1971. of the BPEL program (using the WS-Engineer extension for [3] H. Foster, S. Uchitel, J. Magee, and J. Kramer. LTSA-WS: a LTSA [3]) and creates a second LTS model with compen- Tool for Model-Based Verification of Web Service sation. The LTS Analyzer (LA) computes goal links and Compositions and Choreography. In Proc. of ICSE’06, pages change states using the techniques described in [6]. 771–774, 2006. [4] H. A. Kautz and B. Selman. Unifying SAT-based and During the execution of the application, the Event In- Graph-based Planning. In IJCAI’99, pages 318–325, 1999. terceptor (EI) component intercepts application events and [5] OASIS. Web Services Business Process Execution Language sends them to the Monitor Manager (MM) for analysis. MM Version 2.0. http://docs.oasis-open.org/wsbpel/2.0/OS/ updates the state of each active monitor, until an error has wsbpel-v2.0-OS.html, Accessed January 2009.
[6] J. Simmonds, S. Ben-David, and M. Chechik. Guided either a limo was booked and later an expensive flight, or Recovery for Web Service Applications. In Proc. of FSE’10, an expensive flight was booked first and then a limo. 2010. To appear. A.3 Scenario S1 APPENDIX Consider the execution of TAS in which the customer chooses the air/ground option (carAndFlight), and then tries to book A. DEMO the flight before the car. In this example, there is a com- Demonstration plan: munication problem with the flight system partner, and the 1. Explain example application invocation of the cf service time outs. This scenario cor- responds to the trace t1 , depicted by dotted transitions in 2. Preprocessing: Fig. 2b. (a) Convert properties into monitors The application server detects that the cf invocation timed (b) Generate the application model out, and sends a TER event (not shown in Fig. 2b) to the 3. Monitoring and Recovery: application. Our framework intercepts this TER event and determines that executing it turns t1 into a failing trace, (a) Execute scenario S1 (S2 ), which violates P1 (P2 ) because the monitor A1 would enter its error (red) state 3. (b) Generate planning domain In response, our framework does not deliver the TER event (c) Generate multiple recovery plans to the application, and instead initiates recovery. (d) Execute a plan chosen by the audience If we set the maximum plan length to be 10, the Liveness Plan Generator computes the plans shown in Fig. 4a. Plan A.1 Example - the Trip Advisor System (TAS) p0 is the shortest: if unable to obtain a price for the flight, Fig. 2a shows the BPEL-expressed workflow of the Trip cancel the flight and reserve the car instead. Plans p1 and p2 Advisor System. In a typical scenario of this system, a cus- also cancel the flight (since 8 is not a change state whereas tomer either chooses to arrive at her destination via a rental 7 is) and then proceed to re-book it and then book the car, car (and thus books it), or via an air/ground transportation regardless of the flight’s cost. Increasing the plan length, combination, combining the flight with either a rental car we also get the option of taking the getCar transition out of from the airport or a limo. The requirement of the system state 6, book the car and then the flight. is to make sure the customer has the transportation needed Fig. 5a shows the snippet generated by the Violation Re- to get to her destination (this is a desired behavior which porter. Fig. 5b shows the (simplified) source code of such an we refer to as P1 ) while keeping the costs down, i.e., she is error reporting page, where the bolded line has the instruc- not allowed by her company to reserve an expensive flight tion to include the snippet, and Fig 5c shows a screen shot of and a limo (this is a forbidden behavior which we refer to as error.jsp after recovery plans for P1 have been computed. P2 ). We will pick one of these plans and execute it. TAS interacts with four external services: 1) book a rental car (bc), 2) book a limo (bl), 3) book a flight (bf), and 4) A.4 Scenario S2 check price of the flight (cf). The result of cf is then passed In another scenario, the customer attempts to arrive at to local services to determine whether it is expensive (expF) her destination via a limo (bl) and an expensive flight (expF). or cheap (cheapF). Service interactions are preceded by a This corresponds to the trace t2 , depicted by dashed transi- symbol. tions in Fig. 2b. As the monitor A2 has a transition on expF The workflow begins with ’ing input (ri), fol- to an error state, our framework delays the execution of this lowed by ’ing (indicated by labeled ) either event from application state 21. In this example, executing the car rental (onMessage onlyCar) or the air/ground trans- expF will make A2 enter its error state 4, so t2 is also a fail- portation combination (onMessage carAndFlight). The latter ing trace. The expF event is not delivered, and the recovery choice is modeled using a (scope enclosed in bold, phase is activated. blue lines , labeled ) since air (getFlight) and ground The Safety Plan Generator returns the recovery plans transportation (getCar) can be arranged independently, so shown in Fig. 4b. We add state names between transitions they are executed in isolation. The air branch sequentially for clarity and refer to plans rs to mean “recovery to state books a flight, checks if it is expensive and updates the state s”. A given plan can also become a prefix for the follow-on of the system accordingly. The ground branch ’s be- one. This is indicated by using the former’s name as part tween booking a rental car and a limo. The end of the of definition of the latter. For example, recovery to state 16 workflow is marked by a activity, reporting that starts with recovery to state 18 and then includes two more the destination has been reached (rd). backward transitions, the last one with a non-empty com- pensation. Plan r18 can avoid the error if, after its applica- A.2 Preprocessing tion, the user chooses a cheap flight instead of an expensive one. Executing plan r15 gives the user the option of chang- Fig. 2b shows the TAS LTS model (with compensation). ing the limousine to a rental car, and plan r2 – the option To increase legibility, each transition represents an action of changing from an air/ground combination to just renting and its compensation, labeled in the form a/ā, where a is a car. Both of these behaviours do not cause the violation the application activity and ā is its compensation. Monitor of A2 . A1 in Fig. 3a represents P1 : if the application terminates before rd appears, the monitor moves to the (error) state 3. State 1 is a good state since the monitor enters it once the B. TOOL MATURITY AND AVAILABILITY booked transportations reach the destination (rd). Monitor The following components of our tool are still under devel- A2 in Fig. 3b represents P2 . It enters its error state (4) when opment: Property Translator, Event Interceptor, and Plan
(a) (b) Figure 2: (a) Workflow of TAS; (b) corresponding LTS model. (a) (b) Figure 3: Monitors: (a) A1 , (b) A2 . Red states are shaded horizontally, green states are shaded vertically, and yellow states are shaded diagonally. τ cancelF τ τ onlyCar bc p0 = 9 −→ 8 −→ 7 −→ 6 −→ 2 −→ 3 −→ 4 τ cancelF bf cf exp true expF getCar car bc (a) p1 = 9 −→ 8 −→ 7 −→ 8 −→ 9 −→ 10 −→ 11 −→ 12 −→ 13 −→ 4 τ cancelF bf cf exp false cheapF getCar car bc p2 = 9 −→ 8 −→ 7 −→ 8 −→ 9 −→ 14 −→ 11 −→ 12 −→ 13 −→ 4 τ τ τ cancelF τ r18 = 4 −→ 21 −→ 20 −→ 19 −→ 18 r6 = r15 −→ 6 (b) τ cancelL τ r16 = r18 −→ 17 −→ 16 r2 = r6 −→ 2 τ τ r15 = r16 −→ 15 r1 = r2 −→ 1 Figure 4: Recovery plans for TAS: (a) plans of length ≤ 10 for Scenario S1 ; (b) plans for Scenario S2 . Executor. The tool is not available online, as it includes IBM intellectual property.
(a) (b) (c) Figure 5: Violation reporting: (a) snippet.jsp, automatically generated snippet that contains recovery plans; (b) error.jsp, the application error handling page; (c) error.jsp displayed on a browser.
You can also read