The requirements of recording and using provenance in e-Science experiments
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
The requirements of recording and using provenance in e-Science experiments Simon Miles, Paul Groth, Miguel Branco and Luc Moreau School of Electronics and Computer Science University of Southampton Southampton, SO17 1BJ, UK Tel: +44 23 8059 8309 Email: sm@ecs.soton.ac.uk Abstract— In e-Science experiments, it is vital to record A provenance architecture is the software archi- the experimental process for later use such as in interpret- tecture for a system that provides necessary func- ing results, verifying that the correct process took place tionality to record, store and use provenance data or tracing where data came from. The documentation of in a wide variety of applications. In the PASOA a process that led to some data is called the provenance of that data, and a provenance architecture is the software project (www.pasoa.org), we aim to develop a architecture for a system that will provide the necessary provenance architecture and, therefore, we must be functionality to record, store and use provenance data. aware of the range of uses to which the provenance However, there has been little principled analysis of what data will be put. For this reason, we have surveyed is actually required of a provenance architecture, so it a range of application areas and determined the use is impossible to determine the functionality they would cases that each has for provenance data. This paper ideally support. In this paper, we present use cases for a provenance architecture from current experiments in biol- focuses on e-Science applications and presents the ogy, chemistry, physics and computer science, and analyse results of our requirements capture and analysis pro- the use cases to determine the technical requirements of a cess and discusses its implications for a provenance generic, application-independent architecture. We propose architecture. an architecture that meets these requirements and evaluate In this paper, we present the use cases indepen- a preliminary implementation by attempting to realise one dently of their analysis, so that others can draw of the use cases. different implications from them. Our presentation is not intended to be a detailed use case specifica- I. I NTRODUCTION tion; instead, the aim of our requirements capture is In business and e-Science, electronic services to draw out the generic, re-usable aspects of each allow an increasing volume of analysis to take place. application area so that a provenance architecture The large amount of processing brings its own can be designed and built. problems, however. Questions that can be answered Our specific contributions in this paper are as relatively easily about a low number of experiments, follows. such as when the experiment took place or whether • A range of use cases regarding the recording, two experiments were performed on the same initial querying and use of information regarding sci- material, become near impossible to resolve with entific, and particularly e-Science, experiments. large numbers of experiments. We use the term • An analysis of the technical requirements provenance data to describe the records of exper- needed to be fulfilled to achieve these use iments used to answer such questions (we discuss cases. the meaning of provenance fully later). Rather than • A proposed architectural design to address relying on scientists to remember experiment details these technical requirements. or write paper notes, there is a need to automatically • A preliminary evaluation of the architecture record provenance data into reliable and accessible through an implementation to achieve one of storage so that it can later be used. the use cases.
II. BACKGROUND to build a dependency map and, using this map, A. Service-Oriented Architectures capture a trace of a program’s execution. The sub- pushdown algorithm [24] is used to document the Service oriented architectures (SOA) are the un- process of array operations in the Array Manipu- derpinning of the common distributed system tech- lation Language. A more comprehensive system is nology in e-Business and e-Science. A service- the audit facilities designed for the S language [11], oriented architecture (SOA) consists of loosely- used for statistical analysis, where the result of users coupled services communicating via a common command are automatically recorded in an audit file. transport. A service, in turn, is defined as a well- These systems work on a single local system with defined, self-contained, entity that performs tasks a single administrator, and so have limited appli- which provide coherent functionality. Typically, a cation in capturing documentation of distributed e- service is only available through an interface, iden- Science processes. tifying all possible interactions with the service Much of the research into provenance recording and represented in some standard format. A client has come in the context of domain specific appli- is an entity that interacts with a service through cations. Some of the first research in provenance its interface, requesting that the service perform was in the area of geographic information sys- an operation by sending a message containing all tems (GIS)[22]. Lanter developed two systems for the required data. SOA technologies include Web tracking provenance in a GIS, a meta-database for Services [7], Grids [17], Common Object Request tracking the process of workflows and a system for Broker Architecture (CORBA) [27] and Jini [34]. tracking Arc/Info GIS operations from a graphical SOAs provide several benefits. First, they hide user interface with a command line [21], [23]. An- implementation behind an interface allowing imple- other GIS system that includes provenance tracking mentation details to change without affecting the is Geo-Opera, an extension of GOOSE, which uses user of the service. Secondly, the loosely-coupled data attributes to point to the latest inputs/outputs nature of services allows for their reuse in multiple of a data transformation, implemented as programs applications. Because of these properties, SOAs are or scripts [9]. In chemistry, the CMCS project has particularly good for building large scale distributed developed a system for managing metadata in a systems. multi-scale chemistry collaboration [25], based on Typically, multiple services are used in conjunc- the Scientific Application Middleware project [26]. tion to provide more extensive functionality than Another domain where provenance tools are being each provides individually. For re-usability, the way developed is bioinformatics. The my Grid project has in which services are combined to perform a func- implemented a system for recording provenance tion can be encoded as workflow [1], [8]. In e- in the context of in-silico experiments represented Science, workflows are used to define experimental as workflows aggregating Web Services [19]. In processes in enactable form. my Grid, provenance is gathered about workflow ex- ecution and stored in the user’s personal repository B. Provenance along with any other metadata that might be of The idea of provenance is fundamental to prove- interest to the scientist [37]. The focus of my Grid nance architectures. Prior research has referred to is personalising the way provenance is presented to this concept using several other terms including the user. audit trail, lineage [22], dataset dependence [10], By their nature, domain-specific provenance ar- and execution trace [31]. We define the provenance chitectures must be re-developed for each new do- of a piece of data as the documentation of the main. Recording provenance is a problem common process that produced that data. In this section, to many, if not all, domains and a generic system we review a number of systems and domains that would allow for greater re-use. respectively provide and manage provenance-related Provenance in database systems has focused on functionality. the data lineage problem [15]. This problem can The Transparent Result Caching (TREC) pro- be summarised as given a data item, determine totype [33] uses the Solaris UNIX proc system the source data used to produce that item. [35] to intercept various UNIX system calls in order look at solving this problem through the use of
the technique of weak inversion, and later used [18]. The schema is divided into three parts: a to improve database visualization [36]. The data transformation, a derivation and a data object. A lineage problem has been formalised and algorithms transformation represents an executable, a derivation for generating lineage data in relational databases represents the execution of a particular executable, are presented in [15]. AutoMed [16] tracks data and a data object is the input or output of a lineage in a data warehouse by recording schema derivation. The virtual data language provided by transformations. In [13], Buneman et al. redefine Chimera is used to both describe schema elements the data lineage problem as “why-provenance” and and query the data catalogue. Using the virtual defines a new type of provenance for databases, data language, a user can query the catalogue to namely, “where-provenance”. “Why-provenance” is retrieve the transformations that led to a result. The the collection of data sets (tuples) contributed to benefit of using a common description language is a data item, whereas, “where-provenance” is the that relationships between entities can be extracted location of a data element in the source data. Based without understanding the underlying data. on this terminology a formal model of provenance In [30], the authors argue for infrastructure sup- was developed applying to both relational and XML port for recording provenance in Grids and pre- databases. In [12], the authors argue for a time- sented a trial implementation of a system that offers stamped based archiving mechanism for change several mechanisms for handling provenance data tracking in contrast to diff-based mechanisms. These after it had been recorded. Their system is based mechanisms may not capture the complete prove- around a workflow enactment engine submitting nance of a database because there may be multiple data to a provenance service. The data submitted changes between each archive of the database. is information about the invocation of various web Database-oriented systems focus on the changing services specified by the executing workflow script. locations of data rather than the processes they None of the existing technologies provide a prin- have been through. Due to the many terms used cipled, application-independent way of recording, in this set of literature, e.g. data lineage, where- storing and using provenance data. We attempt to provenance, data provenance, we instead use the achieve this with our provenance architecture. term input provenance defined as follows: given a piece of data X, the input provenance of X is all III. A PPLICATIONS data that contributed to X being as it is. In this section, we briefly introduce the exper- There have been several systems developed to iments, i.e. scientific projects to check hypotheses provide middleware provenance support to appli- or investigate material properties, from which we cations. These systems aim to provide a general derived our use cases. They have been classified by mechanism for recording and querying provenance their scientific domain. for use with multiple applications across domains and beyond the confines of a local machine. According to [29], each user is required to have A. Biology an individual e-notebook which can record data and 1) Intron Complexity Experiment: The bioinfor- transformations either through connections directly matics domain already involves the analysis of a to instruments or via direct input from the user. Data massive amount of complex data, and, as exper- stored in an e-notebook can be shared with other e- iments become faster and automated to a larger notebooks via a peer-to-peer mechanism. degree, the experimental records are becoming un- Scientific Application Middleware (SAM) [26], manageable. The Intron Complexity Experiment built on the WebDav standard, provides facilities (ICE) is a bioinformatics experiment to identify the for storing and managing records, metadata and relative Kolmogrov complexity of introns and exons, semantic relationships. Support for provenance is and the relation between the complexities of the provided through adding metadata to files stored in two. Exons are subsequences of chromosomes that a SAM repository. encode for proteins, introns are the sub-sequences The Chimera Virtual Data System contains a that separate exons on a chromosome. This exper- virtual data catalogue, which is defined by a virtual iment uses a number of services, some externally data schema and accessed via a query language provided, some written by the biologist, that analyse
data drawn from publicly accessible databases such sion of particles at high energies. Experimental pro- as GenBank [3]. When a potentially interesting cesses in a Particle Detection Experiment (PDE) are result is found, the biologist re-runs parts of the complex, with the data provider, CERN, providing workflow with different configuration parameters to some processing of the raw data, followed by further try and determine why that result was produced. analysis localised around the world. The group of 2) Candidate Gene Experiment: The my Grid [5] PWGs that manage the data as a whole, along with project attempts to provide a working environment everyone that provides the resources to do so, is for bioinformaticians, particularly providing portals called the Collaboration for this experiment. and middleware that can be used by many parties. Experimental processes are automated or partially C. Chemistry automated by encoding them as workflows and ex- ecuting them within a workflow enactment engine. Second Harmonic Generation Experiment: The my Second Harmonic Generation Experiment (SHGE) Grid has been concentrating on a few bioinformat- ics experiments that fit into a class called Candidate analyses properties of liquids by bouncing lasers off Gene Experiments (CGE). These experiments aim them and measuring the changes that have occurred to discover as much information as possible about in the polarisation of the laser beam [14]. a gene (the candidate gene) from existing data sources, to determine whether it is involved in D. Computer Science causing a genetic disorder. 3) Protein Identification Experiment: Proteomics 1) Service Reliability Experiment: The e- is the study of proteomes, which are defined as Demand [2] project attempts to make service- all the proteins produced by a single organism. oriented Grids more reliable and better tailored The Protein Identification Experiment (PIE) is per- to those using them by examining the relative formed to identify proteins from a given sample, reliability and quality of services. In the Service e.g. to determine what proteins are present only Reliability Experiment (SRE), several services in someone with a certain disease. To this end, implement the same function using different the characteristics of protein fragments can provide algorithms. The results returned by the services are evidence for the identification of the protein. This compared in order to increase the assurance that requires first breaking the protein at well-identified the results are valid. points, i.e. at given amino acids, resulting in a set of 2) Security Testing Experiment: The Semantic peptides. The peptides are examined using a mass Firewall project aims to deal with the security spectrometer to determine their mass-to-charge ra- implications of supporting complex, dynamics re- tio. To obtain more accurate results, the peptides lationships between service providers and clients are then further fragmented, at random points, by that operate from within different domains, where bombarding the peptides with a charged gas, and different security policies may hold and different se- these fragments are again fed to the spectrometer. curity capabilities exist [28]. In the Security Testing Databases of previously analysed results are used to Experiment (STE), a client wishes to delegate their match peptide characteristics to possible proteins, access to data to another service, and so a complex as well as to provide further information on the interaction between the services is necessary to proteins such as the functional group to which they ensure security requirements are met. A semantic belong. firewall will reason about the multiple security poli- cies and allow different operations to take place on the basis of that reasoning. The reasoning can B. Physics be dependent on the entities interacting and other Particle Detection Experiment: In High Energy contextual information provided to and from the Physics (HEP) experiments, vast amounts of data existing security infrastructures. The semantic fire- are collected from detectors and stored ready to be wall can be seen as guiding the interacting parties analysed in different ways by groups of specialised through a series of interaction protocol states on the physicists, Physics Working Groups (PWG), in order basis of reasoning, ensuring that interactions follow to identify traces of particles produced by the colli- the security policies of individual domains.
IV. U SE C ASE A NALYSIS B. Functional Requirements In this section, we present those use cases pro- The above experiments provided us with a selec- viding functional requirements on the provenance tion of use cases involving the capture and use of architecture. Each use case in this section is defined provenance data. In this section, we present each in terms of the relevant actors and the actions of the issues raised by the use cases, introducing they perform. The final sentence of each use case each use case where it is most illustrative. The is a provenance question: an action that can be issues identified are expressed as general technical realised by processing recorded provenance data. requirements so that design decisions can be made The provenance questions place explicit demands regarding a suitable provenance architecture. In each on the provenance architecture and so imply general case, we have given the technical requirement in technical requirements. For ease of identification, the form of a statement “PASOA should provide the provenance question in each use case is ital- for...” with reference to a particular behaviour of icised. All experiments produce some data, so the the system, where PASOA refers to the provenance record of an experiment is the provenance of one or architecture we wish to design. Each statement more pieces of data. Where a question is asked of makes no implications about how the architecture the information recorded by the provenance archi- achieves the requirement, so that others can use tecture, we mean that it is asked of the provenance them to develop alternatives to PASOA. of one piece of data produced by the experiment. 1) Types of Provenance: The term ‘provenance’ was understood to have different, though strongly related, meanings to the users and it is helpful to A. Methodology distinguish and describe these types by the use of a Given the project aims, we followed the method- few particular use cases. Use Case 1: (ICE) A bioinformatician, B, down- ology below for gathering use cases from each user. loads sequence data of a human chromosome from • We provided a broad description of our goals, GenBank and performs an experiment. B later per- making it clear that we intended to design forms the same experiment on data of the same an architecture to aid recording what occurred chromosome, again downloaded from GenBank. B during experiments. We did not provide a defi- compares the two experiment results and notices a nition of ‘provenance’ or any comparable term, difference. B determines whether the difference was as this is one of the pieces of information caused by the experimental process or configuration we wish to derive from the use cases. Since having been changed, or by the chromosome data we aim to uncover tasks that the user cannot being different (or both). 2 currently perform, we presented some of the First, this use case requires a record of the use cases gathered from previous users to each execution of the experiment, i.e. the interaction subsequent user as inspiration. between services that took place including the data • We catalogued the provenance-related use that was passed between them. We call this type of cases that the user has already considered and provenance interaction provenance. thoughts regarding possible other benefits that The same use case provides an example of ac- may be obtained from having provenance data tor provenance, i.e. extra information from either available, i.e. functional requirements. Also, service participating in the experiment at the time we asked the user about the non-functional that the experiment was run. Each service typically requirements of any software we may provide. relies on an algorithm, which may be modified • We extracted the concrete functional and non- over time, and it is likely that only the service functional use cases from the interviews, iden- running the algorithm will have access to it. If B tifying the actors involved and the actions they can determine whether the algorithm has changed perform, and wrote them in a consistent form. between experiment runs, B can also determine • We presented the written use cases to the user whether the results are due to that change. for confirmation that they were correct, and for Use Case 2: (CGE) A bioinformatician, B, en- them to correct where not. acts an experimental workflow using a workflow
enactment engine, W. W processes source data to the experiments should be re-run based on the new produce intermediate data, and then processes the data set. 2 intermediate data to produce result data. B retrieves Technical Requirement 2: PASOA should pro- the result data. B then examines the source and vide for association of identifiers with data, so that intermediate data used to produce the result data. it can be referred to in queries and by data sources 2 linking experiments together. Use Case 2 demonstrates the desire for input Technical Requirement 3: PASOA should pro- provenance, which is the record of the set of data vide for referencing of individual data elements con- used to produce another piece of data. We can tained in message bodies recorded in the provenance summarise the types of provenance as follows. data. • Interaction Provenance: A record of the inter- 3) Metadata and Context: The questions that action between services that took place, includ- users wish to ask often draw together provenance ing the data that was passed between them. data regarding particular experiments with other • Actor Provenance: Extra information from ei- information. For example, in the Candidate Gene ther service participating in the experiment at Experiment, information such as the semantic type the time that the experiment was run. of each piece of data in an ontology, such as the • Input provenance: Given a piece of data, X, Gene Ontology [4], may be used by the bioinfor- input provenance refers to the set of data used matician to provide further reason to believe the in the creation of X. candidate gene is involved in the genetic disease. Technical Requirement 1: PASOA should pro- Similarly, the lab and project on which the producer vide for the recording and querying of execution, of a given piece of data worked may be used to help actor and input provenance. determine its likelihood of being accurate. 2) Structure and Identity of Data: Services ex- Use Case 5: (SHGE) In order to conform to change data in the form of messages. Messages health and safety requirements, a chemist, C, plans specify the operation that the client wishes to per- an experiment prior to performing it. The plan is at form as well as a set of structured data to be a high-level, e.g. including the steps of mixing and analysed and/or to be used to configure the analysis. analysing materials but excluding implied steps like Use Case 3: (ICE) A bioinformatician, B, per- measuring out materials. C performs the experiment. forms an experiment on a set of chromosome data, Later, another chemist, R, determines whether the from which the exon and intron sequences have experiment carried out conformed to the plan. 2 been extracted. As a result of that experiment, B In Use Case 5, the pre-defined plan of the exper- identifies a highly compressable intron sequence. B iment does not necessarily match the actual steps identifies which chromosome the intron originally performed. As shown in Figure 1, a single planned came from. 2 activity may map to one or more actual activities. In Use Case 3, data elements within the messages As described in the use case, the plan is produced exhanged between services need to be consistently before any provenance data is recorded, but is used identified. We cannot guarantee that the content of in comparison with the provenance data. It is an the data itself provides unique identification, so an example of provenance metadata: data independent identitifier may have to be associated with the data. from but used in conjunction with provenance data. To satisfy the questions regarding a data element, Given that provenance metadata is of an arbitrary its identifier should be usable in queries about the wide scope, any framework for supporting the use provenance data. Finally, to associate an identifier of provenance must take into account stores of meta- with an element of a message recorded in the data that will be queried along with the provenance provenance data, there must be a way to reference data. that element. The context of an experiment is anything that Use Case 4: (PDE) A physicist, P, extracts a was true when the experiment was performed. Some subset of data from a large data set, owned by contextual information is relevant to the provenance the Collaboration, and performs experiments on that questions. In Use Case 6, the experiment configu- subset over time. The Collaboration later updates ration, the spectrometer voltage, is relevant to the the data set with new data. P determines whether question asked later.
Planned Activity Planned Activity Planned Activity this means that we need to delimit one set of service interactions from another. We define a session as a group of service interactions (experiment activities). Use Case 8: (SRE) A computer scientist, C, calls Actual Actual Activity Activity service X which calculates the mean average of two numbers as (a/2)+(b/2). C then calls service Actual Activity Actual Activity Actual Activity Actual Activity Y with the same two numbers, where Y calculates the average as (a+b)/2. C does not know if X or Actual Activity Y are reliable, so by getting results from both, C can compare them and, if they are the same, be more sure having the correct result (because the Fig. 1. Plans in CombeChem: planned activities do not map exactly same value is produced by two different services). to performed activities However, X and Y may use a common third service, Z, behind the scenes, e.g. to perform division oper- ations. If Z is faulty then the results from X and Y Use Case 6: (PIE) A biologist, B, sets the volt- may be consistent but wrong. For extra assurance, age of a mass spectrometer before performing an C determines whether X and Y did in fact use a experiment to determine the mass-to-charge ratio common third service. 2 of peptides. Later another biologist, R, judges the experiment results and considers them to be partic- C ularly accurate. R determines the voltage used in the experiment so that it can be set the same for Uses Uses measuring peptides of the same protein in future experiments. 2 Service X Service Y A particular type of metadata is semantic infor- Uses Uses mation about the entities involved in an experiment. Service Z For instance, the following use case requires se- mantic metadata about the data exchanged between services in the experiments. C Service X Service Z Use Case 7: (ICE) A bioinformatician, B, per- forms an experiment on a FASTA sequence en- Session 1 coding a nucleotide sequence. A reviewer, R, later C Service Y Service Z determines whether or not the sequence was in fact processed by a service that meaningfully processes Session 2 protein sequences only. 2 Use Case 7 requires not only that an ontology Fig. 2. Sessions using the same common service in e-Demand: the client is unaware that two services, X and Y performing the same of biological data types is provided, but also that function using different algorithms, rely on a common service Z provenance data can be annotated with semantic types. This does not require, however, that the semantic annotation be stored in the same place as In Use Case 8, two sessions must be distinguished the data. in order to answer the provenance question. The first Technical Requirement 4: PASOA should pro- session is the execution of X and all its dependen- vide for provenance data and associated metadata cies, the second is the execution of Y and all its in different stores to being integrated in providing dependencies. The scenario is depicted in Figure 2. the answer to a query. The provenance question can then be expressed as: 4) Sessions: We have found that many use cases was the same service used in both sessions? Sim- compare the run of one experiment to that of ilarly, Bioinformatics Use Case 1 requires that we another, requiring that records regarding those ex- compare two experiments, recorded as two sessions, periments include a delimitation of one experiment and show the differences. from another. In service-oriented architecture terms, Technical Requirement 5: PASOA should pro-
vide a mechanism by which to group recorded experiments. provenance data into a session, and should allow Technical Requirement 6: PASOA should pro- comparison between sessions. vide for the provenance data to be returned in the 5) Query: The actor asking a provenance ques- groups specified at the time of recording or searched tion does not always know in advance which specific through on the basis of contextual criteria. experiments or data their question addresses. For 6) Processing and Visualisation: In most use example, in Use Case 9, we do not know which cases, the full provenance data of an experiment experiments we are looking for in advance, only is not presented to the user in order to answer the which source material was used as input to them, provenance question. It must first be analysed and and perhaps contextual information such as the then presented in a form that makes the answer to experimenter. the provenance question clear. Use Case 9: (SHGE) A chemist, C, performs an Use Case 11: (SHGE) A chemist, C, performs experiment but then examines the results and finds an experiment to determine the characteristics of a them doubtful. C determines the source material liquid by bouncing laser light off of it and exam- used in the experiment and then which other recent ining the changes to the polarisation of the light. experiments used material from the same batch. As this method is fairly new, it is not established C examines the results of those experiments to how to then process the results. C analyses the determine whether the batch may have been con- results through a plan, i.e. a succession of processes, taminated and so should be discarded. 2 that seem appropriate at the time and ends with Given that we expect a large volume of prove- potentially interesting results. At a later date, C nance data to be recorded over the course of many determines the high-level plan that they followed experiments, a search mechanism is required to and re-performs the experiment with different liquid answer the provenance question of Use Case 9. Data and configuration. 2 from one experiment may be used to improve the Use Case 12: (STE) A service, X, is accessed quality of future results by filtering intermediary by by an intruder, I, that should not have rights data, as follows. to do so. Later, an administrator becomes aware Use Case 10: (PIE) A biologist, B, performs of the intrusion and determines the time and the many experiments over time to discover the char- credentials used by the intruder to gain access. 2 acteristics of peptide fragments. The fragments are In Use Case 11, the provenance data provides used as evidence that a peptide is in the analysed the full information of what has occurred, but to material. Usually the discovery of several fragments answer the question, C requires a high-level plan. is required to confidently identify a peptide, but The provenance data therefore needs to be processed some fragments are unique enough to be adequate to answer the question. Again in Use Case 12, alone. B determines that a fragment with particular the provenance data must be processed in order to characteristics is produced most times a particular provide an answer to the provenance question. All peptide was analysed and rarely or never when that answers to provenance questions have to be made peptide was not present. 2 presentable to the user. For example, in Use Case To understand the range of queries required, we 13, the provenance data is presented in a report. can present those required to help achieve some of Use Case 13: (ICE) A bioinformatician, B, per- the use cases described above. To achieve Use Case forms an experiment. B publishes the results and 1, the user asks for the full contents of the records makes a record of the experiment details available of two experiments, so that a comparison can then for the interest of B’s peers. 2 be made. To achieve Use Case 2, the user asks for Technical Requirement 7: PASOA should pro- the interaction that has a given piece of data as vide a framework for introducing processing of its output. To achieve Use Case 8, the user asks provenance data of all three types discussed in for all services used in two given experiments. To Section IV-B.1 (interaction, actor and input prove- achieve Use Case 5, the user asks for all experiments nance), using various methods, then visualising the using a given piece of data as input. To achieve Use results of that processing. Case 10, the user asks for all peptides output as 7) Non-repudiation: In some cases, such as intermediary data in previous protein identification where the experimental results justify the efficacy
of a new drug for example, the provenance does periment to identify peptides in a sample. Iden- not just need to verify that the experiment was per- tifications are made by comparing characteristics formed as stated but prove it. To aid this, all parties of the peptides and their fragments with already in an experiment could record the provenance from known matches in a database. In the experiment, their own perspective, and these perspectives can some peptides are identified, others cannot be. Later, then be compared. Along with other measures to after other experiments have been conducted, the prevent collusion or tampering with the provenance database contains more information. The system au- data, the joint provenance data provides evidence of tomatically re-enacts the analysis of those peptides the experiment that cannot be denied, or repudiated. that were not identified. 2 One use case that requires multiple parties to In Use Case 16, the scientists can use prove- record provenance independently is where the in- nance data to re-enact the experiment. The re- tellectual property rights of the experimenter may enactment can even be automatic, since changes conflict with those of the services they use in in the databases can be matched to experiments experiments, as now described. that use those databases. In order to re-enact the Use Case 14: (ICE) A bioinformatician, B, per- experiment the following information is needed: the forms an experiment from which they develop a service called in at each stage of an experiment and new drug. B attempts to patent the drug. The patent the inputs given to each service. The provenance reviewer, R, checks that the experiment did not use data regarding previous experiments may be used a database that is free only for non-commercial use, in a less automated fashion to determine how future such as the Ecoli database. 2 experiments are to be run. As well as being able to prove particular services In fact, there are several different ways in which were used in an experiment, we may also need to be experimental process can be re-used. Re-enactment able to prove the time at which it was done, so that is performing the same experiment, but using con- researchers can (or cannot) claim they performed an temporary data and services, while repetition means experiment earlier than a published one. performing the same experiment with the same data Use Case 15: (SHGE) A chemist, C, performs and services as before, e.g. to test that the results an experiment finishing at a particular time. D later can be reproduced. Also, rather than performing the performs the same experiment and submits a patent whole experiment again, a scientist may wish to for the result and the process that led to it to patent perform it only up until the stage that intermediate officer R. C claims to R that they performed the results differ, to detect at what point the difference experiment before D. R determines whether C is lies. correct. 2 Technical Requirement 9: PASOA should pro- Technical Requirement 8: PASOA should pro- vide for the use of provenance data to re-enact an vide a mechanism for recording adequate prove- experiment using the same process but new inputs, nance data, in an unmodifiable way, to make results and to reproduce an experiment with the same non-repudiable. process and inputs. 8) Re-using Experimental Process: Provenance 9) Aggregated Service Information: The prove- data can be used in deciding what should happen in nance data provides information on services used the future. An experiment is performed to achieve in experiments as well as experiments themselves. some goal, such as verifying a hypothesis. The Combining the information of several traces allows provenance data can be used to identify the process the scientist to aggregate data about individual ser- and to repeat it. vices used in multiple experiments, as illustrated in Use Case 16: (CGE) A bioinformatician, B, per- the next use case. forms an experiment using as input data a specific Use Case 18: (CGE) Several bioinformaticians human chromosome from the most recent version perform experiments using service X. Another of a database. Later, another bioinformatician, D, bioinformatician, B, constructs a workflow that uses updates the chromosome data. B re-enacts the same X. B can estimate the duration that the experiment experiment with the most recent version of the might take on the basis of the average time X has chromosome data. 2 taken to complete its tasks before. 2 Use Case 17: (PIE) A biologist performs an ex- Technical Requirement 10: PASOA should pro-
vide for querying, over provenance data of multiple Collaboration. The Collaboration stores the results experiments, about the aggregate behaviour and and provenance data with security, fidelity and ac- properties of services. cessibility for a longer period of time that P or G are able to. 2 As services are distributed, provenance may be C. Non-functional Requirements stored in a distributed manner and must be linked Other use cases provide us with non-functional up in order to answer queries. It is clear that requirements, regarding how the architecture should provenance storage should be distributed but that operate. Since the use cases presented highlight queries should draw provenance data from all rele- demands on the way in which provenance data vant stores. should be recorded, stored and used, there is not a Technical Requirement 12: PASOA should pro- provenance question in every case, i.e. there is not vide for distribution in the storage of provenance always a new function realised by the provenance data and allow queries to draw data from multiple architecture. stores. 1) Storage: All provenance use cases require 3) Very Large Data Sets: Where data is relatively some reliable storage mechanism for the provenance small it can be stored easily for long periods. data; however, some require long-term storage of However, in some cases, it can be very large, such provenance to satisfy their needs, while others re- as in the Use Case 21. quire the data to be preserved and accessible only Use Case 21: (PDE) A physicist, P, performs an in the short-term. An example of the former type of experiment using detector data as input. The size of use case is the following. the detector data is in the order of petabytes. The Use Case 19: (SHGE) A chemist, C, performs provenance data of the experiment is recorded for an experiment. C then publishes their results on- later use without copying the data set. 2 line. Another chemist, R, discovers the published It is impractical to store or process data multiple results years later. R determines whether the results times for very large data sets, and provenance are valid by checking the experimental process that architectures must address this. was performed. 2 Technical Requirement 13: PASOA should pro- In order for provenance data to be accessible as a vide for recording and querying the provenance of part of a publication, it should persist as long as the very large data sets. publication, preferably forever. On the other hand, 4) Integration with Existing Software: In some for many use cases the provenance data may only domains, de-facto standards exist for recording retain its relevance for a matter of hours, months or some of the process information electronically, and years. in some cases there is also software support. For Technical Requirement 11: PASOA should pro- example, the provenance question in Use Case 22 vide for the management of the period of storage of can be answered using data from legacy software. provenance data to be managed, including preserva- Use Case 22: (PDE) An existing service, X, reg- tion of data for indefinite periods or deletion after ularly records the versions of libraries installed given periods. on computer node N. X records the version of 2) Distribution: Given that e-Science experi- library L at time T. A physicist, P, performs an ments can involve many services owned by many experiment using data produced by N. P examines parties, it is impractical to expect a single data store the experiment results and judges that they may be to be used to retain all of the provenance data. An incorrect. P queries the provenance data to discover example of this is given in Use Case 20. the library versions used by N when producing the Use Case 20: (PDE) A physicist, P, performs data. 2 a set of experiments. A selective subset of the Developers of a new provenance architecture have results, including the provenance data of the ex- to be aware of existing standards for recording and periments that produced them, are made available accessing provenance data and ensure that their soft- to the physicist’s Physics Working Group, G. The ware interoperates with that which already exists. administrators of G then make a subset of those Also, forthcoming standards that have the support of results, including their provenance, available to the the community should be acknowledged, and prove-
nance architectures should be able to interoperate sented use cases. Our analysis has led to a number with them. of architecural design decisions, which we outline Use Case 23: (PIE) A biologist, B, performs an in this section. We then describe our provenance experiment. B then queries the provenance data architecture. regarding that experiment by using software that follows the widely supported Proteomics Standards Initiative [6]. 2 A. Design Decisions Technical Requirement 14: PASOA should pro- The technical requirements of Section IV have vide for the integration of the architecture with informed a number of design decisions regarding existing standards and software. the PASOA architecture. We describe the most significant ones below. D. Summary 1) Separation of concerns: The breadth of use The types of use use case listed above can be cases shows the potentially unlimited scope of summarised as the following general tasks. functionality that a provenance architecture could • Checking whether results were due to interest- provide. We need to separate concerns so as to ing features of the material being experimented provide a framework which can be built upon to on or nuances of the experiment performed. satisfy not only use cases above, but also new ones • Determining the probable effectiveness of sim- as they appear. It should be noted that very few of ilar future experiments. the concerns expressed in the technical requirements • Accessing a historical record, or aide memoire, apply universally and uniformly to all applications; of work conducted. there is just a general need for recording, querying • Proving that the experiment claimed to have and processing provenance data. As querying re- been done was actually done. quires that data be recorded in a queryable form and • Proving that the experiment done conformed to processing requires that data can be queried using a required standard. a pre-defined mechanism, recording can be seen as • Checking that the experiment was performed a crucial part of this architecture. Also, recording correctly, and the services involved used cor- needs to be consistent across applications for open rectly. system querying and processing of the provenance • Tracing where data came from and the pro- data. cesses it had been through to reach its current Hence, we define a layered architecture with form. three layers, each building on the previous one: (i) • Tracing which source data was used to produce Fundamentals of recording and access, (ii) Query- given result data and vice-versa. ing, and (iii) Processing. Application specificity • Linking together data and experiments by their should be pushed up these three layers where provenance data, to provide extra context to possible, in order to separate out general from understanding those experiments. application-specific concerns. • Deriving the higher-level processes that have 2) Recording based on interaction provenance: been gone through to perform an experiment, As described in Section IV-B.1, we have determined so that they can be checked and re-used. there to be at least three types of provenance data: • Providing the process information required for interaction provenance, actor provenance and in- publishing an experiment’s results. put provenance. Our architecture, therefore, has to • Verifying that services used are working as they support the recording and use of all these types should be. of data, and, importantly, to maintain the links • Allowing experiments to be re-enacted to check that exist between them: execution involves the that services and/or data has not changed in a interaction of actors exchanging data. We argue that way which affects the results. this can best be done by viewing all provenance in relation to interaction provenance. Actor provenance V. P ROPOSED A RCHITECTURE is effectively metadata to interaction provenance, as In the PASOA project, we aim to provide a it describes the state of actors at the time when framework architecture capable of tackling the pre- an interaction took place, while input provenance
is derivable from sufficiently detailed interaction B. Proposed architecture provenance. Therefore, our architecture should be We have developed a protocol for recording based on the recording of the interaction between provenance according to the design decisions of services, interaction provenance and allow meta- Section V-A, which is detailed in [20] and not data regarding each interaction to be additionally expanded on further here. We can now design an recorded in association. architecture to address the use cases as a whole. Our 3) Interaction-specific or non-provenance meta- proposed architecture is shown in Figure 3, which data: Given the basis of interaction provenance, we embodies the design decisions of Section V-A, and can further separate concerns. Metadata specific to each entity depicted is explained below. an interaction, including the state of an actor or the Data data exchanged, must clearly be associated directly Visualisation Services with the interaction and so should be recognised in our recording provenance data procedures. Other Trace Visualiser / Trace Difference Trace Validity Presentation Browser Visualiser Visualiser metadata can be stored elsewhere and references Services Service Workflow made to the provenance data to make the association Publication Browser Quality Visualiser Construction Tool explicit. The metadata will then be used together when performing queries or processing. Service Trace Semantic 4) Reference of elements in the store: In order to Processing Auditing Service Quality Analysers Comparison Service Validity Analyser associate metadata with actors and data in interac- Services Trace tions, there must be a way to refer to those entities. User Re-enactment Service to Workflow Converter Publication Generator First, we can provide a way to reference recorded interactions and the messages passed in those inter- User Provenance Proxy actions. Then, while the structure of data used in Recording Tools Query API Provenance Recording experiments will vary widely, we can provide some Service uniformity in referring to elements of the data at the Interaction Interaction Submission API query level by using common abstractions over the CVS Provenance Services Provenance Service Query API data types. Provenance Annotater Policy Submission API Enforcer 5) Independent identification: In uniquely iden- Portlets tifying data elements and actors, we can again Submission Client-Side separate concerns. While we can and should pro- Workflow Library Non- Ontology Provenance vide unique identifiers for each interaction that is Enactment Engine Data Stores User Experiment recorded, we leave identification of data elements Services Metadata Store Domain- to be metadata provided by external services and Specific Services allow them to be used in querying. 6) Extensible architecture for querying: As the Fig. 3. Proposed PASOA Provenance Architecture data comes in many forms and structures, because we should attempt to fit in with existing standards and software in some cases, and because the ques- The Interaction Provenance Service stores inter- tions asked about past experiments vary consider- action provenance, annotated with actor provenance ably between applications, we cannot and should and input provenance where supplied (satisfying not provide a single query interface for them all. TR 1). The storage has a consistent structure for However, we can take a layered approach, whereby all provenance data so that recorded data can be re- we provide a few general search mechanisms over ferred to (satisfying TR 3) and identifiers associated the provenance data with the aim that it will ease the with the referenced data (satisfying TR 2). prove- development of application-specific query engines. nance data can be assigned group identifiers (satis- There should be no compulsion for these query fying TR 5) and multiple parties can record prove- mechanisms to be used if it is easier to search for nance on the same interaction (satisfying TR 8). results without them. Query APIs provide access to the provenance data
using different query languages (satisfying TR 6). processing services into human-interpretable form, Because the provenance data can be referred to and as per several of the use cases. In Figure 3, we the query languages are flexible, aggregated infor- show presentation services required for several of mation regarding services can be derived (satisfying the use cases: Trace Difference Visualiser for Use TR 10). Case 1 etc. Some data requires specific visualisation The Policy Enforcer verifies that both parties in and Data Visualisation Services transform them for an interaction agree on the events that make up human interpretation. an interaction and uses policies to determine how We believe this architecture addresses the func- the interaction provenance service should respond tional requirements of the presented use cases. In in case of disagreement. The Proxy Provenance future work, discussed in Section VII, we need Recording Service acts as a trusted intermediary to make the architecture robust enough to work recording provenance for services that cannot record as a production provenance system, in particular provenance themselves. Operation calls are passed addressing non-functional TRs 11, 12, 13 and 14. to and then forwarded by the proxy. The group of Interaction Provenance Services defines the whole VI. P RELIMINARY I MPLEMENTATION set of provenance services and proxies available to We have created a first, basic implementation interacting services. This should be scalable and of the architecture, PReServ, available to download secure in its entirety. The Submission Client-Side from www.pasoa.org, and are beginning to evaluate Library supports the provenance data submission, to its effectiveness in satisfying the use cases. We ease the task of services wishing to use the PASOA chose to attempt to achieve Use Case 8, which asks architecture. The Experiment Services are the set a simple question of potentially complex provenance of services using the PASOA architecture to record data. A far more detailed version of this evaluation provenance. This includes workflow enactment en- was conducted by the scientists themselves and is gines, which act as clients to other services, and discussed in [32]. domain-specific services, e.g. bioinformatics tools. We implemented three Web Services and a client provenance data stored in multiple distributed Inter- as stated in the use case. We wrote all code in action Provenance Services is combined to provide Java 1.4, used Axis 1.1 for all sending and parsing a full picture of an experiment (satisfying TR 12). all Web Service calls and deployed the services on User Provenance Recording Tools are client-side Tomcat 5.0. We used a single provenance store for tools used to allow users to behave as services in all provenance data. Axis allows handlers to easily the provenance data submission process. be introduced into the parsing of incoming and Processing Services are tools that add value to the outgoing handlers, by modifying the deployment provenance by processing it (satisfying TR 7). The descriptor and including a JAR archive on the class provenance data, including metadata, is extracted path. Our architecture implementation includes an from the provenance services using the Query APIs. Axis handler that automatically sends to a prove- Each processing service shown is taken from a nance store every SOAP message that is received specific use case, and includes services to re-enact or sent by the service. experiments (satisfying TR 9). The message passed between each client/service Non-Provenance Data Stores are stores of data in invocation or result is recorded in the provenance that do not relate to the provenance of a particu- service by both parties in each interaction (via the lar experiment execution, actors or data. The data Axis handler). To distinguish the calling of X and may exist before any auditable experiment is run. the calling of Y, we use two session identifiers, as Examples are ontologies, which are used to provide illustrated in Figure 2. The first session identifier is semantic terms for testing the semantic validity of recorded along with the interaction of C and X and experiments and user stored metadata that can be with the interaction of X and Z. The second session referred to by provenance metadata. Because, in identifier is recorded along with the interaction of our architecture, it can be processed along with the C and Y and with the interaction of Y and Z. The provenance data, this satisfies TR 4. session identifier is communicated between services Presentation Services are particular types of pro- in the SOAP message header, stripped out and used cessing service that transform the results of other by the Axis handler.
After X and Y have finished, C attempts to VIII. C ONCLUSIONS determine whether they used a common service. We have presented a broad range of use cases C queries the provenance service find the list of regarding the recording and use of the provenance interactions that were recorded with the first session data of scientific experiments. We have observed identifier, and from this data discovers which ser- that there is little that spans all use cases, but many vices were used. The same is then done for the other issues appear in a range of areas. Our proposed session identifier. Finally, C takes the intersection protocol and architecture attempts to separate the of the set of services used in the first session and general from the application specific concerns and those used in the second session, to produce the set provide a framework for building solid recording of services used in both, and outputs this set. The provenance data, querying and processing software. set consists of a single element, the identity of Z, It is clear that we can provide generic middle- so C knows this was used by both X and Y. ware that allows the provenance-related use cases The same process will work regardless of the to be more easily achieved. We have separated the complexity of the operation of X and Y. For exam- tasks supported by the architecture into recording, ple, X may call a long succession of other services querying and processing, with each depending on in order to achieve its results, one or more of which the former. As far as possible, we intend to push occur in Y’s operation also. The common set of application-specific solutions into the processing. services can still be discovered. While there are many issues still to be addressed, we believe our architecture provides the foundations VII. F UTURE W ORK of a full solution. While the architecture described is a framework for satisfying use cases, there are many details to IX. ACKNOWLEDGEMENT be resolved. We wish to thank all that have contributed the First, several non-functional requirements relating requirements used in this paper. In particular, this in- to storage of provenance data must be met, partic- cludes Klaus-Peter Zauner for the Intron Complex- ularly the management of storage duration (TR 11) ity Experiment, David O’Connor and Paul Skipp and storage of large quantities of data (TR 13). for the Protein Identification Experiment, Mark There are a number of compelling reasons for Greenwood, Chris Wroe and Nedim Alpdemir for distributing the storage of provenance data, as sug- gested in TR 12. First, our architecture should the Candidate Gene Experiment, Paul Townend for the Service Reliability Experiment, Hugo Mills for ensure there is not a single point of failure in provid- the Simple Harmonic Generation Experiment, and ing access to provenance data. Further, we should Ronald Ashri and Terry Payne for the Security allow service owners to keep data related to their Testing Experiment. This research is funded by the service within their own security domain. However, PASOA project (EPSRC GR/S67623/01). as pointed out in Use Case 20, the architecture should provide a way to view data from multiple provenance stores in a unified way. R EFERENCES The PASOA architecture should ensure that the [1] Business Process Execution Language for Web Services Version 1.1. http://www-128.ibm.com/developerworks/library/ws-bpel/, performance of the system does not significantly 2004. deteriorate as the number of provenance stores, [2] e-Demand. http://www.comp.leeds.ac.uk/edemand, 2004. provenance data, provenance data recorders or dis- [3] GenBank. http://www.ncbi.nlm.nih.gov/Genbank/, 2004. tribution of data increases. As indicated in TR 14, [4] Gene Ontology Consortium. http://www.geneontology.org/, 2004. adapters for storing and querying provenance data [5] myGrid. http://www.mygrid.org.uk, 2004. may have to be provided to integrate our provenance [6] PSI. http://psidev.sourceforge.net, 2004. architecture with other existing standards, software [7] Web Services Architecture. http://www.w3.org/TR/ws-arch/, 2004. and protocols. [8] Matthew Addis, Justin Ferris, Mark Greenwood, Darren Mar- Finally, the current architecture does not address vin, Peter Li, Tom Oinn, and Anil Wipat. Experiences with the needs of controlling access to the provenance escience workflow specification and enactment in bioinformat- ics. In Proc. of the UK OST e-Science second All Hands data, which is essential for any real world deploy- Meeting 2003 (AHM’03), pages 459–467, Nottingham, UK, ment. September 2003.
You can also read