DataGrid, Prototype of a Biomedical Grid
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
1 Methods MIMST 12 © 2003 Schattauer GmbH DataGrid, Prototype of a Biomedical Grid V. Breton1, R. Medina2, J. Montagnat3 1 Laboratoire de Physique Corpusculaire, CNRS-IN2P3, Campus des Cézeaux, Aubière , France 2 Laboratoire d’Informatique, de Modélisation et d’Optimisation des Systèmes, Université Blaise Pascal, Campus des Cézeaux, Aubière, France 3 Creatis, CNRS UMR 5515, INSA, – Bât. B. Pascal, Villeurbanne, France Summary 1. Introduction On a grid, the exchange of information between computers is hidden from the Background: The availability of large amounts of data in heterogeneous formats and the rapid progress in Bio-informatics and automated medical user. High level services (resource broker, fields such as computer based drug design, medical image analysis are today identified as high distributed file system…) hide the underly- imaging and medical simulations have lead to a grow- priority research by funding agencies be- ing infrastructure required to respond to ing demand for large computational power and easy cause progresses in health care are clearly the user requests. accessibility to heterogeneous data sources. connected to the analysis of genomics data This transparency from the user point of Objectives: The goal is to address these needs by de- and the diffusion of information technolo- view requires an extra layer of software, ploying computing grids. Grids provide both large scale gy in medicine. called middleware. Beside research projects and distributed storage facilities and an increased But how to allow multiple laboratories dedicated to develop new middleware or to computing power. Moreover, Grids are a promising to collect genomics and post-genomics data enhance the performances of the existing tool to foster the synergy between bio-informatics and computerised medical imaging. around Europe and to analyse them in an ones, grids should be deployed to address Methods: A first biomedical grid is being deployed up-to-date and competitive environment? the needs of the biomedical community us- within the framework of the DataGrid IST project Large biology or bio-informatics research ing the state of the art of the middleware (http://www.edg.org). The goal of the project is to laboratories have to maintain their own technology. provide a novel environment to support globally dis- computing resources, but they are facing a The DataGrid European project bio- tributed scientific exploration involving up to multi- challenging growth of the data they need to medical work package gathers biologists, Perabyte datasets. manage and process for recent algorithms computer scientists, physicians and physi- Results and Conclusions: The first biomedical applica- such as data mining. cists around the common goal of deploying tions deployed inside the project demonstrate the rele- The medical image processing commu- a first biomedical grid. vance of the grid paradigm for genomics and medical nity is also facing a growing need for large In this paper, we briefly present the image processing. They also highlight the specific re- computations to analyse 2D, 3D, 4D images, DataGrid project, explain the relevance of quirements of the biomedical community. to simulate medical treatments or surgeries the grid concept for genomics and medical (radiotherapy, plastic surgery …), and to imaging and describe the first applications Keywords Genomics, medical image processing, computing grid develop computer aided surgery. An in- being deployed on DataGrid as a proof of creasing need for large computing re- concept of a biomedical grid. Methods Inf Med 2003; 42: ■–■ sources is appearing in hospitals. Physicians should be able to download and process all their patients’ medical data from their office. The grid paradigm (1, 12) offers CPU 2. The DataGrid Project and data handling capabilities to the user. The goal of the European DataGrid Pro- Indeed, grids are designed to share multiple ject (2) is the development of a novel envi- computing and data storage resources in- ronment to support globally distributed terconnected through high bandwidth net- scientific exploration involving multi-Per- works between a large user community. abyte datasets. The project designs and de- This differs from the Internet where the velops middleware solutions and testbeds user has to choose on which machine he capable of scaling to handle Perabytes of wants to connect and which information he distributed data, tens of thousands of re- wants to retrieve among the tremendous sources (processors, disks, etc.), and thou- amount of data available. sands of simultaneous users. The scale of Methods Inf Med 2/2003
2 Breton et al. the problem and the distribution of the from post-genomics (micro arrays, protein in each hospital) due to the amount of data resources and user community preclude structure …) must also be added. This in- composing 3D or 4D images. straightforward replication of the data on formation comes in multiple formats, from Automatic processing of these data- several sites, while the aim of providing a many laboratories around the world. A bases is increasingly needed in clinical general purpose application environment laboratory actively involved in genomics practice. Indeed, the recent availability of precludes distributing the data using static or post-genomics faces three basic needs multiple digital acquisition devices in hos- policies. This environment is built by com- related to bio-informatics: pitals (X-ray, CT, MRI, US, TEP scan- bining and extending newly emerging ● The need to acquire and store the data ners…) is responsible for the increasing “Grid” technologies to manage large dis- produced with their own experimental amount of digital images. The need for tributed datasets in addition to computa- resources (mass spectrometer, se- large scale management of medical images tional elements. A consequence of this pro- quencer, etc.). led to defining distributed health care in- ject will be the emergence of fundamental ● The need to access to the web servers formation systems (3). However, physicians new modes of scientific exploration, since (EBI, NCBI, InfoBiogen, etc.) where do not have access to the necessary tools access to fundamental scientific data is no they can compare their sequences to the today to easily access medical databases longer restricted to the only producer of public data banks and run the available and make use of automated image process- that data. While the project focuses on algorithms. ing algorithms that could help for diagnosis. scientific applications such as High Energy ● The need to store private databases as a The grid architecture will be extremely Physics, Earth and Biomedical Sciences, result of previous data acquisition and valuable for distributing computational re- issues of sharing data are common to many analysis. sources over a large community of medical applications and thus the project has a po- users and to ease data access between dif- tential impact on future industrial and com- Once these basic needs are met, some re- ferent centres. Image production centres do mercial activities. search teams may want to develop their not dispose today of the necessary compu- own algorithms to analyze their data. Some tation resources to process their data. A others are eager to make their data avail- grid architecture would allow medical cen- able to the rest of the community. As a re- tres to share computation resources and 3. The Grid, a New Tool sult, the databases made available by the bio-informatics computing centers are up- make accessible image processing algo- rithms to physicians in all centres. The grid to Face the Challenges dated weekly. would be responsible for optimising access of Biomedical Sciences A grid offers the opportunity to provide CPU and storage resources distributed in to the computation resources available. Moreover, the grid architecture would Biomedical sciences are facing a growth of the laboratories, rather than concentrated facilitate the development of telemedicine the amount of data, as well as a growing in larger and larger computing centers. Its (4, 10). need for processing larger data sets in order architecture could be a flat grid made of The grid is also expected to bring solu- to tackle emerging challenges (comparative many “small” (10 to 100 CPU’s, 1 to 10 Tb tions to actual problems that can not be genomics, image guided epidemiology…). disk) clusters where the public databases handled by commonly available resources These data are produced in many laborato- would be mirrored weekly. Such mirroring in medical centres. Some medical applica- ries and hospitals that are generally not can take full advantage of the high flux net- tions have huge memory and computation equipped to archive or to analyse them.The works. The biology laboratories would ac- requirements and can be parallelized. The format of these data is highly dependent on cess these resources through web portals grid is expected to provide a parallel archi- the device used to produce them, whether providing grid-enabled algorithms running tecture in which these applications could be an imaging or a sequencing device. These on distributed databases. run. Indeed parallel and distributed archi- data are generally confidential and should tecture have been successfully reported to not be accessed without careful identifica- solve challenging problems related to med- tion. 3.2 Using the Grid ical image visualisation (5, 6) and process- ing (7). Other medical studies involve very for Medical Imaging large database of images that are not neces- 3.1 Using the Grid for Medical images are distributed over their sarily available on a single site. production sites (radiology departments, Bio-informatics hospitals…). Although there is no widely Biologists are facing an exponential growth established standard for sharing data be- of their databases. Every time a new tween sites today, there is an increasing genome is sequenced and annotated, the need for remote medical data access and whole database is processed again to find processing. Medical image databases are new homologies. Additional data coming huge (several Tb of data produced per year Methods Inf Med 2/2003
3 DataGrid, Biomedical Grid Prototype 4. Grid-Blast, First Use Case et Chimie des Protéines (http://npsa- pbil.ibcp.fr). in Genomics Comparative Analysis The first application deployed on DataGrid biomedical testbed dealt with genomics 5. Design of a Biomedical Grid comparative analysis. BLAST (Basic Local Considering the common requirements of Alignment Search Tool) (8) is a set of simi- bio-informatics and medical imaging, we Fig. 1 Structure of the DataGrid biomedical testbed larity search programs designed to explore proposed an architecture for a biomedical all of the available sequence databases re- grid. In this section, we describe how we gardless of whether the query is protein or perceive the different levels of software be- DNA. BLAST is typically used by biolo- tween the local operating system and the general grid services and an extra layer of gists when they need to compare sequences user and how the community of biomedical so-called biomedical services is needed. of nucleic acids or amino acids coming from users of the grid could organize its work. Among these services specifically relevant their own research to the ones stored in As we stressed earlier, running a grid re- to the needs of the biomedical community public databases. The BLAST programs quires an extra layer of software, called are distributed data management, automa- have been designed for speed, with a mini- middleware. This middleware makes a set tic mirroring and updating of databases, mal sacrifice of sensitivity to distant se- of services available to the grid users. Low visualization and interaction with remote quence relationships.The scores assigned in level services are useful to grid developers processes… a BLAST search have a well-defined statis- and high level services for the end users. These biomedical services are available tical interpretation, making real matches These services are made available to bio- for different families of applications de- easier to distinguish from random back- medical users who can work on the grid ployed by different groups of users. Three ground hits. provided they are identified as authorized user groups are experimenting the Data- Many web portals in the world dedicat- users. The mechanism to authenticate users Grid biomedical testbed today: ed to genomics comparative analysis offer belonging to a given community is through ● Computer scientists are taking advan- to the biologist the possibility to compare a so-called virtual organization. tage of the grid architecture and services his sequences to databases with BLAST. to design new distributed and/or parallel These portals have to restrict the length algorithms for bio-medical analysis. and the number of sequences to compare in 5.1 The Different Layers Grid-aware algorithms are distributed order to avoid saturating their computing algorithms that benefit from the grid resources. A straightforward impact of exe- of a Biomedical Grid architecture to optimize and parallelize cuting BLAST comparisons on a distant The DataGrid project is developing a computations. These algorithms rely on node on a grid is to reduce the work load on middleware based on the Globus toolkit. an efficient communication interface for the local computers dedicated to the portal. Led by K. Kesselman and I. Foster of message exchanges between parallel Moreover, the input file of sequences pro- Argonne National Laboratory (ANL) and processes. They also usually rely on an vided by the biologist can be split in small- the University of Chicago. The Globus efficient data management service to er sets of sequences that can be compared (www.globus.org) project (11) is develop- access large amounts of data. grid-aware to the selected database in parallel on sev- ing fundamental technologies needed to algorithm development is an emerging eral distant grid nodes. This requires an up- build computational grids. It provides basic research area and mainly involves the dated copy of the database to be available services on top of which scientists can deve- definition of new algorithms. on these grid nodes. lop application programs. The most funda- ● Bioinformaticians are creating Grid Ser- The impact on executing BLAST on the mental layer consists of a set of core servi- vices Portals. These are actual service grid was demonstrated by measuring the ces, including resource management, secu- providers wishing to take advantage of time needed to compare the Swissprot rity, remote execution, file transfer, and the grid’s computational power and data database to itself on one Linux Pentium III communications that enable the linking storage capacity. Grid Portals may be processor and on a DataGrid testbed. Com- and interoperation of distributed computer used to run the presently existing algo- puting time was reduced 80 times on the systems. On top of these core services, the rithms as well as new grid-aware ones. grid. DataGrid middleware work packages have Many biomedical service providers rely Based on the experience with the Visual been developing an additional layer of ser- on web-based technologies to offer ac- DataGridBlast (9), several bio-informatics vices dealing with workload scheduling and cess to their databases and computa- algorithms deployed on DataGrid will be management, data management, grid moni- tional resources. The applications de- made available from the Protein Sequence toring services, local fabric management scribed in this section intend to take Analysis portal of the Institut de Biologie and mass storage management. These are advantage of the grid computational Methods Inf Med 2/2003
4 Breton et al. power and the data storage capacity. Existing or new portals should therefore 5.2 The Biomedical Virtual cal is to prepare the future and to evalu- ate what would be the benefits and the interface to the grid jobs submission and Organization limits of using the grid to mine very data management services. A grid is, by definition, shared by different large databases. ● Biologists and researchers in image groups of users with different goals. These ● Parallel magnetic resonance image sim- guided diagnosis and therapy use the communities of users use common re- ulator: with the increased interest in grid as a cooperative framework: their sources but they do not share their data. computer-aided MRI image analysis aim is to take advantage of the grid in Virtual organizations are simply a way to methods (segmentation, data fusion, order to organize their work in a coop- organize the different communities, their quantization, etc.), there is a greater erative manner. A computational grid access to data and resources. Each grid user need for objective methods of algo- can help the biomedical community by has to be recorded in one virtual organiza- rithms evaluation. In this context, a MRI offering a cooperative framework with tion where his roles (access rights, autho- simulator provides an interesting assess- shared resources as well as shared data rizations…) are recorded. The DataGrid ment tool since it generates 3D realistic bases and data format.The grid will help testbed is used for applications in High images (volumes) from virtual medical users to organize their work in a cooper- Energy Physics, Earth and Biomedical objects. In order to take into account the ative manner. It will allow assembling Sciences. Each research field has its own MRI artifacts, a 3D simulator is under distributed databases, opening new op- virtual organization: the one for biomedical development at CREATIS. The data portunities for large scale studies such as sciences is shown on Fig. 2. We divided it in grid will be specifically of interest for epidemiological studies. The required two subgroups: one for genomics and one the parallelization of the isochromats components are the data management for medical imaging. The users involved in and the MR sequences. interface for sharing, replicating, updat- the different applications are recorded ac- ● The Bioinformatics initiative in Padova ing, and exchanging data, and for offer- cording to their application and their home concentrates on the study of indexing ing access to large CPU resources, etc. institute. techniques to create and to query large Seven applications are being presently databases of 3D structures (13). Index- These three families of users/applications deployed on the DataGrid biomedical ing techniques, initially proposed within have different requirements. The users also testbed: 5 in genomics and 2 in medical the area of computer vision, are used in do not have the same level of computing imaging. different contexts and differ in the type awareness. For instance biologists wishing We are going to present three of them of invariant properties (either local or to use the grid as a collaborative frame- briefly: global), in the transformation class work are not necessarily as skilled in the ● Data mining on the grid: Knowledge (rigid body or affine transformations), use of computers as computer scientists Discovery in Databases (KDD) stands and in the method used to formulate developing distributed algorithms. for the non trivial process of implicit in- and verify hypotheses of associations of Fig. 1 gives a schematic representation formation, previously unknown and po- the query object. Within the proposed of the different layers of the DataGrid bio- tentially useful, contained in stored da- scheme, the structural data are stored in medical test bed. ta. The aim of the Université Blaise Pas- separate tables, spread across the grid. 6. Conclusion Biomedical sciences are facing an exponen- tial growth of the volume of data they need to process and analyze. On one hand, bio- logists are sequencing more and more genomes and proteins and want to analyze them with more and more sophisticated algorithms. On the other hand, imaging devices are widely spreading in the hospi- tals, generating terabytes of data that need to be stored and made available to physi- cians. Fig. 2 The computing needs of bio-informatics Biomedical virtual organi- and medical informatics are basically of the zation same nature: Methods Inf Med 2/2003
5 DataGrid, Biomedical Grid Prototype ● handle large volumes of data produced in many centers, References Workshop on Parallel Image Analysis, pp 65- 78, Lyon, France, December 1995. 1. Foster I, Kesselman C.The Grid, blueprint for a 8. Altschul SF, Gish W, Miller W, Myers EW, Lip- ● define common standards for their in- new computing infrastructure. Morgan Kauf- man DJ. Basic local alignment search tool. teroperability, man, San Francisco, 1999. J Mol Biol 1990; 215: 403-10. ● provide to a large community of users 2. Segal B. Grid computing: the European data 9. Legré Y, Météry R, Fougas AS, Joubert M. project. IEEE Nuclear Science Symposium and Visual DataGridBlast. Private communication. (biologists, physicians) a secured and Medical Imaging Conference, Lyon, 15-20 10. Montagnat J, Davila E, Magnin IE. 3D objects efficient access to their content. October 2000. visualization for remote interactive medical ap- 3. Thomson M, Johnson W, Goujun J, Lee J, Tier- plications. 3D Data Visualization, Processing, We have tried to demonstrate that the grid ney B, Terdiman JF. Distributed health care and Transmission, Padova, Italy, June 2002. imaging information systems. PACS Design 11. Foster I, Kesselman C. Globus: A Metacomput- paradigm was a good response to these and Evaluation: Engineering and Clinical ing Infrastructure Toolkit. International J Su- needs. The DataGrid biomedical work Issues, volume 3035, SPIE Medical Imaging, percomputer Applications 1997;11(2): 115-28. package is the first attempt to develop the 1997. 12. Foster I, Kesselman C, Tuecke S. The anatomy specific grid services that will allow the bio- 4. Graves S, Tullio J, Downs JH, Kassel N. Tele- of the Grid: enabling scalable virtual organiza- presence in neurosurgery: the integrated re- tions. International J Supercomputer Applica- medical community to successfully address mote neurosurgical system. Medicine meets tions 2001; 15 (3). its challenges. virtual reality 5, 1997. 13. Guerra C, Lonardi S, Zanotti G.Analysis of sec- 5. von Laszewski G, Su MH, Insley JA, Foster I, ondary structures of proteins using indexing Acknowledgment Bresnahan J, Kesselman C, Thiebaux M, Rivers techniques. IEEE Proc. First Int. Symposium The authors acknowledge the contributions of ML, Wang S, Tieman B, McNulty I. Real-time on 3D Data Processing Visualization and all the participants to the biomedical work pack- analysis, visualization, and Steering of Microto- Transmission, 2002. age of DataGrid. Special thanks are due to mography experiments at photon sources. 9th Christophe Blanchet, Emmanuel Cornillot, SIAM Conference on Parallel Processing for Scientific Computing, April 1999. Correspondence to: Nicolas Jacq, and Christian Michau. Vincent Breton 6. Li JJ, Miguet S. Parallel volume rendering of medical images. EWPC’92: From theory to Laboratoire de Physique Corpusculaire sound Practice, pp 332-343, Barcelone, 1992. Campus des Cezeaux 7. Miguet S, Nicod JM. An optimal parallel iso- 63177 Aubière Cedex, France surface extraction algorithm. 4th International E-mail:breton@clermont.in2p3.fr Methods Inf Med 2/2003
You can also read