DataGrid, Prototype of a Biomedical Grid

Page created by Steve Wagner
 
CONTINUE READING
1

Methods MIMST 12                                                                                          © 2003                        Schattauer GmbH

DataGrid, Prototype of a Biomedical Grid
V. Breton1, R. Medina2, J. Montagnat3
1
 Laboratoire de Physique Corpusculaire, CNRS-IN2P3, Campus des Cézeaux, Aubière , France
2
 Laboratoire d’Informatique, de Modélisation et d’Optimisation des Systèmes, Université Blaise
Pascal, Campus des Cézeaux, Aubière, France
3
 Creatis, CNRS UMR 5515, INSA, – Bât. B. Pascal, Villeurbanne, France

Summary                                                   1. Introduction                                     On a grid, the exchange of information
                                                                                                          between computers is hidden from the
Background: The availability of large amounts of data
in heterogeneous formats and the rapid progress in        Bio-informatics and automated medical           user. High level services (resource broker,
fields such as computer based drug design, medical        image analysis are today identified as high     distributed file system…) hide the underly-
imaging and medical simulations have lead to a grow-      priority research by funding agencies be-       ing infrastructure required to respond to
ing demand for large computational power and easy         cause progresses in health care are clearly     the user requests.
accessibility to heterogeneous data sources.              connected to the analysis of genomics data          This transparency from the user point of
Objectives: The goal is to address these needs by de-     and the diffusion of information technolo-      view requires an extra layer of software,
ploying computing grids. Grids provide both large scale   gy in medicine.                                 called middleware. Beside research projects
and distributed storage facilities and an increased          But how to allow multiple laboratories       dedicated to develop new middleware or to
computing power. Moreover, Grids are a promising
                                                          to collect genomics and post-genomics data      enhance the performances of the existing
tool to foster the synergy between bio-informatics and
computerised medical imaging.                             around Europe and to analyse them in an         ones, grids should be deployed to address
Methods: A first biomedical grid is being deployed        up-to-date and competitive environment?         the needs of the biomedical community us-
within the framework of the DataGrid IST project          Large biology or bio-informatics research       ing the state of the art of the middleware
(http://www.edg.org). The goal of the project is to       laboratories have to maintain their own         technology.
provide a novel environment to support globally dis-      computing resources, but they are facing a          The DataGrid European project bio-
tributed scientific exploration involving up to multi-    challenging growth of the data they need to     medical work package gathers biologists,
Perabyte datasets.                                        manage and process for recent algorithms        computer scientists, physicians and physi-
Results and Conclusions: The first biomedical applica-    such as data mining.                            cists around the common goal of deploying
tions deployed inside the project demonstrate the rele-      The medical image processing commu-          a first biomedical grid.
vance of the grid paradigm for genomics and medical       nity is also facing a growing need for large        In this paper, we briefly present the
image processing. They also highlight the specific re-
                                                          computations to analyse 2D, 3D, 4D images,      DataGrid project, explain the relevance of
quirements of the biomedical community.
                                                          to simulate medical treatments or surgeries     the grid concept for genomics and medical
                                                          (radiotherapy, plastic surgery …), and to       imaging and describe the first applications
Keywords
Genomics, medical image processing, computing grid        develop computer aided surgery. An in-          being deployed on DataGrid as a proof of
                                                          creasing need for large computing re-           concept of a biomedical grid.
Methods Inf Med 2003; 42: ■–■                             sources is appearing in hospitals. Physicians
                                                          should be able to download and process all
                                                          their patients’ medical data from their
                                                          office.
                                                             The grid paradigm (1, 12) offers CPU
                                                                                                          2. The DataGrid Project
                                                          and data handling capabilities to the user.     The goal of the European DataGrid Pro-
                                                          Indeed, grids are designed to share multiple    ject (2) is the development of a novel envi-
                                                          computing and data storage resources in-        ronment to support globally distributed
                                                          terconnected through high bandwidth net-        scientific exploration involving multi-Per-
                                                          works between a large user community.           abyte datasets. The project designs and de-
                                                          This differs from the Internet where the        velops middleware solutions and testbeds
                                                          user has to choose on which machine he          capable of scaling to handle Perabytes of
                                                          wants to connect and which information he       distributed data, tens of thousands of re-
                                                          wants to retrieve among the tremendous          sources (processors, disks, etc.), and thou-
                                                          amount of data available.                       sands of simultaneous users. The scale of

                                                                                                                                 Methods Inf Med 2/2003
2
Breton et al.

the problem and the distribution of the         from post-genomics (micro arrays, protein       in each hospital) due to the amount of data
resources and user community preclude           structure …) must also be added. This in-       composing 3D or 4D images.
straightforward replication of the data on      formation comes in multiple formats, from           Automatic processing of these data-
several sites, while the aim of providing a     many laboratories around the world. A           bases is increasingly needed in clinical
general purpose application environment         laboratory actively involved in genomics        practice. Indeed, the recent availability of
precludes distributing the data using static    or post-genomics faces three basic needs        multiple digital acquisition devices in hos-
policies. This environment is built by com-     related to bio-informatics:                     pitals (X-ray, CT, MRI, US, TEP scan-
bining and extending newly emerging             ● The need to acquire and store the data        ners…) is responsible for the increasing
“Grid” technologies to manage large dis-           produced with their own experimental         amount of digital images. The need for
tributed datasets in addition to computa-          resources (mass spectrometer, se-            large scale management of medical images
tional elements. A consequence of this pro-        quencer, etc.).                              led to defining distributed health care in-
ject will be the emergence of fundamental       ● The need to access to the web servers         formation systems (3). However, physicians
new modes of scientific exploration, since         (EBI, NCBI, InfoBiogen, etc.) where          do not have access to the necessary tools
access to fundamental scientific data is no        they can compare their sequences to the      today to easily access medical databases
longer restricted to the only producer of          public data banks and run the available      and make use of automated image process-
that data. While the project focuses on            algorithms.                                  ing algorithms that could help for diagnosis.
scientific applications such as High Energy     ● The need to store private databases as a          The grid architecture will be extremely
Physics, Earth and Biomedical Sciences,            result of previous data acquisition and      valuable for distributing computational re-
issues of sharing data are common to many          analysis.                                    sources over a large community of medical
applications and thus the project has a po-                                                     users and to ease data access between dif-
tential impact on future industrial and com-    Once these basic needs are met, some re-        ferent centres. Image production centres do
mercial activities.                             search teams may want to develop their          not dispose today of the necessary compu-
                                                own algorithms to analyze their data. Some      tation resources to process their data. A
                                                others are eager to make their data avail-      grid architecture would allow medical cen-
                                                able to the rest of the community. As a re-     tres to share computation resources and
3. The Grid, a New Tool                         sult, the databases made available by the
                                                bio-informatics computing centers are up-
                                                                                                make accessible image processing algo-
                                                                                                rithms to physicians in all centres. The grid
to Face the Challenges                          dated weekly.                                   would be responsible for optimising access
of Biomedical Sciences                             A grid offers the opportunity to provide
                                                CPU and storage resources distributed in
                                                                                                to the computation resources available.
                                                                                                Moreover, the grid architecture would
Biomedical sciences are facing a growth of      the laboratories, rather than concentrated      facilitate the development of telemedicine
the amount of data, as well as a growing        in larger and larger computing centers. Its     (4, 10).
need for processing larger data sets in order   architecture could be a flat grid made of           The grid is also expected to bring solu-
to tackle emerging challenges (comparative      many “small” (10 to 100 CPU’s, 1 to 10 Tb       tions to actual problems that can not be
genomics, image guided epidemiology…).          disk) clusters where the public databases       handled by commonly available resources
These data are produced in many laborato-       would be mirrored weekly. Such mirroring        in medical centres. Some medical applica-
ries and hospitals that are generally not       can take full advantage of the high flux net-   tions have huge memory and computation
equipped to archive or to analyse them.The      works. The biology laboratories would ac-       requirements and can be parallelized. The
format of these data is highly dependent on     cess these resources through web portals        grid is expected to provide a parallel archi-
the device used to produce them, whether        providing grid-enabled algorithms running       tecture in which these applications could be
an imaging or a sequencing device. These        on distributed databases.                       run. Indeed parallel and distributed archi-
data are generally confidential and should                                                      tecture have been successfully reported to
not be accessed without careful identifica-                                                     solve challenging problems related to med-
tion.                                           3.2 Using the Grid                              ical image visualisation (5, 6) and process-
                                                                                                ing (7). Other medical studies involve very
                                                for Medical Imaging                             large database of images that are not neces-
3.1 Using the Grid for                          Medical images are distributed over their       sarily available on a single site.
                                                production sites (radiology departments,
Bio-informatics                                 hospitals…). Although there is no widely
Biologists are facing an exponential growth     established standard for sharing data be-
of their databases. Every time a new            tween sites today, there is an increasing
genome is sequenced and annotated, the          need for remote medical data access and
whole database is processed again to find       processing. Medical image databases are
new homologies. Additional data coming          huge (several Tb of data produced per year

Methods Inf Med 2/2003
3
                                                                                                                    DataGrid, Biomedical Grid Prototype

4. Grid-Blast, First Use Case                   et Chimie des Protéines (http://npsa-
                                                pbil.ibcp.fr).
in Genomics Comparative
Analysis
The first application deployed on DataGrid
biomedical testbed dealt with genomics
                                                5. Design of a Biomedical Grid
comparative analysis. BLAST (Basic Local        Considering the common requirements of
Alignment Search Tool) (8) is a set of simi-    bio-informatics and medical imaging, we
                                                                                                Fig. 1 Structure of the DataGrid biomedical testbed
larity search programs designed to explore      proposed an architecture for a biomedical
all of the available sequence databases re-     grid. In this section, we describe how we
gardless of whether the query is protein or     perceive the different levels of software be-
DNA. BLAST is typically used by biolo-          tween the local operating system and the        general grid services and an extra layer of
gists when they need to compare sequences       user and how the community of biomedical        so-called biomedical services is needed.
of nucleic acids or amino acids coming from     users of the grid could organize its work.      Among these services specifically relevant
their own research to the ones stored in           As we stressed earlier, running a grid re-   to the needs of the biomedical community
public databases. The BLAST programs            quires an extra layer of software, called       are distributed data management, automa-
have been designed for speed, with a mini-      middleware. This middleware makes a set         tic mirroring and updating of databases,
mal sacrifice of sensitivity to distant se-     of services available to the grid users. Low    visualization and interaction with remote
quence relationships.The scores assigned in     level services are useful to grid developers    processes…
a BLAST search have a well-defined statis-      and high level services for the end users.         These biomedical services are available
tical interpretation, making real matches       These services are made available to bio-       for different families of applications de-
easier to distinguish from random back-         medical users who can work on the grid          ployed by different groups of users. Three
ground hits.                                    provided they are identified as authorized      user groups are experimenting the Data-
    Many web portals in the world dedicat-      users. The mechanism to authenticate users      Grid biomedical testbed today:
ed to genomics comparative analysis offer       belonging to a given community is through       ● Computer scientists are taking advan-
to the biologist the possibility to compare     a so-called virtual organization.                  tage of the grid architecture and services
his sequences to databases with BLAST.                                                             to design new distributed and/or parallel
These portals have to restrict the length                                                          algorithms for bio-medical analysis.
and the number of sequences to compare in       5.1 The Different Layers                           Grid-aware algorithms are distributed
order to avoid saturating their computing                                                          algorithms that benefit from the grid
resources. A straightforward impact of exe-
                                                of a Biomedical Grid                               architecture to optimize and parallelize
cuting BLAST comparisons on a distant           The DataGrid project is developing a               computations. These algorithms rely on
node on a grid is to reduce the work load on    middleware based on the Globus toolkit.            an efficient communication interface for
the local computers dedicated to the portal.    Led by K. Kesselman and I. Foster of               message exchanges between parallel
Moreover, the input file of sequences pro-      Argonne National Laboratory (ANL) and              processes. They also usually rely on an
vided by the biologist can be split in small-   the University of Chicago. The Globus              efficient data management service to
er sets of sequences that can be compared       (www.globus.org) project (11) is develop-          access large amounts of data. grid-aware
to the selected database in parallel on sev-    ing fundamental technologies needed to             algorithm development is an emerging
eral distant grid nodes. This requires an up-   build computational grids. It provides basic       research area and mainly involves the
dated copy of the database to be available      services on top of which scientists can deve-      definition of new algorithms.
on these grid nodes.                            lop application programs. The most funda-       ● Bioinformaticians are creating Grid Ser-
    The impact on executing BLAST on the        mental layer consists of a set of core servi-      vices Portals. These are actual service
grid was demonstrated by measuring the          ces, including resource management, secu-          providers wishing to take advantage of
time needed to compare the Swissprot            rity, remote execution, file transfer, and         the grid’s computational power and data
database to itself on one Linux Pentium III     communications that enable the linking             storage capacity. Grid Portals may be
processor and on a DataGrid testbed. Com-       and interoperation of distributed computer         used to run the presently existing algo-
puting time was reduced 80 times on the         systems. On top of these core services, the        rithms as well as new grid-aware ones.
grid.                                           DataGrid middleware work packages have             Many biomedical service providers rely
    Based on the experience with the Visual     been developing an additional layer of ser-        on web-based technologies to offer ac-
DataGridBlast (9), several bio-informatics      vices dealing with workload scheduling and         cess to their databases and computa-
algorithms deployed on DataGrid will be         management, data management, grid moni-            tional resources. The applications de-
made available from the Protein Sequence        toring services, local fabric management           scribed in this section intend to take
Analysis portal of the Institut de Biologie     and mass storage management. These are             advantage of the grid computational

                                                                                                                             Methods Inf Med 2/2003
4
Breton et al.

    power and the data storage capacity.
    Existing or new portals should therefore
                                                  5.2 The Biomedical Virtual                               cal is to prepare the future and to evalu-
                                                                                                           ate what would be the benefits and the
    interface to the grid jobs submission and     Organization                                             limits of using the grid to mine very
    data management services.                     A grid is, by definition, shared by different            large databases.
●   Biologists and researchers in image           groups of users with different goals. These          ●   Parallel magnetic resonance image sim-
    guided diagnosis and therapy use the          communities of users use common re-                      ulator: with the increased interest in
    grid as a cooperative framework: their        sources but they do not share their data.                computer-aided MRI image analysis
    aim is to take advantage of the grid in       Virtual organizations are simply a way to                methods (segmentation, data fusion,
    order to organize their work in a coop-       organize the different communities, their                quantization, etc.), there is a greater
    erative manner. A computational grid          access to data and resources. Each grid user             need for objective methods of algo-
    can help the biomedical community by          has to be recorded in one virtual organiza-              rithms evaluation. In this context, a MRI
    offering a cooperative framework with         tion where his roles (access rights, autho-              simulator provides an interesting assess-
    shared resources as well as shared data       rizations…) are recorded. The DataGrid                   ment tool since it generates 3D realistic
    bases and data format.The grid will help      testbed is used for applications in High                 images (volumes) from virtual medical
    users to organize their work in a cooper-     Energy Physics, Earth and Biomedical                     objects. In order to take into account the
    ative manner. It will allow assembling        Sciences. Each research field has its own                MRI artifacts, a 3D simulator is under
    distributed databases, opening new op-        virtual organization: the one for biomedical             development at CREATIS. The data
    portunities for large scale studies such as   sciences is shown on Fig. 2. We divided it in            grid will be specifically of interest for
    epidemiological studies. The required         two subgroups: one for genomics and one                  the parallelization of the isochromats
    components are the data management            for medical imaging. The users involved in               and the MR sequences.
    interface for sharing, replicating, updat-    the different applications are recorded ac-          ●   The Bioinformatics initiative in Padova
    ing, and exchanging data, and for offer-      cording to their application and their home              concentrates on the study of indexing
    ing access to large CPU resources, etc.       institute.                                               techniques to create and to query large
                                                     Seven applications are being presently                databases of 3D structures (13). Index-
These three families of users/applications        deployed on the DataGrid biomedical                      ing techniques, initially proposed within
have different requirements. The users also       testbed: 5 in genomics and 2 in medical                  the area of computer vision, are used in
do not have the same level of computing           imaging.                                                 different contexts and differ in the type
awareness. For instance biologists wishing           We are going to present three of them                 of invariant properties (either local or
to use the grid as a collaborative frame-         briefly:                                                 global), in the transformation class
work are not necessarily as skilled in the        ● Data mining on the grid: Knowledge                     (rigid body or affine transformations),
use of computers as computer scientists              Discovery in Databases (KDD) stands                   and in the method used to formulate
developing distributed algorithms.                   for the non trivial process of implicit in-           and verify hypotheses of associations of
   Fig. 1 gives a schematic representation           formation, previously unknown and po-                 the query object. Within the proposed
of the different layers of the DataGrid bio-         tentially useful, contained in stored da-             scheme, the structural data are stored in
medical test bed.                                    ta. The aim of the Université Blaise Pas-             separate tables, spread across the grid.

                                                                                                       6. Conclusion
                                                                                                       Biomedical sciences are facing an exponen-
                                                                                                       tial growth of the volume of data they need
                                                                                                       to process and analyze. On one hand, bio-
                                                                                                       logists are sequencing more and more
                                                                                                       genomes and proteins and want to analyze
                                                                                                       them with more and more sophisticated
                                                                                                       algorithms. On the other hand, imaging
                                                                                                       devices are widely spreading in the hospi-
                                                                                                       tals, generating terabytes of data that need
                                                                                                       to be stored and made available to physi-
                                                                                                       cians.
                                                                          Fig. 2                           The computing needs of bio-informatics
                                                                          Biomedical virtual organi-   and medical informatics are basically of the
                                                                          zation                       same nature:

Methods Inf Med 2/2003
5
                                                                                                                                     DataGrid, Biomedical Grid Prototype

●   handle large volumes of data produced
    in many centers,
                                                    References                                                 Workshop on Parallel Image Analysis, pp 65-
                                                                                                               78, Lyon, France, December 1995.
                                                    1. Foster I, Kesselman C.The Grid, blueprint for a      8. Altschul SF, Gish W, Miller W, Myers EW, Lip-
●   define common standards for their in-              new computing infrastructure. Morgan Kauf-              man DJ. Basic local alignment search tool.
    teroperability,                                    man, San Francisco, 1999.                               J Mol Biol 1990; 215: 403-10.
●   provide to a large community of users           2. Segal B. Grid computing: the European data           9. Legré Y, Météry R, Fougas AS, Joubert M.
                                                       project. IEEE Nuclear Science Symposium and             Visual DataGridBlast. Private communication.
    (biologists, physicians) a secured and             Medical Imaging Conference, Lyon, 15-20             10. Montagnat J, Davila E, Magnin IE. 3D objects
    efficient access to their content.                 October 2000.                                           visualization for remote interactive medical ap-
                                                    3. Thomson M, Johnson W, Goujun J, Lee J, Tier-            plications. 3D Data Visualization, Processing,
We have tried to demonstrate that the grid             ney B, Terdiman JF. Distributed health care             and Transmission, Padova, Italy, June 2002.
                                                       imaging information systems. PACS Design            11. Foster I, Kesselman C. Globus: A Metacomput-
paradigm was a good response to these                  and Evaluation: Engineering and Clinical                ing Infrastructure Toolkit. International J Su-
needs. The DataGrid biomedical work                    Issues, volume 3035, SPIE Medical Imaging,              percomputer Applications 1997;11(2): 115-28.
package is the first attempt to develop the            1997.                                               12. Foster I, Kesselman C, Tuecke S. The anatomy
specific grid services that will allow the bio-     4. Graves S, Tullio J, Downs JH, Kassel N. Tele-           of the Grid: enabling scalable virtual organiza-
                                                       presence in neurosurgery: the integrated re-            tions. International J Supercomputer Applica-
medical community to successfully address              mote neurosurgical system. Medicine meets               tions 2001; 15 (3).
its challenges.                                        virtual reality 5, 1997.                            13. Guerra C, Lonardi S, Zanotti G.Analysis of sec-
                                                    5. von Laszewski G, Su MH, Insley JA, Foster I,            ondary structures of proteins using indexing
Acknowledgment                                         Bresnahan J, Kesselman C, Thiebaux M, Rivers            techniques. IEEE Proc. First Int. Symposium
The authors acknowledge the contributions of           ML, Wang S, Tieman B, McNulty I. Real-time              on 3D Data Processing Visualization and
all the participants to the biomedical work pack-      analysis, visualization, and Steering of Microto-       Transmission, 2002.
age of DataGrid. Special thanks are due to             mography experiments at photon sources. 9th
Christophe Blanchet, Emmanuel Cornillot,               SIAM Conference on Parallel Processing for
                                                       Scientific Computing, April 1999.                   Correspondence to:
Nicolas Jacq, and Christian Michau.                                                                        Vincent Breton
                                                    6. Li JJ, Miguet S. Parallel volume rendering of
                                                       medical images. EWPC’92: From theory to             Laboratoire de Physique Corpusculaire
                                                       sound Practice, pp 332-343, Barcelone, 1992.        Campus des Cezeaux
                                                    7. Miguet S, Nicod JM. An optimal parallel iso-        63177 Aubière Cedex, France
                                                       surface extraction algorithm. 4th International     E-mail:breton@clermont.in2p3.fr

                                                                                                                                                   Methods Inf Med 2/2003
You can also read