FlyNets and GIF-DB, two Internet databases for molecular interactions in Drosophila melanogaster
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
1998 Oxford University Press Nucleic Acids Research, 1998, Vol. 26, No. 1 89–93 FlyNets and GIF-DB, two Internet databases for molecular interactions in Drosophila melanogaster Elodie Mohr, Florence Horn+, Florence Janody, Catherine Sanchez, Violaine Pillet, Bernard Bellon, Laurence Röder and Bernard Jacq* Laboratoire de Génétique et Physiologie du Développement, IBDM, Parc Scientifique de Luminy, CNRS Case 907, 13288 Marseille Cedex 09, France and 1Atelier de Bio-Informatique, Case 13, Université de Provence, 3 Place Victor Hugo, 13331 Marseille Cedex 03, France Received September 30, 1997; Accepted October 3, 1997 ABSTRACT describing known specific molecular interactions between genes, RNA and proteins are very often underepresented in these GIF-DB and FlyNets are two WWW databases databases and difficult to query. If one considers protein–DNA describing molecular (protein–DNA, protein–RNA and interactions for instance, it is in principle possible to extract protein–protein) interactions occuring in the fly Droso- specific information about them from the major nucleic acid phila melanogaster (http://gifts.univ-mrs.fr/GIFTS_ databases: in the features table of GenBank, EMBL and DDBJ home_ page.html ). GIF-DB is a specialised database databases entries, the ‘protein_bind’ feature was designed to which focuses on molecular interactions involved in localise the regions of DNA or RNA sequences which specifi- the process of embryonic pattern formation, whereas cally interact with proteins and to identify these proteins. Using FlyNets is a new and more general database, the the SRS 5.05 retrieval system (7) on GenBank release 102 long-term goal of which is to report on any published (August 1997), we found only 18 Drosophila melanogaster molecular interaction occuring in the fly. The informa- sequences in which the ‘protein_bind’ feature was present and a tion content of both databases is distributed in specific total number of 59 corresponding DNA or RNA binding sites in lines arranged into an EMBL- (or GenBank-) like them (out of 25 315 sequences from this species). Although the format. These databases achieve a high level of number of Drosophila genes and RNAs for which specific integration with other databases such as FlyBase, interactions with proteins have been published is difficult to EMBL, GenBank and SWISS-PROT through numerous estimate, the above numbers are clearly an underepresentation of hyperlinks. In addition, we also describe SOS-DGDB, our present knowledge. Conversely, on the protein databases side, a new collection of annotated Drosophila gene it is extremely difficult to extract from SWISS-PROT (4) or PIR sequences, in which binding sites for regulatory proteins are directly visible on the DNA primary (5) databases either a list of proteins which interact with a given sequence and hyperlinked both to GIF-DB and TRANS- gene or a list of genes controlled by a given regulatory protein. FAC database entries. The TRANSFAC database (8) gives some precise structural data for transcription factors and their known binding sites. Using again Drosophila as an example, we found in TRANSFAC 40 INTRODUCTION target genes for transcription factors and a total of 322 correspon- Direct and specific molecular interactions, involving DNA, RNA ding DNA-binding sites from this species, a better result than the and proteins play an essential role in all known biological one obtained with GenBank. Even in this case however, data processes. Three major types of interactions, i.e., protein–DNA, essential for the understanding of transcription factor function in protein–RNA and protein–protein interactions account for the their specific biological contexts are missing such as: the great majority of known biological macromolecular interactions. developmental stage at which interaction occurs, the phenotype Several general databases exist for each of the three types of of animals in which the transcription factor is absent or mutated, informational macromolecules, such as GenBank (1), EMBL (2), the biological result of the interaction (gene activation or and DDBJ (3) databases for DNA and RNA sequences, repression), the organisation of the cis-regulatory region and the SWISS-PROT (4) and PIR (5) databases for protein sequences experimental evidence for interaction. As far as protein–RNA and the PDB database (6) for molecular 3D structures. Many and protein–protein interactions are being considered, and more specialised databases exist for specific families of genes, although some very specific databases do exist, such as the RNAs and proteins and this issue of Nucleic Acids Research MHCPEP database of MHC binding peptides for instance (9), provides the reader with an up-to-date collection of such data on these interactions are also not easy to extract from existing databases. Despite this abundance of information sources, data databases. *To whom correspondence should be addressed. Tel: +33 4 91 82 90 55; Fax: +33 4 91 82 06 82; Email: jacq@lgpd.univ-mrs.fr +Present address: Biocomputing 3D modelling unit, European Molecular Biology Laboratory, D-69012 Heidelberg, Germany
85 Nucleic Acids Nucleic Acids Research, Research,1994, 1998,Vol. Vol.22, 26,No. No.11 85 We are interested in the process of pattern formation in several molecules of the same transcription factor on different Drosophila and in understanding the basis of specific identity cis-regulatory binding sites of a given gene will be considered as acquisition by the different body parts (10–12). Different classes one interaction only, in which the individual molecular inter- of genes involved in the segmentation process (maternal, gap, action events represent sub-interaction components. The same is pair-rule and segment polarity genes) divide the embryo along the true for the interaction between an RNA molecule and several antero-posterior axis into repeated homologous units (13,14) identical RNA-binding protein molecules or also for the inter- which will develop specific identities and morphogenetic features action of several copies of two different proteins within a under the control of homeotic genes (15). Specific interactions multimeric complex. within and between these gene families are essential for the establishment of a correct body pattern. Being able to access, THE GIF-DB DATABASE query and manipulate the data on these developmental genes and their functional interactions within specific regulatory networks Purpose and leading concepts of GIF-DB is now recognised as an important need for developmental and GIF-DB, the Gene Interactions in the Fly Database, is a WWW molecular biologists studying gene regulation. From a more database which aims at providing a repository for data on gene general point of view, a basic knowledge of all other known interactions involved in Drosophila embryonic development and macromolecular interactions in Drosophila would certainly help the regulatory networks in which they are involved. The first to integrate our knowledge on the structure and function of the version of the database appeared on the WWW in October 1995 individual genes into a unified and physiological view of the and the concepts, database organisation and entry format of organism. To achieve these goals, the development of new GIF-DB have previously been described (16). Briefly, four main computer and information science tools is needed, among which leading concepts and goals were considered to elaborate GIF-DB. that of interaction databases would probably be one of the most (i) Finding a relatively simple way to represent the various and useful. complex knowledge we presently have on gene molecular In this paper, we describe the concepts, organisation, content interactions during embryonic development of Drosophila. and use of two Drosophila interaction databases: GIF-DB, which (ii) Obtaining a high level of integration with other databases. focuses on developmental molecular interactions and FlyNets, a (iii) Classifying all molecular interactions in one of three major new general Drosophila interaction database. Finally, SOS- interaction types (protein–DNA, protein–RNA or protein– DGDB, a new collection of Drosophila DNA sequences in which protein interactions). (iv) Defining a generic mode of interaction binding sites for regulatory proteins are annotated on the primary representation which could potentially be used for the description sequence and hyperlinked to GIF-DB and TRANSFAC database of nearly any gene interaction, whatever the biological process entries is presented. and the organism in which they occur may be. INTERACTION DEFINITIONS AND CLASSIFICATION Database organisation and entry format Gene molecular interactions should not to be mistaken for genetic In order to fulfill the above requirements, we have developed a interactions. The latter are more general and include both indirect generic structured hypertext format. The GIF-DB interaction and direct interactions. Our working definition for a gene database, which makes use of this format, is a collection of molecular interaction is the following: there is a direct molecular hypertext files, each of them describing an interaction between interaction between gene A and gene B if gene A or one of its two molecular partners as discussed above. Each entry contains products (i.e., mRNA or protein) physically interacts at the biological information which has been arranged into an ‘EMBL- molecular level with gene B or one of its products (mRNA or like’ or ‘SWISS-PROT-like’ model format. Wherever possible, protein). Among the six different molecular interaction types symbols and nomenclature supposed to be familiar to drosophi- which could then theoretically be considered we have focused on lists, geneticists, biochemists and molecular biologists are used to three major types of interactions only, which are by far the most describe the interactions and some conventions used in FlyBase, documented ones, whatever the organism being considered: the genetic and molecular Drosophila database (17) have been protein–DNA interactions (type I), protein–RNA interactions followed. All scientific data found in GIF-DB comes from the (type II) and protein–protein interactions (type III). literature. The information extracted from different articles is In order to simplify the management of data on interactions, compiled (and synthetized if necessary), verified and entered in and in accordance with the above definition of interaction types, DEXIFLY (Horn et al., manuscript submitted), a Drosophila all molecular interactions in GIF-DB and FlyNets databases will relational database built using the 4th Dimension program be described as binary interactions (i.e., interactions occuring (ACI, Inc.) on a MacIntosh computer. The HTML files constitut- between two molecular partners). This could be viewed as a ing GIF-DB are then automatically generated from this database. limitation if one considers what is already known about the Each entry in the GIF-DB database is composed of lines and complexity of gene interactions. However, and within certain different types of lines (each having its own format) are used to limits, any complex interaction which involves more than two record the various types of gene interaction information which partners (interaction between a DNA sequence and several make up the entry. As is the case for the EMBL and SWISS- different transcription factors, or between several proteins into a PROT databases, each line in a GIF-DB entry begins with a multimeric complex, for instance) could be split up into several two-character line code indicating the type of information binary interactions in order to be described. This binary point of contained in the line. Wherever possible, we have tried to use the view is also well adapted to our current experimental approach of linetypes already established by the EMBL and SWISS-PROT genetic and molecular interactions, in which two entities only are databases, but due to the specificity of the GIF-DB database, usually studied at a time. It has to be noted that the binding of many new linetypes had to be introduced. Some of the original
86 Nucleic Acids Research, 1998, Vol. 26, No. 1 linetypes deserve a few comments. In particular, information about the cis-regulatory regions in protein–DNA interaction entries is given with an increasingly detailed view by the group of RR (Regulatory Region location), RS (number and strength of Regulatory Sites) and SS (regulatory Sites Sequence) lines. Several lines contain hypertext pointers towards other databases and at the moment, links towards FlyBase, GenBank, EMBL and SWISS-PROT databases are supported. For the sake of clarity, the 40 different linetypes in a GIF-DB entry have been arranged into five zones: the ENTRY zone, the EFFECTOR zone, the TARGET zone, the INTERACTION zone and the REFER- ENCES zone. More details on the database general organisation, entry format, the different linetypes an the on-line user manual for the database can be found in the first description of GIF-DB (16). Recent developments Version 2.0 of GIF-DB (January 1997) contains 25 entries and each of them is ∼4–6 pages long, with an average of 6–10 associated bibliographic references. Ten new entries have been added and the majority of Version 1.2 entries have been updated. The major change in Version 2.0 is the creation of a collection of annotated DNA sequences linked to GIF-DB (Fig. 1). This collection of sequences has been named SOS-DGDB (Sites On Sequences Drosophila Gene DataBase). A click on any site in the Figure 1. Example of an annotated sequence from the SOS-DGDB database. SS lines of a given GIF-DB entry opens the corresponding A part of the double-stranded genomic sequence of the hunchback gene is displayed. Lower case letters correspond to upstream control sequences and SOS-DGDB sequence entry and points directly at the position of pale blue capital letters at the bottom of the figure correspond to exon I of the the selected site. This possibility allows the user to obtain an gene. Sequence binding sites for different transcription factors are shown as integrated view of the sites for all different trans-acting factors capital colored letters in the sequence. The name of the DNA-binding protein acting upon a cis-regulatory region within the context of the and numbering of the binding site are indicated either above or below the sequence of the target gene. From each binding site, two corresponding site; blue G/T bracketed letters located besides the factors’ name correspond to activable links to either GIF-DB database (G) or TRANSFAC hyperlinks are available which point either back to the GIF-DB database (T) entries. database or to the corresponding site in the TRANSFAC (8) database (if available). At present, our SOS-DGDB collection (Version 1.0) contains sequences for 16 genes and ∼200 binding sites have been color-highlighted on the sequences. Different these interactions (which we estimate to represent a few thousand colors are used to discriminate between the several families of ones) with the precision level that GIF-DB entries have presently transcription factors. As has been extensively discussed in reached is now impossible, since elaboration of a GIF-DB entry Arnone and Davidson (18), identification and analysis of many represents, on average, 20–30 hours of work (including critical cis-regulatory elements is of central importance for understand- reading and analysis of the original papers). We therefore decided ing the function of regulatory networks and of the genomic to build another interaction database with less information fields program for development. The development of databases like than GIF-DB, and for which part of the elaboration task could be SOS-DGDB is a step in that direction, but new specific tools automatized in the future. This database is called the FlyNets devoted to the analysis and comparison of regulatory sequences database and our goal is to progressively integrate in it data about have still to be created. the majority of known molecular Drosophila interactions. THE FlyNets DATABASE Database organisation and entry format Purpose and leading concepts The FlyNets database shares many common features with the GIF-DB database: the definition of three major interaction classes A database like GIF-DB could gain added value if the develop- and the general organisation of entries into five zones have been mental networks that it contains could also be viewed and conserved. This is also true for almost all FlyNets linetypes which analysed in the physiological context of all other regulatory are the same as in GIF-DB, for the conventions adopted in the line networks occurring in the organism. Although a large number of contents (and described in the on-line FlyNets-primer document), molecular interactions constitutive of these networks have been for the line length (80 characters) and for the format of the studied, no database or even a simple list of these interactions is references. The two main differences are the number of linetypes as yet available. Flybase (17) recently started to compile genetic supported and the format of the line headers. A total of 31 linetypes interactions (M. Ashburner, personal communication). Collect- (instead of 40 in GIF-DB) are presently supported. The comments ing and organising data about all Drosophila direct molecular line in FlyNets now regroups data which were present in several interactions for which published experimental evidence is different comment lines in GIF-DB. This line is organised in a way available would therefore represent a major advance in our similar to that of the SWISS-PROT CC line in which different types knowledge on interaction networks. However, describing all of comments are arranged in as many sub-comments. The second
87 Nucleic Acids Nucleic Acids Research, Research,1994, 1998,Vol. Vol.22, 26,No. No.11 87 INTERACTIVE WWW ACCESS TO THE INTERACTION DATABASES GIF-DB and FlyNets databases can be accessed using the World Wide Web through the GIFTS (Gene Interaction in the Fly Transworld Server) WWW server in Marseille. To access the WWW, one needs a WWW browser such as Netscape Navi- gator (from Netscape Communications Corp.) and a link to the Internet. The URL (Uniform Resource Locator, the addressing system used in the WWW) of the GIFTS Server is http://gifts. univ-mrs.fr/GIFTS_home_page.html . In addition to giving an access to these databases, the GIFTS server also provides services such as GIN, a series of annotated pages to help navigate on the Internet and BLASTula, a specialised service giving an integrated access to more than 40 different Blast analysis sites in the world. Many of these Blast servers operate on collections of new DNA sequences from the different genome projects which are not yet integrated in the EMBL, GenBank and DDBJ general databases. There are two different ways to access GIF-DB data once the connection with the server is established: either through a hypertext list of all available entries accessible from the GIF-DB home page or through a query search program. The search can be performed either on the entire database or in any one of the 40 different data lines of all entries. At the moment, FlyNets data is accessible through the hypertext list of entries only, and a query program will soon be available. CONCLUSIONS AND FUTURE PROSPECTS Interaction databases such as GIF-DB and FlyNets provide a simple and straightforward way to make functional links between specific entries from different molecular databases. Such func- tional links are a useful complement to the structural links Figure 2. Example of a typical FlyNets entry. The information content of the (present as database cross-references) existing between many entry is distributed among several lines which have been grouped into five EMBL (or GenBank) entries and their SWISS-PROT (or different zones (Entry, Effector, etc.). Each line starts with an explicit header PIR-International) corresponding translational products. followed by the corresponding data. Hyperlinks towards EMBL, SWISS- After that the different genome projects will have provided us PROT and FlyBase are shown in blue or red. Hyperlinks appearing within the comments and the references point to FlyBase publication reports. The headers with extensive catalogs of genes and proteins for several of the five zones are hyperlinks towards the FlyNets database primer, an on-line organisms, it will be essential to describe how and with which reference manual explaining the conventions used in the database. other molecules these components establish specific interactions, a knowledge which cannot be deduced from their sequences or structures. In what is now called the ‘post-genome’ era, both new experimental methods and new bioinformatics concepts and tools are needed to gradually paint the complicated picture of biological pathways. In this respect, on the informatics side, difference with GIF-DB is the format of linetypes headers: each line interaction databases, such as GIF-DB and FlyNets, as well as in a GIF-DB entry begins with a two-character line code indicating metabolic databases such as EcoCyc (19) or KEGG (20) for the type of information contained in the line, as is the case for the instance, are likely to play increasingly important roles in the near EMBL and SWISS-PROT databases. Since several users of future. We have recently been aware of the existence of GeNet, GIF-DB having found difficulty in memorising the signification of a gene networks database (21) in which Drosophila segmentation a two-letter code for 40 linetypes, we have decided to adopt a networks are also described. Some concepts, originally devel- different convention in FlyNets. The line headers now have a oped in the GIF-DB database, have been introduced in the GeNet GenBank-like format and are explicit words or group of words with database. A nice feature of this database is the presence of a maximum length of 20 characters (e.g. identificator, creation date, interactive schematic representations of regulatory pathways target function, authors, etc.). As is the case for GIF-DB, the linked to gene entries. In an evolutionary perspective, building information necessary to build FlyNets entries is extracted from interaction databases for different organisms would be extremely different scientific articles, compiled, verified and entered into a interesting since it would provide a means to see to what extent relational database (F.Horn, unpublished) from which the HTML homologous genes are working through homologous regulatory files constituting FlyNets are then automatically generated. Version pathways. 1.0 of FlyNets (January 1997) contains ∼70 interaction entries. A Within the next few years, we plan to offer new possibilities typical example of a FlyNets entry is shown in Figure 2. within GIF-DB and FlyNets through the addition of a few new
88 Nucleic Acids Research, 1998, Vol. 26, No. 1 linetypes and the adjunction of hyperlinks towards other data- 3 Tateno,Y. and Gojobori,T. (1997) Nucleic Acids Res., 25, 14–17. [See also bases. Among the scheduled improvements are: links towards this issue Nucleic Acids Res. (1998) 26, 16–20.] 4 Bairoch,A. and Apweiler,R. (1997) Nucleic Acids Res., 25, 31–36. [See Medline PubMed abstracts, the Interactive Fly, a cyberspace also this issue Nucleic Acids Res. (1998) 26, 38–42.] guide to Drosophila genes and their roles in development (22) 5 George,D.G., Dodson,R.J., Garavelli,J.S., Haft,D.H., Hunt,L;T., and Flyview (23), a database on expression patterns of Drosophi- Marzec,C.R., Orcutt,B.C., Sidman,K.E., Srinivasarao,G.Y., Yeh,L.S.L., la genes. Part of our data on interactions have now been included et al. (1997) Nucleic Acids Res., 25, 24–27. [See also this issue Nucleic in KNIFE, a knowledge base presently under development (24), Acids Res. (1998) 26, 27–32.] within which graphical representations of interactions and 6 Abola,E.E., Bernstein,F.C. and Koetzle,T.F. (1988) In Lesk,A.M. (ed.) Computational Molecular Biology. Sources and Methods for Sequence regulatory networks are automatically generated. This represents Analysis. Oxford University Press, Oxford, UK. pp. 69–81. a first step towards the simulation of some aspects of the dynamic 7 Etzold,T., Ulyanov,A. and Argos,P. (1996) Methods Enzymol., 266, behavior of developmental genetic regulatory networks. 114–128. Finally, many efforts will also be devoted to the problem of 8 Wingender,E., Kel,A.E., Kel,O.V., Karas,H., Karas,H., Heinemeyer,T. interaction data acquisition. Recent results (Pillet et al., manu- Dietze,P., Knüppel,R., Romaschenko,A.G. and Kolchanov,N.A. (1997) script in preparation) have shown that it is possible, using textual Nucleic Acids Res., 25, 265–268. [See also this issue Nucleic Acids Res. (1998) 26, 362–367.] statistics techniques, to retrieve in a semi-automatic way a list of 9 Brusic,V., Rudy,G., Kyne,A.P. and Harrison,L.C. (1997) Nucleic Acids interactions from a collection of article mini-abstracts found in Res., 25, 269–271. [See also this issue Nucleic Acids Res. (1998) 26, the FlyBase database. We are presently working to improve the 368–371.] efficiency of our methodology and plan to apply it to extract 10 Fasano,L., Röder,L., Cor,N., Alexandre,E., Vola,C., Jacq,B. and information on molecular interactions from Medline abstracts. Kerridge,S. (1991) Cell, 64, 63–79. 11 Röder,L., Vola,C. and Kerridge,S. (1992) Development, 115, 1017–1033. 12 Alexandre,E., Graba,Y., Fasano,L., Gallet,A., Perrin,L., De Zulueta,P., CITING AND CONTACTING GIF-DB, FlyNets OR Pradel,J., Kerridge,S. and Jacq,B. (1996) Mech. Dev., 59, 191–204. SOS-DGDB 13 Gaul,U. and Jäckle,H. (1990) Adv. Genet., 27, 489–504. 14 Nüsslein-Volhard,C. and Wieschaus,E. (1980) Nature, 287, 795–801. If you use GIF-DB, FlyNets or SOS-DGDB as tools in your 15 Lewis,E.B. (1978) Nature, 276, 565–570. published research work, please cite this paper. Comments and 16 Jacq,B., Horn,F., Janody,F., Gompel,N., Serralbo,O. Mohr,E., Leroy,C., inquiries about GIF-DB, FlyNets or SOS-DGDB are welcome Bellon,B., Fasano,L., Laurenti,P. and Röder,L. (1997) Nucleic Acids Res., and should be sent to Bernard Jacq (e-mail: jacq@lgpd.univ- 25, 67–71. mrs.fr). 17 The FlyBase Consortium (1997) Nucleic Acids Res., 25, 63–66. [See also this issue Nucleic Acids Res. (1998) 26, 85–88.] 18 Arnone,M.I. and Davidson, E.H. (1997) Development, 124, 1851–1864. ACKNOWLEDGEMENTS 19 Karp,P., Riley,M., Paley,S., Pelligrini-Toole,A. and Krummenacker,M (1997) Nucleic Acids Res., 25, 43–50. [See also this issue Nucleic Acids We would like to thank J. Euzenat, F. Rechenmann, L. Quoniam, Res. (1998) 26, 50–53.] L. Fasano and M. Djabali for helpul discusssions on computer and 20 Goto,S., Bono,H., Ogata,H., Fujibuchi,W., Nishioka,T. and Kanehisa,M. biological issues about interactions. This work has been sup- (1996) In Hunter,L. and Klein,T. (eds), Proceedings of the Pacific ported by an ACC-SV grant from the MESR (Ministre de Symposium on Biocomputing ‘96. World Scientific Publishing Co., l’enseignement et de la recherche) to B.J. and F.Rechenmann and Singapore, pp. 175–186. 21 GeNet, Spirov,A., A gene Networks database. URL by the CNRS. http://www.iephb.ru/∼spirov/genet00.html 22 Brody,T.B., The Interactive fly. A cyberspace guide to Drosophila genes and their roles in development. Purdue University. URL http://sdb.bio. REFERENCES purdue.edu/fly/aimain/1aahome.htm 1 Benson,D.A., Boguski,M., Lipman,D.J. and Ostell,J. (1997) Nucleic Acids 23 Janning,W., et al. Flyview. A Drosophila Image Database. University of Res., 25, 1–6. [See also this issue Nucleic Acids Res. (1998) 26, 1–7.] Münster. URL http://pbio07.uni-muenster.de/ 2 Stoesser,G., Sterk,P., Tuli,M.A., Stoehr,P.J. and Cameron,G.N. (1997) 24 Euzenat,J., Chemla,C. and Jacq,B. (1997) In Proceedings of the Fifth Nucleic Acids Res., 25, 7–13. [See also this issue Nucleic Acids Res. International Conference on Intelligent Systems for Molecular Biology. (1998) 26, 8–15.] AAAI Press, Menlo Park, CA, pp 108–119.
You can also read