Finding function through structural genomics Lawrence Shapiro* and Tim Harris

Page created by Marjorie Norton
 
CONTINUE READING
Finding function through structural genomics Lawrence Shapiro* and Tim Harris
btb107.qxd    02/16/2000       03:20    Page 31

                                                                                                                                            31

             Finding function through structural genomics
             Lawrence Shapiro* and Tim Harris†
             The recent availability of whole-genome sequences and large        kinases had been well documented before the first crystal
             numbers of protein-coding regions from high-throughput cDNA        structure revealed the molecular details of their function
             analysis has fundamentally changed experimental biology.           [6]. And the functions of the molecular chaperones GroEL
             These efforts have provided huge databases of protein              [7] and DnaK [8] were well understood before structural
             sequences, many of which are of unknown function.                  studies revealed the details of their function.
             Deciphering the functions of these myriad proteins presents a
             major intellectual challenge.                                      Is this detail-uncovering role the limit of what structural
                                                                                studies can give to biology? We think that structural stud-
             Addresses                                                          ies can now be gainfully employed as part of early phase
             *Structural Biology Program, Department of Physiology and          biological discovery, to answer the question ‘what is the
             Biophysics, Mount Sinai School of Medicine, 1425 Madison Avenue,   function of this protein?’ This is the defining feature of
             New York, NY 10029, USA;
             e-mail: shapiro@anguilla.physbio.mssm.edu
                                                                                structure-based functional genomics.
             † Structural Genomix Inc., 10505 Roselle Street, San Diego, CA
             92121, USA; e-mail: tim@stromix.com                                Almost universally, when a new cDNA sequence is deter-
                                                                                mined, it is compared by basic local alignment search tool
             Current Opinion in Biotechnology 2000, 11:31–35
                                                                                (BLAST) searching with cDNAs corresponding to proteins
             0958-1669/00/$ — see front matter © 2000 Elsevier Science Ltd.     of known function [9]. A similarity ‘hit’ is often used to cat-
             All rights reserved.                                               egorize the biological function of the protein deriving from
             Abbreviations                                                      that cDNA: a kinase-related sequence is usually thought of
             BLAST basic local alignment search tool                            as a kinase, and biological studies on such a molecule
             HIT     histidine triad                                            would generally begin with the overall idea to look for
             PTB     phosphotyrosine binding
             SH2     src homology-2
                                                                                kinase function. Three-dimensional (3D) structures, how-
             TNF     tumor necrosis factor                                      ever, provide advantages over sequence data in two
                                                                                distinct areas. Firstly, 3D structural information can pro-
                                                                                vide new insights into biology by elucidating relatedness
             Introduction                                                       to proteins of known biological function, which could not
             While the revolution in genomics has provided compre-              have been arrived at by sequence analysis alone. Secondly,
             hensive lists of newly discovered proteins, a technological        structural information can identify binding motifs and cat-
             revolution in crystallography (particularly multiwavelength        alytic centers — even for proteins without a known
             anomolous diffraction [MAD] analysis [1,2] and the advent          biological function. These, to our minds, are the two main
             of new synchrotron sources enabling the use of undulator           principles of the structure-based functional genomics
             radiation [3,4•]) has enabled a quantum leap in the rapidi-        approach. Here we present a brief review on the current
             ty with which structures can be determined. It is now              state of structure-based functional genomics, which uses
             therefore possible to determine high-resolution structures         protein structure as a cornerstone in a strategy for deter-
             of proteins with greatly improved efficiency. The central          mining protein function.
             question addressed in this review is this: can structural
             knowledge enable us to efficiently answer early-stage bio-         Aha! Functional relationships derived through
             logical questions of protein function? Structural biology          structural similarity
             has focused mainly on the details of protein function, ask-        Sequence conservation is, in general, much weaker than
             ing the question ‘how does the protein achieve its                 structural conservation. A good example of this can be seen
             function?’ Can we now use protein structure to answer the          in comparing the oxygen carriers from blood and muscle,
             more basic question ‘what does the protein do?’                    hemoglobin and myoglobin. Although these proteins are
                                                                                not clearly related in sequence (e.g. a BLAST search with
             Traditionally, structural studies have come near the ‘end’         hemoglobin will not elicit a myoglobin ‘hit’), they are
             of the biological discovery process. After the function of a       nonetheless closely related in 3D structure [10]. Thus, we
             protein or protein family has been well characterized by           can imagine the hypothetical situation of discovering myo-
             biochemistry, cell biology, and molecular biology studies,         globin for the first time as a muscle-specific cDNA, but not
             the crystal structure has been a logical ‘ultimate’ step for       knowing its function. Determining its 3D structure and
             understanding the atomic bases of these molecular func-            comparison to the database of known proteins might lead
             tions. Examples of this approach are almost too numerous           us to exclaim: ‘Aha! This must be an oxygen carrier!’
             to mention, for it includes most of structural biology:
             trypsin was known to be a protease, and its specificity and        A real world example of this is illustrated in a study of
             kinetics were well understood long before the structure            AdipoQ, a protein secreted exclusively from adipocytes,
             was determined [5]. The signaling function of tyrosine             and related to the C1q family of innate immunity proteins,
btb107.qxd    02/16/2000         03:20      Page 32

             32   Analytical biotechnology

             Figure 1

             An example of unexpected similarity between proteins with                    identical in all five proteins are shaded red, and those that are
             undetectable sequence homology, the adipocyte secreted factor                conserved in four of the five are shaded blue. β-Strand regions are
             AdipoQ (Acrp30) and TNFs [11•]. (a) A ribbon diagram comparison of           indicated for AdipoQ by green arrows above the sequence alignment,
             the AdipoQ and TNFα trimers. (b) Superposition of Cα traces from             and below the alignment in blue for CD40L. Sequence identity
             monomers of AdipoQ (white) and CD40 ligand, a TNF family cytokine            between C1q-like domains and TNFs is negligible (e.g. 9% between
             (PDB entry 1ALY). The level of structural similarity is on a par with that   AdipoQ and TNFα, with the alignment shown here). Sequence-based
             between different TNFs. Note that the corresponding sidechains               searches such as BLAST have proved unable to identify relatedness
             shown take on remarkably similar orientations and thus conserve the          between C1q and TNF family members; however, using residue
             overall packing of the cores in the two molecules. (c) Structure-based       patterns identified in the structure-based alignment enables the
             sequence alignment between the TNFs CD40L, TNFα, and TNFβ and                identification of potential new TNF/C1q family members.
             the C1q-like domains from C1qA and AdipoQ. Residues that are

             which at the time were of unknown structure [11•]. AdipoQ                    existence of a TNF superfamily, and resulted in the identi-
             was discovered in a high-throughput sequencing screen of                     fication of new members of this class of cytokine. All this
             cDNAs enriched in adipocytes. The crystal structure of                       from determining the crystal structure of one (well-chosen)
             AdipoQ revealed clear structural similarity to the TNF                       protein of unknown function.
             family of cytokines (Figure 1). This led to the hypothesis
             that AdipoQ might function as a cell signaling factor, which                 Several important cases of deriving functional hypotheses
             now appears to be correct. The similarity to TNF family                      from structural similarity have arisen in the course of ‘clas-
             proteins was not and could not have been found without                       sical’ structural biology studies. Good examples of these
             the structural data. In this case, one key crystal structure                 include proteins involved in phosphotyrosine binding.
             transformed our understanding of the biology of two pro-                     Several different types of protein domains are known to
             tein families (C1qs and TNFs) and their role in the                          function in binding to phosphotyrosine, including src
             evolution of the immune system, has demonstrated the                         homology-2 (SH2) and phosphotyrosine binding (PTB)
btb107.qxd    02/16/2000      03:20    Page 33

                                                                                Finding function through structural genomics Shapiro and Harris   33

             domains, which are not structurally related. When the                the very nature of their identification, and structural data
             structure of the first PTB domain was determined [12], the           may provide key information for deducing their functions.
             surprising discovery was made that although being of
             apparently unrelated sequence these proteins were struc-             Even when applied to proteins with no clear functional
             turally similar to pleckstrin homology (PH) domains, which           connections, 3D structure determination may allow identi-
             bind negatively charged phospholipids to aid in membrane             fication of potential binding motifs and catalytic centers.
             localization of some proteins. The authors formed the                Clearly, this is sometimes difficult, but it has been suc-
             hypothesis that PTB domains might serve a similar func-              cessfully demonstrated in a number of cases. A good
             tion, and then demonstrated through biochemical and                  example is for the histidine triad (HIT) family of proteins
             molecular biological experiments that phospholipid bind-             [16–18]. HIT proteins, which were identified through
             ing is in fact a natural function of these domains.                  genomic analysis, form a large and ubiquitous family,
                                                                                  but — as yet — have no known specific biological func-
             Structural similarity can sometimes be a very strong indi-           tion. Structure solution of several HIT family members
             cator of similar function. For example, the crystal structure        identified common structural motifs, which were used to
             of the amino-terminal domain from the human signaling                identify a catalytic center and nucleotide-binding sites.
             protein Cbl revealed the presence of a ‘cryptic’ SH2                 These structural studies showed that that HIT family pro-
             domain [13]. The sequence of this SH2 domain was so                  teins are nucleotide hydrolases, and even enabled
             divergent that its existence had not been previously sus-            identification of compounds that could inhibit the catalyt-
             pected. The conservation of unique and important                     ic function of the HIT proteins. Their in vivo substrate is
             structural elements, however, clearly identified it as a             still unknown. Nonetheless, future biological investigation
             member of the SH2 domain family. SH2 domains feature a               of this protein family will be fundamentally altered by the
             conserved arginine residue that protrudes form an interior           knowledge provided by the structure-based functional
             position to coordinate the phosphate of bound phospho-               genomics approach.
             peptides. This arginine is also found the Cbl SH2 domain,
             in a structural context that is that is nearly identical to that     Structure-based assignment of function can also be aided
             of other SH2 domains. These similarities can not be                  by visualizing cofactors in the electron density of experi-
             inferred from sequence comparisons alone.                            mentally determined structures. For example, Kim and
                                                                                  co-workers [19•] unexpectedly found bound ATP mole-
             Clearly, similar structure will not always imply a similar           cules in the Methanococcus jannaschii protein MJ0577.
             function; but what are the parameters of this relationship?          These ATP molecules were scavenged from the Escherichia
             Common evolutionary origins might suggest an increased               coli expression host, and co-purified with the recombinant-
             likelihood of shared functional characteristics — because            ly expressed protein. The hypothesis was formed that the
             the evolutionary precursors must then have shared a com-             molecule might function as an ATPase, and biochemical
             mon function. Similar structures alone cannot discern                experiments showed that this was in fact correct. In this
             evolutionary relatedness in the absence of significant               case, the biochemical function could be assigned, but the
             sequence identity. Structure-based alignments can some-              specific biological function performed by this ATPase is
             times reveal conservation of key residues, and other data,           still unknown. MJ0577 cannot function as an ATPase in
             such as coincidence of intron position and phase, can also           isolation, but only in the presence of a soluble extract from
             suggest common origins [14]. Nonetheless, functional                 M. jannaschii. This led the authors to suggest that MJ0577
             insights based on structural relationships can only rise to          is a factor-dependent ATPase that could potentially func-
             the level of hypotheses, and these hypotheses must be                tion as a molecular switch in an as yet unidentified cellular
             tested by direct functional experimentation.                         process. Again, we see that the structure-based functional
                                                                                  genomics approach can lead to hypotheses that can be test-
             Functional insight from analysis of novel                            ed by simple biochemical experiments. We cannot expect
             protein structures                                                   this approach to solve the entirety of most functional prob-
             The best — and most famous — example of structural                   lems, but it may often provide important clues to guide
             insight leading to functional knowledge comes not from               further analysis. The assignment of a specific functionality
             the structure of a protein, but of DNA. Watson’s and                 to a protein, such as the ATPase activity of MJ0577, sug-
             Crick’s famous discovery of the double helix led to the              gests that similar functionality will persist in other proteins
             greatest biological insight of our era: it did not escape their      that are related in primary sequence. Thus, this knowledge
             notice that the structure ‘immediately suggests a possible           propagates through the database, and may add key infor-
             copying mechanism for the genetic material’ [15]. Of                 mation to other problems as well.
             course it was known at the time that DNA was the genet-
             ic material, so this may not be a good analogy for many of           Other types of biological studies often provide functional
             the potential targets presented by high-throughput screen-           links between proteins — for example, genetic epistasis
             ing. Many proteins identified through genetic techniques,            analysis can suggest the connections and order of interac-
             positional cloning, and differential expression analysis are,        tions in a signaling pathway. Thus, understanding the
             however, implicated in a biological or disease process by            function of one of these proteins has the potential to rever-
btb107.qxd    02/16/2000      03:20     Page 34

             34   Analytical biotechnology

             berate throughout the entire chain. In one recent real-         Genome sequencing projects are also underway for sever-
             world example, a few key crystal structures have strongly       al species of plants, including major food crops such as rice,
             affected understanding of the control mechanisms of             maize, tomato, corn, and potato (see http://www.tigr.org).
             clathrin-mediated endocytosis [20].                             As we begin to understand the functions of more genes
                                                                             from these species, the prospects for producing new types
                                                                             of genetically modified strains will increase accordingly
             Picking targets wisely may efficiently lead to
                                                                             [23]. Also, understanding new mechanisms of plant growth
             medical and agricultural benefits
                                                                             control and pathways of natural immunity might lead to
             Many disease genes and genes important in biological
                                                                             new plant husbandry improvements that do not depend on
             processes have been identified through methods that impli-
                                                                             genetic alteration.
             cate them in specific functions. For example, positional
             cloning can demonstrate that mutation of a particular gene
             results in a disease phenotype. This does not, however,         Conclusion: what the future may hold
             necessarily provide information on the biochemical func-        The examples that we have given in this review come
             tion of the encoded protein. To rationally develop              mainly from ‘classical’ structural biology studies.
             therapeutics requires this functional knowledge, and thus       Currently, there are several projects underway — both
             such proteins are ideal targets for analysis by structure-      commercial and academic — to begin structure determi-
             based      functional   genomics.     New      efforts   in     nation projects on a genome-wide scale, but the full
             pharmacogenomics, particularly those relying on single          impact of these efforts probably won’t be felt for the next
             nucleotide polymorphism (SNP) data [21], also stand to          year or two [24•]. These projects, in addition to their scale,
             reap tremendous benefits from a structural treatment.           are non-classical in that targets are not chosen based on
             Effects of SNPs within protein-coding regions surely derive     their known biological interest, but rather are often based
             from structural changes in protein structure or functionali-    on a lack of specific biological knowledge. These struc-
             ty, and these can be fully understood only in the light of      tural genomics projects have several different aims,
             high-resolution structures. The functional consequences of      among them to construct an extensive map of the folds
             many gene alterations can be implied by mapping them to         that natural proteins adopt. Structure-based functional
             the 3D structure of the encoded protein. A good example is      genomics will grow out of these efforts, whether or not it
             glucokinase, an enzyme involved in the first steps of glu-      is their primary mission.
             cose metabolism. Mutations in the gene were found to be
             co-inherited with a form of maturity onset diabetes in the      Although the number of crystal and NMR structures
             young (MODY). Biochemical analysis shows that the muta-         deposited each year in the protein data bank (PDB;
             tions affect the Km and Vmax of the enzyme, and structural      http://www.rcsb.org) has grown exponentially, the number
             studies show that these mutations map to amino acids lin-       of structures unrelated to other proteins has not [24•]. This
             ing the active site. Another example is the p53 protein,        is due in part to the practice of structural biologists looking
             which is involved in many human cancers. A majority of          ever deeper at biological problems, and thus studying com-
             cancer-causing mutations in p53 map to the region of the        plexes, mutants, and relatives of proteins of already known
             protein involved in DNA binding [22]. These observations        structure. Also, the universe of protein structures is finite,
             are entirely dependent on 3D structural knowledge, and it       and chance occurrences of structural similarity have
             is clear that they have strong implications for our under-      become increasingly common. This latter trend would
             standing of the underlying disease.                             appear to be strongly positive for the task of deriving func-
                                                                             tion from these structures.
             Analysis of proteins from pathogens may also provide a
             unique opportunity for the application of this technique.       Functional genomics approaches in general give correla-
             As of October 1999, the complete sequences of 24 micro-         tive and not definitive data about the function of new
             bial genomes had been determined, and sequencing                genes. Hybridization array experiments, for example,
             projects on about double that number were underway (see:        might demonstrate that a particular gene (and thus its pro-
             http://www.tigr.org). In addition, comprehensive ‘knock-        tein product) is expressed in step with the cell cycle,
             out’ projects have been initiated to identify those genes       implying a possible role as a ‘cell-cycle gene’. Genetic
             that are essential for viability. Mapping the functions of      approaches, hypothetically, might give confirmatory data
             these essential genes is an important first step in applying    showing that mutations in the gene affect the cell cycle.
             this information toward the development of new anti-            Neither of these experiments, however, uncover the pro-
             microbial drugs. The availability of high-resolution            tein’s biochemical role. When the database of protein
             structures also adds important data for identifying new         structures (with functions proscribed) is large enough, the
             drug targets. Whereas a particular protein may be of vital      biochemical function of any protein with a 3D structure
             importance to the life cycle of a microbe, the best drug tar-   can be implied and tested. This ‘function by structure’
             gets are usually enzymes with active sites that can be          model of genomics will become easier and more rewarding
             targeted for therapeutic intervention. This knowledge           as time goes on, and the public domain (PDB) and com-
             comes as a by-product of structural studies, and can be         mercial protein data banks increase in size and
             used in the selection of targets for further investigation.     comprehensiveness.
btb107.qxd        02/16/2000        03:20      Page 35

                                                                                             Finding function through structural genomics Shapiro and Harris             35

             References and recommended reading                                                12. Zhou MM, Ravichandran KS, Olejniczak EF, Petros AM, Meadows RP,
             Papers of particular interest, published within the annual period of review,          Sattler M, Harlan JE, Wade WS, Burakoff SJ, Fesik SW: Structure
             have been highlighted as:                                                             and ligand recognition of the phosphotyrosine binding domain of
                                                                                                   Shc. Nature 1995, 378:584-592.
                  • of special interest
                  •• of outstanding interest                                                   13. Meng W, Sawasdikosol S, Burakoff SJ, Eck MJ: Structure of the
                                                                                                   amino-terminal domain of Cbl complexed to its binding site on
             1.     Hendrickson WA, Horton JR, LeMaster DM: Selenomethionyl                        ZAP-70 kinase [see comments]. Nature 1999, 398:84-90.
                    proteins produced for analysis by multiwavelength anomalous
                    diffraction (MAD): a vehicle for direct determination of three-            14. Shapiro L, Kwong PD, Fannon AM, Colman DR, Hendrickson WA:
                    dimensional structure. EMBO J 1990, 9:1665-1672.                               Considerations on the folding topology and evolutionary origin of
                                                                                                   cadherin domains. Proc Natl Acad Sci USA 1995, 92:6793-6797.
             2.     Hendrickson WA: Determination of macromolecular structures
                    from anomalous diffraction of synchrotron radiation. Science               15. Watson JD, Crick FHC: Molecular structure of nucleic acids: a
                    1991, 254:51-58.                                                               structure for deoxyribose nucleic acid. Nature 1953, 171:737-738.
             3.     Shapiro L, Fannon AM, Kwong PD, Thompson A, Lehmann MS, Grübel             16. Lima CD, D’Amico KL, Naday I, Rosenbaum G, Westbrook EM,
                    G, Legrand J-F, Als-Nielsen J, Colman DR, Hendrickson WA: Structural           Hendrickson WA: MAD analysis of FHIT, a putative human tumor
                    basis of cell-cell adhesion by cadherins. Nature 1995, 374:327-337.            suppressor from the HIT protein family. Structure 1997, 5:763-774.
             4.   Walsh MA, Dementieva I, Evans G, Sanishvili R, Joachimiak A: Taking          17.   Lima CD, Klein MG, Hendrickson WA: Structure-based analysis of
             •    MAD to the extreme: ultrafast protein structure determination.                     catalysis and substrate definition in the HIT protein family.
                  Acta Crystallogr D Biol Crystallogr 1999, 55:1168-1173.                            Science 1997, 278:286-290.
             An excellent review on recent advances in synchrotron instrumentation and
             techniques that will probably provide the basis for much of the high-through-     18. Klein MG, Yao Y, Slosberg ED, Lima CD, Doki Y, Weinstein IB:
             put structure determination efforts of structural genomics projects.                  Characterization of PKCI and comparative studies with FHIT,
                                                                                                   related members of the HIT protein family. Exp Cell Res 1998,
             5.     Stroud RM, Kay LM, Dickerson RE: The crystal and molecular                     244:26-32.
                    structure of DIP-inhibited bovine trypsin at 2.7 Angstrom
                    resolution. Cold Spring Harb Symp Quant Biol 1972, 36:125-140.             19. Zarembinski TI, Hung LW, Mueller-Dieckmann HJ, Kim KK, Yokota H,
                                                                                               •     Kim R, Kim SH: Structure-based assignment of the biochemical
             6.     Hubbard SR, Wei L, Ellis L, Hendrickson WA: Crystal structure of                 function of a hypothetical protein: a test case of structural
                    the tyrosine kinase domain of the human insulin receptor [see                    genomics. Proc Natl Acad Sci USA 1998, 95:15189-15193.
                    comments]. Nature 1994, 372:746-754.                                       This paper shows one of the results of a pilot structural genomics project on
             7.     Braig K, Otwinowski Z, Hegde R, Boisvert DC, Joachimiak A, Horwich         M. janaschii. ATP molecules unexpectedly co-purified with the subject pro-
                    AL, Sigler PB: The crystal structure of the bacterial chaperonin           tein, and were revealed in the electron density of the high-resolution crystal
                    GroEL at 2.8 Å [see comments]. Nature 1994, 371:578-586.                   structure. The authors hypothesized a potential ATPase activity, and their
                                                                                               hypothesis was vindicated by biochemical experiments.
             8.     Zhu X, Zhao X, Burkholder WF, Gragerov A, Ogata CM, Gottesman
                    ME, Hendrickson WA: Structural analysis of substrate binding by            20. Marsh M, McMahon HT: The structural era of endocytosis. Science
                    the molecular chaperone DnaK. Science 1996, 272:1606-1614.                     1999, 285:215-220.
             9.     Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local            21. Masood E: As consortium plans free SNP map of human genome.
                    alignment search tool. J Mol Biol 1990, 215:403-410.                           Nature 1999, 398:545-546.
             10. Aronson HE, Royer WE Jr, Hendrickson WA: Quantification of                    22. Cho Y, Gorina S, Jeffrey PD, Pavletich NP: Crystal structure of a p53
                 tertiary structural conservation despite primary sequence drift in                tumor suppressor-DNA complex: understanding tumorigenic
                 the globin fold. Protein Sci 1994, 3:1706-1711.                                   mutations. Science 1994, 265:346-355.
             11. Shapiro L, Scherer PE: The crystal structure of a complement-1q               23. Somerville C, Somerville S: Plant functional genomics. Science
             •     family protein suggests an evolutionary link to tumor necrosis                  1999, 285:380-383.
                   factor. Curr Biol 1998, 8:335-338.
             This work provides an early example of putative functional assignment by          24. Burley SK, Almo SC, Bonanno JB, Capel M, Chance MR, Gaasterland
             structure-based functional genomics. A protein (AdipoQ/Acrp30) identified         •    T, Lin D, Sali A, Studier FW, Swaminathan S: Structural genomics:
             by high-throughput sequencing from a fat cell-specific library was subjected           beyond the human genome project. Nat Genet 1999, 23:151-157.
             to X-ray crystallographic analysis. The structure revealed similarity to TNF-     An excellent review describing the current state of pilot structural genomics
             family cytokines, and thus the hypothesis was formed that this protein might      projects, and laying out the bioinformatics and experimental processes
             function as a cytokine.                                                           required for structural genomics work.
You can also read