Finding function through structural genomics Lawrence Shapiro* and Tim Harris
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
btb107.qxd 02/16/2000 03:20 Page 31 31 Finding function through structural genomics Lawrence Shapiro* and Tim Harris† The recent availability of whole-genome sequences and large kinases had been well documented before the first crystal numbers of protein-coding regions from high-throughput cDNA structure revealed the molecular details of their function analysis has fundamentally changed experimental biology. [6]. And the functions of the molecular chaperones GroEL These efforts have provided huge databases of protein [7] and DnaK [8] were well understood before structural sequences, many of which are of unknown function. studies revealed the details of their function. Deciphering the functions of these myriad proteins presents a major intellectual challenge. Is this detail-uncovering role the limit of what structural studies can give to biology? We think that structural stud- Addresses ies can now be gainfully employed as part of early phase *Structural Biology Program, Department of Physiology and biological discovery, to answer the question ‘what is the Biophysics, Mount Sinai School of Medicine, 1425 Madison Avenue, function of this protein?’ This is the defining feature of New York, NY 10029, USA; e-mail: shapiro@anguilla.physbio.mssm.edu structure-based functional genomics. † Structural Genomix Inc., 10505 Roselle Street, San Diego, CA 92121, USA; e-mail: tim@stromix.com Almost universally, when a new cDNA sequence is deter- mined, it is compared by basic local alignment search tool Current Opinion in Biotechnology 2000, 11:31–35 (BLAST) searching with cDNAs corresponding to proteins 0958-1669/00/$ — see front matter © 2000 Elsevier Science Ltd. of known function [9]. A similarity ‘hit’ is often used to cat- All rights reserved. egorize the biological function of the protein deriving from Abbreviations that cDNA: a kinase-related sequence is usually thought of BLAST basic local alignment search tool as a kinase, and biological studies on such a molecule HIT histidine triad would generally begin with the overall idea to look for PTB phosphotyrosine binding SH2 src homology-2 kinase function. Three-dimensional (3D) structures, how- TNF tumor necrosis factor ever, provide advantages over sequence data in two distinct areas. Firstly, 3D structural information can pro- vide new insights into biology by elucidating relatedness Introduction to proteins of known biological function, which could not While the revolution in genomics has provided compre- have been arrived at by sequence analysis alone. Secondly, hensive lists of newly discovered proteins, a technological structural information can identify binding motifs and cat- revolution in crystallography (particularly multiwavelength alytic centers — even for proteins without a known anomolous diffraction [MAD] analysis [1,2] and the advent biological function. These, to our minds, are the two main of new synchrotron sources enabling the use of undulator principles of the structure-based functional genomics radiation [3,4•]) has enabled a quantum leap in the rapidi- approach. Here we present a brief review on the current ty with which structures can be determined. It is now state of structure-based functional genomics, which uses therefore possible to determine high-resolution structures protein structure as a cornerstone in a strategy for deter- of proteins with greatly improved efficiency. The central mining protein function. question addressed in this review is this: can structural knowledge enable us to efficiently answer early-stage bio- Aha! Functional relationships derived through logical questions of protein function? Structural biology structural similarity has focused mainly on the details of protein function, ask- Sequence conservation is, in general, much weaker than ing the question ‘how does the protein achieve its structural conservation. A good example of this can be seen function?’ Can we now use protein structure to answer the in comparing the oxygen carriers from blood and muscle, more basic question ‘what does the protein do?’ hemoglobin and myoglobin. Although these proteins are not clearly related in sequence (e.g. a BLAST search with Traditionally, structural studies have come near the ‘end’ hemoglobin will not elicit a myoglobin ‘hit’), they are of the biological discovery process. After the function of a nonetheless closely related in 3D structure [10]. Thus, we protein or protein family has been well characterized by can imagine the hypothetical situation of discovering myo- biochemistry, cell biology, and molecular biology studies, globin for the first time as a muscle-specific cDNA, but not the crystal structure has been a logical ‘ultimate’ step for knowing its function. Determining its 3D structure and understanding the atomic bases of these molecular func- comparison to the database of known proteins might lead tions. Examples of this approach are almost too numerous us to exclaim: ‘Aha! This must be an oxygen carrier!’ to mention, for it includes most of structural biology: trypsin was known to be a protease, and its specificity and A real world example of this is illustrated in a study of kinetics were well understood long before the structure AdipoQ, a protein secreted exclusively from adipocytes, was determined [5]. The signaling function of tyrosine and related to the C1q family of innate immunity proteins,
btb107.qxd 02/16/2000 03:20 Page 32 32 Analytical biotechnology Figure 1 An example of unexpected similarity between proteins with identical in all five proteins are shaded red, and those that are undetectable sequence homology, the adipocyte secreted factor conserved in four of the five are shaded blue. β-Strand regions are AdipoQ (Acrp30) and TNFs [11•]. (a) A ribbon diagram comparison of indicated for AdipoQ by green arrows above the sequence alignment, the AdipoQ and TNFα trimers. (b) Superposition of Cα traces from and below the alignment in blue for CD40L. Sequence identity monomers of AdipoQ (white) and CD40 ligand, a TNF family cytokine between C1q-like domains and TNFs is negligible (e.g. 9% between (PDB entry 1ALY). The level of structural similarity is on a par with that AdipoQ and TNFα, with the alignment shown here). Sequence-based between different TNFs. Note that the corresponding sidechains searches such as BLAST have proved unable to identify relatedness shown take on remarkably similar orientations and thus conserve the between C1q and TNF family members; however, using residue overall packing of the cores in the two molecules. (c) Structure-based patterns identified in the structure-based alignment enables the sequence alignment between the TNFs CD40L, TNFα, and TNFβ and identification of potential new TNF/C1q family members. the C1q-like domains from C1qA and AdipoQ. Residues that are which at the time were of unknown structure [11•]. AdipoQ existence of a TNF superfamily, and resulted in the identi- was discovered in a high-throughput sequencing screen of fication of new members of this class of cytokine. All this cDNAs enriched in adipocytes. The crystal structure of from determining the crystal structure of one (well-chosen) AdipoQ revealed clear structural similarity to the TNF protein of unknown function. family of cytokines (Figure 1). This led to the hypothesis that AdipoQ might function as a cell signaling factor, which Several important cases of deriving functional hypotheses now appears to be correct. The similarity to TNF family from structural similarity have arisen in the course of ‘clas- proteins was not and could not have been found without sical’ structural biology studies. Good examples of these the structural data. In this case, one key crystal structure include proteins involved in phosphotyrosine binding. transformed our understanding of the biology of two pro- Several different types of protein domains are known to tein families (C1qs and TNFs) and their role in the function in binding to phosphotyrosine, including src evolution of the immune system, has demonstrated the homology-2 (SH2) and phosphotyrosine binding (PTB)
btb107.qxd 02/16/2000 03:20 Page 33 Finding function through structural genomics Shapiro and Harris 33 domains, which are not structurally related. When the the very nature of their identification, and structural data structure of the first PTB domain was determined [12], the may provide key information for deducing their functions. surprising discovery was made that although being of apparently unrelated sequence these proteins were struc- Even when applied to proteins with no clear functional turally similar to pleckstrin homology (PH) domains, which connections, 3D structure determination may allow identi- bind negatively charged phospholipids to aid in membrane fication of potential binding motifs and catalytic centers. localization of some proteins. The authors formed the Clearly, this is sometimes difficult, but it has been suc- hypothesis that PTB domains might serve a similar func- cessfully demonstrated in a number of cases. A good tion, and then demonstrated through biochemical and example is for the histidine triad (HIT) family of proteins molecular biological experiments that phospholipid bind- [16–18]. HIT proteins, which were identified through ing is in fact a natural function of these domains. genomic analysis, form a large and ubiquitous family, but — as yet — have no known specific biological func- Structural similarity can sometimes be a very strong indi- tion. Structure solution of several HIT family members cator of similar function. For example, the crystal structure identified common structural motifs, which were used to of the amino-terminal domain from the human signaling identify a catalytic center and nucleotide-binding sites. protein Cbl revealed the presence of a ‘cryptic’ SH2 These structural studies showed that that HIT family pro- domain [13]. The sequence of this SH2 domain was so teins are nucleotide hydrolases, and even enabled divergent that its existence had not been previously sus- identification of compounds that could inhibit the catalyt- pected. The conservation of unique and important ic function of the HIT proteins. Their in vivo substrate is structural elements, however, clearly identified it as a still unknown. Nonetheless, future biological investigation member of the SH2 domain family. SH2 domains feature a of this protein family will be fundamentally altered by the conserved arginine residue that protrudes form an interior knowledge provided by the structure-based functional position to coordinate the phosphate of bound phospho- genomics approach. peptides. This arginine is also found the Cbl SH2 domain, in a structural context that is that is nearly identical to that Structure-based assignment of function can also be aided of other SH2 domains. These similarities can not be by visualizing cofactors in the electron density of experi- inferred from sequence comparisons alone. mentally determined structures. For example, Kim and co-workers [19•] unexpectedly found bound ATP mole- Clearly, similar structure will not always imply a similar cules in the Methanococcus jannaschii protein MJ0577. function; but what are the parameters of this relationship? These ATP molecules were scavenged from the Escherichia Common evolutionary origins might suggest an increased coli expression host, and co-purified with the recombinant- likelihood of shared functional characteristics — because ly expressed protein. The hypothesis was formed that the the evolutionary precursors must then have shared a com- molecule might function as an ATPase, and biochemical mon function. Similar structures alone cannot discern experiments showed that this was in fact correct. In this evolutionary relatedness in the absence of significant case, the biochemical function could be assigned, but the sequence identity. Structure-based alignments can some- specific biological function performed by this ATPase is times reveal conservation of key residues, and other data, still unknown. MJ0577 cannot function as an ATPase in such as coincidence of intron position and phase, can also isolation, but only in the presence of a soluble extract from suggest common origins [14]. Nonetheless, functional M. jannaschii. This led the authors to suggest that MJ0577 insights based on structural relationships can only rise to is a factor-dependent ATPase that could potentially func- the level of hypotheses, and these hypotheses must be tion as a molecular switch in an as yet unidentified cellular tested by direct functional experimentation. process. Again, we see that the structure-based functional genomics approach can lead to hypotheses that can be test- Functional insight from analysis of novel ed by simple biochemical experiments. We cannot expect protein structures this approach to solve the entirety of most functional prob- The best — and most famous — example of structural lems, but it may often provide important clues to guide insight leading to functional knowledge comes not from further analysis. The assignment of a specific functionality the structure of a protein, but of DNA. Watson’s and to a protein, such as the ATPase activity of MJ0577, sug- Crick’s famous discovery of the double helix led to the gests that similar functionality will persist in other proteins greatest biological insight of our era: it did not escape their that are related in primary sequence. Thus, this knowledge notice that the structure ‘immediately suggests a possible propagates through the database, and may add key infor- copying mechanism for the genetic material’ [15]. Of mation to other problems as well. course it was known at the time that DNA was the genet- ic material, so this may not be a good analogy for many of Other types of biological studies often provide functional the potential targets presented by high-throughput screen- links between proteins — for example, genetic epistasis ing. Many proteins identified through genetic techniques, analysis can suggest the connections and order of interac- positional cloning, and differential expression analysis are, tions in a signaling pathway. Thus, understanding the however, implicated in a biological or disease process by function of one of these proteins has the potential to rever-
btb107.qxd 02/16/2000 03:20 Page 34 34 Analytical biotechnology berate throughout the entire chain. In one recent real- Genome sequencing projects are also underway for sever- world example, a few key crystal structures have strongly al species of plants, including major food crops such as rice, affected understanding of the control mechanisms of maize, tomato, corn, and potato (see http://www.tigr.org). clathrin-mediated endocytosis [20]. As we begin to understand the functions of more genes from these species, the prospects for producing new types of genetically modified strains will increase accordingly Picking targets wisely may efficiently lead to [23]. Also, understanding new mechanisms of plant growth medical and agricultural benefits control and pathways of natural immunity might lead to Many disease genes and genes important in biological new plant husbandry improvements that do not depend on processes have been identified through methods that impli- genetic alteration. cate them in specific functions. For example, positional cloning can demonstrate that mutation of a particular gene results in a disease phenotype. This does not, however, Conclusion: what the future may hold necessarily provide information on the biochemical func- The examples that we have given in this review come tion of the encoded protein. To rationally develop mainly from ‘classical’ structural biology studies. therapeutics requires this functional knowledge, and thus Currently, there are several projects underway — both such proteins are ideal targets for analysis by structure- commercial and academic — to begin structure determi- based functional genomics. New efforts in nation projects on a genome-wide scale, but the full pharmacogenomics, particularly those relying on single impact of these efforts probably won’t be felt for the next nucleotide polymorphism (SNP) data [21], also stand to year or two [24•]. These projects, in addition to their scale, reap tremendous benefits from a structural treatment. are non-classical in that targets are not chosen based on Effects of SNPs within protein-coding regions surely derive their known biological interest, but rather are often based from structural changes in protein structure or functionali- on a lack of specific biological knowledge. These struc- ty, and these can be fully understood only in the light of tural genomics projects have several different aims, high-resolution structures. The functional consequences of among them to construct an extensive map of the folds many gene alterations can be implied by mapping them to that natural proteins adopt. Structure-based functional the 3D structure of the encoded protein. A good example is genomics will grow out of these efforts, whether or not it glucokinase, an enzyme involved in the first steps of glu- is their primary mission. cose metabolism. Mutations in the gene were found to be co-inherited with a form of maturity onset diabetes in the Although the number of crystal and NMR structures young (MODY). Biochemical analysis shows that the muta- deposited each year in the protein data bank (PDB; tions affect the Km and Vmax of the enzyme, and structural http://www.rcsb.org) has grown exponentially, the number studies show that these mutations map to amino acids lin- of structures unrelated to other proteins has not [24•]. This ing the active site. Another example is the p53 protein, is due in part to the practice of structural biologists looking which is involved in many human cancers. A majority of ever deeper at biological problems, and thus studying com- cancer-causing mutations in p53 map to the region of the plexes, mutants, and relatives of proteins of already known protein involved in DNA binding [22]. These observations structure. Also, the universe of protein structures is finite, are entirely dependent on 3D structural knowledge, and it and chance occurrences of structural similarity have is clear that they have strong implications for our under- become increasingly common. This latter trend would standing of the underlying disease. appear to be strongly positive for the task of deriving func- tion from these structures. Analysis of proteins from pathogens may also provide a unique opportunity for the application of this technique. Functional genomics approaches in general give correla- As of October 1999, the complete sequences of 24 micro- tive and not definitive data about the function of new bial genomes had been determined, and sequencing genes. Hybridization array experiments, for example, projects on about double that number were underway (see: might demonstrate that a particular gene (and thus its pro- http://www.tigr.org). In addition, comprehensive ‘knock- tein product) is expressed in step with the cell cycle, out’ projects have been initiated to identify those genes implying a possible role as a ‘cell-cycle gene’. Genetic that are essential for viability. Mapping the functions of approaches, hypothetically, might give confirmatory data these essential genes is an important first step in applying showing that mutations in the gene affect the cell cycle. this information toward the development of new anti- Neither of these experiments, however, uncover the pro- microbial drugs. The availability of high-resolution tein’s biochemical role. When the database of protein structures also adds important data for identifying new structures (with functions proscribed) is large enough, the drug targets. Whereas a particular protein may be of vital biochemical function of any protein with a 3D structure importance to the life cycle of a microbe, the best drug tar- can be implied and tested. This ‘function by structure’ gets are usually enzymes with active sites that can be model of genomics will become easier and more rewarding targeted for therapeutic intervention. This knowledge as time goes on, and the public domain (PDB) and com- comes as a by-product of structural studies, and can be mercial protein data banks increase in size and used in the selection of targets for further investigation. comprehensiveness.
btb107.qxd 02/16/2000 03:20 Page 35 Finding function through structural genomics Shapiro and Harris 35 References and recommended reading 12. Zhou MM, Ravichandran KS, Olejniczak EF, Petros AM, Meadows RP, Papers of particular interest, published within the annual period of review, Sattler M, Harlan JE, Wade WS, Burakoff SJ, Fesik SW: Structure have been highlighted as: and ligand recognition of the phosphotyrosine binding domain of Shc. Nature 1995, 378:584-592. • of special interest •• of outstanding interest 13. Meng W, Sawasdikosol S, Burakoff SJ, Eck MJ: Structure of the amino-terminal domain of Cbl complexed to its binding site on 1. Hendrickson WA, Horton JR, LeMaster DM: Selenomethionyl ZAP-70 kinase [see comments]. Nature 1999, 398:84-90. proteins produced for analysis by multiwavelength anomalous diffraction (MAD): a vehicle for direct determination of three- 14. Shapiro L, Kwong PD, Fannon AM, Colman DR, Hendrickson WA: dimensional structure. EMBO J 1990, 9:1665-1672. Considerations on the folding topology and evolutionary origin of cadherin domains. Proc Natl Acad Sci USA 1995, 92:6793-6797. 2. Hendrickson WA: Determination of macromolecular structures from anomalous diffraction of synchrotron radiation. Science 15. Watson JD, Crick FHC: Molecular structure of nucleic acids: a 1991, 254:51-58. structure for deoxyribose nucleic acid. Nature 1953, 171:737-738. 3. Shapiro L, Fannon AM, Kwong PD, Thompson A, Lehmann MS, Grübel 16. Lima CD, D’Amico KL, Naday I, Rosenbaum G, Westbrook EM, G, Legrand J-F, Als-Nielsen J, Colman DR, Hendrickson WA: Structural Hendrickson WA: MAD analysis of FHIT, a putative human tumor basis of cell-cell adhesion by cadherins. Nature 1995, 374:327-337. suppressor from the HIT protein family. Structure 1997, 5:763-774. 4. Walsh MA, Dementieva I, Evans G, Sanishvili R, Joachimiak A: Taking 17. Lima CD, Klein MG, Hendrickson WA: Structure-based analysis of • MAD to the extreme: ultrafast protein structure determination. catalysis and substrate definition in the HIT protein family. Acta Crystallogr D Biol Crystallogr 1999, 55:1168-1173. Science 1997, 278:286-290. An excellent review on recent advances in synchrotron instrumentation and techniques that will probably provide the basis for much of the high-through- 18. Klein MG, Yao Y, Slosberg ED, Lima CD, Doki Y, Weinstein IB: put structure determination efforts of structural genomics projects. Characterization of PKCI and comparative studies with FHIT, related members of the HIT protein family. Exp Cell Res 1998, 5. Stroud RM, Kay LM, Dickerson RE: The crystal and molecular 244:26-32. structure of DIP-inhibited bovine trypsin at 2.7 Angstrom resolution. Cold Spring Harb Symp Quant Biol 1972, 36:125-140. 19. Zarembinski TI, Hung LW, Mueller-Dieckmann HJ, Kim KK, Yokota H, • Kim R, Kim SH: Structure-based assignment of the biochemical 6. Hubbard SR, Wei L, Ellis L, Hendrickson WA: Crystal structure of function of a hypothetical protein: a test case of structural the tyrosine kinase domain of the human insulin receptor [see genomics. Proc Natl Acad Sci USA 1998, 95:15189-15193. comments]. Nature 1994, 372:746-754. This paper shows one of the results of a pilot structural genomics project on 7. Braig K, Otwinowski Z, Hegde R, Boisvert DC, Joachimiak A, Horwich M. janaschii. ATP molecules unexpectedly co-purified with the subject pro- AL, Sigler PB: The crystal structure of the bacterial chaperonin tein, and were revealed in the electron density of the high-resolution crystal GroEL at 2.8 Å [see comments]. Nature 1994, 371:578-586. structure. The authors hypothesized a potential ATPase activity, and their hypothesis was vindicated by biochemical experiments. 8. Zhu X, Zhao X, Burkholder WF, Gragerov A, Ogata CM, Gottesman ME, Hendrickson WA: Structural analysis of substrate binding by 20. Marsh M, McMahon HT: The structural era of endocytosis. Science the molecular chaperone DnaK. Science 1996, 272:1606-1614. 1999, 285:215-220. 9. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local 21. Masood E: As consortium plans free SNP map of human genome. alignment search tool. J Mol Biol 1990, 215:403-410. Nature 1999, 398:545-546. 10. Aronson HE, Royer WE Jr, Hendrickson WA: Quantification of 22. Cho Y, Gorina S, Jeffrey PD, Pavletich NP: Crystal structure of a p53 tertiary structural conservation despite primary sequence drift in tumor suppressor-DNA complex: understanding tumorigenic the globin fold. Protein Sci 1994, 3:1706-1711. mutations. Science 1994, 265:346-355. 11. Shapiro L, Scherer PE: The crystal structure of a complement-1q 23. Somerville C, Somerville S: Plant functional genomics. Science • family protein suggests an evolutionary link to tumor necrosis 1999, 285:380-383. factor. Curr Biol 1998, 8:335-338. This work provides an early example of putative functional assignment by 24. Burley SK, Almo SC, Bonanno JB, Capel M, Chance MR, Gaasterland structure-based functional genomics. A protein (AdipoQ/Acrp30) identified • T, Lin D, Sali A, Studier FW, Swaminathan S: Structural genomics: by high-throughput sequencing from a fat cell-specific library was subjected beyond the human genome project. Nat Genet 1999, 23:151-157. to X-ray crystallographic analysis. The structure revealed similarity to TNF- An excellent review describing the current state of pilot structural genomics family cytokines, and thus the hypothesis was formed that this protein might projects, and laying out the bioinformatics and experimental processes function as a cytokine. required for structural genomics work.
You can also read