Theory and Practice in Quantitative Genetics - University of ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Theory and Practice in Quantitative Genetics Daniëlle Posthuma1, A. Leo Beem1, Eco J. C. de Geus1, G. Caroline M. van Baal1, Jacob B. von Hjelmborg 2, Ivan Iachine3, and Dorret I. Boomsma1 1 Department of Biological Psychology,Vrije Universiteit Amsterdam,The Netherlands 2 Institute of Public Health, Epidemiology, University of Southern Denmark, Denmark 3 Department of Statistics, University of Southern Denmark, Denmark ith the rapid advances in molecular biology, the near and phenotypic data from eight participating twin registries W completion of the human genome, the development of appropriate statistical genetic methods and the availability of will be simultaneously analysed. In this paper the main theoretical foundations underly- the necessary computing power, the identification of quantita- tive trait loci has now become a realistic prospect for ing quantitative genetic analyses that are used within quantitative geneticists. We briefly describe the theoretical bio- the genomEUtwin project will be described. In addition, metrical foundations underlying quantitative genetics. These an algebraic translation from theoretical foundation theoretical underpinnings are translated into mathematical to advanced structural equation models will be made that equations that allow the assessment of the contribution of can be used in generating scripts for statistical genetic soft- observed (using DNA samples) and unobserved (using known genetic relationships) genetic variation to population variance in ware packages. quantitative traits. Several statistical models for quantitative genetic analyses are described, such as models for the classi- Observed, Genetic, and Environmental Variation cal twin design, multivariate and longitudinal genetic analyses, The starting point for gene finding is the observation of extended twin analyses, and linkage and association analyses. population variation in a certain trait. This “observed”, or For each, we show how the theoretical biometrical model can be translated into algebraic equations that may be used to gen- phenotypic, variation may be attributed to genetic and erate scripts for statistical genetic software packages, such as environmental causes. Genetic and environmental effects Mx, Lisrel, SOLAR, or MERLIN. For using the former program interact when the same variant of a gene differentially a web-library (available from http://www.psy.vu.nl/mxbib) has affects the phenotype in different environments. been developed of freely available scripts that can be used to About 1% of the total genome sequence is estimated to conduct all genetic analyses described in this paper. code for protein and an additional but still unknown per- centage of the genome is involved in regulation of gene expression. Human individuals differ from one another by “Genetic factors explain x% of the population variance in about one base pair per thousand. If these differences occur trait Y” is an oft heard outcome of quantitative genetic within coding or regulatory regions, phenotypic variation studies. Usually this statement derives from (twin) family in a trait may result. The different effects of variants research that exploits known genetic relationships to esti- (“alleles”) of the same gene is the basis of the model that mate the contribution of unknown genes to the observed underlies quantitative genetic analysis. variance in the trait. It does not imply that any specific genes that influence the trait have been identified. Given the rapid Quantifying Genetic and Environmental Influences: advances made in molecular biology (Nature Genome Issue, The Quick and Dirty Approach February 15, 2001; Science Genome Issue, February 16, In human quantitative genetic studies, genetic and environ- 2001), the near completion of the human genome and the mental sources of variance are separated using a design that development of sophisticated statistical genetic methods includes subjects of different degrees of genetic and envi- (e.g., Dolan et al., 1999a, 1999b; Fulker et al., 1999; ronmental relationship (Fisher, 1918; Mather & Jinks, Goring, 2000; Terwilliger & Zhao, 2000), the identification 1982). A widely used design compares phenotypic resem- of specific genes, even for complex traits, has now become a blance of monozygotic (MZ) and dizygotic (DZ) twins. realistic prospect for quantitative geneticists. To identify Since MZ twins reared together share part of their environ- genes, family studies, specifically twin family studies, again ment and 100% of their genes (but see Martin et al., appear to have great value, for they allow simultaneous 1997), any resemblance between them is attributed to these modelling of observed and unobserved genetic variation. As a “proof of principle”, genomEUtwin will perform genome- wide genotyping in twins to target genes for the complex Address for correspondence: Daniëlle Posthuma, Vrije Universiteit, traits of stature, body mass index (BMI), coronary artery Department of Biological Psychology, van der Boechorststraat 1, disease and migraine. To increase power, epidemiological 1081 BT, Amsterdam, The Netherlands. Email: danielle@psy.vu.nl Twin Research Volume 6 Number 5 pp. 361–376 361 Downloaded from https://www.cambridge.org/core. UQ Library, on 10 Feb 2020 at 19:38:16, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1375/twin.6.5.361
Daniëlle Posthuma, A. Leo Beem, Eco J. C. de Geus, G. Caroline M. van Baal, Jacob B. von Hjelmborg, Ivan Iachine, and Dorret I. Boomsma two sources of resemblance. The extent to which MZ twins can be obtained by subtracting the MZ correlation from do not resemble each other is ascribed to unique, non- unit correlation (e 2 = 1 – rMZ). shared environmental factors, which also include These intuitively simple rules are described in textbooks measurement error. Resemblance between DZ twins reared on quantitative genetics and can be understood without together is also ascribed to the sharing of the environment, knowledge of the relative effects and location of the actual and to the sharing of genes. DZ twins share on average genes that influence a trait, or the genotypic effects on phe- 50% of their segregating genes, so any resemblance notypic means. These point estimates, however, depend on between them due to genetic influences will be lower than the accuracy of the MZ and DZ correlation estimates and for MZ pairs. The extent to which DZ twins do not resem- the true causes of variation of a trait in the population. For ble each other is due to non-shared environmental factors small sample sizes (i.e., most of the time) they may be and to non-shared genetic influences. grossly misleading. Knowledge of the underlying biometri- Genetic effects at a single locus can be partitioned into cal model becomes crucial when one wants to move beyond additive (i.e., the effect of one allele is added to the effect of these twin-based heritability estimates, for instance, to add another allele) or dominant (the deviation from purely information of multiple additional family members or additive effects) effects, or a combination. The total simultaneously estimate from a number of different rela- amount of genetic influence on a trait is the sum of the tionships the magnitude of genetic variance in the additive and dominance effects of alleles at multiple loci, population. plus variance due to the interaction of alleles at different loci (epistasis; Bateson, 1909). The expectation for the phe- The Classical Biometrical Model: notypic resemblance between DZ twins due to genetic From Single Locus Effects on a Trait Mean influences depends on the underlying (and usually to the Decomposition of Observed Variation unknown) mode of gene action. If all contributing alleles in a Complex Trait act additively and there is no interaction between them Although within a population many different alleles may within or between loci, the correlation of genetic effects in exist for a gene (e.g., Lackner et al., 1991), for simplicity DZ twins will be on average 0.50. However, if some alleles we describe the biometrical model assuming one gene with act in a dominant way the correlation of genetic dominance two possible alleles, allele A1 and allele A2. By convention, effects will be 0.25. The presence of dominant gene action allele A1 has a frequency p, while allele A2 has frequency q, thus reduces the expected phenotypic resemblance in DZ and p + q = 1. With two alleles there are three possible twins relative to MZ twins. Epistasis reduces this similarity genotypes: A1A1, A1A2, and A2A2 with genotypic freq- even further, the extent depending on the number of loci uencies p 2, 2pq, and q 2, respectively, under random mating. involved and their relative effect on the phenotype (Mather The genotypic effect on the phenotypic trait (i.e., the geno- & Jinks, 1982). Depending on the nature of the types of typic value) of genotype A1A1, is called “a” and the effect familial relationships within a dataset, additive genetic, of genotype A2A2 “-a”. The midpoint of the phenotypes of dominant genetic, and shared and non-shared environmen- the homozygotes A1A1 and A2A2 is by convention 0, so a tal influences on a trait can be estimated. For example, is called the increaser effect, and –a the decreaser effect. employing a design including MZ and DZ twins reared The effect of genotype A1A2 is called “d”. If the mean together allows decomposition of the phenotypic variance into components of additive genetic variance, non-shared genotypic value of the heterozygote equals the midpoint of environmental variance, and either dominant genetic vari- the phenotypes of the two homozygotes (d = 0), there is no ance or shared environmental variance. Additive and dominance. If allele A1 is completely dominant over allele dominant genetic and shared environmental influences are A2, effect d equals effect a. If d ≠ 0 and the two alleles confounded in the classical twin design and cannot be esti- produce three discernable phenotypes of the trait, d is mated simultaneously. Disentangling the contributions of unequal to a. This is also known as the classical biometrical shared environment and genetic dominance effects requires model (Falconer & Mackay, 1996; Mather & Jinks, 1982) additional data from, for example, twins reared apart, half- (see Figure 1). sibs, or non-biological relatives reared together. Statistical derivations for the contributions of single and Similarity between two (biologically or otherwise multiple genetic loci to the population mean of a trait are related) individuals is usually quantified by covariances or given in several of the standard statistical genetic textbooks correlations. Twice the difference between the MZ and DZ (e.g., Falconer & Mackay, 1996; Lynch & Walsh, 1998; correlations provides a quick estimate of the proportional Mather & Jinks, 1982), and some of these statistics are contribution of additive genetic influences (a2) to the phe- summarized in Table 1. notypic variation in a trait (a 2 = 2[r M Z – r D Z ]). The The genotypic contribution of a locus to the popula- proportional contribution of the dominant genetic influ- tion mean of a trait is the sum of the products of the ences (d2) is obtained by subtracting four times the DZ frequencies and the genotypic values of the different geno- correlation from twice the MZ correlation (d 2 = 2r MZ types. Complex traits such as height or weight, are assumed – 4rDZ). An estimate of the proportional contribution of the to be influenced by the effects of multiple genes. Assuming shared environmental influences (c2) to the phenotypic vari- only additive and independent effects of all of these loci, ation is given by subtracting the MZ correlation from twice the expectation for the population mean (µ) is the sum of the DZ correlation (c2 = 2rDZ – rMZ). The proportional con- the contributions of the separate loci, and is formally tribution of the non-shared environmental influences (e2) Σ expressed as µ = a(p – q) + 2 dpq Σ 362 Twin Research October 2003 Downloaded from https://www.cambridge.org/core. UQ Library, on 10 Feb 2020 at 19:38:16, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1375/twin.6.5.361
Genotyping in GenomEUtwin A2 A2 O A1 A2 A1 A1 Figure 1 Graphical illustration of the genotypic values for a diallelic locus. Table 1 Summary of Genotypic Values, Frequencies, and Dominance Deviation for Three Genotypes A1A1, A1A2, and A2A2 Genotype A1A1 A1A2 A2A2 Genotypic value a d –a Frequency p2 2pq q2 Frequency x value a p2 2dpq –a q2 Deviation from the population mean 2q(a – dp) a(q – p) + d(1 – 2pq) –2p(a + dq) Dominance deviation –2q2d 2dpq –2p2d Decomposition of Phenotypic Variance variance component contains the effect of d. Even if a = 0, Although Figure 1 and Table 1 lack environmental effects, VA is usually greater than zero (except when p = q). Thus, quantitative geneticists assume that the individual pheno- although VA represents the variance due to the additive type (P) is a function of both genetic (G) and influences, it is not only a function of p, q, and a, but also environmental effects (E): P = G + E, where E refers to the of d. The consequences are that, except in the rare situation environmental deviations, which have an expected average where all contributing loci are diallelic with p = q, VA is value of zero. This equation does not include the term usually greater than zero. Models that decompose the phe- GxE, and thereby assumes no interaction between the notypic variance into components of VD and VE only, are genetic effects and the environmental effects. therefore biologically implausible. When more than one The variance of the phenotype, which itself is defined locus is involved and it is assumed that the effects of these by G + E, is given by VP = VG + VE + 2covGE where VP rep- loci are uncorrelated and there is no interaction (i.e., no resents the variance of the phenotypic values, VG represents epistasis), the VGs of each individual locus may be summed the variance of the genotypic values, VE represents the vari- to obtain the total genetic variances of all loci that influ- ance of the environmental deviations, and covGE represents ence a trait (Fisher, 1918; Mather, 1949). In most human the covariance between G and E. GE-covariance or GE- quantitative genetic models the observed VP of a trait is not correlation can be modelled in a twin design that includes modelled directly as a function of p, q, a, d and environ- the parents of twins (e.g., Boomsma & Molenaar, 1987a ; mental deviations (as all of these are usually unknown), but Fulker, 1988) or in a design that includes actual measure- instead is modelled by comparing the observed resemblance ments of the relevant genetic and environmental factors. between pairs of differential, known genetic relatedness, For simplicity we assume that VP = VG + VE. Statistically the such as MZ and DZ pairs. Ultimately p, q, a, d and envi- total genetic variance (VG) can be obtained by the standard ronmental deviations are the parameters quantitative formula for the variance: σ2 = Σ fi(xi – µ)2, where fi geneticists hope to “quantify”. Comparing the observed resemblance in MZ twins and denotes the frequency of genotype i, xi denotes the corre- sponding mean of that genotype (as given in Table 1) and DZ twins allows decomposition of observed variance in a µ denotes the population mean. Thus, VG = p2[2q(a – dp)] 2 trait into components of VA, VD (or VC, for shared environ- + 2pq[a(q – p) + d(1 – 2pq)] 2 + q2 [–2p(a + dq)] 2. Which mental variation which is not considered at this point), and can be simplified to VG = 2pq[a + d(q – p)] 2 + (2pqd)2 = VA VE . As MZ twins share 100% of their genome, the expecta- + VD (see e.g., Falconer & Mackay, 1996). tion for their covariance is COVMZ = VA + VD . If the phenotypic value of the heterozygous genotype The expectation for DZ twins is less straightforward: as lies midway between A1A1 and A2A2 (i.e., the effect of d DZ twins share on average 50 per cent of their genome as in Figure 1 equals zero), the total genetic variance simplifies stated earlier, they share half of the genetic variance that is to 2pqa2. If d is not equal to zero, the “additive” genetic transmitted from the parents (i.e., 1/2 VA). As VD is not Twin Research October 2003 363 Downloaded from https://www.cambridge.org/core. UQ Library, on 10 Feb 2020 at 19:38:16, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1375/twin.6.5.361
Daniëlle Posthuma, A. Leo Beem, Eco J. C. de Geus, G. Caroline M. van Baal, Jacob B. von Hjelmborg, Ivan Iachine, and Dorret I. Boomsma transmitted from parents to offspring it is less obvious matrices X, Y, and Z of dimensions 1 × 1, containing the where the coefficient of sharing for the dominance devia- path coefficients x, y, and z, respectively. The matrix algebra tions (i.e., the 0.25 mentioned earlier) derives from. If two notation for VP is XX T + YY T + ZZ T, where T denotes the members of a DZ twin pair share both of their alleles at a transpose of the matrix (and corresponds to tracing for- single locus they will have the same coefficient for d. If they wards through a path, see Neale & Cardon, 1992). The share no alleles or just one parental allele they will have no expectation for the MZ covariance is XXT + YYT and the similarity for the effect of d. The probability that two expectation for the DZ covariance is 0.5XXT + 0.25YYT. members of a DZ pair have received two identical alleles Including additional siblings in this design is straightfor- from both parents is the coefficient of similarity for d ward, and in this diagram the expectation for sib pair between them. The probability is 1/2 that two siblings (or covariance is also 0.5XXT + 0.25YYT (but one should note DZ twins) receive the same allele from their father, and the that it is not necessary to assume the same model for sib-sib probability is 1/2 that they have received the same allele covariance as for DZ covariance). from their mother. Thus, the probability that they have The inclusion of non-twin siblings (if available) will not received the same two ancestral alleles is 1/2 × 1/2 = 1/4 , and only enhance statistical power (Dolan et al., 1999b; the expectation for the covariance in DZ twins is COVDZ = 1 Posthuma & Boomsma, 2000), but also provides the /2 VA + 1/4 VD . opportunity to test several assumptions, such as whether Path Analysis and the covariance between DZ twins equals the covariance Structural Equation Modelling between non-twin siblings (which is often assumed, but can now be tested), whether the means and variances in The expectations for variances and covariances of MZ twins are similar to the means and variances observed in twins and DZ twins or sib pairs reared together may also be siblings, or whether twin-sib covariance is different from inferred from a path diagram (Wright, 1921, 1934), which is an often convenient non-algebraic representation of sib-sib covariance, across males and females. models such as are discussed here (see e.g., Neale & As in practice some families may for example consist Cardon, 1992, for a brief introduction into path analysis) of a twin pair and six additional siblings while other fami- and which can be translated directly into structural equa- lies consist of twins, the correlational method cannot tions. The parameters of these equations can be estimated be applied to estimate genetic and environmental contribu- by widely available statistical software (e.g., Mx and Lisrel). tions to the variance. Fortunately, these so-called non- Structural Equation Modelling (SEM) has several advan- rectangular (or unbalanced nested) data structures can be tages over merely comparing the MZ and DZ correlations handled with ease using a SEM approach. (Eaves, 1969; Jinks & Fulker, 1970). If the model assump- Threshold Models for Categorical Twin Data tions are valid, SEM produces parameter estimates with known statistical properties, while the correlational method So far we have considered quantitative traits. Several merely allows parameter calculation. SEM thus also allows observed traits, however, are measured on a non-continu- determination of confidence intervals and of standard ous scale, such as dichotomous traits (e.g., disease vs. no errors of parameter estimates and quantifies how well the disease; smoking vs. non-smoking) or ordinal phenotypes specified model describes the data. One can either directly (e.g., underweight/normal weight/overweight/obesity/ derive structural equations from a theoretical model, or use severe obesity), yielding summary counts in contingency path analysis as a non-algebraic intermediary to derive the tables instead of means and variances/covariances. structural equations. Reducing continuous scores like BMI to a categorical score like obese/non-obese should be avoided, as the statistical Extended Twin Design power to detect significant effects is much lower in categor- A convenient feature of SEM is the flexible handling of ical analyses (Neale et al., 1994).Contingency tables unbalanced data structures. This enables the relative easy typically contain the number of (twin) pairs (for each incorporation of data from a variable number of family zygosity group) for each combination (e.g., concordant members. In Figure 2 a path diagram is drawn for a uni- non-smokers, concordant smokers, discordant on smoking). variate trait measured in families consisting of a twin pair Because of the inherent polygenic background of complex and one additional sibling. As additive, dominant and traits, these data are often treated by assuming that an shared environmental effects are confounded in the twin design, the path diagram includes additive genetic influ- underlying quantitative liability exists with one or more ences (A), dominant genetic influences (D) and non-shared thresholds, depicting the categorization of subjects. environmental influences (E), but not shared environmental Although the liability itself cannot be measured, a stan- influences (C). Note that a path diagram for shared environ- dard-normal distribution is assumed for the liability. The mental influences can be obtained by substituting 0.25 (for thresholds (z-values in the standard normal distribution) the DZ correlation for dominant genetic influences) for 1.00 are chosen in such a way that the area under the standard (for the DZ correlation for shared environmental influences) normal curve between two thresholds (or from minus infin- and replacing the D with C for a latent shared environmen- ity to the first threshold, and from the last threshold to tal factor. infinity) reflects the prevalence of that category. To rewrite the model depicted in Figure 2 into struc- For one variable measured on single subjects, the preva- tural equations using matrix algebra we introduce three lence of category i is given by: 364 Twin Research October 2003 Downloaded from https://www.cambridge.org/core. UQ Library, on 10 Feb 2020 at 19:38:16, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1375/twin.6.5.361
Theory and Practice in Quantitative Genetics Figure 2 Path diagram representing the resemblance between MZ or DZ twins and an additional sibling, for additive genetic influences (A), dominant genetic influences (D), and non-shared environmental influences (E), for a univariate trait. The latent factors A, D, and E have unit variance, and x, y, and z represent the respective path coefficients from A, D, and E to the phenotype P. The path coefficients can be regarded as standardized regression coefficients. The phenotypic variance (VP) for the trait (the same for both members of a twin pair) equals x2 + y2 + z2 which equals VA + VD + VE. By applying the tracing rules of path analysis, the covariance between DZ twins (and sib pairs) is traced as 0.50x2 + 0.25y2, which equals 1/2 VA + 1/4 VD. The covariance between MZ twins is traced as x2 + y2 which equals VA + VD. ti ti si φ(v)dv φ(v)dv ti–1 ti–1 si–1 were ti-1 = –∞ for i = 0 and ti = ∞ for i = p for p categories, with ti-1, si-1 = –∞ for i = 0 and ti , si = ∞ for i = p. In this and φ(v) is the normal probability density function case, 1 2 φ(v) = ⏐2πΣ⏐–n/2 × e – — 2 v i T × Σ–1 × vi , e –0.5v , — 2π where Σ is the predicted correlation matrix. where π = 3.14. Although the underlying bivariate distribution cannot be observed, its shape depends on the correlation between the two liability distributions. A high correlation results in This can easily be extended to twin data, where one vari- a relatively low proportion of discordant twin pairs, able is available for both members of a twin pair (e.g., whereas a low correlation results in a much higher propor- obesity yes/no in twin 1 and twin 2). In this case, the tion. Based on the contingency tables one may calculate underlying bivariate normal probability density function tetrachoric (for dichotomous traits) or polychoric (for will be characterized by two liabilities (that can be con- ordinal traits) twin correlations between the trait measured strained to be the same) and by a correlation between them: in twin 1 and the trait measured in twin 2 for each zygosity Twin Research October 2003 365 Downloaded from https://www.cambridge.org/core. UQ Library, on 10 Feb 2020 at 19:38:16, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1375/twin.6.5.361
Daniëlle Posthuma, A. Leo Beem, Eco J. C. de Geus, G. Caroline M. van Baal, Jacob B. von Hjelmborg, Ivan Iachine, and Dorret I. Boomsma group. Subsequently, genetic analyses can be conducted data when the same subject is assessed repeatedly in time. that use these correlations in the same way as is done with Figure 3 is a path diagram for a bivariate design (two mea- continuous data. The practical extension of this method to surements per subject; four measures for a pair of twins or the analysis of more than two categorical variables may not siblings). The corresponding matrix algebra expressions for be straightforward, especially if many variables are involved. the expected MZ or DZ variances and covariances are the The method then needs an asymptotic weight matrix (see same as for the univariate situation, except that the dimen- Neale & Cardon, 1992) and the tetrachoric or polychoric sions of matrices X, Y, and Z are no longer 1×1. An often correlations should be estimated simultaneously, not pair- convenient form for those matrices is lower triangular of wise, as the latter method often produces improper dimensions n × n (where n is the number of variables correlation matrices. The computations may then become assessed on a single subject; in Figure 3, n = 2). The sub- very time consuming. The preferred method for categorical scripts of the path coefficients correspond to matrix multivariate designs is to conduct analyses directly on all elements (i.e., xij denotes the matrix element in the i-th row available raw data, fitting each twin or sib pair individually and j-th column of matrix X). The path coefficients sub- using the multivariate normal probability density functions scripted by 21 reflect the variation that both measured given above. Again, the underlying correlation matrix of phenotypes have in common. For example, if the path family data can be decomposed into genetic and environ- mental influences in the same way as can be done for denoted by x21 is not equal to zero, this suggests that there continuous data. are some genes that influence both phenotypes. Thus, multivariate genetic designs allow the decompo- Multivariate Analysis of Twin Data sition of an observed correlation between two variables into The univariate model can easily be extended to a multivari- a genetic and an environmental part. This can be quantified ate model when more than one measurement per subject is by calculating the genetic and environmental correlations available (Boomsma & Molenaar, 1986, 1987b; Eaves & and the genetic and environmental contributions to the Gale, 1974; Martin & Eaves, 1977) or for longitudinal observed correlation. Figure 3 Path diagram representing the resemblance between MZ or DZ twins, for additive genetic influences (A), dominant genetic influences (D), and non-shared environmental influences (E), in a bivariate design. 366 Twin Research October 2003 Downloaded from https://www.cambridge.org/core. UQ Library, on 10 Feb 2020 at 19:38:16, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1375/twin.6.5.361
Genotyping in GenomEUtwin The additive, dominance and environmental (co)vari- trait that in turn influences the second trait. Or, there may ances can be represented as elements of the symmetric be genes that act in a pleiotropic way (i.e., they influence matrices A = XXT, D = YYT, and E = ZZT. They contain both traits but neither trait influences the other). Genetic the additive genetic, dominance, and non-shared environ- correlations do not distinguish between these situations, mental variances respectively on the diagonals for variables but merely provide information on the nature of the causes 1 to n. X, Y and Z are known as the Cholesky decomposi- of covariation between two traits. tion of the matrices A, D and E, which assures that these matrices are nonnegative definite. The latter is required for Longitudinal Analysis of Twin Data variance-covariance matrices. The aim of longitudinal analysis of twin data is to consider The genetic correlation between variables i and j (rgij) is the genetic and environmental contributions to the dynam- derived as the genetic covariance between variables i and j ics of twin pair responses through time. In this case the (denoted by element ij of matrix A; aij) divided by the phenotype is measured at several distinct time points for square root of the product of the genetic variances of vari- each twin in a pair. To analyse such data one must take the ables i (aii) and j (ajj): serial correlation between the consequent measurements aij of the phenotype into consideration. The classical genetic rgij = —— . analysis methods described in previous sections are aimed ai i ×ajj at the analysis of a phenotype measured at a single point Analogously, the environmental correlation (reij) between in time and provide a way of estimating the time-specific variables i and j is derived as the environmental covariance heritability and variability of environmental effects. How- between variables i and j divided by the square root of the ever, these methods are not able to handle serially correlated product of the environmental variances of variables i and j: longitudinal data efficiently. eij To deal with these issues the classic genetic analysis reij = —— . methods have been extended to investigate the effects of eii ×ejj genes and environment on the development of traits over The phenotypic correlation r is the sum of the product of time (Boomsma & Molenaar, 1987b; McArdle 1986). the genetic correlation and the square roots of the standard- Methods based on the Cholesky factorization of the covari- ized genetic variances (i.e., the heritabilities) of the two ance matrix of the responses treat the multiple phenotype phenotypes and the product of the environmental correla- measurements in a multivariate genetic analysis framework tion and the square roots of the standardized environmental (as discussed under “Multivariate Analysis of Twin Data”). variances of the two phenotypes: “Markov chain” (or “Simplex”) model methods (Dolan, 1992; Dolan et al., 1991) provide an alternative account of r = rgij × — aii (aii + eii) × — ajj (ajj + ejj) change in covariance and mean structure of the trait over time. In this case the Markov model structure implies that future values of the phenotype depend on the current trait + reij × — × e (a + e ) ii ii — e (a + e ) ii jj jj jj values alone, not on the entire past history. Methods of function-valued quantitative genetics (Pletcher & Geyer, 1999) or the genetics of infinite-dimensional characters (i.e., observed correlation is the sum of the genetic (Kirkpatrick & Heckman, 1989) have been developed for contribution and the environmental contribution). situations where it is necessary to consider the time variable on a continuous scale. The aim of these approaches is to The genetic contribution to the observed correlation investigate to what extent the variation of the phenotype at between two traits is a function of the two sets of genes that different times may be explained by the same genetic and influence the traits and the correlation between these two environmental factors acting at the different time points sets. However, a large genetic correlation does not imply a and to establish how much of the genetic and environmen- large phenotypic correlation, as the latter is also a function tal variation is time-specific. of the heritabilities. If these are low, the genetic contribu- An alternative approach for the analysis of longitudinal tion to the observed correlation will also be low. twin data is based on random growth curve models (Neale If the genetic correlation is 1, the two sets of genetic & McArdle, 2000). The growth curve approach to genetic influences overlap completely. If the genetic correlation is analysis was introduced by Vandenberg & Falkner (1965) less than 1, at least some genes are a member of only one of who first fitted polynomial growth curves for each subject the sets of genes. A large genetic correlation, however, does and then estimated heritabilities of the components. These not imply that the overlapping genes have effects of similar methods focus on the rate of change of the phenotype (i.e., magnitude on each trait. The overlapping genes may even its slope or partial derivative) as a way to predict the level at act additively for one trait and show dominance for the a series of points in time. It is assumed that the individual second trait. A genetic correlation less than 1 therefore phenotype trajectory in time may be described by a para- cannot exclude that all of the genes are overlapping metric growth curve (e.g., linear, exponential, logistic etc.) between the two traits (Carey, 1988). Similar reasoning up to some additive measurement error. The parameters of applies to the environmental correlation. the growth curve (e.g., intercept and slope, also called Genetic correlations do not provide information on the latent variables) are assumed to be random and individual- direction of causation. In fact, genes may influence one specific. However, the random intercepts and slopes may be Twin Research October 2003 367 Downloaded from https://www.cambridge.org/core. UQ Library, on 10 Feb 2020 at 19:38:16, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1375/twin.6.5.361
Daniëlle Posthuma, A. Leo Beem, Eco J. C. de Geus, G. Caroline M. van Baal, Jacob B. von Hjelmborg, Ivan Iachine, and Dorret I. Boomsma dependent within a pair of twins because of genetic and if L denotes the vector of the random growth curve para- shared environmental influences on the random coeffi- meters, the matrix form of the linear growth curve model cients. The basic idea of the method is that the mean and is: Y = DL + E ε11 α1 () covariance structure of the latent variables determines the where () () … 1 1 expected mean and covariance structure of the longitudinal phenotype measurements and one may therefore estimate β ( ) L = 1 , D = F 0 , F = 1 2 , E = 1n α2 0 F ε ε21 … … the characteristics of the latent variable distribution based β2 … 1 n on the longitudinal data. ε2n The random growth curve approach shifts focus of the genetic analysis towards the new phenotypes — the para- Note, that the linear growth curve model actually repre- meters of the growth curve model. This framework permits sents a Structural Equation Model with latent variables αi to investigate new questions concerning the nature of and βi (and εit’s) with loadings of the latent variables on the genetic influence on the dynamic characteristics of the phe- observed responses Yit given by either 1 or t . This implies, notype, such as the rate of change. If the random that this model may be analysed using the general SEM parameters of the growth curve would be observed, they techniques. In particular, parameter estimation may be might have been analysed directly using the classical carried out via the Maximum Likelihood method under methods of multivariate genetic analysis. However, their latent nature requires a more elaborate statistical approach. multivariate normality assumptions using the fact that the Since the growth curve model may be formulated in terms moment structure (mY, ΣY) of Y can be expressed in terms of the mean and covariance structure of the random para- of m, Σ and Var(εit), where m = E(L), Σ = Cov(L, L): mY = meters one might simply take the specification of the mean Dm, ΣY = D Σ DT + Σε, where Σε = Cov(E, E). and covariance structure of a multivariate phenotype as pre- As described previously in the section on multivariate dicted by the classical methods of multivariate genetic genetic analysis, the two-dimensional phenotype (αi , βi) analysis and plug it in into the growth curve model. The may be analysed by modelling the covariance matrix Σ for resulting two-level latent variable model would then allow MZ and DZ twins using the Cholesky factorisation for multivariate genetic analysis of the random coefficients. approach. The two-level model construction leads to a In the following sections we consider the bivariate parameterisation of the joint likelihood for the trait in linear growth curve model applied to longitudinal twin terms of the variance components, the respective mean data using age as timescale. The approach may be extended vectors and residual variances. This yields estimates of the to other parametric growth curves (e.g., exponential, logis- two heritability values of αi and βi (and respective variabili- tic etc.) using first-order Taylor expansions and the ties of the environmental effects) and also estimates of resulting mean and covariance structure approximations correlations between the genetic and environmental com- (Neale & McArdle, 2000). We also describe a method for ponents of α i and β i , as described earlier under obtaining the predicted individual random growth curve “Multivariate Analysis of Twin Data”. parameters using the empirical Bayes estimator. These pre- Predicting the Random Intercepts and Slopes dictions may be useful for selection of most informative pairs for subsequent linkage analysis of the random inter- In the following, we briefly describe a method for obtaining cepts and slopes. the predicted individual random growth curve parameters. These predictions may be useful for selection of most infor- Linear Growth Curve Model mative pairs for subsequent linkage analysis of the random A simple implementation of the random effects approach intercepts and slopes. may be carried out using linear growth curve models. In The prediction of the individual random growth curve this case each individual is characterized by a random inter- parameters may be given by the empirical Bayes estimates, cept and a random slope, which are considered to be the that is, the conditional expectation of the random growth new phenotypes. In a linear growth curve model the con- curve parameters given the measurement Y = y. As noted tinuous age-dependent trait (Y1t , Y2t) for sib 1 and sib 2 are above, Y and L are assumed multivariate normal, i.e., Y ~ assumed to follow a linear age-trajectory given the random MVN(mY; ΣY) and L ~ MVN(m; Σ). The joint distribution slopes and intercepts with some additive measurement of Y and L is given by error: Yit = αi + βi t + εit , for sibling i, where i = 1, 2, t denotes the timepoint (t = 1, 2, …, n), αi and βi are the individual (random) intercept and slope of sib i, respec- tively, and εit is a zero-mean individual error residual, which is assumed to be independent from αi and βi. The aim of []L ∼MVN Y {[ m mY ][ , Σ ΣDT DΣ ΣY ]} . the study is then the genetic analysis of the individual inter- By multivariate Gaussian theory, the predicted random cepts ( α i ) and slopes ( β i ). The model may easily be growth curve parameters of a pair with measurement Y = y ^ extended to include covariates. may then be given by L(y) = m + ΣDT Σ–1Y (y – mY) for esti- Assume the trait (Yit ) is measured for the two sibs at t = mated parameters. The estimator is the best linear 1, 2, …, n. The measurements on both twins at all time unbiased predictor of the individual random growth curve points may be written in vector form as Y = (Y11,…., Y1n, parameters (see Harville, 1976). Furthermore, an assess- Y21, …, Y2n)T (where T denotes transposition). Furthermore, ment of the error in estimation is provided by the variance 368 Twin Research October 2003 Downloaded from https://www.cambridge.org/core. UQ Library, on 10 Feb 2020 at 19:38:16, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1375/twin.6.5.361
Genotyping in GenomEUtwin ^ ^ of the difference L – L given by: Var(L – L|Y) = Σ – ΣDT ment (cultural transmission) relevant for a certain trait Σ–1Y DΣ. (Eaves et al., 1977). Effects of cultural transmission can be measured using a twin design that also includes the parents Additional Components of Variance of twins (Boomsma & Molenaar, 1987a; Fulker, 1988). In the aforementioned models the absence of effects of Active GE-correlation is the situation where subjects of a genes × environment interaction, of a genes–environment cor- certain genotype actively select environments that are corre- relation and of assortative mating was assumed. Using the lated with that genotype. Reactive GE-correlation refers to appropriate design these effects can relatively easy be incor- the effects of reactions from the environment evoked by an porated in structural equation models. individual’s genotype. The presence of a positive GE-corre- G×E interaction occurs when the effects of the environ- lation leads to an increase in the phenotypic variance. It is ment are conditional on an individual’s genotype, such as difficult to measure GE-correlation, however, as active and when some genotypes are more sensitive to the environ- reactive GE-correlation necessitate the direct measurement ment than other genotypes. Genetic studies on crops and of these influences (Falconer & Mackay, 1996). Falconer animal breeding experiments have shown that G×E interac- and Mackay (1996) state that GE-correlation is best tion is extremely common (see summary in Lynch & regarded as part of the genetic variance because “… the Walsh, 1998. pp. 657–686). However, in general G×E non-random aspects of the environment are a consequence interaction accounts for less than 20% of the variance of a of the genotypic value …” trait in the population (Eaves, 1984; Eaves et al., 1977). Assortative mating refers to a correlation between the G×E interaction can be modelled according to two phenotypic values of spouses (e.g., Willemsen et al., 2002). main methods. In the first method the presence of G×E Assortative mating can be based on social homogamy (i.e., interaction is explored by studying the same trait in two preferential mating within one’s own social class) or may environments (or at two time points). A genetic correlation occur when mate selection is based on a certain phenotype between the two measurements that is less than one indi- (P; which in turn is a function of G and E). Phenotypic cates the presence of G×E interaction (Boomsma & assortative mating tends to increase additive genetic varia- Martin, 2002; Falconer, 1952). However, a genetic correla- tion (and therefore in the overall phenotypic variation; tion equal to one does not need to imply the absence of Lynch & Walsh, 1998) and consequently increases the G×E interaction (see Lynch & Walsh, 1998). In human resemblance between parents and offspring as well as the quantitative genetic analyses it is often not possible to resemblance among siblings and DZ twins. Statistically, it control the environmental or genetic influences, unless the may conceal the presence of non-additive genetic effects and specific genotype and specific environmental factors are overestimate the influence of additive genetic factors (Eaves explicitly measured (see Boomsma et al., 1999; Dick et al., et al., 1989; Cardon & Bell, 2000; Carey, 2002; Heath et 2001; Kendler & Eaves, 1986; Rose et al., 2001; but see al., 1984; Posthuma et al., in press). Assortative mating is Molenaar et al., 1990, 1999). known to exist for traits such as intelligence, exercise behav- In a second method the presence of G×E interaction is ior and body height and weight (e.g., Aarnio et al., 1997; explored by correlating the MZ intra-pair differences and Boomsma, 1998; Vandenberg, 1972), but is to a large extent MZ pair sums (Jinks & Fulker, 1970). Assuming that MZ an unexplored topic in most human populations. twin similarity is purely genetic, a relation between MZ means and standard deviations suggests the presence of Gene Finding G×E interaction. In addition to the estimation of the effects of unmeasured G×E interaction is often not included in quantitative genes and environmental influences on traits, SEM can genetic models. However, if the true world does include also be used to test the effects of measured genetic and envi- GxE interaction, assuming its absence may lead to biased ronmental factors. In the following sections we will estimates of G and E (Eaves et al., 1977). For example if concentrate on the detection of the actual genes that influ- G×E interaction was truly gene by non-shared environment ence a trait. Two methods are currently employed: interaction, a model without G×E interaction will result in Quantitative trait loci (QTL) linkage analysis and associa- overestimation of the effects of the non-shared environ- tion analysis. ment. If, however, G×E interaction was interaction between genes and shared environmental influences, QTL Linkage Analysis assuming its absence will result in overestimation of the Quantitative trait loci (QTL) linkage analysis establishes effect of genes on the phenotype, as well as in overestima- relationships between dissimilarity or similarity in a quanti- tion of the influence of the shared environmental on the tative trait in genetically related individuals and their phenotype. The separate detection of these two biased dissimilarity or similarity in regions of the genome. If such effects in the presence of genes by shared environmental a relationship can be established with sufficient statistical interaction necessitates the inclusion of twin pairs reared confidence, then one or more genes in those regions are apart (Eaves et al., 1977; Jinks & Fulker, 1970). possibly involved in trait (dis)similarity among individuals. GE-correlation occurs when the genotypic and environ- Linkage analysis depends on the co-segregation (i.e., a mental values are correlated. Three different forms of violation of Mendel’s law of independent assortment which GE-correlation have been described (Plomin et al., 1977; applies only to inheritance of different chromosomes) of Scarr & McCartney, 1983). Passive GE-correlation occurs alleles at a marker and a trait locus (Ott, 1999). If a pair of for example when parents transmit both genes and environ- offspring has received the same haplotype from a parent in Twin Research October 2003 369 Downloaded from https://www.cambridge.org/core. UQ Library, on 10 Feb 2020 at 19:38:16, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1375/twin.6.5.361
Daniëlle Posthuma, A. Leo Beem, Eco J. C. de Geus, G. Caroline M. van Baal, Jacob B. von Hjelmborg, Ivan Iachine, and Dorret I. Boomsma a certain region of the genome, the pair is said to share the unambiguously known, it is usually estimated probabilisti- parent’s alleles in that region identical by descent (IBD). cally from the specific allele pattern across chromosomes of Since offspring receive their haplotypes from two parents, two or more siblings (Abecasis et al., 2002; Kruglyak et al., the pair can share 0, 1 or 2 alleles IBD at a certain locus in 1996). The estimate of π is referred to as π, ^ and can be cal- a region. The IBD status of a pair is usually estimated for a culated as (Sham, 1998): π = /2 pIBD1 + pIBD2. ^ 1 number of markers with (approximately) known location We call the correlation between the dominance values along the genome and is then used as the measure of within a population of sib pairs δ, which in the population ^ genetic similarity at the marker. The IBD status at a marker can be estimated as: δ = pIBD2, where pIBD2 is the proportion is informative for the IBD status at any other locus along of sib pairs that share two alleles IBD in this population. the chromosome as long as the population recombination This can be incorporated in a path diagram (Figure 4). fraction between the marker and the locus is less than 0.5. The path coefficients v and w (Figure 4) and the rela- In that case the IBD status at the marker and the locus are tive contribution of the factors Am and Dm to the correlated in the population and hence similarity at the phenotypic variation are a function of the recombination marker is informative for similarity at the locus. The locus fraction between the marker and the trait locus and the may be a gene or may be located near a gene. If variation in magnitude of the genetic effects of the trait locus. the gene and variation in the trait are related (i.e., the gene Relatively small effects of the factors Am and Dm can thus is a QTL), then variation in the IBD status at the locus and either reflect a situation with small effects at the trait locus thus also at the marker will be related to variation in trait and a small recombination fraction (close to zero) or may similarity. Identity by descent is distinct from identity is reflect large effects at the trait locus in combination with a distinct from identical by state (IBS), which denotes the large (close to 0.5) recombination fraction between the number of physically identical alleles that a pair of off- marker and the trait locus. spring has received, and that have not necessarily been The expectation for the variance is in algebraic terms x2 inherited from the same parent. Table 2 gives all possible + y2 + v2 + w2 + z2 , the expectation for the covariance sib pairs from a A1A2 by A1A2 mating and their IBD / among MZ twins is x2 + y2 + v2 + w2 , and for the covari- IBS status. ^ ance among DZ twins is 1/2 x 2 + 1/4 y 2 + πv ^ 2 + δw2 . A commonly made distinction is between parametric Translating this into matrix algebraic terms we introduce and nonparametric models for linkage analysis. Parametric models, which are discussed at length by Ott (1999), matrix Q (i.e., the product of matrices V and VT, represent- require a fairly detailed specification of population charac- ing the additive genetic influences on the phenotype at the teristics of the gene, such as allele frequencies and marker site) and matrix R (i.e., the product of matrices W penetrances. Nonparametric models are not nonparametric and WT, representing the dominant genetic influences on in the usual statistical meaning of nonparametric, but these the phenotype at the marker site). Written in matrix are parametric models that require fewer assumptions than algebra the expectation for the variance equals XXT + YYT + nonparametric linkage models. Parametric models use a ZZT + VVT + WWT (or A + D + E + Q + R), the expecta- fairly simple relationship between the contribution of a tion for the covariance of MZ twins equals XXT + YYT + QTL to the covariance of the trait values of pairs of indi- VVT + WWT (or A + D + Q + R), and the expectation for viduals, the pairs’ IBD status at a certain location on the the covariance of DZ twins equals 0.5XXT + 0.25YYT + ^ ^ genome and the recombination fraction between the locus πVVT + δ WWT (or 0.5A + 0.25D + πQ ^ ^ + δ R). and the QTL. This relationship is usually expressed as a Testing whether the elements of matrices V and W are function of the proportion πi of alleles shared identical by statistically different from zero provides a test for linkage at descent, π i = i/2 for i = 0, 1, 2. For sibling pairs, for a particular marker position. In a genome screen this test is instance, the contribution to the covariance given the pro- conducted for each marker along the genome. Those portion πi is f(θ) πi σ 2, where σ 2 is the QTL’s contribution marker positions for which the 2 difference exceeds a to the population trait variance, θ is the recombination certain critical value are believed to be linked to a QTL. ^ fraction between the locus and the QTL, and the monoto- Apart from calculating π's ^ and δ's to model the linkage nic function f(θ) equals 0 and 1 for recombination fractions component, one may also apply a mixture model. In this 0.5 and 0, respectively. Since IBD status is not always model for each sib pair three models (for IBD = 0, IBD = Table 2 IBD / IBS Status from all Possible Sib Pairings from Parental Mating Type A1A2 (father) A1A2(mother) Sib 1 A1A1 A1A2 A2A1 A2A2 Sib 2 A1A1 2/2 1/1 1/1 0/0 A1A2 1/1 2/2 0/2 1/1 A2A1 1/1 0/2 2/2 1/1 A2A2 0/0 1/1 1/1 2/2 Note: A1A2 (Father in bold) × A1A2 (Mother in normal text). 370 Twin Research October 2003 Downloaded from https://www.cambridge.org/core. UQ Library, on 10 Feb 2020 at 19:38:16, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1375/twin.6.5.361
Genotyping in GenomEUtwin Figure 4 Path diagram representing the resemblance between MZ or DZ twins, for background additive genetic influences (A), background dominant genetic influences (D), additive genetic influences due to the marker site (Am), dominant genetic influences due to the marker site (Dm), and non- shared environmental influences (E), in a univariate design. 1, IBD = 2) are fitted to the data that are weighted by their By a simple rewriting of the likelihood of the joint dis- relative probabilities. tribution under normality, Wright (1997) demonstrated Apart from these variance components methods for that the difference and sum together carry all the informa- linkage analyses, other statistical methods for conducting a tion in the variance and covariance of the joint QTL linkage analysis have been proposed. Haseman and distribution. He suggested regression approaches using Elston (1972) developed a now classical model for QTL both the difference and the sum. Subsequently several linkage analysis. In the HE model the squared difference of authors (Drigalenko, 1998; Forrest, 2001; Sham & Purcell, the trait values y1 and y2 of siblings 1 and 2 is regressed on 2001; Sham et al., 2002; Visscher & Hopper, 2001; Xu et the value of πi at a particular location on the genome. The al., 2000) proposed regression methods that use the infor- average of the squared difference in a given population mation in both the squared sum and the squared difference equals var (y1 ) + var (y2 ) – 2 cov (y1, y2 ). As the assignment for inference about a QTL effect. These methods were of individuals’ trait values to y1 or y2 is usually arbitrary this becomes 2var (y) – 2 cov (y1, y2 ). With a few additional shown to be nearly as powerful as the likelihood-based vari- assumptions only cov (y1, y2) varies as a function of the IBD ance component methods. The methods generally have the status at the locus, although it does also depend on the vari- advantage that the computations are easy and fast, which is ances of other QTL’s. Then the regression equation for the of some importance if the models are fitted for many loca- squared difference is µ - f (θ)2 σ 2πi, where πi is the regres- tions along the genome. In contrast, the variance sor and the constant µ contains variances associated with components methods can be quite time consuming and the total environmental and genetic effects, including the may, moreover, yield estimates that do not maximize the QTL. A statistically significant negative estimated regres- likelihood, which is required for the validity of the distribu- sion weight is suggestive of a QTL at or near the locus. tion theory on which inferences about the QTL are based. Twin Research October 2003 371 Downloaded from https://www.cambridge.org/core. UQ Library, on 10 Feb 2020 at 19:38:16, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1375/twin.6.5.361
You can also read