HOW MANY TAXA MUST BE SAMPLED TO IDENTIFY THE ROOT NODE OF A LARGE CLADE?
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Syst. Biol. 45(2):168-173, 1996 HOW MANY TAXA MUST BE SAMPLED TO IDENTIFY THE ROOT NODE OF A LARGE CLADE? MICHAEL J. SANDERSON Section of Evolution and Ecobgy, University of California, Davis, California 95616, USA; E-mail: mjsanderson@ucdavis.edu Abstract.—The importance of choice of taxa in phylogenetic analysis has been explored mainly with reference to its effect on the accuracy of tree estimation. Taxon sampling can also introduce other kinds of errors. Even if the sampled topology agrees with the true topology, it may not include the true root node of a clade, a node that is of interest for many reasons. Using a simple Yule model for the diversification process, the probability of identifying this node is derived under random sampling of taxa. For large clades, the minimum sample size needed to be 95% confident Downloaded from http://sysbio.oxfordjournals.org/ by guest on November 8, 2015 of identifying the root node is approximately 40 and is independent of the size of the clade. If rates of diversification differ in the two sister groups descended from the root node, the minimum sample size needed increases markedly. If these two sister groups are so different in diversity that a Yule model would be rejected by conventional diversification tests, then the necessary sample size is an order of magnitude greater than when diversification is homogeneous. [Diver- sification; phylogeny; branching; speciation; Yule model; taxon sampling.] The recent publication of a very large cult because the conditions under which phylogenetic analysis of seed plants based phylogenetic algorithms give the correct on chloroplast rbcL data (Chase et al., 1993) answer even when all taxa have been sam- has raised a number of interesting ques- pled are even now understood only in a tions about phylogenetic analyses of large few special cases, usually for small num- clades. Among these questions are com- bers of taxa (e.g., Huelsenbeck and Hillis, putational issues related to reconstructing 1993). When taxa are omitted, as they com- optimal trees using heuristic algorithms monly are in higher level analyses, the ef- (Rice et al v 1995) and the choice of taxon fect of omission is less clear. For example, sampling scheme for groups that are either there has been considerable support for large or poorly understood phylogeneti- the idea that "long" branches should be cally. The rbcL analysis included nearly broken up by sampling additional taxa to 500 sequences, a remarkable and possibly prevent an unwanted trip into the "Felsen- record-setting number but one that sam- stein zone" of inconsistency (Felsenstein, ples barely 0.2% of seed plant diversity. 1978). J. Kim (pers. comm.), however, has Other similarly large clades probably will recently shown conditions under which remain sparsely sampled by systematists more intense taxon sampling actually will for the foreseeable future. How much sam- increase the likelihood of inconsistency. pling is enough in groups that are excep- However, there are also phylogenetic is- tionally species rich? sues that depend on taxon sampling but Naturally, we must first agree on are more or less decoupled from the ac- "enough for what?" Most recent work on curacy of the estimated tree. One of these taxon sampling has focused on whether or issues is the identification of the root node, not the sampled taxa are sufficient to re- or most recent common ancestor, of a clade. construct the true sample tree (e.g., Lecoin- Even if one knew that the phylogeny for tre et al v 1993). The true sample tree is the some sample of taxa were correct, one tree of sampled taxa remaining after un- might not be sure that the root node of that sampled taxa are pruned from the true sample tree was the same as the root node tree of all the taxa. The issue is accuracy of the tree consisting of all representatives of estimation of the true tree by this sam- of the sampled clade (Fig. 1). Sometimes ple tree. This issue is important but diffi- the root node would be nested well within 168
1996 SANDERSON—TAXON SAMPLING 169 generally, the identification of the root CO, node amounts to a restriction on the pos- sible phylogenetic relationships of taxa not yet sampled; they will be descendants of that root node rather than sister taxa or more distant relatives. Identification of the SRN root node therefore represents real prog- ress toward understanding a large clade. In this paper, I derive a simple formula for the probability that the root node of a C0o sample of taxa is the same node as the root node of some larger set of taxa from which c co. the sample was drawn. The formula makes Downloaded from http://sysbio.oxfordjournals.org/ by guest on November 8, 2015 it possible to answer questions such as "what is the probability that the 500 spe- RN cies sampled in the angiosperm rbcL anal- ysis have the same root node as do all 240,000 angiosperms?" The calculation re- quires two assumptions: (1) a clade has been circumscribed a priori, perhaps on the basis of some shared set of morpholog- ical novelties, and (2) diversification (spe- FIGURE 1. Illustration of the effect of taxon sam- ciation and extinction) occurs according to pling on the identification of the true root node (RN) some model that can be specified. Three of a clade. The sample root node (SRN) is identified such models are examined here; two are when just the three taxa, o^, a>2, and Wj, are sampled extreme and unrealistic, and the third, from the entire clade. RN would be identified if and only if at least one taxon from each of the two clades which is bracketed by the first two, is con- descended from RN were sampled. siderably more reasonable. DERIVATION the phylogeny of the entire clade. Yet the Assume that a sample (w) of k species is identification of the true root node is es- drawn from a collection (C) of N species sential. Outgroup analysis rests on the as- .that form a monophyletic group. Assume sumption that the "sample" root node rep- that speciation occurs by bifurcation (or resents the real root node. Otherwise, the equivalently that any apparent polytomies reconstruction of ancestral states at the comprise on closer inspection merely re- sample root node might be rather different markably short but non-zero-length from the states that would be reconstruct- branches in a truly bifurcating tree). Let ed at the real root node. One method of the true phylogeny of C be (€), and let dealing with large phytogenies is to syn- the true phylogeny of (a>), which is thesize root states for large clades that can obtained merely by pruning away the un- then be used as terminal taxa. The synthe- sampled taxa from 4>(C). We are not con- sis of root states for a clade may be biased cerned with the phylogenies that are re- if the node taken as the root is actually constructed by some tree-building much more apical in the tree than the real algorithm, only with the true tree. Denote root node of the entire clade. Although the root node of a phylogeny by R((G>)] = R[3>(C)] will be of itself has rarely been discussed. More met if and only if at least one species from
170 SYSTEMATIC BIOLOGY VOL. 45 each of the two sister groups descended However, an equally extreme model of from the root node of C is included in the diversification generates an entirely unbal- sample, a>. The task is to design a sam- anced (comblike or pectinate) tree in which pling scheme that will insure this inclu- each node is the ancestor of a single spe- sion. However, unless considerable knowl- cies in one sister group and all the remain- edge about relationships in the vicinity of ing species not already accounted for in the root node is available, it may be nec- the other sister group. In that case the re- essary to rely on simple random sampling quired probability is of species from C. Given random sampling of taxa, it is enough to know the probabil- N- 1 ity, P(Nlf N2), that the two sister groups (3) will have Na and N2 taxa, subject to the N N' constraint that Nx + N2 = N. Then, for each a number that is generally small unless k Downloaded from http://sysbio.oxfordjournals.org/ by guest on November 8, 2015 possible observation of Nx and N2, one need only calculate the probability that a is a large fraction of N. For the angiosperm random sample will contain at least one case, we would need a sample of 237,500 species from each of these sister groups. species to be sure that we had identified Together this is the root node at the 95% level. This result is obviously not as encouraging as was the last result. V = v N2) Clearly, these widely divergent results confirm the worst fears of those that object (1) to the use of models in phylogenetic infer- The term in brackets is the probability that ence. However, no reasonable model of di- a random sample of k balls drawn from an versification could produce either of these urn containing N1 green balls and N2 red patterns of sister group diversity. Between balls contains at least one green and one these two extreme models lies a class of red ball. The probability before it depends presumably more realistic diversification on the particular model of diversification models. The Yule or pure-birth model, chosen (i.e., how likely is it that the urn which uses a Poisson process for specia- will contain the colors observed), here the tion in each lineage, has been widely used probability of the observed diversities in studies of diversification (Raup, 1985; (based on some model of diversification). Nee et al., 1992). Its properties are well un- Finally, the summation is included to con- derstood, and it has provided an adequate sider the mutually exclusive events of each fit to many real data sets in applications of the different possible observations on using both fossil data and data on stand- diversities. It is assumed that N > 1. ing diversity alone (reviewed by Sander- Now consider three different patterns of son and Donoghue, 1996). In the present diversification, ignoring extinction for the context, it has one very desirable property moment. First, suppose that speciation is that leads to a fairly simple reduction of completely homogeneous and clocklike, Equation 1. Under a Yule model, every di- such that the tree is balanced and the two vision of the N taxa into two sister groups sister group diversities are always the occurs with equal probability of 1/(N — 1). same at any time. Equation 1 reduces in Thus, an observation of 1 and 99,999 spe- this trivial case to cies is as probable as 50,000 and 50,000. This seemingly counterintuitive result p = 1 - 2(1/2)*. (Slowinski and Guyer, 1989) is actually This equation is independent of the size of quite reasonable once it is understood that the original clade. To obtain a 95% proba- any particular realization of the stochastic bility of reconstructing the true root node, process is just as likely or unlikely. Substi- k must be >6 taxa, and for 99%, k > 8 taxa. tuting this as the required probability in This number is perhaps startlingly low. Equation 1 gives
1996 SANDERSON—TAXON SAMPLING 171 h sister groups descended from the root -I N-l "A* (NX node is itself largely independent of how v= {N)\ many species have evolved at any point in 1i 1 time. /N-l\ /1 \\ "- v \(N'\ U-lj \ N-l h) U OTHER MODES OF DIVERSIFICATION It is unlikely that any large clade diver- sifies homogeneously during its entire his- 11 ( 2 l/N^l (4) tory. Indeed, for angiosperms there is evi- [N-I dence of a shift in diversification rate early The last line is an exact result but can be in its history, one portion of the clade di- tedious to calculate without a symbolic versifying much more rapidly than its sis- math program, even for fairly small N. For ter group (Sanderson and Donoghue, Downloaded from http://sysbio.oxfordjournals.org/ by guest on November 8, 2015 moderate to large values of N, the sum can 1994). Angiosperms include some relative- be approximated by an integral in the fol- ly recent and highly species-rich clades, lowing way: such as the family Asteraceae (compos- ites), which dates from the Oligocene and SGWfc N rfn - 1/2 contains upwards of 21,000 species (Cron- quist, 1981). The calculations above will be affected by diversification shifts primarily (5) if these shifts are preferentially associated k + 1 with one of the two sister groups descend- ed from the root node. This might happen and then if there were a shift in rate in one lineage immediately following the first split, if N - 1/2 (6) there were one or more shifts in rate (bi- N - l\k + 1 ased toward increases or decreases) any- This approximation is quite good over a where in one of the sister groups, or even broad range of N and k unless N is on the if there were more instances of such biased order of ^10 species. For N values greater shifts in one group versus the other. Any than about 100, which includes all cases of these biases will tend to generate trees that might reasonably be considered that are more asymmetric than expected "large" clades, Equation 6 simplifies even under the homogeneous model. In turn, further to larger sample sizes will be needed to en- sure that at least one species from the smaller sister group is included in any (7) sample. k + 1' Computer simulations were run to ex- Oddly enough, Equation 7 is independent amine the effect of nonhomogeneous di- of the size of the underlying clade, so long versification on the sample size necessary as it is large enough for the approximation for identification of the root node. The two to hold (if not, use Eq. 6 or Eq. 4). In the clades descended from the root were al- angiosperm case, a sample of about 40 taxa lowed to diversify, each according to a is sufficient to guarantee that the root node Yule model with a different rate parame- of all angiosperms has been identified at a ter. Then the observed species diversities in confidence level of 95%. The same results the two clades generated from this process obtain if the clade sampled is considerably were used to calculate the probability of smaller. One must also sample about 40 correctly identifying the root node (an ex- species in the legume genus Oxytropis, act calculation given by the bracketed term which has only 300 species. This assault on in Eq. 1) for progressively larger samples intuition can be explained because the dis- of taxa. When the fraction among 1,000 tribution of relative diversities in the two simulations indicated 95% confidence for a
172 SYSTEMATIC BIOLOGY VOL. 45 Guyer (1993) showed that random extinc- tion does not alter the probability of the observed sister group diversities, P(NV N2), under a Yule model. If extinction is biased toward one or the other sister group de- scended from the root node, then the sam- ple size under a random sampling scheme would have to be increased for much the same reason as outlined in the preceding paragraph. Ratio of Expected Species Diversity of Larger to Smaller Clade RANDOM SAMPLING AND ITS ALTERNATIVE FIGURE 2. Plot of necessary taxon sample size in A sample size of about 40 represents at Downloaded from http://sysbio.oxfordjournals.org/ by guest on November 8, 2015 the case when two sister groups descended from thebest a lower bound on the sample size nec- root node are diversifying at different rates. The ratio essary for identifying the root node in a of the two rates is indicated along the horizontal axis large clade. Nonhomogenous diversifica- and is measured in terms of the expected diversities tion rates tend to increase that number. Al- of the two clades after a fixed interval of time. Results are based on computer simulation of a Yule model though sampling of 40 taxa is not unrea- with different rate parameters in each clade. At a ratio sonable, the prospect of sampling 2,000 of 1 (homogeneous branching), the sample size is the species to obtain a confidence level of same as that predicted by the analytical results de- rived in the text (approximately 40 species). 99.9% (see Eq. 5) or even 500 species in cases in which nonhomogeneous diversi- fication is suspected is still beyond the particular sample size, the increase in sam- scale of typical phylogenetic investigations. ple size was halted and its value reported. Only some kind of nonrandom sampling Figure 2 is a plot of this necessary sample can reduce the sample size. For example, size versus the difference in rate in the two systematic sampling (in the statistical sister clades descended from the root. This sense) based on prior knowledge of rela- rate difference is expressed in terms of the tionships in the clade might help. Studies expected species diversity in the clade, of higher level relationships of angio- which is proportional to erate. These results sperms do not commonly sample Astera- indicate that differences in rate are impor- ceae (a recent group) in proportion to its tant determinants of sample size. If one species diversity, otherwise about 1 in 12 clade is five times larger than its sister clade, taxa in such an analysis would belong to the sample size needed for 95% confidence that family. Instead, such studies attempt in the identification of the root doubles to to increase the representation of "basal" around 100, whereas if it is 20 times larger taxa. Basal taxa are separated from the the needed sample size increases to nearly root node by fewer nodes than are other 400 species. This difference in species di- taxa. This approach is fine as long as sam- versity is on the order of what can be de- pling of basal taxa increases the likelihood tected by conventional tests for differences of sampling species descended from both in diversification rate, such as Slowinski sister groups of the root node. Sampling of and Guyer's (1989) null model test (re- basal taxa does not guarantee this even- viewed by Sanderson and Donoghue, handedness, but it does tend to decrease 1996). Preliminary data on the phylogeny the probability that a sample will draw of a group coupled with such diversifica- most of its representatives from some par- tion tests may help provide guidance ticularly species-rich clade descended from about the sample size needed to correctly one of the root node's sister groups. Alter- identify the root node. natively, one could use higher taxa as the sampling units on the assumption that Extinction can be included as a compo- most of the shifts in diversification are ac- nent in the diversification process in a fair- counted for in the diversity differences ob- ly straightforward way. Slowinski and
1996 SANDERSON—TAXON SAMPLING 173 served in those higher taxa. Thus, random fication of flowering plants. Columbia Univ. Press, New sampling of 40 families of angiosperms FELSENSTEIN, York. J. 1978. Cases in which parsimony and (supposing we had circumscribed a set of compatibility will be positively misleading. Syst. monophyletic families) may be a more ef- Zool. 27:401-410. fective way to avoid bias than sampling of HUELSENBECK, J. P., AND D. M. HILLIS. 1993. Success 40 species. of phylogenetic methods in the four-taxon case. Syst. Biol. 42:247-264. LECOINTRE, G., H. PHILIPPE, H. L. V. LE, AND H. L. ACKNOWLEDGMENTS GUYADER. 1993. Species sampling has a major im- This paper was prompted by discussions of taxon pact on phylogenetic inference. Mol. Phylogenet. sampling held at the Green Plant Phylogeny Research Evol. 2:205-224. Coordination Group workshop at the University of NEE, S., A. 0 . MOOERS, AND P. H. HARVEY. 1992. Tem- California-Berkeley (June 1995, organized by M. po and mode of evolution revealed from molecular Buchheim, B. Mishler, and R. Chapman), especially phylogenies. Proc. Natl. Acad. Sri. USA 89:8322- comments by Jim Doyle on the importance of looking 8326. Downloaded from http://sysbio.oxfordjournals.org/ by guest on November 8, 2015 at "both sides of the root node." I thank two anony- RAUP, D. M. 1985. Mathematical models of cladogen- mous reviewers for helpful suggestions about the esis. Palaeobiology 11:42-52. manuscript. RICE, K. A., M. J. DONOGHUE, AND R. G. OLMSTEAD. 1995. A reanalysis of the large rbcL dataset. Am. J. REFERENCES Bot. 82(suppl.):157-158. (Abstr.) SANDERSON, M. J., AND M. J. DONOGHUE. 1994. Shifts CHASE, M. W, D. E. SOLTIS, R. G. OLMSTEAD, D. MOR- in diversification rate with the origin of angio- GAN, D. H. LES, B. D. MISHLER, M. R. DUVALL, R. A. sperms. Science 264:1590-1593. PRICE, H. G. HILLS, Y.-L. QIU, K. A. KRON, J. H. RET- SANDERSON, M. J., AND M. J. DONOGHUE. 1996. Re- TIG, E. CONTI, J. D. PALMER, J. R. MANHART, K. J. constructing shifts in diversification on phylogenet- SYTSMA, H. J. MICHAELS, W. J. KRESS, K. G. KAROL, ic trees. Trends Ecol. Evol. 11:15-20. W. D. CLARK, M. HEDREN, B. S. GAUT, R. K. JANSEN, SLOWINSKI, J. B., AND C. GUYER. 1989. Testing the sto- K.-J. KIM, C. F. WIMPEE, J. F. SMITH, G. R. FURNIER, S. H. STRAUSS, Q.-Y. XIANG, G. M. PLUNKETT, P. S. chastidty of patterns of organismal diversity: An SOLTIS, S. M. SWENSEN, S. E. WILLIAMS, P. A. GADEK, improved null model. Am. Nat. 134:907-921. C. J. QUINN, L. E. EGUIARTE, E. GOLENBERG, G. H. SLOWINSKI, J. B., AND C. GUYER. 1993. Testing wheth- LEARN, JR., S. W GRAHAM, S. C. H. BARRETT, S. DAY- er certain traits have caused amplified diversifica- ANANDAN, AND V. A. ALBERT. 1993. Phylogenetics tion: An improved method based on a model of ran- of seed plants: An analysis of nucleotide sequences dom spedation and extinction. Am. Nat. 142:1019- from the plastid gene rbcL. Ann. Mo. Bot. Gard. 80: 1024. 528-580. Received 25 July 1995; accepted 5 January 1996 CRONQUIST, A. 1981. An integrated system of classi- Associate Editor: David Cannatella
You can also read