HOW MANY TAXA MUST BE SAMPLED TO IDENTIFY THE ROOT NODE OF A LARGE CLADE?

Page created by Alfred Gardner
 
CONTINUE READING
Syst. Biol. 45(2):168-173, 1996

  HOW MANY TAXA MUST BE SAMPLED TO IDENTIFY THE ROOT
               NODE OF A LARGE CLADE?
                                               MICHAEL J. SANDERSON
                   Section of Evolution and Ecobgy, University of California, Davis, California 95616, USA;
                                              E-mail: mjsanderson@ucdavis.edu

        Abstract.—The importance of choice of taxa in phylogenetic analysis has been explored mainly
        with reference to its effect on the accuracy of tree estimation. Taxon sampling can also introduce
        other kinds of errors. Even if the sampled topology agrees with the true topology, it may not
        include the true root node of a clade, a node that is of interest for many reasons. Using a simple
        Yule model for the diversification process, the probability of identifying this node is derived under
        random sampling of taxa. For large clades, the minimum sample size needed to be 95% confident

                                                                                                                 Downloaded from http://sysbio.oxfordjournals.org/ by guest on November 8, 2015
        of identifying the root node is approximately 40 and is independent of the size of the clade. If
        rates of diversification differ in the two sister groups descended from the root node, the minimum
        sample size needed increases markedly. If these two sister groups are so different in diversity
        that a Yule model would be rejected by conventional diversification tests, then the necessary
        sample size is an order of magnitude greater than when diversification is homogeneous. [Diver-
        sification; phylogeny; branching; speciation; Yule model; taxon sampling.]

   The recent publication of a very large                       cult because the conditions under which
phylogenetic analysis of seed plants based                      phylogenetic algorithms give the correct
on chloroplast rbcL data (Chase et al., 1993)                   answer even when all taxa have been sam-
has raised a number of interesting ques-                        pled are even now understood only in a
tions about phylogenetic analyses of large                      few special cases, usually for small num-
clades. Among these questions are com-                          bers of taxa (e.g., Huelsenbeck and Hillis,
putational issues related to reconstructing                     1993). When taxa are omitted, as they com-
optimal trees using heuristic algorithms                        monly are in higher level analyses, the ef-
(Rice et al v 1995) and the choice of taxon                     fect of omission is less clear. For example,
sampling scheme for groups that are either                      there has been considerable support for
large or poorly understood phylogeneti-                         the idea that "long" branches should be
cally. The rbcL analysis included nearly                        broken up by sampling additional taxa to
500 sequences, a remarkable and possibly                        prevent an unwanted trip into the "Felsen-
record-setting number but one that sam-                         stein zone" of inconsistency (Felsenstein,
ples barely 0.2% of seed plant diversity.                       1978). J. Kim (pers. comm.), however, has
Other similarly large clades probably will                      recently shown conditions under which
remain sparsely sampled by systematists                         more intense taxon sampling actually will
for the foreseeable future. How much sam-                       increase the likelihood of inconsistency.
pling is enough in groups that are excep-                          However, there are also phylogenetic is-
tionally species rich?                                          sues that depend on taxon sampling but
   Naturally, we must first agree on                            are more or less decoupled from the ac-
"enough for what?" Most recent work on                          curacy of the estimated tree. One of these
taxon sampling has focused on whether or                        issues is the identification of the root node,
not the sampled taxa are sufficient to re-                      or most recent common ancestor, of a clade.
construct the true sample tree (e.g., Lecoin-                   Even if one knew that the phylogeny for
tre et al v 1993). The true sample tree is the                  some sample of taxa were correct, one
tree of sampled taxa remaining after un-                        might not be sure that the root node of that
sampled taxa are pruned from the true                           sample tree was the same as the root node
tree of all the taxa. The issue is accuracy                     of the tree consisting of all representatives
of estimation of the true tree by this sam-                     of the sampled clade (Fig. 1). Sometimes
ple tree. This issue is important but diffi-                    the root node would be nested well within

                                                            168
1996                             SANDERSON—TAXON SAMPLING                                       169

                                                        generally, the identification of the root
                                                    CO,
                                                        node   amounts to a restriction on the pos-
                                                        sible phylogenetic relationships of taxa not
                                                        yet sampled; they will be descendants of
                                                        that root node rather than sister taxa or
                                                        more distant relatives. Identification of the
                  SRN                                   root node therefore represents real prog-
                                                        ress toward understanding a large clade.
                                                           In this paper, I derive a simple formula
                                                        for the probability that the root node of a
                                                   C0o
                                                        sample of taxa is the same node as the root
                                                        node of some larger set of taxa from which
                                      c            co. the sample was drawn. The formula makes

                                                                                                        Downloaded from http://sysbio.oxfordjournals.org/ by guest on November 8, 2015
                                                        it possible to answer questions such as
                                                        "what is the probability that the 500 spe-
RN                                                      cies sampled in the angiosperm rbcL anal-
                                                        ysis have the same root node as do all
                                                        240,000 angiosperms?" The calculation re-
                                                        quires two assumptions: (1) a clade has
                                                        been circumscribed a priori, perhaps on
                                                        the basis of some shared set of morpholog-
                                                        ical novelties, and (2) diversification (spe-
   FIGURE 1. Illustration of the effect of taxon sam- ciation and extinction) occurs according to
pling on the identification of the true root node (RN) some model that can be specified. Three
of a clade. The sample root node (SRN) is identified such models are examined here; two are
when just the three taxa, o^, a>2, and Wj, are sampled extreme and unrealistic, and the third,
from the entire clade. RN would be identified if and
only if at least one taxon from each of the two clades which is bracketed by the first two, is con-
descended from RN were sampled.                         siderably more reasonable.
                                                                     DERIVATION
the phylogeny of the entire clade. Yet the       Assume that a sample (w) of k species is
identification of the true root node is es- drawn from a collection (C) of N species
sential. Outgroup analysis rests on the as- .that form a monophyletic group. Assume
sumption that the "sample" root node rep- that speciation occurs by bifurcation (or
resents the real root node. Otherwise, the equivalently that any apparent polytomies
reconstruction of ancestral states at the comprise on closer inspection merely re-
sample root node might be rather different markably short but non-zero-length
from the states that would be reconstruct- branches in a truly bifurcating tree). Let
ed at the real root node. One method of the true phylogeny of C be (€), and let
dealing with large phytogenies is to syn- the true phylogeny of (a>), which is
thesize root states for large clades that can obtained merely by pruning away the un-
then be used as terminal taxa. The synthe- sampled taxa from 4>(C). We are not con-
sis of root states for a clade may be biased cerned with the phylogenies that are re-
if the node taken as the root is actually constructed by some tree-building
much more apical in the tree than the real algorithm, only with the true tree. Denote
root node of the entire clade. Although the root node of a phylogeny by R((G>)] = R[3>(C)] will be
of itself has rarely been discussed. More met if and only if at least one species from
170                                SYSTEMATIC BIOLOGY                               VOL. 45

each of the two sister groups descended           However, an equally extreme model of
from the root node of C is included in the     diversification generates an entirely unbal-
sample, a>. The task is to design a sam-       anced (comblike or pectinate) tree in which
pling scheme that will insure this inclu-      each node is the ancestor of a single spe-
sion. However, unless considerable knowl-      cies in one sister group and all the remain-
edge about relationships in the vicinity of    ing species not already accounted for in
the root node is available, it may be nec-     the other sister group. In that case the re-
essary to rely on simple random sampling       quired probability is
of species from C. Given random sampling
of taxa, it is enough to know the probabil-                        N- 1
ity, P(Nlf N2), that the two sister groups                                               (3)
will have Na and N2 taxa, subject to the                            N          N'
constraint that Nx + N2 = N. Then, for each    a number that is generally small unless k

                                                                                                Downloaded from http://sysbio.oxfordjournals.org/ by guest on November 8, 2015
possible observation of Nx and N2, one
need only calculate the probability that a     is a large fraction of N. For the angiosperm
random sample will contain at least one        case, we would need a sample of 237,500
species from each of these sister groups.      species to be sure that we had identified
Together this is                               the root node at the 95% level. This result
                                               is obviously not as encouraging as was the
                                               last result.
  V =          v N2)                              Clearly, these widely divergent results
                                               confirm the worst fears of those that object
                                         (1)   to the use of models in phylogenetic infer-
The term in brackets is the probability that   ence. However, no reasonable model of di-
a random sample of k balls drawn from an       versification could produce either of these
urn containing N1 green balls and N2 red       patterns of sister group diversity. Between
balls contains at least one green and one      these two extreme models lies a class of
red ball. The probability before it depends    presumably more realistic diversification
on the particular model of diversification     models. The Yule or pure-birth model,
chosen (i.e., how likely is it that the urn    which uses a Poisson process for specia-
will contain the colors observed), here the    tion in each lineage, has been widely used
probability of the observed diversities        in studies of diversification (Raup, 1985;
(based on some model of diversification).      Nee et al., 1992). Its properties are well un-
Finally, the summation is included to con-     derstood, and it has provided an adequate
sider the mutually exclusive events of each    fit to many real data sets in applications
of the different possible observations on      using both fossil data and data on stand-
diversities. It is assumed that N > 1.         ing diversity alone (reviewed by Sander-
   Now consider three different patterns of    son and Donoghue, 1996). In the present
diversification, ignoring extinction for the   context, it has one very desirable property
moment. First, suppose that speciation is      that leads to a fairly simple reduction of
completely homogeneous and clocklike,          Equation 1. Under a Yule model, every di-
such that the tree is balanced and the two     vision of the N taxa into two sister groups
sister group diversities are always the        occurs with equal probability of 1/(N — 1).
same at any time. Equation 1 reduces in        Thus, an observation of 1 and 99,999 spe-
this trivial case to                           cies is as probable as 50,000 and 50,000.
                                               This seemingly counterintuitive result
              p = 1 - 2(1/2)*.                 (Slowinski and Guyer, 1989) is actually
This equation is independent of the size of    quite reasonable once it is understood that
the original clade. To obtain a 95% proba-     any particular realization of the stochastic
bility of reconstructing the true root node,   process is just as likely or unlikely. Substi-
k must be >6 taxa, and for 99%, k > 8 taxa.    tuting this as the required probability in
This number is perhaps startlingly low.        Equation 1 gives
1996                                      SANDERSON—TAXON SAMPLING                          171

                         h                             sister groups descended from the root
       -I    N-l
                             "A*       (NX             node is itself largely independent of how
v=                                     {N)\            many species have evolved at any point in
                        1i         1                   time.
     /N-l\         /1        \\ "-
                                v \(N'\
     U-lj       \ N-l
                                       h) U               OTHER MODES OF DIVERSIFICATION
                                                  It is unlikely that any large clade diver-
                                               sifies homogeneously during its entire his-
     11 ( 2 l/N^l                          (4) tory. Indeed, for angiosperms there is evi-
           [N-I                                dence of a shift in diversification rate early
The last line is an exact result but can be in its history, one portion of the clade di-
tedious to calculate without a symbolic versifying much more rapidly than its sis-
math program, even for fairly small N. For ter group (Sanderson and Donoghue,

                                                                                                   Downloaded from http://sysbio.oxfordjournals.org/ by guest on November 8, 2015
moderate to large values of N, the sum can 1994). Angiosperms include some relative-
be approximated by an integral in the fol- ly recent and highly species-rich clades,
lowing way:                                    such as the family Asteraceae (compos-
                                               ites), which dates from the Oligocene and

       SGWfc          N
                             rfn - 1/2         contains upwards of 21,000 species (Cron-
                                               quist, 1981). The calculations above will be
                                               affected by diversification shifts primarily
                                           (5) if these shifts are preferentially associated
                    k + 1                      with one of the two sister groups descend-
                                               ed from the root node. This might happen
and then                                       if there were a shift in rate in one lineage
                                               immediately following the first split, if
                          N
                               - 1/2       (6) there were one or more shifts in rate (bi-
                 N - l\k + 1                   ased toward increases or decreases) any-
This approximation is quite good over a        where    in one of the sister groups, or even
broad range of N and k unless N is on the      if there  were more instances of such biased
order of ^10 species. For N values greater     shifts  in one group versus the other. Any
than about 100, which includes all cases of these biases will tend to generate trees
that might reasonably be considered that are more asymmetric than expected
"large" clades, Equation 6 simplifies even under the homogeneous model. In turn,
further to                                     larger sample sizes will be needed to en-
                                               sure that at least one species from the
                                               smaller sister group is included in any
                                           (7) sample.
                        k + 1'
                                                  Computer simulations were run to ex-
Oddly enough, Equation 7 is independent amine the effect of nonhomogeneous di-
of the size of the underlying clade, so long versification on the sample size necessary
as it is large enough for the approximation for identification of the root node. The two
to hold (if not, use Eq. 6 or Eq. 4). In the clades descended from the root were al-
angiosperm case, a sample of about 40 taxa lowed to diversify, each according to a
is sufficient to guarantee that the root node Yule model with a different rate parame-
of all angiosperms has been identified at a ter. Then the observed species diversities in
confidence level of 95%. The same results the two clades generated from this process
obtain if the clade sampled is considerably were used to calculate the probability of
smaller. One must also sample about 40 correctly identifying the root node (an ex-
species in the legume genus Oxytropis, act calculation given by the bracketed term
which has only 300 species. This assault on in Eq. 1) for progressively larger samples
intuition can be explained because the dis- of taxa. When the fraction among 1,000
tribution of relative diversities in the two simulations indicated 95% confidence for a
172                                           SYSTEMATIC BIOLOGY                                VOL. 45

                                                            Guyer (1993) showed that random extinc-
                                                            tion does not alter the probability of the
                                                            observed sister group diversities, P(NV N2),
                                                            under a Yule model. If extinction is biased
                                                            toward one or the other sister group de-
                                                            scended from the root node, then the sam-
                                                            ple size under a random sampling scheme
                                                            would have to be increased for much the
                                                            same reason as outlined in the preceding
                                                            paragraph.
            Ratio of Expected Species Diversity of
            Larger to Smaller Clade                         RANDOM SAMPLING AND ITS ALTERNATIVE
   FIGURE 2. Plot of necessary taxon sample size in  A sample size of about 40 represents at

                                                                                                           Downloaded from http://sysbio.oxfordjournals.org/ by guest on November 8, 2015
the case when two sister groups descended from thebest a lower bound on the sample size nec-
root node are diversifying at different rates. The ratio
                                                  essary for identifying the root node in a
of the two rates is indicated along the horizontal axis
                                                  large clade. Nonhomogenous diversifica-
and is measured in terms of the expected diversities
                                                  tion rates tend to increase that number. Al-
of the two clades after a fixed interval of time. Results
are based on computer simulation of a Yule model  though sampling of 40 taxa is not unrea-
with different rate parameters in each clade. At a ratio
                                                  sonable, the prospect of sampling 2,000
of 1 (homogeneous branching), the sample size is the
                                                  species to obtain a confidence level of
same as that predicted by the analytical results de-
rived in the text (approximately 40 species).     99.9% (see Eq. 5) or even 500 species in
                                                  cases in which nonhomogeneous diversi-
                                                  fication is suspected is still beyond the
particular sample size, the increase in sam- scale of typical phylogenetic investigations.
ple size was halted and its value reported. Only some kind of nonrandom sampling
Figure 2 is a plot of this necessary sample can reduce the sample size. For example,
size versus the difference in rate in the two systematic sampling (in the statistical
sister clades descended from the root. This sense) based on prior knowledge of rela-
rate difference is expressed in terms of the tionships in the clade might help. Studies
expected species diversity in the clade, of higher level relationships of angio-
which is proportional to erate. These results sperms do not commonly sample Astera-
indicate that differences in rate are impor- ceae (a recent group) in proportion to its
tant determinants of sample size. If one species diversity, otherwise about 1 in 12
clade is five times larger than its sister clade, taxa in such an analysis would belong to
the sample size needed for 95% confidence that family. Instead, such studies attempt
in the identification of the root doubles to to increase the representation of "basal"
around 100, whereas if it is 20 times larger taxa. Basal taxa are separated from the
the needed sample size increases to nearly root node by fewer nodes than are other
400 species. This difference in species di- taxa. This approach is fine as long as sam-
versity is on the order of what can be de- pling of basal taxa increases the likelihood
tected by conventional tests for differences of sampling species descended from both
in diversification rate, such as Slowinski sister groups of the root node. Sampling of
and Guyer's (1989) null model test (re- basal taxa does not guarantee this even-
viewed by Sanderson and Donoghue, handedness, but it does tend to decrease
1996). Preliminary data on the phylogeny the probability that a sample will draw
of a group coupled with such diversifica- most of its representatives from some par-
tion tests may help provide guidance ticularly species-rich clade descended from
about the sample size needed to correctly one of the root node's sister groups. Alter-
identify the root node.                           natively, one could use higher taxa as the
                                                  sampling units on the assumption that
   Extinction can be included as a compo- most of the shifts in diversification are ac-
nent in the diversification process in a fair- counted for in the diversity differences ob-
ly straightforward way. Slowinski and
1996                                  SANDERSON—TAXON SAMPLING                                              173

served in those higher taxa. Thus, random    fication of flowering plants. Columbia Univ. Press,
                                             New
sampling of 40 families of angiosperms FELSENSTEIN, York.
                                                          J. 1978. Cases in which parsimony and
(supposing we had circumscribed a set of     compatibility will be positively misleading. Syst.
monophyletic families) may be a more ef-     Zool. 27:401-410.
fective way to avoid bias than sampling of HUELSENBECK, J. P., AND D. M. HILLIS. 1993. Success
40 species.                                  of phylogenetic methods in the four-taxon case.
                                                             Syst. Biol. 42:247-264.
                                                          LECOINTRE, G., H. PHILIPPE, H. L. V. LE, AND H. L.
              ACKNOWLEDGMENTS
                                                             GUYADER. 1993. Species sampling has a major im-
   This paper was prompted by discussions of taxon           pact on phylogenetic inference. Mol. Phylogenet.
sampling held at the Green Plant Phylogeny Research          Evol. 2:205-224.
Coordination Group workshop at the University of          NEE, S., A. 0 . MOOERS, AND P. H. HARVEY. 1992. Tem-
California-Berkeley (June 1995, organized by M.             po and mode of evolution revealed from molecular
Buchheim, B. Mishler, and R. Chapman), especially           phylogenies. Proc. Natl. Acad. Sri. USA 89:8322-
comments by Jim Doyle on the importance of looking          8326.

                                                                                                                    Downloaded from http://sysbio.oxfordjournals.org/ by guest on November 8, 2015
at "both sides of the root node." I thank two anony-      RAUP, D. M. 1985. Mathematical models of cladogen-
mous reviewers for helpful suggestions about the            esis. Palaeobiology 11:42-52.
manuscript.                                               RICE, K. A., M. J. DONOGHUE, AND R. G. OLMSTEAD.
                                                             1995. A reanalysis of the large rbcL dataset. Am. J.
                   REFERENCES                                Bot. 82(suppl.):157-158. (Abstr.)
                                                          SANDERSON, M. J., AND M. J. DONOGHUE. 1994. Shifts
CHASE, M. W, D. E. SOLTIS, R. G. OLMSTEAD, D. MOR-           in diversification rate with the origin of angio-
  GAN, D. H. LES, B. D. MISHLER, M. R. DUVALL, R. A.         sperms. Science 264:1590-1593.
  PRICE, H. G. HILLS, Y.-L. QIU, K. A. KRON, J. H. RET-
                                                          SANDERSON, M. J., AND M. J. DONOGHUE. 1996. Re-
  TIG, E. CONTI, J. D. PALMER, J. R. MANHART, K. J.
                                                            constructing shifts in diversification on phylogenet-
  SYTSMA, H. J. MICHAELS, W. J. KRESS, K. G. KAROL,
                                                            ic trees. Trends Ecol. Evol. 11:15-20.
  W. D. CLARK, M. HEDREN, B. S. GAUT, R. K. JANSEN,
                                                          SLOWINSKI, J. B., AND C. GUYER. 1989. Testing the sto-
  K.-J. KIM, C. F. WIMPEE, J. F. SMITH, G. R. FURNIER,
 S. H. STRAUSS, Q.-Y. XIANG, G. M. PLUNKETT, P. S.
                                                            chastidty of patterns of organismal diversity: An
 SOLTIS, S. M. SWENSEN, S. E. WILLIAMS, P. A. GADEK,
                                                            improved null model. Am. Nat. 134:907-921.
 C. J. QUINN, L. E. EGUIARTE, E. GOLENBERG, G. H.
                                                          SLOWINSKI, J. B., AND C. GUYER. 1993. Testing wheth-
  LEARN, JR., S. W GRAHAM, S. C. H. BARRETT, S. DAY-
                                                            er certain traits have caused amplified diversifica-
  ANANDAN, AND V. A. ALBERT. 1993. Phylogenetics            tion: An improved method based on a model of ran-
  of seed plants: An analysis of nucleotide sequences       dom spedation and extinction. Am. Nat. 142:1019-
  from the plastid gene rbcL. Ann. Mo. Bot. Gard. 80:       1024.
  528-580.                                                Received 25 July 1995; accepted 5 January 1996
CRONQUIST, A. 1981. An integrated system of classi-       Associate Editor: David Cannatella
You can also read