A Method for Estimating the Relative Importance of Characters in Cladistic Analyses

Embed Size (px)

Citation preview

  • 8/14/2019 A Method for Estimating the Relative Importance of Characters in Cladistic Analyses

    1/5

    Society of Systematic Biologists

    A Method for Estimating the Relative Importance of Characters in Cladistic AnalysesAuthor(s): David DeGustaReviewed work(s):Source: Systematic Biology, Vol. 53, No. 4 (Aug., 2004), pp. 529-532

    Published by: Oxford University Pressfor the Society of Systematic BiologistsStable URL: http://www.jstor.org/stable/4135422.

    Accessed: 18/11/2012 22:30

    Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at.http://www.jstor.org/page/info/about/policies/terms.jsp

    .JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of

    content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms

    of scholarship. For more information about JSTOR, please contact [email protected].

    .

    Society of Systematic Biologistsand Oxford University Pressare collaborating with JSTOR to digitize,

    preserve and extend access to Systematic Biology.

    http://www.jstor.org

    http://www.jstor.org/action/showPublisher?publisherCode=ouphttp://www.jstor.org/action/showPublisher?publisherCode=ssbiolhttp://www.jstor.org/stable/4135422?origin=JSTOR-pdfhttp://www.jstor.org/page/info/about/policies/terms.jsphttp://www.jstor.org/page/info/about/policies/terms.jsphttp://www.jstor.org/stable/4135422?origin=JSTOR-pdfhttp://www.jstor.org/action/showPublisher?publisherCode=ssbiolhttp://www.jstor.org/action/showPublisher?publisherCode=oup
  • 8/14/2019 A Method for Estimating the Relative Importance of Characters in Cladistic Analyses

    2/5

    Syst. Biol.53(4):529-532, 2004Copyright @ Society of Systematic BiologistsISSN: 1063-5157 print / 1076-836X onlineDOI: 10.1080/10635150490470320

    A Method for Estimating the Relative Importance of Characters in Cladistic AnalysesDAVIDDEGUSTA

    Departmentof AnthropologicalSciences,Building360, Stanford University, Stanford,CA 94305-2117, USA;E-mail:[email protected] method of character importance ranking (CIR) is proposed here as a means for estimating the relativeimportance of characters in cladistic analyses, especially those based on morphological features. CIR uses the weightingvariable to incrementally remove one character at a time from the analysis, and then evaluates the impact of the removal onthe shape of the cladogram. The greater the impact, the more important the character.The CIRmethod for determining whichcharacters drive the shape of a particular cladogram has several applications. It identifies the characters with the strongest(though not necessarily most accurate) signal in a cladistic analysis; it permits the informed prioritization of characters forfurther investigation via genetic, developmental, and functional approaches; and it highlights characters whose definition,scoring, independence, and variation should be reviewed with particular care. The application of CIR reveals that at leastsome cladograms depend entirely on a single character. [Character analysis; cladogram shape; importance; parsimony.]

    The cladograms generated by a particular cladisticanalysis obviously depend, in part, on the charactersused in the analysis. It is unlikely, though, that all char-acters have equal influence on the shape of the resultingcladogram, even in an unweighted analysis. A variety ofmeasures exist that examine the quality of individualcharacters relative to a given cladogram (i.e., their con-gruence, consistency, level of homoplasy, etc.). However,there appears to be a paucity of methods that examinethe influence of individual characters on the shape of agiven cladogram, apart from their quality. For exam-ple, the removal of one particular character from an anal-ysis may not change the most parsimonious cladogramat all, and thus this character has little influence on clado-gram shape. In contrast, the removal of another charac-ter from the analysis may result in changes at severalnodes, rendering it quite important in this sense (sep-arate from quality measures). A new method, termedcharacter importance ranking (CIR), is proposed herefor quantifying the relative importance of the charactersused in a cladistic analysis.CIR is useful because it identifies the degree to whicha cladogram depends on individual characters, as wellas identifies those characters, thus facilitating furtherstudy of the most influential characters. Problems withimportant characters (e.g., functional correlations, non-heritability, etc.) will, by definition, have a greater con-founding effect on a cladogram than will such problemswith relatively unimportant characters. Although the re-sults of additional study of the important characters mayeventually lead to modifications in character lists anddefinitions, CIR is not proposed as a means for select-ing or weighting characters, nor for discriminating be-tween good characters and bad characters, nor isit a character weighting method. CIR simply identifiesthose characters with the greatest impact on the shape ofa given cladogram-it evaluates the strength of a charac-ter's signal, but not the accuracy of that signal. As such, itis conceptually distinct from indices of character qual-ity (e.g., the consistency index).Although CIR was developed for morphological par-simony analyses, and is presented here mainly in that

    regard, it is theoretically applicable to all types of data(though the computation of CIR is probably impracticalfor large DNA sequence databases). With further work,it could be productively extended for application to like-lihood analyses.

    METHODSThe CIR method calculates an importance value foreach character in a given cladistic analysis. CIR does soby using the weighting variable to incrementally removea single characterat a time from an analysis (e.g., the char-acter weight is set at 0.8, then 0.6, 0.4, 0.2, and finally 0).For each new weight of the character, the most parsimo-nious tree is generated. Then, using the symmetric differ-ence metric (which measures the number of splits presentin one tree but not the other; Steel and Penny, 1993), thetopographic distance is measured between the originalmost parsimonious tree and each of the trees producedby the incremental downweighting. The mean of thesedistances is taken to be the importance value of thatcharacter, as it represents the effects on tree shape of thedownweighting and incremental removal of that char-acter from the analysis. A character that can be entirelyremoved without altering the cladogram will have an im-portance value of 0. If setting the weight of a character toslightly less than 1 induces major changes in the clado-gram, that character will have a very high importancevalue. For the purposes of this paper, then, importantcharacters are defined as those with the greatest impacton the shape of the resulting cladogram (i.e., those withthe largest importance values as calculated by CIR). It isimportant to keep in mind that CIR only considers theimpact of a character's downweighting on the shape ofthe cladogram, which is here termed importance as ashorthand.The importance value of a character represents themean topographic distance between the original clado-gram and the cladograms produced by variably weight-ing that character.However, absolute importance valuesare generally not comparable across analyses, due to dif-ferences in the parameters of the analyses (i.e., number

    529

    This content downloaded by the authorized user from 192.168.52.63 on Sun,18 Nov 2012 22:30:59 PMAll use subject to JSTOR Terms and Conditions

    http://www.jstor.org/page/info/about/policies/terms.jsphttp://www.jstor.org/page/info/about/policies/terms.jsphttp://www.jstor.org/page/info/about/policies/terms.jsp
  • 8/14/2019 A Method for Estimating the Relative Importance of Characters in Cladistic Analyses

    3/5

    530 SYSTEMATIC BIOLOGY VOL. 53of most parsimonious cladograms) as well as the depen-dence of most distance measures on the number of oper-ational taxonomic units (Steel and Penny, 1993). In otherwords, a character with an importance value of 2 in oneanalysis is not necessarily twice as important as a char-acter in a different analysis with a value of 1;though thiswould be the case if both characters were from the sameanalysis. The rank ordering of characters by importancevalue is, of course, always comparable across analyses,and this should be the focus of interpretation.Because the goal of CIR is to establish the rela-tive importance of the characters in a given analysis,the existence of multiple best (equally parsimonious)cladograms generally does not preclude its applica-tion, though more than about 10 equally parsimonioustrees makes for tedious comparisons. Similarly,whateverchoices of settings (i.e., search algorithm, assumptions,etc.) were made in a given analysis can be maintained inthe process of character importance ranking.

    The incremental downweighting of an individualcharacter in CIR allows for a more refined esti-mate of the character's importance than the simpleinclusion/exclusion of the character.For example, imag-ine that the complete removal of character 1 from an anal-ysis results in a particular change X to the resultingmost parsimonious tree, and that the complete removalof character 2 from the analysis also produces the samechange X. In terms of inclusion/exclusion, then, char-acters 1 and 2 would have identical importance values.But perhaps just downweighting character 1 to slightlyless than its original weight of 1.0 (say a weight of 0.8)also produces change X to the tree, whereas a simi-lar downweighting of character 2 produces no change.In terms of importance, as defined here, character 1 ismore important than character 2, a difference detectableonly by incremental downweighting rather than simpleinclusion versus exclusion.For some cladistic analyses, CIRwill demonstrate thatthere are no individual important characters (i.e., allcharacters have equal, and zero, importance values). Al-though this result obviously precludes the prioritizationof characters for further investigation, it does documentthat the cladogram in question is independent of any sin-gle character, which is useful information. A more gen-eral limitation is that CIR does not identify charactersthat may only be important in combination with oth-ers. In other words, it is theoretically possible that twocharacters might fail to alter the best cladogram whenremoved individually, but their removal in tandem mayhave a major effect on the cladogram.To apply CIR to a cladistic analysis requires thedata set for the analysis and the particular settingsused. The CIR method is implemented as two Nexusfiles for PAUP* 4.0 (Swofford, 1998). By using the CIRPAUP files in conjunction with a spreadsheet program,character importance values can be obtained almostautomatically. The generation of importance values typ-ically takes between 30 minutes and 2 hours, depend-ing on data set size and processor speed. The generalalgorithm for CIR is given in the Appendix. The CIR

    PAUP* 4.0 files, along with detailed instructions andthe data sets analyzed here, can be downloaded fromhttp://www.stanford.edu/ -degusta.Subsequent to the development of CIR, a relatedmethod was located in the literature-the sequentialcharacter removal (SCR) method (Davis, 1993; Daviset al., 1993). The purpose of SCR, however, is completelydifferent than CIR: SCR provides an index of stabilityfor individual clades. In SCR, separate cladistic analy-ses are conducted of all possible data sets derived by theremoval of individual characters and character combi-nations of successively increasing number. The resultingclade stability index (CSI) is the ratio of the minimumnumber of characters whose removal causes the collapseof that clade to the total number of informative characters(Davis, 1993).SCR is not aimed at identifying important characters.Italso employs only the complete removal of multiple char-acters, rather than CIR's incremental downweighting ofsingle characters. SCR is computationally intensive, ren-dering the full implementation of the method impracti-cal for even small data sets (Davis, 1993), whereas CIRis computationally practical for all save the very largestdata sets.

    MATERIALSCIR is applied here to three cladistic analyses to illus-trate the utility of this method: Graham et al.'s (1991)analysis of charophycean green algae (9 OTUs, 21 mor-phological characters), Chamberlain and Wood's (1987)analysis of fossil hominids (9 OTUs, 90 morphologicalcharacters), and Anderberg's (1986) analysis of Pegolettiaplants (9OTUs, 19morphological characters).Thedetailsof the cladistic analyses used, including the data sets, canbe found in the original publications and will not be re-peated here. These three analyses are representative ofCIR results based on the circa 20 cladistic analyses towhich CIRhas been applied to so far.

    RESULTSApplying CIR to the cladistic analysis of charophyceangreen algae by Graham et al. (1991) demonstrates that4 characters are more important (values of 0.3 and 0.4)than the remaining 17 characters (values of 0), as detailedin Table 1. Three of these four important characters arecharacteristics of the zygote: eggs retained (12), zy-gote retained (13), and substantial enlargement of thezygote (14). Graham et al. (1991) note that these threecharacters might be collapsed into a single multistate or-dered character,but justify their division on the groundsof clarity and caution. The results of CIR demonstratethat the Graham et al. (1991) cladogram depends on thisdecision, whereas this is not the case for similar decisionsabout other characters in their analysis. Indeed, exclud-ing two of these characters (12 and 13) results in eightequally parsimonious trees, many with notably differ-ent topology than the one original most parsimoniouscladogram. Although CIR does not bear on the question

    This content downloaded by the authorized user from 192.168.52.63 on Sun,18 Nov 2012 22:30:59 PMAll use subject to JSTOR Terms and Conditions

    http://www.jstor.org/page/info/about/policies/terms.jsphttp://www.jstor.org/page/info/about/policies/terms.jsphttp://www.jstor.org/page/info/about/policies/terms.jsp
  • 8/14/2019 A Method for Estimating the Relative Importance of Characters in Cladistic Analyses

    4/5

    2004 DEGUSTA-CHARACTER IMPORTANCE RANKING 531TABLE . Importance scores, tree steps, and consistency index forcharacters in Graham et al. (1991).

    Character o. CIR core Tree teps CI6 0.43 1 113 0.43 1 112 0.33 1 114 0.33 1 11 0 1 12 0 1 03 0 2 14 0 1 15 0 3 0.337 0 1 18 0 1 19 0 1 110 0 1 111 0 1 115 0 1 116 0 1 117 0 3 0.3318 0 1 119 0 1 1

    20 0 1 121 0 1 1

    of whether these characters should be collapsed or not,it has spotlighted the critical importance of this decision.Applying CIR to the cladistic analysis of Pegolettia byAnderberg (1986) revealed five characters as being moreimportant (values of 1.5 to 3.7) than the remaining 14characters (values of 0), as shown in Table 2. Of thesefive characters (7 to 11), two relate to the presence orabsence of crystal clumps (either on anthers or style-branches), suggesting that this feature is of substantialphylogenetic importance for this group. As such, furtherinvestigation of the comparative morphology, ontogeny,and formation of crystal clumps may directly impact es-timations of Pegolettiaphylogeny.Applying CIR to the cladistic analysis of fossil ho-minids by Chamberlain and Wood (1987) demonstrates

    TABLE . Importance scores, tree steps, and consistency index forcharacters in Anderberg et al. (1986).Character o. CIR core Treesteps CI11 3.67 1 17 3.63 1 19 3.29 2 0.510 1.9 2 0.58 1.5 2 0.51 0 Constant Constant2 0 Constant Constant3 0 Constant Constant4 0 1 15 0 1 16 0 1 112 0 3 0.3313 0 1 114 0 1 115 0 1 116 0 1 117 0 1 118 0 1 119 0 1 1

    TABLE. Importance scores, tree steps, and consistency index forselected characters from Chamberlain and Wood (1987). All impor-tant characters are shown paired, where possible, with similar unim-portant characters.Character o. CIR core Tree teps CI52 0.86 4 0.52 0 4 0.553 0.86 3 0.6676 0 3 0.66757 0.67 2 0.58 0 2 0.561 0.67 4 0.7522 0 4 0.7562 0.86 3 181 0.67 2 119 0 2 1

    that six characters are more important (values of 0.7to 0.9) than the remaining 84 characters (values of 0),as outlined in Table 3. The two most important char-acters are mental foramen height and the height ofthe foramen supraspinosum, for which Chamberlainand Wood (1987) used the measurements reported inChamberlain (1987). Given the dependency of the clado-gram on these characters, the raw measurements forthese two variables were reviewed. This examination re-vealed that Chamberlain (1987) lists measurements ofsupraspinosum height for a number of fossils that donot preserve any trace whatsoever of this feature (e.g.,AL 128-23, AL 288-1i, CKT Jaw K) and includes highlyproblematic measurements of mental foramen height aswell. If either of those two characters, or both, are re-moved from the cladistic analysis, the resulting clado-grams are all markedly different than that of the originalanalysis (as predicted by CIR).These clear errors in scor-ing could have been located without CIR,but the use ofCIR identified the characters for which a careful reviewwas most crucial.

    DIsCUSSIONThe preceding examples demonstrate how CIR canidentify issues regarding character definition (as forGraham et al., 1991) and character scoring (as forChamberlain and Wood, 1987), as well as suggestingcharacters whose further study may be particularly valu-able for systematics (as for Anderberg, 1986). Based onthose cases, as well as the application of CIR to an ad-ditional circa 20 analyses (data not shown), it is possibleto draw some general conclusions regarding the natureof the CIR method. There is not a correlation betweenthe importance of a character (in terms of CIR)and theconsistency of that character with the particular clado-gram (Tables1 to 3), confirming theoretical expectations.The number of characters identified as important (i.e.,importance value greater than 0) does not seem to scalesignificantly with the ratio of number of characters tonumber of OTUs. In any case, the number of importantcharacters identified for an analysis is of much less inter-est than which characters they are (as illustrated above).The approach of averaging the topographic distances

    This content downloaded by the authorized user from 192.168.52.63 on Sun,18 Nov 2012 22:30:59 PMAll use subject to JSTOR Terms and Conditions

    http://www.jstor.org/page/info/about/policies/terms.jsphttp://www.jstor.org/page/info/about/policies/terms.jsphttp://www.jstor.org/page/info/about/policies/terms.jsp
  • 8/14/2019 A Method for Estimating the Relative Importance of Characters in Cladistic Analyses

    5/5

    532 SYSTEMATIC BIOLOGY VOL. 53

    obtained for the incremental downweighting of a par-ticular character was found to be a reasonable represen-tation of the tree shape change, as the rank order of theimportant characters was not altered by using a thresh-old approach (see Appendix) or several other alterna-tives. Comparison of these results with those obtainedby simple exclusion of characters shows that a more re-fined estimate of importance (i.e., fewer characters withidentical importance values) is obtained with incremen-tal downweighting.Even so, it is important to keep in mind that CIR onlyconsiders the impact of a character on the shape of acladogram. It is clear that characters identified as im-portant may be those that provide widespread support(in terms of supporting many different nodes), or theymay be extreme characters that strongly influence thecladogram shape in a particular fashion, or they may becharacters that maintain a weakly supported node, orthey may be characters that maintain a balance betweencompeting cladograms. By examining the distribution ofimportant characters on particular trees, it is apparentthat although they sometimes involve deep nodes, theysometimes involve terminal nodes. Such characters mayor may not have anything in common beyond alteringthe cladogram (which they may do in a wide varietyof ways, both within an analysis and across analyses).The crucial point is that important characters directlyinfluence the result of the analysis, and therefore meritmore immediate scrutiny than those characters which donot. Such further scrutiny should include considerationof the specific ways in which the tree is altered (local ver-sus global changes), and the reasons for such alterations(data conflict versus lack of data). At this point, it doesnot seem possible to predict, a priori, which characterswill be important, which supports the use of CIR as anonredundant index.

    CONCLUSIONSBy systematically identifying those characters that di-rectly influence the resulting cladogram, CIR spotlightsthose characters most in need of careful review. Asdemonstrated here, such review often results in signif-icant alterations to the preferred cladogram. CIR alsoidentifies the extent to which cladistic analyses dependon single characters, documenting that (for most cladis-tic analyses) not all characters contribute equally to theshape of the cladogram. By identifying the characterswith the strongest (though not necessarily most accu-rate) signal in a cladistic analysis, CIR sets up a feedbackloop between characters and trees that may prove usefulin refining estimates of phylogeny.

    ACKNOWLEDGMENTSForhelpful discussions and assistance, I thank W.Henry Gilbert, KenAngielczyk, F.ClarkHowell, John Hutchinson, Diogo Meyer, Kevin Pa-dian, James Parham, Alan Shabel, and Tim White, all at the Universityof California, Berkeley. I also thank an anonymous reviewer and as-sociate editor Dan Faith for many helpful comments. This work was

    supported in part by the Laboratory for Human Evolutionary Studiesof the Museum of VertebrateZoology, U. C. Berkeley.

    REFERENCESAnderberg, A. 1986.The genus Pegolettia(Compositae, Inuleae). Cladis-

    tics 2:158-186.Chamberlain, A. T. 1987. A taxonomic review and phylogenetic analy-sis of Homohabilis.Ph.D. dissertation, University of Liverpool.Chamberlain, A. T., and B. A. Wood. 1987. Early hominid phylogeny.J. Hum. Evol. 16:119-133.Davis, J. I. 1993. Character removal as a means for assessing stabilityof clades. Cladistics 9:201-210.Davis, J. I., M. W. Frohlich, and R. J. Soreng. 1993. Cladistic charactersand cladogram stability. Syst. Bot. 18:188-196.Graham, L. E., C. F. Delwiche, and B. D. Mishler. 1991. Phylogeneticconnections between the 'green algae' and the 'bryophytes.' Adv.Bryol. 4:213-244.Steel, M. A., and D. Penny. 1993. Distributions of tree comparison met-rics: Some new results. Syst. Biol. 42:126-141.Swofford, D. L. 1998. PAUP*:Phylogenetic Analysis Using Parsimony(*and Other Methods) Version 4.0b4. Sinauer Associates, Sunder-land, Massachusetts.First submitted4 December2002; reviewsreturned15 August 2003;

    final acceptance7 March 2004AssociateEditor:Dan FaithAPPENDIX

    CHARACTER IMPORTANCE RANKING ALGORITHM1. Variables are defined as follows.a. Startwith a cladistic analysis that uses characters C1through Cn.b. Let each character have a corresponding weighting variable W1through Wn.c. Let Tobe the one best cladogram that results from an analysis ofcharacters C1 through Cn (if there are multiple equally parsimo-nious cladograms, use each one sequentially and compare theresults).d. Let IMPx be the importance value of character Cx.2. Let Wx assume a range of values between 1 and 0 (e.g., 0.9, 0.8,...,0).a. If the original analysis used character weighting, then either mul-tiply each of the values between 1 and 0 by the original weight,or discard the original weighting to establish the importance ofunmodified characters.3. Let all other W = 1.4. For each value of Wx, determine the most parsimonious clado-gram(s).5. For each of those cladograms, calculate the distance (D) betweenit and To. Different distance measures can be employed, but thecurrent implementation uses the PAUP* 4.0 symmetric differencemetric (Steel and Penny, 1993).6. IMPx= mean of all D calculated in step 5.a. In the current implementation, the mean is taken for all the dis-tances. It may be more precise to first calcuate the mean of each

    set of distances (if there is more than one) for each weightingvalue, and then take the mean of those averages. Preliminaryempirical findings suggest, however, that these two approachsyield very similar results in practice.b. Alternately, a threshold approach could be used whereby IMPxequals the maximum value of Wx such that Dwxis greater thansome constant.7. Repeat steps 2 through 6 for each W (i.e., x = 1 to n). In other words,the weighting variable for each character is manipulated in turn.8. Rank all characters by IMP value.N.B. The CIR PAUP* 4.0 files, along with detailed instructionsand data sets, can be downloaded from http://www.stanford.edu/-degusta.

    This content downloaded by the authorized user from 192.168.52.63 on Sun,18 Nov 2012 22:30:59 PMAll use subject to JSTOR Terms and Conditions

    http://www.jstor.org/page/info/about/policies/terms.jsphttp://www.jstor.org/page/info/about/policies/terms.jsphttp://www.jstor.org/page/info/about/policies/terms.jsp