43
Ž . Mutation Research 435 1999 171–213 www.elsevier.comrlocaterdnarepair Community address: www.elsevier.comrlocatermutres A phylogenomic study of DNA repair genes, proteins, and processes Jonathan A. Eisen ) , Philip C. Hanawalt Department of Biological Sciences, Stanford UniÕersity, Stanford, CA 94305-5020, USA Received 8 April 1999; received in revised form 18 August 1999; accepted 24 August 1999 Abstract The ability to recognize and repair abnormal DNA structures is common to all forms of life. Studies in a variety of species have identified an incredible diversity of DNA repair pathways. Documenting and characterizing the similarities and differences in repair between species has important value for understanding the origin and evolution of repair pathways as Ž well as for improving our understanding of phenotypes affected by repair e.g., mutation rates, lifespan, tumorigenesis, . survival in extreme environments . Unfortunately, while repair processes have been studied in quite a few species, the ecological and evolutionary diversity of such studies has been limited. Complete genome sequences can provide potential sources of new information about repair in different species. In this paper, we present a global comparative analysis of DNA repair proteins and processes based upon the analysis of available complete genome sequences. We use a new form of analysis that combines genome sequence information and phylogenetic studies into a composite analysis we refer to as phylogenomics. We use this phylogenomic analysis to study the evolution of repair proteins and processes and to predict the repair phenotypes of those species for which we now know the complete genome sequence. q 1999 Elsevier Science B.V. All rights reserved. Keywords: DNA repair; Molecular evolution; Phylogenomics; Gene duplication and gene loss; Orthology and paralogy; Comparative genomics 1. Introduction Genomic integrity is under constant threat in all Ž species. These threats come in many forms e.g., agents that damage DNA, spontaneous chemical . changes, and errors in DNA metabolism , lead to a variety of alterations in the normal DNA structure Ž e.g., single- and double-strand breaks, chemically ) Corresponding author. The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850, USA. Tel.: q 1-301-838-3507; fax: q 1-301-838-0208; e-mail: [email protected] modified bases, abasic sites, bulky adducts, inter- and intra-strand cross-links, and base-pairing mis- . matches and have many direct and indirect effects Ž on cells and organisms e.g., mutations, genetic re- combination, the inhibition or alteration of cellular processes, chromosomal aberrations, tumorigenesis, . and cell death . Given this diversity of threats and their effects, it is not surprising that there is a corresponding diversity of DNA repair processes. Overall, repair pathways have been found that can repair just about any type of DNA abnormality. The cellular functions of all known repair pathways are also diverse. These functions include the correction 0921-8777r99r$ - see front matter q 1999 Elsevier Science B.V. All rights reserved. Ž . PII: S0921-8777 99 00050-6

A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

Ž .Mutation Research 435 1999 171–213www.elsevier.comrlocaterdnarepair

Community address: www.elsevier.comrlocatermutres

A phylogenomic study of DNA repair genes, proteins, andprocesses

Jonathan A. Eisen ), Philip C. HanawaltDepartment of Biological Sciences, Stanford UniÕersity, Stanford, CA 94305-5020, USA

Received 8 April 1999; received in revised form 18 August 1999; accepted 24 August 1999

Abstract

The ability to recognize and repair abnormal DNA structures is common to all forms of life. Studies in a variety ofspecies have identified an incredible diversity of DNA repair pathways. Documenting and characterizing the similarities anddifferences in repair between species has important value for understanding the origin and evolution of repair pathways as

Žwell as for improving our understanding of phenotypes affected by repair e.g., mutation rates, lifespan, tumorigenesis,.survival in extreme environments . Unfortunately, while repair processes have been studied in quite a few species, the

ecological and evolutionary diversity of such studies has been limited. Complete genome sequences can provide potentialsources of new information about repair in different species. In this paper, we present a global comparative analysis of DNArepair proteins and processes based upon the analysis of available complete genome sequences. We use a new form ofanalysis that combines genome sequence information and phylogenetic studies into a composite analysis we refer to asphylogenomics. We use this phylogenomic analysis to study the evolution of repair proteins and processes and to predict therepair phenotypes of those species for which we now know the complete genome sequence. q 1999 Elsevier Science B.V.All rights reserved.

Keywords: DNA repair; Molecular evolution; Phylogenomics; Gene duplication and gene loss; Orthology and paralogy; Comparativegenomics

1. Introduction

Genomic integrity is under constant threat in allŽspecies. These threats come in many forms e.g.,

agents that damage DNA, spontaneous chemical.changes, and errors in DNA metabolism , lead to a

variety of alterations in the normal DNA structureŽe.g., single- and double-strand breaks, chemically

) Corresponding author. The Institute for Genomic Research,9712 Medical Center Drive, Rockville, MD 20850, USA. Tel.:q 1-301-838-3507; fax: q 1-301-838-0208; e-m ail:[email protected]

modified bases, abasic sites, bulky adducts, inter-and intra-strand cross-links, and base-pairing mis-

.matches and have many direct and indirect effectsŽon cells and organisms e.g., mutations, genetic re-

combination, the inhibition or alteration of cellularprocesses, chromosomal aberrations, tumorigenesis,

.and cell death . Given this diversity of threats andtheir effects, it is not surprising that there is acorresponding diversity of DNA repair processes.Overall, repair pathways have been found that canrepair just about any type of DNA abnormality. Thecellular functions of all known repair pathways arealso diverse. These functions include the correction

0921-8777r99r$ - see front matter q 1999 Elsevier Science B.V. All rights reserved.Ž .PII: S0921-8777 99 00050-6

Page 2: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213172

of replication errors, resistance to killing by DNAdamaging agents, chromosome duplication and seg-regation, cell cycle control, generation of antibodydiversity in vertebrates, regulation of interspeciesrecombination, meiotic and mitotic recombination,transcription or replication elongation, and tumorsuppression. The diversity of DNA repair pathwayscan be readily seen by comparing and contrastingdifferent pathways. For example, some pathways areable to repair only a single type of abnormality,others are quite broad and are able to repair manyabnormalities. Similarly, some pathways are simple,involving single enzymes and single steps; others arehighly complex, involving many steps and dozens ofenzymes working in concert. In addition, some path-ways have single functions while others have roles ina variety of cellular processes.

The diversity of specificity, functions, and com-plexity of repair pathways is best understood bycomparing mechanisms of action among pathways.Such comparisons are simplified by the division ofrepair processes into three major classes based on

Žgeneral mechanism of action: direct repair in which.abnormalities are chemically reversed , recombina-

Žtional repair in which homologous recombination is. Žused to repair abnormalities and excision repair in

which a section of the DNA strand containing anabnormality is removed and a repair patch is synthe-

.sized using the intact strand as a template . Withineach of these classes there are multiple types andsometimes even subtypes of repair. For example,there are dozens of different subtypes of base exci-

Ž .sion repair BER , which itself is one of three mainŽtypes of excision repair the other two being nu-

Ž .cleotide excision repair NER and mismatch exci-Ž ..sion repair MMR .

The diversity of DNA repair pathways outlinedabove is the diversity of all known repair processesin all species. One aspect of this overall diversity isthat found within species. For example, Escherichiacoli, Saccharomyces cereÕisiae and humans eachexhibit all the major classes of repair, and multipletypes and even subtypes of each class. It is likelythat most or even all species also have many classesand types of repair. The within species diversityallows a species to recognize and repair many typesof abnormalities and also provides redundancy sincethere is overlap among many pathways. Another

aspect of the diversity of DNA repair is that due todifferences between species. These interspecific dif-ferences come in two forms. First, although allspecies have many types of repair, the exact reper-toire of types differs between species. For example,although all species studied have BER, the particulartypes of abnormal bases that are repaired by BER

Ž .differ greatly. Similarly, photoreactivation PHR isfound in some species, such as E. coli and yeast, but

w xnot others, such as humans 1 . There are also differ-ences within particular types and subtypes of repairbetween species. For example, in those species thathave been found to have MMR, the particular mis-matches that are best repaired are highly species

w xspecific 2 . Differences in specificity exist in almostevery type of repair even between closely relatedspecies.

Differences in the specificity and types of repairsuch as those described above can have profoundbiological effects. For example, it has been sug-gested that the accelerated mutation rate in my-coplasmas may be due in part to deficiencies in

w xDNA repair 3,4 . Examples of other phenotypes andfeatures that may be variable between individuals,strains or species due to differences in repair include

w x w x w xcancer rates 5 , lifespan 6,7 , pathogenesis 8–10 ,w xcodon usage and GC content 11,12 , evolutionary

w x w xrates 13 , survival in extreme environments 14 ,w xspeciation 15,16 , and diurnalrnocturnal patterns

w x17 . Thus, to understand differences in any of thesephenotypes, it is useful to understand differences inrepair.

Characterization of repair in different species isalso of great use in understanding the evolution ofrepair proteins and processes. This is important notjust because repair is a major cellular process, butalso because information about the evolution of re-pair provides a useful perspective for comparativerepair studies. In general, an evolutionary perspec-tive is useful in any comparative study because itallows a focus on how and why similarities anddifferences arose rather than the simple identificationand characterization of similarities and differencesw x18 . For studies of DNA repair, we believe anevolutionary perspective is the key to understandingdifferences in repair between species, as well as themechanisms and functions of particular repair pro-

w xcesses 19–21 .

Page 3: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213 173

Unfortunately, comparative and evolutionary stud-ies of DNA repair processes have been limited be-cause of the lack of detailed studies of repair in awide ecological and evolutionary diversity of speciesw x19 . Recently, a potential new source of comparativerepair data has emerged: complete genome se-quences. In theory, complete genome sequencesshould enable the prediction of the phenotype of aparticular strain or species, while also providing awealth of data for comparative analysis. In practice,however, obtaining useful information from com-plete genome sequences is quite difficult. We havebeen developing a new approach that combines theanalysis of complete genome sequences with evolu-tionary reconstructions into a composite analysis we

w xrefer to as phylogenomics 21–23 . We present herea global phylogenomic analysis of DNA repair pro-teins and processes. We use this phylogenomic anal-ysis to infer the evolutionary history of repair path-ways and the respective proteins that comprise themand to make predictions about the repair phenotypesof species for which genomes have been sequenced.In addition, we discuss the uses of evolutionaryanalysis in studies of complete genome sequences,the uses of complete genome sequences in studies ofevolution, and the advantages of the combined phy-logenomic approach.

2. Methods

Our phylogenomic analysis can be divided into aseries of steps, with feedback loops between somesteps such that initial analyses are subsequently re-

Ž .fined see Table 2 for an outline of methods used .The steps are described below as well as in some

w xprevious papers 21–23 .

2.1. Presence and absence of homologs

The first major step in phylogenomic analysis isthe determination of the presence and absence ofhomologs of genes of interest in different species.For the analysis here, genes with established roles inDNA repair processes were identified by a compre-hensive review of the literature. Likely homologs ofthese genes were identified by searching a variety ofsequence databases using the blast and blast 2 search

w xalgorithms 24 . A conservative operational defini-Žtion of homology i.e., high threshold of sequence

.similarity was used to limit the number of falseŽpositive results i.e., identifying genes as homologs

.that do not share common ancestry . In some cases,this threshold was lowered if other evidence sug-

Žgested that homologs were highly divergent see.Section 4 . Since this conservative approach might

Žlead to false negatives, iterative search methods e.g.,w x .PSI-blast 24 and manual methods were used to

increase the likelihood of identifying highly diver-gent homologs of the reference protein. Presence andabsence of homologs of genes in particular species

Žwas determined by searching using the above meth-. Ž .ods against complete genome sequences Table 1 .

Homologs of repair genes that had been cloned fromspecies for which complete genomes were not avail-able were identified by searching against the NR andEST databases at the National Center for Biotechnol-

Ž .ogy www.ncbi.nlm.nih.gov . The amino acid se-quences of all putative homologs of a particular gene

w xwere aligned using the Clustal W program 25 . Thealignments were examined manually to assess thereliability of the homology assignments. In addition,block-motifs were made of alignments using the

Ž .blocks web server www.blocks.fhcrc.org . Thesewere then used for additional database searches toidentify sequences containing motifs similar to thosethat were aligned together.

2.2. EÕolutionary relationships among homologs

The second major step in phylogenomic analysisis the characterization of the evolutionary relation-ships among all homologs of each gene. To do this,phylogenetic trees were generated for each group of

Žhomologs from the sequence alignments excluding.poorly conserved regions by the neighbor-joining

U w xand parsimony methods of the PAUP program 26 .The robustness of phylogenetic patterns was assessedusing bootstrapping and by comparing phylogenetictrees generated with different algorithms.

2.3. Inference of eÕolutionary eÕents

In the third major step in phylogenomic analysis,four main events in the history of each gene familyŽgene origin, gene duplication, lateral gene transfer

Page 4: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213174

Table 1Completely or nearly completely sequenced genomes analyzed

Ž .Species Classification Size mb a Orfs Reference

BacteriaŽ . w xE. coli K-12 Proteobacteria g 4.60 4288 172Ž . w xHaemophilus influenzae Rd KW20 Proteobacteria g 1.83 1743 30Ž . w xRickettsia prowazekii Madrid E Proteobacteria a 1.1 y834 173Ž . w xHelicobacter pylori 26695 Proteobacteria ´ 1.67 1590 174Ž . w xHe. pylori J99 Proteobacteria ´ 1.67 1590 175Ž . w xCampylobacter jejuni NCTC 1168 Proteobacteria ´ 1.70 nra 176

w xBacillus subtilis 168 Low GC Gram q 4.20 4100 177w xMycoplasma genitalium G-37 Low GC Gram q 0.58 470 178w xMycop. pneumoniae M129 Low GC Gram q 0.82 679 179w xMycobacterium tuberculosis H37rV High GC Gram q 4.41 ;4000 180w xBorrelia borgdorferi B31 Spirochete 1.44 1283 181w xTreponema pallidum Nichols Spirochete 1.14 1041 182w xChlamydia trachomatis serovar D Chlamydia 1.05 nra 183w xSynechocystis sp. PCC6803 Cyanobacteria 3.57 3168 184w xDeinococcus radiodurans R1 DeinococcusrThermus 3.20 3193 185w xThermotoga maritima MSB8 Thermotogales 1.80 1877 171w xAquifex aeolicus VF5 Aquificaceae 1.55 1512 186

Archaeaw xMethanococcus jannaschii Euryarchaeota 1.66 1738 187

DSM 2661w xMethanobacterium Euryarchaeota 1.75 1855 188

thermoautotrophicum DHw xPyrococcus horikoshii OT3 Euryarchaeota 1.80 y2000 189w xArchaeoglobus fulgidus Euryarchaeota 2.18 2436 190

VC-16, DSM4304

Eukaryotew xS. cereÕisiae S288C Fungi 13.0 5885 191

.and gene loss are inferred. The first step in identify-ing these events involves determining evolutionary

Ž .distribution patterns EDPs for each gene. EDPs,which are determined by overlaying genepresencerabsence information onto an evolutionarytree of species, reveal a great deal about the evolu-

Ž .tionary history of particular genes see Table 3 . Forexample, if a gene is present in only one subsectionof the species tree, then it likely originated in thatsubsection. However, some EDPs do not have asingle likely mechanism of generation and thus re-quire further analysis before being used to identifyspecific evolutionary events. For example, an uneven

Ždistribution pattern scattered presence and absence.throughout the species tree can be explained either

by lateral transfer to the species with an unexpectedpresence of the gene or by gene loss in species with

an unexpected absence. Ascertaining which eventoccurred can usually be accomplished by comparingthe gene tree to the species tree and testing forcongruence. If there has been a lateral transfer in thepast, then the species tree and the gene tree should

Žbe incongruent i.e., they should have different.branching topology . In contrast, if there has been a

gene loss in the past, then the gene and species treesshould be congruent except that some species willnot be represented in the gene tree. This comparisonof gene tree, species tree, and EDPs was used toidentify likely cases of gene duplication, loss, andlateral transfer in the history of every DNA repairgene. Then parsimony reconstruction methods wereused to determine the likely timing of gene origin,loss, duplication and transfer events. In this analysis,

w xwe used the MacClade computer program 27 to

Page 5: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213 175

attempt to identify the evolutionary scenario thatrequires the fewest events to arrive at the currentdistribution patterns. Since this type of analysis isnot commonly used for molecular data, an exampleŽ .for tracing gene loss is presented in Fig. 1.

An essential component in the identification ofgene loss, duplication, origin and transfer is thespecies tree. Unfortunately, there is no general con-sensus concerning the relationships among all of thespecies analyzed here. For the analysis describedhere, a species tree based upon the Ribosomal

w xDatabase Project trees 28 was used. In this tree,Archaea, bacteria and eukaryotes are each mono-phyletic and Archaea are a sister group to eukary-otes. In the sections on specific repair pathways, thepossible effects of alternative species trees are dis-cussed.

2.4. Refining homology groups

In the fourth major step in phylogenomic analysis,evolutionary analysis is used to refine the list ofpresence and absence of homologs of particular

genes. For the analysis reported here, this involvedtwo types of refinement. First, if gene duplicationevents were identified, genes were divided intogroups of orthologs and paralogs, and then presenceand absence was determined only for orthologs ofthe query gene. In addition, in some cases, gene treeswere used to subdivide gene families into evolution-arily distinct subfamilies, and then presence andabsence was determined only for homologs in thesame subfamily as the query gene.

2.5. Functional predictions and functional eÕolution

The fifth major step in phylogenomic analysisinvolves studies of functional evolution for individ-ual gene families. Functional evolution was studiedby overlaying information on gene functions onto thegene trees. Then parsimony reconstruction methodswere used to trace changes of function over evolu-

Žtionary time this was done in much the same way as.for presence and absence of genes described above .

Tracing functional evolution is an important compo-nent of making functional predictions for both ances-

Fig. 1. Demonstration of using EDPs to trace gene gain and loss. An evolutionary tree of the relationships among some representatives ofthe bacteria, Archaea and eukaryotes is shown. Presence of genes in these species is indicated by a colored box at the tip of the terminalbranches of the tree. Gain and loss of the gene is inferred through parsimony reconstruction techniques. Within the bacterial part of the tree,we divide the species into major phyla but have collapsed the branches joining the different phyla to indicate that the relationships amongthese phyla are ambiguous.

Page 6: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213176

tral genes and uncharacterized genes as describedw x22 and phenotypic predictions for species. For ex-ample, if there have been many functional changes inthe history of a particular gene or gene family, thenthe identification of the presence of homologs ofsuch genes in a species is not sufficient informationto predict the presence of a particular activity. Thus,tracing functional changes helps prevent incorrectpredictions of function. In addition, tracing func-tional changes can also greatly improve the chancesof making correct functional predictions for ancestral

w xgenes and uncharacterized genes 22 . Such func-tional predictions are made based on the position ofthe gene of interest in the gene tree relative to geneswith known functions and based on inferring evolu-tionary events such as gene duplications that may

w xidentify groups of genes with similar functions 22 .Specific functional predictions made for repair genesare discussed in the sections on the different repairpathways. It is important to note that all studies offunctional evolution and predictions of gene functionshould use only experimental information on genefunctions and not database annotation. Thus, for ouranalysis we made extensive use of the primary litera-ture on gene functions. We apologize for beingunable to cite all sources here.

2.6. Pathway analysis

The last major step in phylogenomic analysisinvolves comparing and combining the results ofanalyses of different genes and pathways. One aspectof this is the comparison of the presence and absenceof all the genes in a pathway in different species. Ifgenes in a pathway are always present or absent as a

Žunit i.e., no gene in the pathway is ever present.without the other genes , this suggests a conserved

association among these genes. If the genes are notalways present together, there are multiple possibleexplanations, including that the pathway is found inthe different species but that some genes have been

w xreplaced by non-orthologous genes 29 ; that thepathway does not function the same way in allspecies; or that the genes do not work together aswas thought. The presence and absence of genes indifferent species was studied for all repair pathways.Another important aspect of pathway analysis isdetermining if there are any correlated evolutionary

Ževents for different genes in a pathway e.g., gene.loss or duplication . Such correlated events lend

extra support to a conserved association among genes,especially if correlated events occurred multiple timesin different lineages. Correlated events were studiedfor all repair pathways. A third component of phy-logenomic analysis of pathways involves comparingfunctional evolution between and within pathways.For example, if a particular activity evolved only

Ž .once, then the presence or absence of the gene srequired for that activity can be used as a goodestimator of the presence and absence of the activity.If a particular activity evolved separately many times,then there may be many as of yet uncharacterizedgenes that can also provide that activity. Therefore,even if a species does not encode any of the genesknown to have that activity, one should not concludethat the species does not have that activity.

3. Results and discussion

The publication in 1995 of the first completegenome sequence of a free-living organism initiated

w xa new phase of biology research 30 . Currently,more than 20 complete genome sequences are pub-

Ž .licly available Table 1 and it is likely that therewill be hundreds more available within a few years.These genome sequences provide an unprecedentedwindow into the biology of the species that havebeen sequenced as well as into the evolution of lifeon the planet. To make the most out of genomesequences, both for studies of the biology of speciesand for studies of evolution, we believe that evolu-tionary reconstructions and genome analysis shouldbe integrated into a single composite approach, which

w xwe refer to as phylogenomics 21,22 .The first reason to combine evolutionary recon-

structions and genomics is that evolutionary analysiscan greatly improve what can be learned fromgenome sequences. In general, an evolutionary per-spective is useful in any comparative biological studybecause it allows one to go beyond identifying whatis similar or different between species and to focusinstead on understanding how and why such similari-

w xties and differences may have arisen 18,31 . Thebenefits of an evolutionary perspective are wellknown in some aspects of comparative biology such

Page 7: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213 177

w xas comparative physiology and ecology 18,31 . Al-though it is not well recognized, an evolutionaryperspective has also been quite useful in many as-pects of comparative molecular biology, including

w xmaking functional predictions 22 , inferring muta-w xtion processes 32 , determining secondary and ter-

w xtiary structures of ribosomal RNA 33 and proteinsw x34 , making motif-patterns for conserved proteinsw x w x35 , and in sequence searching algorithms 36 . Allsuch methods can be of use in comparative genomicanalysis as well.

Just as evolutionary methods can benefit compara-tive genomic studies, genome analysis is incrediblyuseful in studies of evolution. One aspect of this isthe wealth of comparative data provided by genomesequences which allow studies of the evolutionaryrelationships among species in a way never beforepossible. It is the completeness of complete genomesequences that allows one to address questions neverbefore possible in evolutionary studies. For example,one can now analyze codon usage of all genes in a

w xgenome and compare this between species 37 .Complete genomes can also be used to compare andcontrast the evolution of different pathways bothwithin and between species.

The reason for a composite phylogenomic ap-proach is that there are feedback loops betweengenome analysis and evolutionary reconstruction suchthat they are impossible to separate in some casesand in most other cases they can be combined for amutual benefit. For example, in the inference of geneloss, complete genome information is required toshow that a species does not encode any homologsof the gene thought to have been lost. Evolutionaryanalysis is then required to show that an ancestor ofthe species without the gene likely had the gene.Similarly, in the inference of gene duplication,genome analysis is required to determine the numberof homologs of a particular gene in different species.Then evolutionary analysis is needed to divide thehomologs into orthologs and paralogs. Finally,genome analysis is required to determine the pres-ence and absence of the different orthologs. Thereare many other areas in which genome and evolu-tionary analysis can be combined for mutual benefit,including making functional predictions for individ-

w xual genes 22 , predicting species phenotype, andtracing the evolution of pathways. We have incorpo-

rated many of these into our phylogenomic analysisŽ .see Table 2 and Section 2 .

Here we apply our phylogenomic approach to thestudy of DNA repair processes. We have divided ouranalysis into two main subsections. In the first sub-section, we discuss our results on a pathway bypathway basis. For each pathway, we review what isknown about the pathway and the proteins in thatpathway in the species in which the pathway is bestcharacterized. Then we discuss what is known aboutthis pathway in other species. Finally, we present theresults of our phylogenomic analysis as well asresults of other comparative or evolutionary studiesof this pathway, such as the recently published com-

w xprehensive analysis of DNA repair domains 38 . Inthe second subsection, we discuss our results from abroader perspective, looking at all repair pathwaystogether. To simplify our discussion, we have sum-marized our results in a few ways. In Fig. 3 we havetraced the inferred gain and loss of repair genes ontoan evolutionary tree of the species. In Table 6 wehave sorted the repair genes by pathway and by theinferred timing of the origin of each gene.

3.1. Direct repair

3.1.1. PHRPHR is a general term used to refer to the ability

of cells to make use of visible light to reverse thetoxic effects of UV irradiation. PHR has been foundin bacteria, Archaea and eukaryotes. Despite thehighly general way that PHR is defined, all charac-terized enzymatic PHR processes involve a similartype of direct repair of UV irradiation-induced DNA

w xlesions 39 . Therefore, the term PHR is frequentlyused more narrowly to refer to this type of DNArepair. Two different types of PHR have been dis-covered — the most common one involving the

Ž .reversal of cyclobutane pyrimidine dimers CPDsand the other involving the reversal of 6–4 pyrimi-

Ž .dine–pyrimidone photoproducts 6–4s . In addition,PHR processes differ from each other in their actionspectrum, the wavelength of light required for peakactivity, and the particular cofactor used to facilitate

w xenergy transfer 39 . Despite the different substrates,all PHR processes are quite similar — all are singlestep processes that have similar mechanisms and all

Ž .enzymes that perform PHR known as photolyasesare homologous.

Page 8: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213178

Table 2Components of phylogenomic analysis

Component How is it determined? Uses of this component

Gene analysisŽ . Ž .1 Database of genes of interest. Personal choice, characterized genes. Similarity searches 2 .Ž . Ž . Ž .2 Searching for homologs. Blast, PSI-blast, BLOCKS. Presencerabsence 4 ; gene tree 7 .

Set homology threshold.Ž . Ž .3 Functional predictions. Overlay known functions of Prediction of phenotypes 6 ;

genes onto gene tree. functional evolution.

Genome analysisŽ . Ž . Ž .4 Gene presencer. Searches 2 of complete genome Evolutionary analysis 8, 10 .

absence in species. sequences. Some refinementŽ .from evolutionary analysis 7, 10 .

Ž . Ž . Ž .5 Correlated presencer Analyze presencerabsence 4 in Functional predictions 3 ,Ž .absence. different species. pathway evolution 11 .

Ž . Ž .6 Phenotype predictions. Combine functional predictions 3 , Identify universal activities.Ž .presencerabsence 4 andŽ .pathway evolution 11 .

EÕolutionary analysisŽ . Ž . Ž .7 Gene trees. Set homology threshold for searches 2 Presencerabsence 4 ;

Ž .and use phylogenetic analysis identifying evolutionary events 10 ,Ž .of all homologs. functional predictions 3 .

Ž . Ž .8 EDPs. Overlay gene presencerabsence 4 Identifying gene evolutionaryonto species tree. events; pathway evolution.

Ž . Ž .9 Congruence. Compare gene tree 7 to species tree. Distinguish lateral transferŽ .from other events 8 .

Ž . Ž . Ž .10 Gene evolution events. Analysis of gene tree 7 , Pathway evolution 11 ,Ž . Ž .congruence 9 and EDPs 8 . correlated and convergent

Ž .events, presencerabsence 4 ;Ž .functional predictions 3 .

Ž . Ž . Ž .11 Pathway evolution. Integrate gene evolution 10 , Phenotype predictions 6 ;Ž . Ž .evolutionary distribution 8 , functional predictions 3 .

Ž .correlated presencerabsence 5 .

The comparison of photolyases is somewhat com-plicated by the fact that some photolyase homologsdo not repair any lesions but instead function as

w xblue-light receptors 40 . Comparative sequence anal-ysis reveals that the photolyase gene family can be

Ždivided into two subfamilies, referred to as classI or. Ž . w xPhrI and classII or PhrII 39 . ClassI includes the

photolyases of E. coli, Halobacterium halobium andyeast, as well as the blue-light receptors from plants

w xand a human gene with no known function 1 .ClassII includes the photolyases from Myxococcusxanthus, M. thermoautotrophicum, goldfish and mar-supials.

Most species for which complete genome se-quences are available do not encode any photolyasehomolog, and those that do encode either a PhrI or a

PhrII, but not both. It is important to note that somespecies for which the complete genomes are not

Žavailable encode both PhrI and PhrII homologs e.g.,.Arabidopsis thaliana . Phylogenetic trees of the pho-Ž w x.tolyase gene family ours and those of Ref. 39 , and

the fact that both PhrI and PhrII are found in each ofŽ .the major domains of life Table 4 , suggest that the

two gene families are the result of an ancient dupli-cation event. Thus, we conclude that the last com-mon ancestor of all life encoded both a PhrI and aPhrII and that the uneven distribution pattern ofthese genes is best explained by gene loss events insome lineages. For example, a PhrI gene loss likelyoccurred recently in the H. influenzae lineage since

Žmany other g-Proteobacteria including E. coli,Neisseria gonorrhoeae, and Salmonella typhi-

Page 9: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213 179

.murium encode a PhrI. It is possible that gene losshas occurred in humans as well. Marsupials encode aPhrII, but a PhrII has not yet been found in humans.The rampant loss of PhrI and PhrII genes is notparticularly surprising since many species may haveswitched from high to low UV irradiation environ-ments and thus may not have much use for PHR. Inaddition, since both 6–4s and CPDs can be repairedby other pathways such as NER, PHR is not abso-lutely necessary for repair of these lesions. Somegene duplication has also occurred in the photolyasegene family — for example, Synechocystis sp. en-codes two PhrIs. In addition, it is likely that therehave been some lateral gene transfers of Phr genes— A. thaliana encodes a PhrI that likely was trans-

Ž .ferred from the chloroplast genome data not shown .By tracing the evolution of functions of pho-

tolyase homologs, we conclude that the ancestral Phrprotein was a photolyase and thus that the lastcommon ancestor could perform PHR. Photolyasegenes may have been more important in the earlyevolution of life since there was no ozone layer then

w xto attenuate the intense solar UV flux 41 . Thisanalysis also shows that the blue-light receptors de-scended from photolyases and thus have lost PHRactivity but retained the ability to absorb blue-lightw x39 . Our analysis also shows that there have beenmultiple cases of change of function between CPDand 6–4 specificity. Because the history of pho-tolyases is filled with functional changes and loss offunction, we believe that the presence of a pho-tolyase homolog in a species cannot be used tounambiguously predict the presence of PHR activity

Ž .or its nature e.g., CPD vs. 6–4 .The specific origin of photolyase enzymes is diffi-

cult to determine since the photolyase gene familydoes not show any obvious homology to any otherproteins. However, it is useful to recognize thatlimited photolyase activity can be provided by a

Ž . w xtripeptide sequence Lys-Trp-Lys 42–44 , suggest-ing that a photolyase protein could have evolvedrelatively easily early in evolution.

3.1.2. Alkylation reÕersalA common form of damage to DNA bases occurs

Žwhen alkyl groups especially methyl and ethyl.groups are covalently linked to DNA bases. One

way that cells repair this damage is by transferring

the alkyl group off the DNA, a form of direct repairw xknown as alkyltransfer repair 45,46 . Alkyltransfer

repair has been found in bacteria, Archaea and eu-w xkaryotes 47 . All alkyltransfer repair processes are

highly similar. First, all are catalyzed by a singleprotein which transfers O-6-alkyl guanine from the

ŽDNA to itself in a suicide process the protein is.never used again . In addition, comparative sequence

analysis reveals that all alkyltransferases share ahighly conserved core domain and thus are all ho-

w xmologs 48,49 . The comparison of alkyltransferaseproteins is somewhat complicated because some con-

Ž .tain additional domains Fig. 2 . For example, in E.coli the Ogt protein contains only the alkyltrans-ferase domain while the Ada protein contains thealkyltransferase domain and a transcriptional regula-tory domain. Ada uses the second domain as part ofan inducible response to alkylation damage.

Our analysis shows that many but not all speciesŽ .encode alkyltransferase homologs Table 4 . Since

alkyltransferase homologs are found in at least somespecies from each of the major domains of lifeŽ .Table 4 , we conclude that they are ancient proteinsand were present in the last common ancestor of allorganisms. Thus, the absence of an alkyltransferase

Žhomolog from some species e.g., D. radiodurans,the two mycoplasmas, Synechocystis sp., R.

.prowazekii, and B. borgdorferi is likely due to geneloss. The two alkyltransferases in E. coli are likelythe result of gene duplication and domain shuffling.Specifically, we infer that in the g-Proteobacteriathere was a duplication into two alkyltransferasegenes and subsequently the transcriptional regulatorydomain was added onto the Ada protein. Interest-ingly, Gram positive bacteria also encode a twodomain protein with an Ada transcriptional-regu-latory domain, but in this case the Ada domain is

Ž .fused to an alkyl glycosylase domain see Fig. 2 .Since all characterized members of this gene fam-

ily function as alkyltransferases, we conclude thatthe presence of an alkyltransferase homolog in aspecies likely indicates the presence of alkyltrans-ferase activity. Thus, the last common ancestor waslikely able to perform alkylation repair. In addition,since no other proteins have been found to have thisactivity, we conclude that the absence of an alkyl-transferase homolog likely indicates the absence ofalkyltransferase activity. However, the species with-

Page 10: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213180

Fig. 2. Schematic diagram of an alignment of alkyltransferase genes.

out an alkyltransferase homolog likely are still ableto repair alkylation damage — some encode alkyla-tion glycosylases for BER and all encode genes for

Ž .NER see below .

3.1.3. DNA ligationŽDNA ligation the process of joining together two

.separate DNA strands is required for replication,recombination and all forms of excision repair, andwhen used to repair DNA strand breaks, is a form ofdirect repair. DNA ligation is usually performed by asingle ligase enzyme, although accessory proteinsfrequently aid in the process. The ligases that areused for DNA repair can be divided into two appar-ently unrelated families. Ligase-Is, which have beenfound and characterized in many bacterial speciesŽ . w xe.g., DnlJ of E. coli , are all NAD-dependent 50 .Ligase-IIs, which have been found in many viruses,

w xArchaea and eukaryotes 51 , are all ATP-dependent.Multiple Ligase-IIs with similar but not completelyoverlapping functions have been found in many eu-karyotes.

Our comparative analysis shows that, of thespecies analyzed here, all bacteria and only bacteriaencode a Ligase-I, some bacteria encode a Ligase-II,

all Archaea encode a Ligase-II, and all eukaryotesencode multiple Ligase-IIs. We therefore concludethat Ligase-Is originated early in bacterial evolution.We also conclude that Ligase-IIs originated in acommon ancestor of Archaea and eukaryotes, andthat subsequently there were duplications in eukary-otes and lateral transfers to some bacteria and viruses.Although functional information is not available forany of the bacterial Ligase-IIs, given the functionalconservation among members of this gene family ineukaryotes and Archaea, we suggest that they act asligases. Perhaps they provide an alternative type ofligase function to the universal bacterial Ligase-Isthat are found in these species. Since all speciesencode a homolog of one of the two ligase familiesand since all characterized members of these genefamilies are ligases, it is likely that all species haveligation activity.

3.2. MMR

The ability to recognize and repair mismatches inDNA has been well documented in many species.Since mismatches can be generated in many ways,processes that repair mismatches have many func-

Page 11: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213 181

tions, including the repair of some types of DNAdamage, the regulation of recombination, and, per-haps most importantly, the prevention of mutations

w xdue to replication errors 52 . Mismatches can berepaired by three main mechanisms — by BERglycosylases which recognize specific mismatchesŽ .discussed in Section 3.5 ; by a general mismatch

Žexcision repair process referred to here as mismatch.repair or MMR that can repair many types of mis-

matches; and by a variant of the general MMRprocess that uses endonucleases specific for certainmismatches as well as many of the proteins involvedin general MMR.

MMR has been found in a wide diversity ofspecies and has been best characterized in E. coli in

w xwhich it works in the following way 52 . First, theMutS protein binds to a mismatch or a small un-paired loop and, with the cooperation of MutL, theregion is targeted for excision repair. The newly

Žreplicated strand and thus the strand containing the.replication error is targeted for repair by the fact

that it will not yet have been methylated by the Damprotein. This lack of methylation makes the newlyreplicated strand the target of the MutH endonucle-ase which, when activated by the MutS–MutL com-plex, cuts the unmethylated strand at GATC sitesnear the mismatch. Various exonucleases and theUvrD helicase complete the excision of the targetstrand and a very large repair patch is resynthesizedusing the intact strand as a template.

While the overall scheme of general MMR issimilar between species, not all details are identicalw x5,53,54 . For example, while all species exhibitstrand specificity, the mechanism of strand recogni-tion is different between species. In addition, thereare many differences in the post-cleavage steps be-tween species. However, there is a conserved core ofgeneral MMR: homologs of the E. coli MutS andMutL proteins are absolutely required for MMR in

w x Ž .all species 5,53 . MutS and its homologs are al-ways responsible for the recognition step and MutLŽ .and its homologs have an as of yet poorly charac-terized structural role. Some of the unusual featuresof MutS and MutL homologs in different species are

w xwhat led us to explore phylogenomic methods 21 .For example, eukaryotes encode multiple function-ally distinct homologs of MutS and MutL, only someof which participate in MMR. In addition, some

species encode MutS homologs but not MutL ho-mologs. We showed that this is explained by thefinding that the MutS family is composed of two

Ž .major subfamilies MutS1 and MutS2 and onlythose proteins in the MutS1 subfamily are involvedin MMR. This functional information is supported bythe finding that all species either encode homologsof both MutS1 and MutL or neither. Thus, in thespecies with only a MutS homolog and no MutL, theMutS is always a MutS2.

The origins of the MutS1 and MutL proteins aredifficult to determine with certainty from the cur-rently available data. In particular, the presence ofhomologs of these two genes in bacteria and eukary-otes but not Archaea needs to be explained. Someevidence suggests that MutS1 and MutL were pre-sent in the last common ancestor of all species. Thearguments for why MutS1 is likely ancient have

w xbeen previously discussed 21 . We propose thatMutL is also ancient because we have isolated aclone of a portion of a MutL homolog from the

w xArchaea Haloferax Õolcanii 19 . Thus, we concludethat the absence of MutS1 and MutL from the Ar-chaea analyzed here is due to gene loss. An alterna-tive theory suggests that all eukaryotic MutS ho-mologs were transferred to the nucleus from the

w xmitochondrial genome 38 . We do not believe this iscorrect because the eukaryotic MutS and MutL ho-mologs do not all branch in evolutionary trees closeto the R. prowazekii MutS and MutL homologs.However, the eukaryotic MSH1 genes do branch inMutS family trees next to the R. prowazekii MutShomolog. Thus, we propose that the MSH1 geneswere transferred from the mitochondrial genome,which is consistent with experiments that show thatthese genes function in mitochondrial MMR. Inter-estingly, there is a MutS homolog encoded by themitochondrial genome of a coral species but this is agene in the MutS2 subfamily and likely does notfunction in MMR.

Whether or not MutS1 and MutL are ancient,since homologs of these genes are found in mostbacteria, we conclude that the ancestor of all bacteriaencoded MutS1 and MutL homologs. Thus, we inferthat the absence of these genes from some species isdue to gene loss. Tracing gene loss events shows thatloss of MutS1 and MutL has occurred many times inthe history of bacteria including in the mycoplasmal

Page 12: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213182

Žlineage they are absent from the mycoplasmas but.present in other low-GC Gram-positive species , the

Ž´-Proteobacterial lineage they are absent from C.jejuni and He. pylori but present in other Proteobac-

. w xteria , and the Mycob. tuberculosis lineage 21 .Since the function of MutS1 and MutL homologs ishighly conserved, we conclude that species withhomologs of these likely have MMR. Since no otherproteins are known to perform general MMR weconclude that species without homologs of these

Žgenes He. pylori, C. jejuni, Mycob. tuberculosis,.the two mycoplasmas, and the Archaea do not have

MMR.That there have been multiple parallel losses of

the MutL and MutS1 genes suggests either that thesegenes are particularly unstable and are easily lost, orthat there is some advantage to the loss of thesegenes. We believe that the latter explanation is morelikely and that MMR genes might have been lost toincrease the mutation rate. Such an increased muta-tion rate should allow a speciesrstrain to more read-ily evolve in response to unstable changing environ-

w xments 8,10,55,56 . In particular, absence of MMRwould result in a very high mutation rate in mi-crosatellite sequences, which in turn could contributeto generating diversity in antigen proteins of these

w xspecies 57 . In addition, since MMR plays a role inother processes, such as the regulation of interspeciesrecombination, differences in MMR could also affect

w xthese processes 58 .The limited distribution of MutH homologs sup-

ports experimental evidence that only close relativesof E. coli use methyl-directed strand recognition.Interestingly, MutH is closely related to the restric-tion enzymes Sau3AI from Staphylococcus aureusw x w x59 and LlaKR2I from Lactococcus lactis 60 . Wepropose that the mutH methylation-based systemevolved from a restriction modification system. Thissuggests that other species may have co-opted sepa-rate restriction systems for strand recognition. Thismay explain why many species encode a Dam ho-molog but not a MutH homolog. In addition it mayalso explain the interaction of a methyl CpG binding

Ž .endonuclease MED1 with the MutL homologw xMLH1 in humans 61 .

Interestingly, the Vsr mismatch endonuclease thatis involved in specific mismatch repair of GT mis-matches also has many functional and structural

w xsimilarities to restriction enzymes 62 . As withMutH, the Vsr system also appears to be of recentorigin in the Proteobacterial lineage.

3.3. NER

NER is a generalized repair process that allowscells to remove many types of bulky DNA lesionsw x63,64 . The overall scheme of NER, which is highlyconserved between species, works in the followingway: recognition of DNA damage; cleavage of the

Ž Xstrand containing the damage usually on both the 5X .and 3 sides of the lesion ; removal of an oligo-

nucleotide containing the damage; resynthesis of arepair patch to fill the gap; and ligation to thecontiguous strand at the end of the gap. Since thebiochemical details of NER are quite different be-tween bacteria and eukaryotes, we have divided theanalysis into multiple sections, first summarizingNER studies in bacteria and eukaryotes, then com-paring the origins of the eukaryotic and bacterialsystem, and finally analyzing what this analysis sug-gests about NER in Archaea.

3.3.1. Bacterial NER — UÕrABCD pathwayNER in bacteria has been best characterized in E.

w xcoli, in which it works in the following way 65,66 .First, a homodimer of UvrA recognizes the putativelesion and recruits UvrB to aid in the verificationthat a lesion exists. UvrA leaves the site and UvrBthen recruits UvrC, revealing a cryptic endonucleaseactivity to produce dual incisions 12–13 nucleotidesapart bracketing the lesion. The UvrD helicase, inconcert with DNA polymerase I, removes the dam-aged oligonucleotide, and a repair patch is synthe-sized by pol I that is then sealed into place by DNAligase. An accessory protein, Mfd, is involved intargeting NER to the transcribed strand of activelytranscribing genes — a subpathway known as tran-

Ž . w xscription coupled repair TCR 67,68 . Homologs ofUvrABCD are required for NER in all bacterialspecies studied. Although Mfd homologs have beenfound in many species, other than E. coli the func-tion has only been studied in Bac. subtilis. As in E.coli, in Bac. subtilis, Mfd is involved in TCR.However, the Bac. subtilis gene may also be in-

w xvolved in recombination 69,70 .

Page 13: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213 183

3.3.2. Eukaryotic NER — XP pathwaysNER in eukaryotes has been most thoroughly

w xstudied in yeast and humans. 63,71 . In humans,multiple proteins are involved in the initial damagerecognition steps, including XPA, RPA, XPE andXPC. The helicase activities are provided by those ofXPB and XPD in the basal transcription factor TFIIH,that serves dual functions in transcription and NER.In NER, TFIIH forms a bubble to enable separateflap endonucleases XPG and an XPF-ERCC1 het-erodimer to produce incisions 3X and 5X of the lesion,respectively, about 30 nucleotides apart. Repairreplication is then carried out by the same proteinsrequired for genomic replication, including RPA,RFC, PCNA and DNA polymerase dr´.

Although NER is highly conserved among eu-Žkaryotes the names of yeast homologs of the human

.proteins are given in Table 4 , some major differ-ences exist among eukaryotes in targeting NER toparticular parts of the genome. For example, theCSA protein in humans is involved in TCR but itsputative ortholog in yeast is not. Similarly, XPC in

Ž .humans is required for global genome repair GGRbut Rad4, the XPC ortholog in yeast, is not. Instead,in yeast Rad7 and Rad16 are required for GGR butorthologs of these have not yet been found in hu-mans. There are even more subtle differences intargeting lesions between humans and rodents. Inparticular, humans and rodents are nearly identical inthe repair of 6–4 photoproducts but rodents do notcarry out efficient global repair of CPDs as well ashumans, evidently because they lack inducible up-regulation of NER.

3.3.3. Comparison of bacterial and eukaryotic NEROne major difference between the bacterial and

eukaryotic NER systems is that many more proteinsare needed to carry out each step in eukaryoticcompared to bacterial NER. However, even morestriking is that, despite the overall similarity of bio-chemical mechanism of each of the steps, the bacte-rial and eukaryotic NER systems appear to be ofcompletely separate origins. For example, UvrA hasno homology to any of the damage recognition pro-teins of eukaryotes. Similarly, the early initiationsteps in eukaryotes require many proteins yet noneof these share a direct common ancestry with any of

the bacterial NER proteins. In some cases, the eu-karyotic and bacterial NER systems have separatelyrecruited similar proteins for particular functions. Forexample, UvrC and Ercc1 have similar activities andshare a similar motif, but are probably not homologs.In the early initiation steps eukaryotes use the 5X–3X

and 3X–5X helicases encoded by XPB and XPD,respectively, while bacteria use the distantly relatedhelicase UvrB to carry out the analogous activity. Inaddition, for TCR, eukaryotes and bacteria each useproteins in the helicase family that are not helicases,

Ž .but these proteins CSB and MFD are not particu-w xlarly closely related to each other 20 .

3.3.4. Origins of bacterial NEROur comparative analysis shows that orthologs of

the UvrABCD proteins are found in all the bacterialŽ .species analyzed Table 4 . Therefore, we infer that

these genes were present in the common ancestor ofall bacteria. Surprisingly, orthologs of UvrA, UvrB,UvrC and UvrD are also found in the Archaea M.thermoautotrophicum. Since the genes for UvrA,UvrB and UvrC are located next to each other in the

w xM. thermoautotrophicum genome, Aravind et al. 38suggest that these were transferred to M. thermoau-totrophicum as a single unit. However, as of yetthese genes have not been found together in anybacterial species so we believe the alternative possi-bility is still possible — that UvrA, UvrB, UvrC andUvrD were present in a common ancestor of bacteriaand Archaea and were then lost in some Archaeallineages. Orthologs of Mfd are found in all bacteriaexcept the mycoplasmas and Aq. aeolicus. There-fore, Mfd likely originated near the beginning ofbacterial evolution and was then lost from the my-coplasmal and Aq. aeolicus lineages. Since the func-tions of UvrA, UvrB, UvrC and UvrD are conservedin many bacteria, we conclude that all the bacteriaanalyzed here as well as the bacterial common ances-tor canrcould perform NER in much the same wayas does E. coli. Since Mfd is absolutely required forTCR in E. coli and Bac. subtilis it is likely that thespecies without Mfd cannot perform TCR.

The specific origins of these proteins help inunderstanding the origins of the bacterial NER pro-cess. UvrA is a member of the ABC transporter

w xfamily of proteins 72 . All proteins in this family for

Page 14: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213184

Ž .which functions are known other than UvrA arew xinvolved in transport across membranes 73 , al-

though it is important to note that for transport anadditional membrane spanning domain is required.Based on this relationship to ABC transporters, wepropose that bacterial NER evolved from a system

Žthat transported toxins out of the cell a function thatmany of the ABC transporters such as the MDR

.proteins still have . We further propose that bacterialNER may still have a transport function — trans-porting DNA damage containing oligonucleotidesout of the cell. Evidence for this includes that NER

w xis associated with the bacterial membrane 74 , thatsome UvrA homologs are possibly involved in trans-porting DNA damaging antibiotics out of the cellw x75 , and that some species are known to export

w xDNA repair products out of the cell 14 . The originsof UvrB are also revealing. UvrB is a member of thehelicase superfamily of proteins, most closely relatedto Mfd and RecG. The relationship of UvrB and Mfdis of particular interest since both interact with UvrA.Maybe the original NER system only used one pro-tein to interact with UvrA and a gene duplicationevent allowed Mfd and UvrB to diverge in function.Our analysis shows that UvrD and UvrC are alsopart of large multigene families and thus arose bygene duplication as well. UvrD is also a member ofthe helicase superfamily and is part of a subfamilythat includes the RecB, rep and helicase IV proteinsof bacteria and RadH from yeast. UvrC likely sharesa common ancestry with homing endonucleases frommitochondrial introns and with a family of uncharac-

Žterized proteins found in many bacteria see also Ref.w x.38 . Thus, all the proteins involved in bacterialNER originated by gene duplication events ratherthan by invention of new proteins.

3.3.5. Origins of eukaryotic NEROur comparative analysis reveals that most of the

proteins involved in eukaryotic NER are only foundin eukaryotes and thus likely evolved during eukary-otic history. However, some bacteria do encode likelyorthologs of some of the eukaryotic NER proteins.For example, homologs of Rad25 are found in twobacterial species, homologs of CSB are found inmany bacteria, and many bacteria encode a proteinDinG that is probably an ortholog of XPD. Thefunctions of these proteins in bacteria are unknown.

In addition, homologs of these and some other genesŽ .are found in Archaea see below .

3.3.6. Archaeal NERWhile NER has been studied in detail in bacteria

and eukaryotes, there have been only very limitedw xstudies in Archaea 19,76 . Our comparative genomic

analysis sheds little light on NER in Archaea.XPFrRad1 and XPGrRad2 homologs are found inall Archaea, suggesting that these genes originated ina common ancestor to Archaea and Eukaryotes.However, the functions of these genes in Archaea arehard to predict and there are a few reasons to thinkthat they may not function in NER. First, XPF worksin concert with Ercc1 in NER, but no Ercc1 ho-mologs are found in any of the Archaea. In additionsome of the eukaryotic XPF homologs do not func-tion only in NER. For example, the XPF homolog in

Ž .yeast RAD1 also functions in recombination. Dif-ferent functions for the Archaeal XPF homologs arealso suggested by the fact that the Archaeal XPFhomologs have likely functional helicase motifs whilethe eukaryotic genes have degenerate helicase motifsw x77 . It is also not possible to predict conclusively thefunctions of the Archaeal XPG homologs since theyare not much more similar to XPG than to othermembers of the FEN1 family with different func-

w xtions 78 . The Archaeal XPBrRad25 andCSBrRad26 homologs are likely not involved inNER either. Interestingly, the one Archaea in whicha NER-like process has been characterized in vitrow x79 is the one that encodes UvrABCD orthologs.With the separate origin of the bacterial and eukary-otic NER systems, we believe it is likely that theArchaea without UvrABCD homologs have an Ar-chaeal specific NER system made up in part of genesyet to be characterized.

3.4. AlternatiÕe excision repair

A novel mechanism for the initiation of excisionrepair of UV-induced photoproducts has been re-ported in Neurospora crassa and in Schizosaccha-romyces pombe. In this process, the UV dimer en-

Ž .donuclease protein UVDE introduces an incisionX w ximmediately 5 of the lesion 80–82 . Following

incision, the subsequent steps of repair are thought tooccur just as in the ‘‘normal’’ NER described above,

Page 15: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213 185

although the specific details are not known. Ho-mologs of UVDE are also found in the bacteria Bac.subtilis and D. radiodurans. In D. radiodurans, theUVDE homolog likely corresponds to the UV en-donuclease b, a UV damage specific endonucleaseactive in uÕrA mutants. Since the bacterial NERsystem is so different from the eukaryotic systemŽ .see above , it would be interesting to see if theUVDE homologs in bacteria also work in conjunc-tion with NER. There is some recent evidence thatthe UVDE homologs of some species may also work

w xon AP sites 83 .

3.5. BER

In BER, damaged or altered bases are detachedfrom the DNA backbone by DNA glycosylases that

w xcleave the glycosylic bond 84 . Subsequently, thebackbone of the DNA is incised by an abasic-siteendonuclease, the sugar is removed, and a repairpatch of a single or a few nucleotides is synthesizedusing the base opposite the excised base as a tem-plate. In this section, we discuss the evolution ofdifferent DNA base glycosylases.

( )3.5.1. Uracil DNA glycosylases UDG or UNGUracil can appear in DNA via two routes —

incorporation during replication and by spontaneousdeamination of cytosine. While the incorporationduring replication can be limited by controlling the

Ž .dUTP pools such as with dUTPase , the deamina-tion of cytosine is spontaneous and cannot be readilycontrolled. This deamination is potentially mutagenicbecause replication will lead to an adenine beingincorporated opposite the uracil, rather than the gua-nine that should have been incorporated opposite thecytosine. A variety of proteins have been found toact as uracil DNA glycosylases, including homologsof the E. coli Ung protein, glyceraldehyde-3-phos-

w x w xphate dehydrogenase 85 , a cyclin-like protein 86 ,Ž .and the MUG protein see below . We focus here on

homologs of Ung since these apparently provide themajor uracil DNA glycosylase activity for most

w xspecies 87 . Ung homologs have been characterizedin many bacterial and eukaryotic species, as well as

Ž .in many viruses mostly herpes related viruses andthese proteins have strikingly similar structures andfunctions.

Our comparative analysis shows that Ung ho-mologs are found in eukaryotes, many bacteria, butnot in any of the Archaea analyzed. Since Ung isfound in a wide diversity of bacteria, we concludethat the bacterial ancestor encoded a Ung homologand that the absence of Ung homologs from some

Žbacteria T. pallidum, Synechocystis sp., R..prowazekii, and Aq. aeolicus is due to gene loss.

Our phylogenetic analysis suggests that the animalUng homologs were transferred from the mitochon-

Ždrial genome they branch within the Proteobacterial.Ung homologs . The possibility of a mitochondrial

transfer is supported by the finding that an alterna-tively splice form of the human Ung functions in the

w xmitochondria 88 . However, since no Ung sequenceis yet available from the a-Proteobacteria which arethought to be the closest living relatives of themitochondria, we cannot conclusively resolve theorigin of the eukaryotic Ung genes. Due to the highdegree of functional conservation among character-ized Ung homologs, it is likely that the species withUng homologs have uracil glycosylase activity.However, the absence of an Ung homolog should notbe used to imply the absence of uracil glycosylaseactivity, because many other proteins have someuracil glycosylase activity. The absence of uracilglycosylase activity would be particularly surprisingin thermophiles like Aq. aeolicus and the Archaeasince the deamination of cytosine increases withincreasing temperature. One possibility is that thesespecies have a novel means of preventing or limitingdeamination. However, more likely, these specieshave an alternative protein that acts as a uracil DNAglycosylase. Although these species do not encode a

Ž .MUG homolog see below , some do encode a novelG:U glycosylase that was originally described in Th.

w xmaritima 89 . This enzyme may explain the uracilw xglycosylase activity found in many thermophiles 90 .

( )3.5.2. G:U and G:T mismatch glycosylase MUGThe first protein in this family to be characterized

Ž .was the thymine DNA glycosylase TDG of humansw x91,92 . This protein was originally shown to cleavethe glycosylic bond of thymine from G–T mis-matches but was subsequently found to also cleavethe uracil from G–U mismatches. Subsequently, ho-mologs of this protein were found in other mammalsas well as some bacteria. The E. coli protein is

Page 16: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213186

called the mismatch specific uracil DNA glycosylaseŽ .MUG although it works on both G–U and G–Tmismatches like the human protein. These proteinsare likely used for the repair of deamination ofcytosine and methyl-cytosine, which will lead toG–U and G–T mismatches, respectively. Since theseproteins can cleave uracil from DNA, they can beconfused with uracil DNA glycosylases. Until moresequences are available for these genes, the evolu-tionary history of this gene family cannot be deter-mined accurately.

3.5.3. MutY–Nth familyThe MutY and Nth proteins of E. coli are both

DNA-glycosylases and, although they are homologsof each other, they have quite different substrate

w xspecificity and cellular functions 47,93 . MutYcleaves the glycosylic bond of adenine from G:A,

w xC:A, 8-oxo-G:A or 8-oxo-A:A base pairs 94 . Itsprimary role is protection against mutations due to

w xoxidative damage of guanine 95 . Nth has a verybroad specificity and excises a variety of damagedpyrimidines. Homologs of MutY and Nth have beencloned from many species and all that have beencharacterized are DNA glycosylases. Some of theseare clearly MutY-like or Nth-like in sequence and

Ž w x w xfunction e.g., the MutY 96,97 and Nth 98 of.mammals . However, many have quite different

specificity than the E. coli proteins including thepyrimidine dimer glycosylase of Micrococcus luteusw x Ž99 , the yeast NTG1 and NTG2 that excise similarsubstrates to the E. coli Nth as well as ring opened

Ž ..purines, the formamidopyrimidines FAPY , the GTmismatch repair enzyme of the Archaea M. thermo-

w xformicum 100 , and a methyl-purine glycosylasew xfrom Th. maritima 101 .

Our comparative analysis shows that all speciesexcept the two mycoplasmal species encode at leastone member of the MutY–Nth gene family. Weattempted unsuccessfully to use phylogenetic analy-sis to divide this gene family into subfamilies oforthologs. Some proteins are clearly more related toMutY or to Nth than others are, but there is noobvious, well-supported subdivision. Therefore, welist the MutY–Nth gene family together withoutattempting to distinguish orthologs of these two pro-teins. Since this family is so widespread, we con-

clude that it is ancient, and thus that the last commonancestor encoded at least one MutY–Nth-like pro-tein. Thus, the absence of a MutY–Nth-like genefrom the mycoplasmas is likely due to gene loss.However, since our phylogenetic analysis was am-biguous and since the activity is not conserved amongthese proteins, we cannot infer any activity otherthan a broad ‘‘glycosylase’’ activity for the ancestralprotein. For similar reasons we also cannot reliablypredict the functions of any of the MutY–Nth familymembers for which functions are not known. TheMutY–Nth family is distantly related to the Ogg and

Ž .AlkA glycosylases see below . Thus, all three ofthese gene families likely descended from a singleancestral glycosylase gene. Since some species en-code three or four members of this gene family, theremust have been some more recent duplications inthis gene family.

3.5.4. Fpg–Nei familyŽ .The Fpg protein in E. coli also known as MutM

Žexcises damaged purines including 8-oxo-G and. w xFAPY from DNA 102 . Its primary function is the

protection against mutation due to oxidative DNAw xdamage 103 . Homologs of Fpg have been isolated

from a variety of bacterial species and all that havebeen characterized have functions similar to that of

w xthe E. coli protein 104,105 . Somewhat surpris-ingly, when the Nei protein was cloned, it was found

w xto be a homolog of Fpg 106,107 . Nei is a glycosy-lase that excises thymine glycol and dihydrothymine.Thus, the Nei–Fpg family has a great deal of func-tional diversity, while exhibiting a common theme ofthe repair of DNA damage due to reactive oxygenspecies.

Our comparative analysis shows that althoughmembers of the Fpg–Nei family are found in manybacterial species, they are not found in Archaea and

Žthe only gene found in eukaryotes that of A..thaliana is likely derived from the chloroplastw xgenome 108 . Therefore, this family is of bacterial

origin. Our phylogenetic analysis of the members ofthis family has allowed us to divide it into clear Fpg

Žand Nei orthologous groups therefore, they are listed.separately in Table 4 . Of the proteins in the family,

most are orthologs of Fpg. The distribution of Fpgorthologs suggests that Fpg was present in the ances-tor of most bacteria. Therefore, the absence of Fpg

Page 17: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213 187

Žfrom some species He. pylori, the spirochetes and.Aq. aeolicus is likely due to gene loss. Since Fpg

proteins have similar activities between species, thepresence of an Fpg homolog likely indicates thepresence of FAPY- and 8-oxo-G glycosylase activ-ity. The origin of Nei is somewhat less clear. Only

Ž .one species other than E. coli Mycob. tuberculosishas been found to encode a likely ortholog of Nei.We do not find evidence for a Nei ortholog in

w xcyanobacteria as found by Aravind et al. 38 . It isnot possible to determine if there was a lateraltransfer between these two lineages or if there was agene duplication in the common ancestor and subse-quent gene loss of Nei from many species.

3.5.5. Ogg1 and 2The Ogg1 and Ogg2 proteins of yeast are homolo-

w xgous and both act as 8-oxo-G glycosylases 109 .Ogg1 excises 8-oxo-G if it is opposite cytosine orthymine and Ogg2 if opposite guanine or adenine.Although these proteins have similar substrate speci-ficity to Fpg proteins, and both are b-lyases-likeFpg, they are not homologs of Fpg despite initialreports. As mentioned above, they may be distantlyrelated to the MutY–Nth family and to AlkA. Ho-mologs of Ogg1 and Ogg2 have been cloned from

w xhumans 110–112 . Some isoforms of these functionw xin the nucleus and others in the mitochondria 113 .

Our comparative analysis reveals that a homolog ofOgg1 is present in M. thermoautotrophicum, but notin the other Archaea or any bacteria analyzed here.

w xAravind et al. 38 suggest that Ogg orthologs arefound in some bacterial species, but we cannot findevidence for this. We conclude that an Ogg homologwas present in the eukaryotic common ancestor. It isnot possible to determine if M. thermoautotrophicumobtained its Ogg protein by lateral transfer, or if Oggoriginated prior to the divergence of Archaeal andeukaryotic ancestors and then was subsequently lostfrom some Archaeal lineages.

3.5.6. Alkylation glycosylasesAlkylation glycosylases can be divided into three

w xgene families 47,114,115 . One includes AlkA of E.Ž .coli also known as TagII and MAG of yeast. AlkA

Žcan excise many alkyl-base lesions e.g., 3-me-A,.3-me-G, 7-me-G and 7-me-A , and a variety of other

damaged bases including hypoxanthine. The AlkAhomolog in yeast, MAG, has a similar broad speci-ficity. A second family includes TagI of E. coli andits homologs in other bacteria. TagI is highly specific

Ž .for 3-methyl-adenine 3-me-A , although it can alsoŽ .remove 3-methyl-guanine 3-me-G , but with much

lower efficiency. The third family includes the MPGproteins of mammals that, like AlkA and MAG, havea broad specificity.

Our comparative analysis shows that each of thethree alkylation glycosylases families has an unevendistribution pattern. TagI is only found in a limitednumber of species and thus likely evolved withinbacteria. AlkA homologs are found in many speciesof bacteria, Archaea and eukaryotes. Thus, we con-clude that the AlkA family is ancient and that itsabsence from some species is due to gene loss. MPG

Žhomologs are found in many eukaryotes including.many species not listed in Table 4 and some bacte-

ria. The origins of the MPG family are not clear.Since the functions of homologs of each of the

three alkylation glycosylase families are highly con-served between species, we conclude that the pres-ence of one of these genes indicates the likely pres-ence of alkylation glycosylase activity. It is likelythat those species with AlkA or MPG homologs canrepair many different types of alkylation damage.

ŽMany species the mycoplasmas, the spirochetes,.Aq. aeolicus and Meth. jannaschii do not encode a

homolog of any of these glycosylases. Given thatalkylation glycosylases have apparently evolved sep-arately many times, it is possible that these specieshave novel alkylation glycosylases. However, sincealkylation damage can be repaired by other pathwaysŽ .e.g., NER and alkyltransferases these species maystill be adequately protected from alkylation damagewithout an alkylation glycosylase.

3.5.7. T4 endonuclease VŽ .The DENV protein also known endonuclease V

of T4 phage is a glycosylase that acts specifically onUV irradiation-induced CPDs. This protein may serve

Žas a back-up system for the host’s NER enzymes itcan functionally complement mutants in bacteria oreukaryotes with deficiencies in the early steps in

.NER . Homologs of DENV have been cloned in aw xparamecium virus and phage RB70 116 , but the

activities of these are not known. DENV homologs

Page 18: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213188

are not present in any of the complete genomesequences.

( )3.6. AP endonucleases abasic site endonucleases

AP endonucleases, which cleave the DNA back-bone at sites at which bases are missing, are required

w xfor BER and for the repair of base loss. 117 . Thereare two distinct families of AP endonucleases. Oneincludes the Xth protein of E. coli, RRP1 ofDrosophila melanogaster, and the APE1rBAP1r

w xHAP1 proteins of mammals 118 . The other in-cludes the Nfo protein of E. coli and the APN1protein of yeast. Some other proteins can serve asAP endonucleases, but usually these activities are

Ž .part of base-glycosylase e.g., Nth and DENV thatdo not function as AP endonucleases on their own.In addition, Xth is distantly related to the p150proteins of LINE elements, although it is not clear ifthese proteins have similar activities.

Our comparative analysis shows that members ofthe XthrAPE1 family are found in most species.The NforAPN1 family have a more limited distribu-tion, although representatives are found in all do-mains of life, suggesting that these proteins are alsoancient. Since both gene families are likely ancient,the absence of either gene from a particular speciesis likely due to gene loss. Interestingly, althougheach gene has been lost many times in differentlineages, all species encode a homolog of one of thetwo AP endonucleases. Thus, the loss of one of thetwo is tolerable, but loss of both is not. Since allcharacterized members of these gene families func-tion as AP endonucleases, we conclude that APendonuclease activity is universal. This is not sur-prising in view of the high frequency of spontaneousdepurination of DNA.

3.7. Recombination and recombinational repair

Homologous recombination is required for a vari-wety of DNA repair and repair related activities 119–

x121 . Before discussing the role of homologous re-combination in repair, it is useful to review some ofthe details of homologous recombination in general.Homologous recombination can be divided into four

Ž . Žmain steps: a initiation during which the substrate

. Ž .for recombination is generated ; b strand pairingŽ . Ž .and exchange; and c branch migration and d

branch resolution. Different pathways within aspecies often differ from each other in the first stepŽ . Žinitiation and the last steps migration and resolu-

.tion but use the same mechanism and proteins forthe pairing and exchange step. For example, in E.coli, there are at least four pathways for the initiationof recombination — the RecBCD, RecE, RecF andSbcCD pathways. These pathways generate sub-strates that are used by RecA to catalyze the pairingand exchange steps. The branch migration and reso-lution steps are then carried out by either theRuvABC, RecG or Rus pathways.

One form of damage that can be repaired byhomologous recombination is the double-strand breakŽ .DSB . DSBs can be created by many agents includ-ing reactive oxygen species, restriction enzymes andnormal cellular processes like VDJ recombination. Itis important to note that DSBs can also be repaired

Ž . Žby non-homologous end joining NHEJ discussed.in more detail below . In E. coli and yeast, the

majority of the repair of DSBs is carried out byhomologous recombination pathways, although inyeast some DSBs are also repaired by NHEJ. Incontrast, in humans, most of the repair of DSBs iscarried out by NHEJ, although some homologousrecombination-based repair is also performed.

Homologous recombination is also used in manyspecies to repair post-replication daughter strand gapsŽ .DSGs . When DNA is being replicated, if the poly-merase encounters a DNA lesion, it has three choices— replicate the DNA anyway, and risk that thelesion might be miscoding; stop replication and waitfor repair; or leave a gap in the daughter strand andcontinue replication a little but further downstream.In E. coli, the choice depends on the type of lesion,but frequently gaps are left in the daughter strand. Insuch cases, it is no longer possible to perform exci-sion repair on the lesion because there is no intacttemplate to allow for the repair synthesis step. How-ever, such gaps can be repaired by daughter-strand

Ž .gap repair DSGR in which homologous recombina-tion with an undamaged homologous section of DNAis used to provide a patch for the unreplicated daugh-

w xter strand section 122 . Thus, although DSGR doesnot remove the instigating DNA damage, it is still aform of DNA repair.

Page 19: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213 189

Homologous recombination can be used to repaira variety of other DNA abnormalities such as inter-strand cross-links. Below we discuss different path-ways for homologous recombination, focusing onthose known to be involved in some type of DNArepair. In Table 4, and below, the proteins are cate-gorized by the stage in which they participate in therecombination process.

3.7.1. Initiation pathways

3.7.1.1. RecBCD pathway. The primary pathway forthe initiation of homologous recombination in E.

Ž w xcoli is the RecBCD pathway see Ref. 123 for.review . This pathway is used for the majority of

Žchromosomal recombination such as during Hfr.mating and for the repair of DSBs. The initiation

steps for this pathway require primarily the RecB,RecC and RecD proteins, although other proteinssuch as PriA may also be required. Together, RecB,RecC and RecD make up an exonucleaserhelicasecomplex that is used to assemble a substrate forRecA-mediated recombination.

Our comparative analysis shows a limited distri-Žbution of RecB, RecC and RecD orthologs they are

only found in some enterobacteria, Mycob. tubercu-.losis, and possibly in B. borgdorferi . Based on this,

we conclude that the RecBCD pathway evolved rela-tively recently within bacteria. The timing of theorigin of the RecBCD pathway is somewhat ambigu-ous. The pathway could have evolved within theProteobacteria and Mycob. tuberculosis could havereceived it by lateral transfer. Alternatively, the path-way could have been present in the common ances-tor of high-GC Gram-positive species and Proteobac-teria, and its absence from many Proteobacteria andpossibly the low-GC Gram-positive species wouldhave to be due to gene loss.

The finding that most species either have or-thologs of all three or of none of these proteinssuggests that these proteins have a conserved affilia-tion with each other. Analysis of the individualproteins suggests that this complex may have anancestry in other recombination and repair functions.RecB and RecD are both in the helicase superfamilyof proteins and both are closely related to proteins

Žwith recombination or repair roles RecB is relatedto UvrD and AddA, RecD is related to TraA which

is involved in DNA transfer in Agrobacterium tume-w xfaciens 124,125 . Functionally similar complexes

that are composed of proteins that are not orthologsof RecBCD have been described and isolated from

w xmany bacterial species 126 . Interestingly, the pro-teins in some of these complexes are related toproteins in the RecBCD complex, even though they

w xare not orthologous 38 .

3.7.1.2. RecF pathway — DSGR initiation in bacte-ria. In E. coli, the RecF pathway is responsible formost plasmid recombination, for daughter-strandgap-repair, for some replication related functionsw x127 and for a process known as thymineless deathw x128,129 . This pathway has only a limited role in‘‘normal’’ homologous recombination accounting forless than 1% of the recombination in E. coli. Theproteins involved in recombination initiation in thispathway are RecF, RecJ, RecN, RecO, RecR and

w xRecQ 121 . Not all of these proteins are required forevery function of the pathway. For example, RecF,RecJ and RecQ are required for thymineless deathwhile RecF, RecR and RecO are evidently required

w xfor replication restart functions 127 . In addition,some of the proteins in this pathway are involved inother repair pathways. For example, RecJ can beused as an exonuclease in MMR if other exonucle-ases are defective.

Homologs of some of the proteins in the RecFpathway have been characterized in a variety ofspecies. RecF and RecJ homologs in many bacteriahave similar functions to the E. coli proteins. RecQhomologs have been characterized in many eukary-otic species and many of these have been shown to

w xbe helicases 130,131 , like the E. coli RecQ. How-ever, their cellular functions are not well understoodand it is not clear if they use their helicase activity insimilar ways as the E. coli RecQ. The yeast RecQhomolog SGS1 is involved in the maintenance ofchromosome stability, possibly through interaction

w xwith topoisomerases during recombination 132 . Hu-mans encode at least three RecQ homologs and allthat is known about their function is that Werner’s

w xsyndrome is caused by a defect in one of these 133and Bloom’s syndrome is caused by a defect in

w xanother 134 .Our comparative analysis of proteins in the RecF

pathway was somewhat limited by difficulty identi-

Page 20: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213190

fying orthologs of many of the proteins. In particu-lar, orthologs of RecN and RecO were difficult toidentify because they show only limited conservationbetween species. Using low stringency searches wewere able to identify more distantly related ho-mologs of RecO and RecN. This was helpful foridentifying RecO orthologs but did not work well forRecN because the lower stringency searches pulledup many homologs in most species. In particular, wewere unable to distinguish whether the RecN-likeproteins of the mycoplasmas and B. borgdorferiwere RecN orthologs or paralogs.

Our analysis shows that, in contrast to many otherrepair pathways, the proteins in the RecF pathway donot have strongly correlated distribution patterns be-tween species. For example, orthologs of RecJNRare found in He. pylori while orthologs of RecFOQare not. Eukaryotes encode orthologs of RecQ butnot of any other proteins in this pathway. Thus, ifother species have a RecF-like pathway, it cannotwork in the same way as in E. coli. One possibilityis that other species have replaced some genes in theRecF pathway by non-orthologous gene displace-

Ž w xment see Ref. 29 for a description of non-ortholo-.gous gene displacement . It is known that some of

the functions of RecJ in E. coli can be comple-mented by other 5X exonucleases such as RecD.Alternatively, it is possible that the functions of theRecF pathway are specific for E. coli and that otherspecies do not have a similar pathway. We also notethat we do not find any evidence for multiple RecRorthologs in any species as suggested by Aravind et

w xal. 38 . It is possible that this suggestion was due tothe fact that in some species RecR orthologs arereferred to as RecM.

Since most of the genes in the RecF pathway arefound in a wide diversity of bacteria, we concludethat they were present in the bacterial common an-cestor. Since RecQ orthologs are found in eukaryotesand RecJ orthologs are found in some Archaeal

Ž w x.species see Table 3 and Ref. 38 , RecQ and RecJmay have originated somewhat before the origins ofbacteria. It is not possible to determine the ultimateorigins of many of the RecF pathway proteins; but atleast three of the proteins originated by some type of

Žgene duplication RecQ within the Dead family ofthe helicase superfamily and RecN and RecF within

w xthe SMC superfamily 135,136 .

3.7.1.3. RecE pathway — alternatiÕe initiation path-way in bacteria. The RecE recombination pathwayof E. coli is only activated in recBrecC, sbcAmutants. This pathway requires many of the proteinsin the RecF pathway, as well as two additional

w xproteins RecE and RecT 137–139 . These additionalproteins are both encoded by a cryptic lambda phage.RecE is an exonuclease that can generate substratesfor recombination either by RecT or by RecA. RecTmay be able to catalyze strand invasion without

w xRecA 140 . Our comparative analysis shows that thespecies distribution of these proteins is extremelylimited. RecT is found in some lowGC Gram-posi-tive bacteria. RecE homologs are not found in anyspecies other than E. coli. The presence of these

Table 3Ž .Evolutionary distribution patterns EDPs

aType of pattern Description Likely explanations How to resolveambiguities?

Universal All species have the Gene is ancient and probably nragene. universally required in all species.

Uniform presence Gene is in only one Gene originated in that lineage. nraevolutionary lineage.

Uniform absence Gene is missing from Gene lost in that lineage. nraone lineage.

Uneven Presencerabsence. Gene loss or lateral transfer. Compare gene tree vs.scattered through tree. species tree.

Multicopy Multiple homologs in Gene duplication or lateral transfer. Compare gene tree vs.some species. species tree.

a Determined by overlaying presencerabsence of genes onto evolutionary tree of species.

Page 21: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213 191

genes on a cryptic phage may reflect a recent lateraltransfer between species.

3.7.1.4. SbcBCD pathway. The SbcB, SbcC andSbcD proteins were all identified as genes that, whendefective, led to the suppression of the phenotype of

w xrecBC mutants 121 . SbcB is an exonuclease and isalso known as exonuclease I, exoI or Xon. When itis defective, the RecE and RecF pathways are re-vealed. SbcC and SbcD together make up an exonu-clease that cleaves hairpin structures. The main func-tion of the SbcCD complex is thought to be theelimination of long cruciform or palindromic se-quences which would remove sequences that may

w xinterfere with DNA replication 141 . Homologs ofSbcC and SbcD have not been characterized in manybacteria. However, these proteins do share somesequence similarity to the yeast Rad50 and MRE11

Žand may be distant homologs of these proteins dis-. w xcussed below 142 .

Our comparative analysis shows that SbcB ho-mologs are found only in E. coli and H. influenzaeand thus this protein apparently originated within theg-Proteobacteria. Homologs of SbcC and SbcD arepresent in many bacteria and are always presenttogether. Thus, the interaction of these proteins ap-pears to have been conserved over time. Given thelikely homology of these proteins to MRE11rRAD50Ž .which are found in eukaryotes and Archaea , webelieve SbcC and SbcD are ancient proteins. Thus,the absence of these genes from some species islikely due to gene loss.

3.7.1.5. Rad52 pathway — DSBR in eukaryotes. Theprimary pathway for homologous recombination inyeast is the Rad52 pathway. This pathway is used formitotic and meiotic recombination as well as forDSB repair. Although the exact biochemical detailsof this pathway are not completely worked out, it isknown that the initiation step depends on three pro-teins — MRE11, Rad50 and XRS2, which form a

w xdistinct exonucleolytic complex 143,144 . It is be-lieved that this complex functions to induce DSBsfor mitotic and meiotic recombination and that itmay alter DSBs caused by DNA damage to allowthem to be repaired by homologous recombination.Genetic studies have found that these genes are also

Ž .involved in NHEJ see below . A similar exonucle-

olytic complex that is also involved in homologousrecombination and NHEJ has been identified in hu-mans. This complex is composed of five proteinsincluding homologs of MRE11 and Rad50 but not

w xXRS2 145 . Defects in one of the other proteins inŽ .the human complex NBS1 lead to the Nijmegen

w xbreakage syndrome 145 .Our comparative analysis shows that homologs of

MRE11 and Rad50 are found in all the ArchaeaŽanalyzed although the Archaeal Rad50 homologs

may not be orthologs of the eukaryotic Rad50s —.our phylogenetic analysis was ambiguous . Since

these genes are related to the SbcC and SbcD genesw xof bacteria 142 , we conclude that the SbcCrRad50

and SbcDrMRE11 proteins are ancient proteins.XRS2 appears to be of recent origin in yeast sincehomologs are not yet found in any other species.

3.7.1.6. Strand pairing and recombinases. The RecAprotein catalyzes strand pairing and invasion, is re-quired for all homologous recombination in E. coli.Comparative studies have shown that homologousrecombination depends on RecA homologs in many

Ž .other bacterial species as well as in Archaea RadAŽ . w xand eukaryotes Rad51 and DMC1 146–148 . There

is some divergence of function in eukaryotes. Rad51is the recombinase for the majority of mitotic recom-bination and both Rad51 and DMC1 are used foraspects of meiotic recombination.

The comparative analysis of RecA and its ho-mologs is somewhat complicated by the fact thatRecA is part of a multigene family that includes theproteins mentioned above, as well as SMS in bacte-ria, RadB in Archaea, and Rad55 and Rad57 ineukaryotes. Our analysis shows that the proteins that

Žact as recombinases RecA, RadA, Rad51 and.DMC1 are all orthologs of RecA and thus we focus

on these here. Our comparative analysis shows thatall species for which complete genomes are availableencode orthologs of RecA. RecA is the only repairgene with such a universal distribution. Since allcharacterized RecA orthologs are recombinases, thissuggests that all these species have recombinaseactivity. The universal presence also suggests thatrecombinase activity is fundamental to life. How-ever, there have been reports of some mycoplasma

Žspecies encoding defective RecA proteins and possi-bly thus being defective in all homologous recom-

Page 22: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

()

J.A.E

isen,P.C

.Hanaw

altrM

utationR

esearch435

1999171

–213

192

Table 4Presence and absence of repair gene homologs in complete genome sequences

Page 23: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213 193

Page 24: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213194

Ž.

Tab

le4

cont

inue

d

Page 25: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

()

J.A.E

isen,P.C

.Hanaw

altrM

utationR

esearch435

1999171

–213

195

U In those cases in which a species encoded a gene for which homology to the gene of interest was ambiguous, we indicated ". If a gene was found in any other species withinbacteria. Archaea or eukaryotes, this is listed in the ‘‘ANY’’ column. For those genes that were part of multigene families, we used phylogenetic analysis to divide the family

Ž .into subfamilies and groups of orthologs and paralogs see Section 3 . If subfamilies could be determined unambiguously, we only identify presence and absence of a homologŽ .within the same subfamily as the search gene. If subfamilies could not be determined unambiguously, we listed the number of homologs of a particular gene e.g., MutY–Nth .

Ž 3 .In cases of relatively recent gene duplications, presence of multiple homologs qq for two and qq for more was indicated for a few species if a limited number of speciesencoded multiple orthologs of a gene. If lateral transfers were identified, this is indicated in the Comments column. Additional details can be found in Section 4.a The first step in BER involves glycosylases. See text for details on other steps. Some of these glycosylases also have AP lyase or dRPase activity.b Ž .Functions similarly to AP-Endonuclease but biochemical activity is AP lyase in conjunction with role in BER .c Many exonucleases can serve this role in mismatch repair.d Ž .RecBCD complex ExoV has many activities including dsDNA and ssDNA exonuclease and endonuclease, ATPase, helicase, and Chi-site recognition.

Page 26: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213196

w x.bination 149,150 . The presence of multiple func-tionally distinct orthologs of RecA in eukaryotes islikely due to a duplication and functional divergence

w xearly in eukaryotic evolution 148 . A few otherfeatures of RecA evolution are worth mentioning.

ŽSome bacteria also encode two RecA orthologs e.g.,.Myx. xanthus , although it is not clear if both are

w xfunctional 151 . Phage T4 also encodes a RecAortholog, UvsX, which also has recombinase activityw x152,153 . In addition, A. thaliana encodes two bac-terial-like RecAs in its nuclear genome. One ofthese, which functions in the chloroplast, likely was

w xtransferred from the chloroplast genome 146 . Per-haps the other functions in the mitochondria.

3.7.2. Branch migration and resolutionIn E. coli, many pathways have been identified

that can perform branch migration andror resolu-tion. In the RuvABC pathway, which is likely themain branch migration and resolution pathway, RuvAbinds to Holliday junctions, RuvB is a helicase thatcatalyzes branch migration and RuvC is a resolvase.The RecG protein catalyzes branch migration and

w xpossibly Holliday junction resolution as well 154 .The RusA protein, the expression of which is nor-mally suppressed, can also serve as a junction re-

w xsolvase 155–157 . It is encoded by a defectiveprophage DLP12 and is similar to protein in phage82.Homologs of RuvABC and RecG have been found inmany other bacteria and shown to function in similar

w xways 158 . Little is known about the proteins re-quired for migrations and resolution in eukaryotes.CCE1 appears to be involved in resolution in yeast

w xmitochondria 159 . It has been suggested that Rad54may be involved in branch migration in the Rad52pathway.

Our comparative analysis reveals that RuvABC,RecG and RusA orthologs are only found in bacteria.Thus, these pathways all likely evolved within bacte-ria. Since Rus has a limited distribution, we concludethat it evolved recently. Since RuvABC and RecGare found in a wide diversity of bacterial species, weconclude that they evolved early in bacterial evolu-tion. Since the functions of RuvABC and RecG areconserved across species, those species with eitherRuvABC or RecG likely have branch migration andresolution abilities. While RuvA and RuvB orthologsare found in all bacterial species, many of these

species do not encode a RuvC ortholog. Two ofthese species that do not encode RuvC orthologs doencode RecG orthologs. Perhaps, as in E. coli, RecGcan replace some of the functions of RuvC in these

w xspecies 160 . Three of the species that do not encodeŽa RuvC ortholog the two Mycoplasmas and B.

.borgdorferi also do not encode RecG or any otherresolvase homolog. Whether these species encodealternative resolvases remains to be determined.However, since resolvase activity has apparentlyevolved many times, it is possible that these specieshave novel resolvase proteins.

The origins of each of these proteins reveals someclues to the origins of branch migration and resolu-tion activities. RuvC is somewhat similar in structure

w xto RnaseH1 158 so it is possible that these proteinsshare a common ancestor. The branch migrationactivities are a little more constrained in that all ofthem appear to use some DNA helicase protein.However, the particular helicases used are quite dif-

Žferent. RecG is closely related to Mfd and UvrB see.Section 3.3 while RuvB is particularly closely re-

lated to an uncharacterized group of RuvB-like pro-teins found in many bacterial species.

3.8. NHEJ

In mammals, most of the repair of DSBs is car-ried out without homologous recombination by NHEJw x161,162 . In this process, DSBs are simply restitchedback together. Thus, NHEJ is in essence a form ofdirect repair although we discuss it here becausemany of the genes involved are also used in recom-binational repair. Genetic studies have shown that at

Ž .least four proteins XRCC4-7 are specifically re-quired for NHEJ in humans. Together XRCC5-7make up the DNA-dependent protein kinase complex

Ž . Ž .composed of Ku80r86 XRCC5 , Ku70 XRCC6Ž .and DNA-PKcs XRCC7 . This complex likely func-

tions by binding to DNA ends and stimulating DNAligase activity. Putative homologs of Ku70 and Ku86have been identified in yeast and these have beenfound to be involved in the repair of DSBs by NHEJw x163 . In addition, as mentioned above, MRE11 andRad50 are also involved in NHEJ in humans and

w xyeast 143 . As discussed above, in yeast, most of therepair of DBSs is carried out by homologous recom-

w xbination-based pathways 164 .

Page 27: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213 197

Our comparative analysis shows that there are nohomologs of XRCC4 or any of the three subunits ofDNA-PK in Archaea or bacteria. Therefore, this

w xpathway most likely evolved in eukaryotes 165 .Our analysis also shows that the sequence similaritybetween the yeast and mammalian proteins is verylimited. Although it is likely that these proteins arehomologous, the low level of sequence similaritysuggests that they also may have many functionaldifferences. An ortholog of DNA-PKcs is not foundin yeast.

3.9. DNA replication

Most repair pathways require some DNA synthe-sis as part of the repair process. In some cases,specific polymerases are used only for repair. Inother cases, the normal replication polymerases areused for repair synthesis. Since the evolution ofpolymerases has been reviewed elsewhere, it will not

w xbe discussed in detail here 166 . Obviously, allspecies are able to replicate their DNA in some wayand thus should be able to perform repair synthesis.The specific types of polymerases used may helpdetermine the accuracy of repair synthesis.

3.10. Inducible responses

3.10.1. LexA and the SOS system in bacteriaThe SOS system in E. coli is an inducible re-

sponse to a variety of cellular stresses, includingw xDNA damage 167 . A key component of the SOS

system is the LexA transcription repressor. In re-sponse to stresses such as DNA damage, the RecAprotein is activated to become a coprotease andassists the autocatalytic cleavage of LexA. WhenLexA is cleaved, it no longer functions as a tran-scription repressor, and the genes that it normallyrepresses are induced. The induction of these LexA-regulated SOS genes is a key component of the SOSsystem.

SOS-like processes have been documented in awide variety of bacterial species. Those that havebeen characterized function like the E. coli system,with regulation of SOS genes by LexA homologs,although sometimes different sets of genes are re-pressed by LexA and different ‘‘SOS-boxes’’ are

w xused 167 . Our comparative analysis suggests thatLexA appeared near the origin of bacteria since it is

Žfound in a wide diversity of bacterial species Table.4 . Thus, the absence of LexA from some species is

likely due to gene loss. Since the function of thecharacterized LexAs is highly conserved, we con-clude that those species that encode LexA likelyhave an SOS system. However, since there are manyways to regulate responses to external stimuli, it ispossible that the species without a LexA homologhave co-opted another type of transcription regulatorto control an SOS-like response.

LexA is part of a large multigene family thatincludes many other proteins that are also cleaved

Table 5

Ž .a DNA repair genes present in all or most bacteria

Process In all In mostbacteria bacteria

NER UvrABCD UvrABCDHolliday junction – RuvABCresolutionRecombination RecA RecA; RecJ,

RecGReplication PolA, C PolA,C;

PriA; SSBLigation LigaseI LigaseITranscription- – Mfdcoupled repairBER – Ung,

MutY–NthAP endonuclease – XthSingle-strand SSB SSBbinding protein

Ž .b DNA repair genes present in bacteria or eukaryotes but notboth

Process Only in Only inbacteria eukaryotes

Transcription- Mfd CSB, CSAcoupled repairMismatch strand MutH –recognition

aNER UvrABCD XPs,TFIIH,etc.

Recombination RecBCD, KU,initiation RecF DNA-PKHolliday junction RuvABC CCE1resolutionBase excision Fpg–Nei, –

TagIInducible responses LexA P53

aAlso found in some Archea.

Page 28: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

()

J.A.E

isen,P.C

.Hanaw

altrM

utationR

esearch435

1999171

–213

198

Table 6Origin of DNA repair genes and pathways

Pathway Ancient Evolved Evolved in Evolved Evolved Ambiguous General Commentswithin Archaea– within within origin mechanismsbacteria eukaryota Archaea Eukaryota conserved?

lineage

Photoreactivation Phrl – – – – – Yes Specificity varies between species.Phrll Phrl and Phrll genes lost many times.

Also some lateral transfer and duplication.Alkyltransfer Ogr Ada – – – – Yes Addition of Ada domain to Ada

protein occurred in bacteria.Base excision repair Ung? FpgrNei Ogg – – 3MG Yes Ung may have originated in bacteria.

MutYrNth Tagf GT MMR Specificity varies greatly betweenAlkA species for MutY–Nth, AlkA, and

others. Many cases of gene loss.AP endonucleases Xth – – – – – Yes Many cases of gene loss of Xth and Nfo.

Nfo All species have one or the other.Nucleotide excision – UvrABCD Rad1 – All other euk. Rad25 YesrNo UvrABCD in M. thermoautotrophicum

Ž .repair Rad2 NER proteins Archaea probably by lateral transfer.Transcription- – Mfd – – CSA, CSB – ? Mfd missing from some bacteria.coupled repairGeneral mismatch MutLS? MutH – – dup MutS – YesrNo Strand recognition systems andrepair Dam dup MutL exonucleases differ between species.

Vsr Many cases of loss of MutLS genes.Duplication in eukaryotes allows useof heterodimers.

Page 29: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

()

J.A.E

isen,P.C

.Hanaw

altrM

utationR

esearch435

1999171

–213

199

Recombination SbcCD AddAB – – dup RecQ RecQ No Many cases of geneinitiation RecBCD loss in bacteria.

RecFJNOR RecF pathway genes notRecET always present together.SbcB

Recombinase RecA RecT? – – dup RecA – Yes Lateral transfer fromchloroplast to plantnucleus has occurred.RecT is of phage origin.

Branch migration – RuvAB – – – – YesrNo RuvAB and RecG missingRecG from some bacteria.

Branch resolution – RuvC – – CCEI – YesrNo CCEI may function inRus mitochondria. Rus is likelyRecG of phage origin and is only

found in a few species.Other recombination – – – – Rad52–59 – – –

XRS2Non-homologous – – – – XRCC4 – – –end joining Ku70.86

DNA-PKcsLigation – Ligl Ligll – – – Maybe –Induction – LexA – – P53 – No –Other MufT SSB – – RFAs Dut – Eukaryotic SSB

UmuC came from mitochondria.SMS?

Page 30: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213200

when RecA is activated, including many transcrip-tional repressors of phage and the SOS-mutagenesisprotein UmuD. Since UmuD is found in only a fewg-Proteobacteria, we propose that it evolved from aLexA-like protein.

3.10.2. p53 in animalsInducible responses to DNA damage have also

been found in some eukaryotes. One gene that isinvolved in inducible responses in animals is p53.One of p53’s activities is transcriptional activation,and this activity is stimulated by the presence ofDNA damage. In fact, it has been shown that theefficient NER of UV-induced DNA damage in hu-man cells requires activation of p53, and that themode of action involves the regulated expression of

w xthe p48 gene that is a component of XPE 168 .Homologs of p53 have only been found in animals,suggesting that this inducible system evolved afteranimals diverged from other eukaryotes. In addition,some species encode multiple p53 homologs, sug-gesting that there was a duplication in this genefamily early in the evolution of animals.

4. Discussion II — the big picture

In the preceding sections we have focused thediscussion on what the phylogenomic analysis re-veals about specific repair proteins and pathways.We believe it is also important to take a ‘‘bigpicture’’ approach and consider all of the pathwaystogether. One reason to take such a global approachis that the different pathways overlap a great deal intheir specificity. For example, CPDs can be repaired

Ž .by PHR, NER, BER by T4EV , and can be toleratedthrough recombinational repair. In fact, it is rare fora particular lesion to be repaired only by one path-way. Another reason for the big picture approach isthat some repair genes function in multiple path-

ways. In addition, it is useful to compare and con-trast the evolution of different genes and pathways toidentify unusual features of any pathway.

4.1. Distribution patterns of particular genes

Examination of the distribution of all DNA repairgenes reveals that only one DNA repair gene, RecA,is found in every species analyzed here. The univer-sality of RecA suggests both that it is an ancient

Žgene, and that its activity is irreplaceable at least for.these species . Since many DNA repair genes have

important cellular functions, we were surprised thatonly one gene was present in all species. There are,however, many genes that are found in all or mostmembers of some of the major domains of life. Forexample, in Table 5a we list those repair genes foundin all or most bacteria. Focusing only on genes foundin all members of a particular group can be some-what misleading because it does not reveal whetherthese genes are found in other groups. For example,it is important to realize that, although UvrABCD arefound in all bacteria, they are also found in oneArchaeal species. Therefore, we also present a com-parison of genes that are found in bacteria but noteukaryotes and the converse, genes that are found in

Ž .eukaryotes but not bacteria Table 5b .

4.2. Timing of origin of repair genes — ancient, old,and recent

It is informative to compare and contrast theorigins of different repair genes to look for generalpatterns as well as to attempt to infer the repairprocesses of common ancestors of particular groupsor of all life. It is important to note that in all ouranalysis of repair gene origins we may be underesti-mating gene loss. For example, we have inferred thatUvrA, UvrB, UvrC and UvrD originated in bacteriaor in a common ancestor of bacteria and Archaea,

Fig. 3. Evolutionary gain and loss of DNA repair genes. The gain and loss of repair genes is traced onto an evolutionary tree of the speciesfor which complete genome sequences were analyzed. Gain and loss were inferred by methods described in the main text. Origins of repair

Ž . Ž .genes q are indicated on the branches while loss of genes y is indicated along side the branches. Gene duplication events are indicatedby a ‘‘d’’ while possible lateral transfers are indicated by a ‘‘t’’.

Page 31: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213 201

Page 32: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213202

because these genes are found in all bacteria and oneArchaea, but not in any eukaryote. However, it ispossible that the common ancestor of all life encodedthese genes and that they were lost early in eukary-otic evolution. Our estimates of gene origin are thusvery conservative, many genes could have originatedeven earlier than we suggest. Nevertheless, with thiscaveat we still have inferred that many repair genes

Ž .were present in the last common ancestor Table 6 .Based on our functional evolution studies for thesegenes, we conclude that early in the evolution of lifemany DNA repair activities were present includingPHR, alkyltransfer, recombination, AP endonuclease,a few DNA glycosylases, and MMR. Interestingly,most of these ancient repair pathways have been lost

Ž .from at least one evolutionary lineage Fig. 3 . Thus,these genes are not absolutely required for survivalin all species. The ancient origin of many repairhelicases and the nature of the particular helicases

w xbeing used led Aravind et al. 38 to suggest thatmany of the DNA repair helicases evolved fromRNA helicases that functioned in an RNA world.This theory is one of the first links made betweenrepair pathways and the RNA world. Another possi-ble link between repair and early evolution was

w xsuggested by Lewis and Hanawalt 169 .Our analysis suggests that many other repair genes

originated at or near the origins of major evolution-Ž .ary groups Fig. 3 . Interestingly, in many cases,

genes with similar functions originated separately inŽdifferent groups e.g., UvrABCD vs. XPs, RuvABC

.vs. CCE1, Ligase I vs. Ligase II . Perhaps mostsurprisingly to us, we found that many repair genes

Žare of very recent origin e.g., MutH, SbcB, Rus,.RecBCD, RecE and AddAB . Thus, repair processes

are continuing to be originated in different lineages.

4.3. Mechanism of origin of repair genes

DNA repair genes have originated by a variety ofmechanisms. One common means is by gene dupli-cation. Perhaps the best example of this comes from

w xthe helicase superfamily of proteins 170 . Membersof this superfamily are involved in almost everyrepair pathway and in many cases multiple distinctsuperfamily members are used in a single pathwayŽ .e.g., UvrB, UvrD and Mfd in NER . It is important

to note, however, that not all proteins in this super-family have helicase activity and that the motifs areactually indicative of DNA-dependent ATPase activ-ity. Since all helicase motif containing proteins arehomologous, each distinct gene must have been cre-ated by a separate gene duplication event. Thus, geneduplication has allowed repair pathways to copyhelicases used for other functions and then use themin slightly different ways. Other examples of geneduplication in the history of repair genes are given inTable 7. The particular details of each duplicationreveal a great deal about each protein. For example,many eukaryotic repair genes were created by gene

Žduplication events within the SNF2 family which.itself is a member of the helicase superfamily . This

suggests that this particular family has an activityw xvery useful for repair in eukaryotes 20 .

Another mechanism of origin of repair genes is byco-opting genes from other functions. For example,MutH may have descended from a restriction en-zyme. New repair genes have also originated by gene

w xfusion or fusion of domains 38 . For example, SMS

Table 7Gene duplications in the history of DNA repair genes

When duplication Duplicated genesoccurred

Ancient SNF2 familyMutS1–MutS2RecA–SMSPhrI–PhrIIMutY–NthEarly helicase evolution

In eukaryotes Rad23a–Rad23b in animalsRecQL–Bloom’s–Werner’s in animalsSNF2 family massive duplicationRad51–DMC1

Ž .MSH1–6 MutS familyŽ .PMS1–MLH1–MLH2 MutL family

Rad52–Rad59polB familyLigase family II

In bacteria Fpg–NeiUvrB–Mfd–RecGUvrALexA–UmuDAda–Ogt in ProteobacteriaPhr in some cyanobacteriaUvrD–Rep–RecBRecA1–RecA2 in Myx. xanthus

Page 33: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213 203

is a fusion between Lon and RecA domains and Adais a fusion between alkyltransferase and transcription

Ž .regulatory domains Fig. 2 . A final way that aspecies can get new repair activities is by lateralgene transfer. Since this does not involve creation ofnew repair genes, it is discussed in a separate sectionbelow.

4.4. Gene loss

Our analysis shows that gene loss is a frequentŽ .event in the history of DNA repair Fig. 3 . As

discussed above, it is important to remember that wemay be underestimating the total amount of geneloss in the history of repair. In some cases, whole

Žpathways have been lost as a unit e.g., MutLS,.SbcCD . Such correlated loss of multiple genes in a

pathway suggests that the functional associationamong these genes is conserved between species. Inother cases single genes or only parts of pathways

Ž .are lost e.g., components of the RecF pathway . Thenumber of times a particular gene or pathway hasbeen lost can also be informative. For example, thatMutL and MutS1 have been lost many times inseparate lineages supports the suggestion that there issometimes a selective advantage to losing thesegenes. Limitations on gene loss can also be informa-tive. For example, our analysis shows that the lastcommon ancestor of life encoded two AP endonucle-ase — Nfo and Xth. Some lineages have lost Nfo,some have lost Xth, but none have lost both, support-ing the suggesting that AP endonuclease activity isrequired for all species. Differences in gene lossbetween lineages are even more striking. For exam-

Table 8DNA repair genes that were lost in the mycoplasmal lineage

Process Protein

BER MutY–Nth, AlkARecombination initiation RecF pathway, SbcCDRecombination resolution RecG, RuvCMismatch repair MutLSTCR MFDInduction LexADirect repair PhrI, OgtAP endonuclease XthOther MutT, Dut, PriA, SMS

ple, there has been extensive loss of repair genes inŽ .the mycoplasmal lineage e.g., Table 8, Fig. 3 . Not

surprisingly, loss of repair genes is more common inspecies or lineages with small genomes. This pointsto a problem in drawing too many conclusions aboutthe likelihood of gene loss in general from theanalysis of currently available genome sequences.Many species have been picked for genome sequenc-ing because their genomes are small and thus mayhave undergone more gene loss than an averagespecies.

4.5. Lateral gene transfer

New repair activities can be acquired by a particu-lar lineage without the creation of a new gene by theprocess of lateral transfer. The best examples of thisare genes transferred from the chloroplast to the

Ž .plant nucleus e.g., RecA, MutM, photolyase .Transfers of genes from the mitochondrial to the

Žeukaryotic nucleus also seem likely e.g., Ung,.MSH1 . These and other possible cases of lateral

transfer are listed with a ‘‘t’’ in Fig. 3. Given thatlateral gene transfer appears to be quite common

w xover evolutionary history 171 , it is likely organismscould replace lost genes relatively easily by genetransfer.

4.6. ConserÕation of pathways

Comparisons of the evolution of the differentclasses of repair reveal a great deal of diversity inhow well conserved the classes of repair are. Inaddition, the ways in which classes of repair differbetween species are also variable. The conservationbetween species can be classified according to thelevel of homology of the pathways. Some pathways

Žare completely homologous between species they.make use of homologous genes in all species . This

is only the case for some of the single enzymeŽ .pathways PHR and alkyltransfer . Interestingly, all

the single enzyme pathways are direct repair path-ways. Other pathways are partially homologous. Forexample, some of the proteins involved in MMR are

Žhomologous between E. coli and eukaryotes e.g.,. ŽMutS and MutL , but others are not e.g., MutH and

.UvrD . Finally, there are some pathways that are not

Page 34: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213204

homologous at all between species despite perform-ing the same functions. The best example of this isNER in bacteria compared to that in eukaryotes.These systems are apparently of completely separateorigins. In addition to different levels of homology,pathways also differ between species by functionaldivergence of homologs. Examples of this includethe divergence of 6–4 and CPD photolyases and thedivergence of MSH genes for MMR in eukaryotes.

4.7. Prediction of species phenotypes and uniÕersalDNA repair actiÕities

We believe that the key to making functional andphenotypic predictions for any species is an under-standing of the evolution of the functions of interest.For example, functional predictions for homologs ofrepair genes are improved by evolutionary analysisŽ .see Section 2 . Such evolutionary functional predic-tion has been particularly helpful in studies of a

variety of repair genes such as MutS, photolyases,many of the BER glycosylases and many of the largehelicase-motif containing families. In addition, iden-tifying how many times new genes have evolvedwith particular functions helps determine whether theabsence of particular homologs is meaningful. Forexample, the fact that uracil-DNA glycosylase activ-ity has evolved many times in non-homologous pro-teins suggests that the absence of homologs of theseproteins cannot be used to predict the absence ofuracil glycosylase activity. A similar case can bemade for recombination initiation and resolution ac-tivities. In contrast, since alkyltransferase activityhas apparently only evolved once, it is likely thatthose species without a homolog of the known alkyl-transferase gene family do not have alkyltransferaseactivity. Similarly, the fact that all generalized MMRsystems use homologs of MutS and MutL suggeststhat the absence of mutL and mutS genes means theabsence of general MMR. Finally, it is important to

Table 9Predicted DNA repair phenotypes of selected speciesa

Pathway Proteins with Bacteriaactivity E. H. He. Bac. M. geni- Mycop.

coli influenzae pylori subtilis talium pneumoniae

PHR PhrI, PhrII q y y y y yAlkyltransfer AdarOgtrMGMT q q q q y yNER UvrABCD or XPs q q q q q qTranscription-coupling Mfd or CSArCSB q q q q y yBER q q q q q q

Uracil glycosylase Ung, GT q q q q q qAlkylation glycosylase AlkA, TagI, MPG q q q q y yMisc. damaged bases MutYrNth. FpgrNei, Ogg q q q q q q

AP endonucleases XthrAPE1, NforAPN1 q q q q q qMismatch repair MutLS q q y q y yRecombination

InitiationRecBCD pathway RecBCD q q y y y yRecF pathway RecFJNORQ q q " q y yAddAB pathway AddAB y y y q y y

bSbcCD pathway SbcCrMRE11, SbcDrRad50 q y y q y yRecombinase RecA, RadA, Rad51 q q q q q qBranch migration RuvAB, RecG q q q q q qResolution RuvC, Rus, RecG, CCE1 q q q q y y

NHEJ Ku, DNA-PK y y y y y yLigation LigI, LigII q q q q q q

a qs likely present, "smay be present, ys likely absent, ?snot able to predict well.bResolution is possible non-enzymatically so absence of genes does not mean resolution is not possible.

Page 35: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213 205

realize that some species may have novel activitiesthat have not been characterized in any species.

The difficulties in making functional and pheno-typic predictions are exacerbated by the biased sam-pling of the evolutionary tree in studies of DNArepair. For example, there has been very little experi-mental work on DNA repair in Archaea and whathas been done is usually the characterization ofhomologs of known repair genes. Thus, any repairprocesses that evolved within Archaea will likely bemissed by comparative genomic approaches. Giventhat many processes appeared to have evolved inbacteria or in eukaryotes, it seems very likely thatthere are also many that have evolved in Archaea.One can easily see the ‘‘bias’’ of model systems byfollowing the gain of repair genes in Fig. 3. Essen-tially all of the gain events are in the lineagesleading up to E. coli, Bac. subtilis, yeast and hu-mans. This is not surprising because almost all therepair genes we analyzed are from these species.These genes must have been present in some ances-

tor of these species and thus the only place theycould have been ‘‘gained’’ is along the lineage lead-ing up to these species. Clearly, repair genes musthave originated in other lineages — especially giventhe evidence that new repair genes have originated

Ž .relatively recently see above . For similar reasons,we have an underestimation of the amount of loss ofrepair genes in these model organisms. They couldnot have lost their own genes.

Despite all these potential problems, we have stillŽ .tried to make phenotypic predictions Table 9 . It

should be remembered that all predictions need to beconfirmed by experimental studies. We believe suchpredictions are a useful starting point for designingexperiments on these species and for determining ifthe predicted presence or absence of particular repairactivities can be correlated with any interesting bio-logical properties. For example, the predicted ab-sence of many repair pathways from mycoplasmas isconsistent with the high mutation and evolutionaryrates of mycoplasmas. Thus, we can use the absence

Archaea Eukarya

Mycob. Synecho- B. borg- T. pal- Aq. All M. thermo- Meth. jan- Ar. ful- All S. cere- Ho. Alltuberculosis cystis sp. dorferi lidum aeolicus auto naschii fulgidus Õisiae sapiens

trophicum

y q y y y y q y y y q y yq y y q q y q q q q q q qq q q q q q q ? ? ? q q qq q q q y q ? ? ? ? q q qq q q q q q q q q q q q qq y q y y y y y y y q q qq q y y y y y y q y q q qq q q q q q q q q q q q qq q q q q q q q q q q q qy q q q q y y y y y q q q

q y " y y y y y y y y y y" q y " " y y y y y y y yy y y " y y y y y y y y yy q q q q y q q q q q q qq q q q q q q q q q q q qq q q q q q ? ? ? ? ? ? ?q q y q q y ? ? ? ? q ? ?y y y y y y y y y y q q qq q q q q q q q q q q q q

Page 36: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213206

of certain genes to make some predictions. For ex-ample, the presence of UvrABCD but the absence ofMfd from the two Mycoplasmas and Aq. aeolicussuggests that these species can perform NER but notthe TCR component of it.

One generalization that can be made from ourphenotypic predictions is that, despite the lack ofmany universal genes, it appears that there are manyuniversal activities. For example, we predict that allspecies have AP endonuclease activity. However, noAP endonuclease gene is universal because there aretwo evolutionarily unrelated AP endonuclease fami-

Ž .lies Nfo or Xth . All species encode at least one ofthese genes. Similarly, all species encode at least oneof the two ligase genes.

5. Summary and conclusions

We believe that the analysis reported here canserve as a starting point for experimental studies ofrepair in species with complete genome sequencesand for understanding the evolution of DNA repairproteins and processes. However, it is important torestate some of the caveats to this type of analysis.First, it should be remembered that all functional andphenotypic predictions are just that — predictions.They need to be followed up by experimental analy-sis. In addition, the species for which completegenome sequences are available is not a randomsampling of ecological and evolutionary diversity. In

Žparticular, many have small genomes this is the.reason they were sequenced and have likely under-

gone large-scale gene loss events in the recent past.Thus, this may give a misleading picture about whatan average bacterium or Archaeon is like.

Despite these limitations, the phylogenomic anal-ysis of DNA repair proteins presented here revealsmany interesting details about DNA repair proteinsand processes and the species for which completegenome sequences were analyzed. We have identi-fied many examples of gene loss, gene duplication,functional divergence and recent origin of new path-ways. All of this information helps us to understandthe evolution of DNA repair as well as to predictphenotypes of species based upon their genome se-quences. In addition, our analysis helps identify theorigins of the different repair genes and has provided

a great deal of information about the origins ofwhole pathways. We believe our analysis also helpsidentify potentially rewarding areas of future re-search. There are some unusual patterns that requirefurther exploration, such as the presence ofUvrABCD in some Archaea and the only limitednumber of homologs of known repair genes in any ofthe four Archaea. In addition, the areas with emptyspaces in the tree tracing the origin of repair genesmay be of interest to determine if novel pathwaysexist in such lineages. In summary, we believe thatthis composite phylogenomic approach is an impor-tant tool in making sense out of genome sequencedata and in understanding the evolution of wholepathways and genomes. Combining genomics andevolutionary analysis into phylogenomics is usefulbecause genome information is useful in inferringevolutionary events and evolutionary information isuseful in understanding genomes.

Acknowledgements

We would like to thank M.B. Eisen, M.-I. Benito,J.H. Miller, A.J. Clark, J. Laval, B. van Houten, I.Mellon, P. Warren, R. Woodgate, W.F. Doolittle, J.Hays, S. Henikoff, P. Forterre, D. Botstein, R. My-ers, A. Ganesan, J. DiRuggiero, H. Ochman, F.Robb, M. Riley, S. Suzen, M. White, V. Mizrahi, A.Villeneuve, M. Galperin, C.M. Fraser, and J.C. Ven-ter for helpful comments and encouragement. Inaddition, we would like to thank the two anonymousreviewers for helpful comments and criticisms. Thispaper was supported by Outstanding InvestigatorGrant CA44349 from the National Cancer Instituteto P. Hanawalt. Supplemental material related to thispaper can be found at http:rrwww.tigr.orgr;

jeisenrRepairrRepair.html. The C. jejuni sequencedata was provided by the C. jejuni Sequencing Groupat the Sanger Centre and can be obtained fromftp:rrftp.sanger.ac.uk.

References

w x1 Y.F. Li, S.T. Kim, A. Sancar, Evidence for lack of DNAphotoreactivating enzyme in humans, Proc. Natl. Acad. Sci.

Ž .U. S. A. 90 1993 4389–4393.

Page 37: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213 207

w x2 J.A. Eisen, Mechanistic basis of microsatellite instability,Ž .in: D.B. Goldstein, C. Schlotterer Eds. , Microsatellites:

Evolution and Applications, Oxford Univ. Press, Oxford,1999, pp. 34–48.

w x3 J. Labarere, DNA replication and repair, in: J. Maniloff´Ž .Ed. , Mycoplasmas: Molecular Biology and Pathogenesis,American Society For Microbiology, Washington, DC,1992, pp. 309–323.

w x4 K. Dybvig, L.L. Voelker, Molecular biology of mycoplas-Ž .mas, Annu. Rev. Microbiol. 50 1996 25–57.

w x5 P. Modrich, R. Lahue, Mismatch repair in replication fi-delity, genetic recombination, and cancer biology, Annu.

Ž .Rev. Biochem. 65 1996 101–133.w x6 G.A. Cortopassi, E. Wang, There is substantial agreement

among interspecies estimates of DNA repair activity, Mech.Ž .Ageing Dev. 91 1996 211–218.

w x7 D.E. Promislow, DNA repair and the evolution of longevity:Ž .a critical analysis, J. Theor. Biol. 170 1994 291–300.

w x8 J.E. LeClerc, B. Li, W.L. Payne, T.A. Cebula, High muta-tion frequencies among Escherichia coli and Salmonella

Ž .pathogens, Science 274 1996 1208–1211.w x9 I. Matic, M. Radman, F. Taddei, B. Picard, C. Doit, E.

Bingen, E. Denamur, J. Elion, Highly variable mutationrates in commensal and pathogenic Escherichia coli, Sci-

Ž .ence 277 1997 1833–1834.w x10 F. Taddei, I. Matic, B. Godelle, M. Radman, To be a

mutator, or how pathogenic and commensal bacteria canŽ .evolve rapidly, Trends Microbiol. 5 1997 427–428.

w x11 N. Sueoka, Intrastrand parity rules of DNA base composi-tion and usage biases of synonymous codons, J. Mol. Evol.

Ž .40 1995 318–325.w x12 A. Eyre-Walker, DNA mismatch repair and synonymous

Ž .codon evolution in mammals, Mol. Biol. Evol. 11 199488–98.

w x13 P.M. Sharp, D.C. Shields, K.H. Wolfe, W.H. Li, Chromoso-mal location and evolutionary rate variation in enterobacte-

Ž .rial genes, Science 246 1989 808–810.w x14 J.R. Battista, Against all odds: the survival strategies of

Ž .Deinococcus radiodurans, Annu. Rev. Microbiol. 51 1997203–224.

w x15 I. Matic, C. Rayssiguier, M. Radman, Interspecies geneexchange in bacteria: the role of SOS and mismatch repair

Ž .systems in evolution of species, Cell 80 1995 507–515.w x16 P. Sniegowski, Mismatch repair: origin of species?, Curr.

Ž .Biol. 8 1998 R59–R61.w x17 J.E. Cleaver, J.R. Speakman, J.P. Volpe, Nucleotide exci-

sion repair: variations associated with cancer developmentŽ .and speciation, Cancer Surv. 25 1995 125–142.

w x18 J. Felsenstein, Phylogenies and the comparative method,Ž .Am. Nat. 125 1985 1–15.

w x19 J.A. Eisen, PhD Thesis, Stanford University, 1999.w x20 J.A. Eisen, K.S. Sweder, P.C. Hanawalt, Evolution of the

SNF2 family of proteins: subfamilies with distinct se-Ž .quences and functions, Nucleic Acids Res. 23 1995 2715–

2723.w x21 J.A. Eisen, A phylogenomic study of the MutS family of

Ž .proteins, Nucleic Acids Res. 26 1998 4291–4300.

w x22 J.A. Eisen, Phylogenomics: improving functional predic-tions for uncharacterized genes by evolutionary analysis,

Ž .Genome Res. 8 1998 163–167.w x23 J.A. Eisen, D. Kaiser, R.M. Myers, Gastrogenomic delights:

Ž . Ž .a movable feast, Nature Medicine 3 1997 1076–1078.w x24 S.F. Altschul, T.L. Madden, A.A. Schaffer, J. Zhang, Z.

Zhang, W. Miller, D.J. Lipman, Gapped BLAST and PSI-BLAST: a new generation of protein database search pro-

Ž .grams, Nucleic Acids Res. 25 1997 3389–3402.w x25 J.D. Thompson, D.G. Higgins, T.J. Gibson, Clustal W:

improving the sensitivity of progressive multiple sequencealignment through sequence weighting, position-specific gappenalties and weight matrix choice, Nucleic Acids Res. 22Ž .1994 4673–4680.

w x26 D. Swofford, Phylogenetic Analysis Using ParsimonyŽ .PAUP 3.0d., Illinois Natural History Survey, 1991.

w x27 W.P. Maddison, D.R. Maddison, MacClade 3, Sinauer As-sociates, 1992.

w x28 B.L. Maidak, N. Larsen, M.J. McCaughey, R. Overbeek,G.J. Olsen, K. Fogel, J. Blandy, C.R. Woese, The ribosomal

Ž .database project, Nucleic Acids Res. 22 1994 3485–3487.w x29 E.V. Koonin, A. Mushegian, P. Bork, Non-orthologous

Ž .gene displacement, Trends Genet. 12 1996 334–336.w x30 R.D. Fleischmann, M.D. Adams, O. White, R.A. Clayton,

E.F. Kirkness, A.R. Kerlavage, C.J. Bult, J.F. Tomb, B.A.Dougherty, J.M. Merrick et al., Whole-genome randomsequencing and assembly of Haemophilus influenzae Rd,Science, 269, 1995, 496–498, 507–512.

w x31 T. Dobzhansky, Nothing in biology makes sense except inŽ .the light of evolution, Am. Biol. Teach. 35 1973 125–129.

w x32 W.H. Li, T. Gojobori, M. Nei, Pseudogenes as a paradigmŽ .of neutral evolution, Nature 292 1981 237–239.

w x33 R.R. Gutell, N. Larsen, C.R. Woese, Lessons from anevolving rRNA: 16S and 23S rRNA structures from a

Ž .comparative perspective, Microbiol. Rev. 58 1994 10–26.w x34 N. Goldman, J.L. Thorne, D.T. Jones, Using evolutionary

trees in protein secondary structure prediction and otherŽ .comparative sequence analysis, J. Mol. Biol. 263 1996

196–208.w x35 S. Henikoff, J. Henikoff, Protein family classification based

Ž .on searching a database of blocks, Genomics 19 199497–107.

w x36 M.O. Dayhoff, Atlas of Protein Sequence and Structure,Vol. 5, Supplement 3, National Biomedical Research Foun-dation, Washington, DC, 1978.

w x37 B. Lafay, A.T. Lloyd, M.J. McLean, K.M. Devine, P.M.Sharp, K.H. Wolfe, Proteome composition and codon usagein spirochaetes: species-specific and DNA strand-specific

Ž .mutational biases, Nucleic Acids Res. 27 1999 1642–1649.w x38 L. Aravind, D.R. Walker, E.V. Koonin, Conserved domains

in DNA repair proteins and evolution of repair systems,Ž .Nucleic Acids Res. 27 1999 1223–1242.

w x39 S. Kanai, R. Kikuno, H. Toh, H. Ryo, T. Todo, Molecularevolution of the photolyase-blue-light photoreceptor family,

Ž .J. Mol. Evol. 45 1997 535–548.w x40 S. Zhao, A. Sancar, Human blue-light photoreceptor hCRY2

specifically interacts with protein serinerthreonine phos-

Page 38: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213208

phatase 5 and modulates its activity, Photochem. Photobiol.Ž .66 1997 727–731.

w x41 C.S. Cockell, Biological effects of high ultraviolet radiationon early Earth — a theoretical evaluation, J. Theor. Biol.

Ž .193 1998 717–729.w x42 J.C. Sutherland, K.P. Griffin, Monomerization of pyrimi-

dine dimers in DNA by tryptophan-containing peptides:Ž .wavelength dependence, Radiat. Res. 83 1980 529–536.

w x43 J. Chen, C.W. Huang, L. Hinman, M.P. Gordon, D.A.Deranleau, Photomonomerization of pyrimidine dimers by

Ž .indoles and proteins, J. Theor. Biol. 62 1976 53–67.w x44 C. Helene, F. Toulme, M. Charlier, M. Yaniv, Photosensi-

tized splitting of thymine dimers in DNA by gene 32protein from phage T 4, Biochem. Biophys. Res. Commun.

Ž .71 1976 91–98.w x45 A.E. Pegg, T.L. Byers, Repair of DNA containing O6-al-

Ž .kylguanine, FASEB J. 6 1992 2302–2310.w x46 A.E. Pegg, M.E. Dolan, R.C. Moschel, Structure, function,

and inhibition of O6-alkylguanine-DNA alkyltransferase,Ž .Prog. Nucleic Acid Res. Mol. Biol. 51 1995 167–223.

w x47 J. Labahn, O.D. Scharer, A. Long, K. Ezaz-Nikpay, G.L.Verdine, T.E. Ellenberger, Structural basis for the excision

Ž .repair of alkylation-damaged DNA, Cell 86 1996 321–329.w x48 M.M. Leclere, M. Nishioka, T. Yuasa, S. Fujiwara, M.

Takagi, T. Imanaka, The O6-methylguanine-DNA methyl-transferase from the hyperthermophilic archaeon Pyrococ-cus sp. KOD1: a thermostable repair enzyme, Mol. Gen.

Ž .Genet. 258 1998 69–77.w x49 M. Skorvaga, N.D.H. Raven, G.P. Margison, Thermostable

archaeal O6-alkylguanine-DNA alkyltransferases, Proc.Ž .Natl. Acad. Sci. U. S. A. 95 1998 6711–6715.

w x50 J. Luo, F. Barany, Identification of essential residues inThermus thermophilus DNA ligase, Nucleic Acids Res. 24Ž .1996 3079–3085.

w x51 A.E. Tomkinson, Z.B. Mackey, Structure and function ofŽ .mammalian DNA ligases, Mutat. Res. 407 1998 1–9.

w x52 P. Modrich, Mechanisms and biological effects of mismatchŽ .repair, Annu. Rev. Genet. 25 1991 229–253.

w x53 R. Kolodner, Biochemistry and genetics of eukaryotic mis-Ž .match repair, Genes Dev. 10 1996 1433–1442.

w x54 J.H. Miller, Spontaneous mutators in bacteria: insights intopathways of mutagenesis and repair, Annu. Rev. Microbiol

Ž .50 1996 625–643.w x55 F. Taddei, M. Vulic, M. Radman, I. Matic, Genetic variabil-

Ž .ity and adaptation to stress, Experientia 83 1997 271–290.w x56 P.D. Sniegowski, P.J. Gerrish, R.E. Lenski, Evolution of

high mutation rates in experimental populations of E. coli,Ž .Nature 387 1997 703–705.

w x57 V. Mizrahi, S.J. Andersen, DNA repair in Mycobacteriumtuberculosis. What have we learnt from the genome se-

Ž .quence?, Mol. Microbiol. 29 1998 1331–1340.w x58 I. Matic, F. Taddei, M. Radman, Genetic barriers among

Ž .bacteria, Trends Microbiol. 4 1996 69–72.w x59 C. Ban, W. Yang, Structural basis for MutH activation in

E. coli mismatch repair and relationship of MutH to restric-Ž .tion endonucleases, EMBO J. 17 1998 1526–1534.

w x60 D.P. Twomey, L.L. McKay, D.J. O’Sullivan, Molecularcharacterization of the Lactococcus lactis LlaKR2I restric-tion–modification system and effect of an IS982 elementpositioned between the restriction and modification genes,

Ž .J. Bacteriol. 180 1998 5844–5854.w x61 A. Bellacosa, L. Cicchillitti, F. Schepis, A. Riccio, A.T.

Yeung, Y. Matsumoto, E.A. Golemis, M. Genuardi, G.Neri, MED1, a novel human methyl-CpG-binding endonu-clease, interacts with DNA mismatch repair protein MLH1,

Ž .Proc. Natl. Acad. Sci. U. S. A. 96 1999 3969–3974.w x62 W. Glasner, R. Merkl, V. Schellenberger, H.J. Fritz, Sub-

strate preferences of Vsr DNA mismatch endonuclease andtheir consequences for the evolution of the Escherichia coli

Ž .K-12 genome, J. Mol. Biol. 245 1995 1–7.w x63 J.H. Hoeijmakers, Nucleotide excision repair. II: From yeast

Ž .to mammals, Trends Genet. 9 1993 211–217.w x64 J.H. Hoeijmakers, Nucleotide excision repair: I. From E.

Ž .coli to yeast, Trends Genet. 9 1993 173–177.w x65 A. Sancar, DNA excision repair, Annu. Rev. Biochem. 65

Ž .1996 43–81.w x66 B. Van Houten, A. Snowden, Mechanism of action of the

Escherichia coli UvrABC nuclease: clues to the damageŽ .recognition problem, BioEssays 15 1993 51–59.

w x67 C.P. Selby, A. Sancar, Structure and function of transcrip-tion-repair coupling factor: I. Structural domains and bind-

Ž .ing properties, J. Biol. Chem. 270 1995 4882–4889.w x68 I. Mellon, P.C. Hanawalt, Induction of the Escherichia coli

lactose operon selectively increases repair of its transcribedŽ .DNA strand, Nature 342 1989 95–98.

w x69 S. Ayora, F. Rojo, N. Ogasawara, S. Nakai, J.C. Alonso,The Mfd protein of Bacillus subtilis 168 is involved in bothtranscription-coupled DNA repair and DNA recombination,

Ž .J. Mol. Biol. 256 1996 301–318.w x70 J.M. Zalieckas, L.V. Wray Jr., A.E. Ferson, S.H. Fisher,

Transcription-repair coupling factor is involved in carboncatabolite repression of the Bacillus subtilis hut and gnt

Ž .operons, Mol. Microbiol. 27 1998 1031–1038.w x71 R.D. Wood, Nucleotide excision repair in mammalian cells,

Ž .J. Biol. Chem. 272 1997 23465–23468.w x72 R.F. Doolittle, M.S. Johnson, I. Husain, B. Van Houten,

D.C. Thomas, A. Sancar, Domainal evolution of a prokary-otic DNA repair protein and its relationship to active-trans-

Ž .port proteins, Nature 323 1986 451–453.w x73 K.J. Linton, C.F. Higgins, The Escherichia coli ATP-bind-

Ž . Ž .ing cassette ABC proteins, Mol. Microbiol. 28 19985–13.

w x74 C.G. Lin, O. Kovalsky, L. Grossman, DNA damage-depen-dent recruitment of nucleotide excision repair and transcrip-tion proteins to Escherichia coli inner membranes, Nucleic

Ž .Acids Res. 25 1997 3151–3158.w x75 N. Lomovskaya, S.K. Hong, S.U. Kim, L. Fonstein, K.

Furuya, R.C. Hutchinson, The Streptomyces peucetius drrCgene encodes a UvrA-like protein involved in daunorubicin

Ž .resistance and production, J. Bacteriol. 178 1996 3225–3238.

w x76 S. McCready, The repair of ultraviolet light-induced DNA

Page 39: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213 209

damage in the halophilic archaebacteria, Halobacteriumcutirubrum, Halobacterium halobium and Haloferax Õol-

Ž .canii, Mutat. Res. 364 1996 25–32.w x77 J. Sgouros, P.H. Gaillard, R.D. Wood, A relationship be-

tween a DNA-repairrrecombination nuclease family andŽ .archaeal helicases, Trends Biochem. Sci. 24 1999 95–97.

w x78 M.R. Lieber, The FEN-1 family of structure-specific nucle-ases in eukaryotic DNA replication, recombination and

Ž .repair, BioEssays 19 1997 233–240.w x79 M. Ogrunc, D.F. Becker, S.W. Ragsdale, A. Sancar, Nu-

cleotide excision repair in the Third Kingdom, J. Bacteriol.Ž .180 1998 5796–5798.

w x80 P.W. Doetsch, What’s old is new: an alternative DNAŽ .excision repair pathway, Trends Biochem. Sci. 20 1995

384–386.w x81 K.K. Bowman, K. Sidik, C.A. Smith, J.S. Taylor, P.W.

Doetsch, G.A. Freyer, A new ATP-independent DNA en-donuclease from Schizosaccharomyces pombe that recog-nizes cyclobutane pyrimidine dimers and 6–4 photo-

Ž .products, Nucleic Acids Res. 22 1994 3026–3032.w x82 A. Yasui, S.J. McCready, Alternative repair pathways for

Ž .UV-induced DNA damage, BioEssays 20 1998 291–297.w x83 S. Kanno, S. Iwai, M. Takao, A. Yasui, Repair of

apurinicrapyrimidinic sites by UV damage endonuclease; arepair protein for UV and oxidative damage, Nucleic Acids

Ž .Res. 27 1999 3096–3103.w x84 H.E. Krokan, R. Standal, G. Slupphaug, DNA glycosylases

Ž .in the base excision repair of DNA, Biochem. J. 325 19971–16.

w x85 K. Meyer-Siegler, D.J. Mauro, G. Seal, J. Wurzer, J.K.deRiel, M.A. Sirover, A human nuclear uracil DNA glyco-sylase is the 37-kDa subunit of glyceraldehyde-3-phosphate

Ž .dehydrogenase, Proc. Natl. Acad. Sci. U. S. A. 88 19918460–8464.

w x86 S.J. Muller, S. Caradonna, Cell cycle regulation of a humancyclin-like gene encoding uracil-DNA glycosylase, J. Biol.

Ž .Chem. 268 1993 1310–1319.w x87 G. Slupphaug, I. Eftedal, B. Kavli, S. Bharati, N.M. Helle,

T. Haug, D.W. Levine, H.E. Krokan, Properties of a recom-binant human uracil-DNA glycosylase from the UNG geneand evidence that UNG encodes the major uracil-DNA

Ž .glycosylase, Biochemistry 34 1995 128–138.w x88 H. Nilsen, M. Otterlei, T. Haug, K. Solum, T.A. Nagelhus,

F. Skorpen, H.E. Krokan, Nuclear and mitochondrial uracil-DNA glycosylases are generated by alternative splicing andtranscription from different positions in the UNG gene,

Ž .Nucleic Acids Res. 25 1997 750–755.w x89 M. Sandigursky, W.A. Franklin, Thermostable uracil-DNA

glycosylase from Thermotoga maritima a member of aŽ .novel class of DNA repair enzymes, Curr. Biol. 9 1999

531–534.w x90 A. Koulis, D.A. Cowan, L.H. Pearl, R. Savva, Uracil-DNA

glycosylase activities in hyperthermophilic micro-organisms,Ž .FEMS Microbiol. Lett. 143 1996 267–271.

w x91 P. Gallinari, J. Jiricny, A new class of uracil-DNA glycosy-

lases related to human thymine-DNA glycosylase, NatureŽ .383 1996 735–738.

w x92 P. Neddermann, J. Jiricny, Efficient removal of uracil fromG.U mispairs by the mismatch-specific thymine DNA gly-cosylase from HeLa cells, Proc. Natl. Acad. Sci. U. S. A.

Ž .91 1994 1642–1646.w x93 R.C. Manuel, E.W. Czerwinski, R.S. Lloyd, Identification

of the structural and functional domains of MutY, an Es-cherichia coli DNA mismatch repair enzyme, J. Biol. Chem.

Ž .271 1996 16218–16226.w x94 K.G. Au, M. Cabrera, J.H. Miller, P. Modrich, Escherichia

coli mutY gene product is required for specific A-G–C.Gmismatch correction, Proc. Natl. Acad. Sci. U. S. A. 85Ž .1988 9163–9166.

w x95 Y. Nghiem, M. Cabrera, C.G. Cupples, J.H. Miller, ThemutY gene: a mutator locus in Escherichia coli that gener-ates G.C–T.A transversions, Proc. Natl. Acad. Sci. U. S. A.

Ž .85 1988 2709–2713.w x96 J.P. McGoldrick, Y.C. Yeh, M. Solomon, J.M. Essigmann,

A.L. Lu, Characterization of a mammalian homolog of theEscherichia coli MutY mismatch repair protein, Mol. Cell.

Ž .Biol. 15 1995 989–996.w x97 M.M. Slupska, C. Baikalov, W.M. Luther, J.H. Chiang,

Y.F. Wei, J.H. Miller, Cloning and sequencing a humanŽ .homolog hMYH of the Escherichia coli mutY gene whose

function is required for the repair of oxidative DNA dam-Ž .age, J. Bacteriol. 178 1996 3885–3892.

w x98 R. Aspinwall, D.G. Rothwell, T. Roldan-Arjona, C.Anselmino, C.J. Ward, J.P. Cheadle, J.R. Sampson, T.Lindahl, P.C. Harris, I.D. Hickson, Cloning and characteri-zation of a functional human homolog of Escherichia coli

Ž .endonuclease III, Proc. Natl. Acad. Sci. U. S. A. 94 1997109–114.

w x99 C.E. Piersen, M.A. Prince, M.L. Augustine, M.L. Dodson,R.S. Lloyd, Purification and cloning of Micrococcus luteusultraviolet endonuclease, an N-glycosylaserabasic lyase thatproceeds via an imino enzyme-DNA intermediate, J. Biol.

Ž .Chem. 270 1995 23475–23484.w x100 J.P. Horst, H.J. Fritz, Counteracting the mutagenic effect of

hydrolytic deamination of DNA 5-methylcytosine residuesat high temperature: DNA mismatch N-glycosylase Mig.Mthof the thermophilic archaeon Methanobacterium thermoau-

Ž .totrophicum THF, EMBO J. 15 1996 5459–5469.w x101 T.J. Begley, B.J. Haas, J. Noel, A. Shekhtman, W.A.

Williams, R.P. Cunningham, A new member of the endonu-clease III family of DNA repair enzymes that removes

Ž .methylated purines from DNA, Curr. Biol. 9 1999 653–656.

w x102 M.L. Michaels, L. Pham, C. Cruz, J.H. Miller, MutM, aprotein that prevents G.C–T.A transversions, is formami-dopyrimidine–DNA glycosylase, Nucleic Acids Res. 19Ž .1991 3629–3632.

w x103 M. Cabrera, Y. Nghiem, J.H. Miller, mutM, a secondmutator locus in Escherichia coli that generates G.C–T.A

Ž .transversions, J. Bacteriol. 170 1988 5405–5407.

Page 40: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213210

w x104 P. Duwat, R. de Oliveira, S.D. Ehrlich, S. Boiteux, Repairof oxidative DNA damage in Gram-positive bacteria: the

Ž .Lactococcus lactis Fpg protein, Microbiology 141 1995411–417.

w x105 T. Mikawa, R. Kato, M. Sugahara, S. Kuramitsu, Ther-mostable repair enzyme for oxidative DNA damage fromextremely thermophilic bacterium, Thermus thermophilus

Ž .HB8, Nucleic Acids Res. 26 1998 903–910.w x106 D. Jiang, Z. Hatahet, R.J. Melamede, Y.W. Kow, S.S.

Wallace, Characterization of Escherichia coli endonucleaseŽ .VIII, J. Biol. Chem. 272 1997 32230–32239.

w x107 D. Jiang, Z. Hatahet, J.O. Blaisdell, R.J. Melamede, S.S.Wallace, Escherichia coli endonuclease VIII: cloning, se-quencing, and overexpression of the nei structural gene andcharacterization of nei and nei nth mutants, J. Bacteriol.

Ž .179 1997 3773–3782.w x108 T. Ohtsubo, O. Matsuda, K. Iba, I. Terashima, M. Sekiguchi,

Y. Nakabeppu, Molecular cloning of AtMMH, an Ara-bidopsis thaliana ortholog of the Escherichia coli mutMgene, and analysis of functional domains of its product,

Ž .Mol. Gen. Genet. 259 1998 577–590.w x109 P.A. van der Kemp, D. Thomas, R. Barbey, R. de Oliveira,

S. Boiteux, Cloning and expression in Escherichia coli ofthe OGG1 gene of Saccharomyces cereÕisiae, which codesfor a DNA glycosylase that excises 7,8-dihydro-8-oxoguanine and 2,6-diamino-4-hydroxy-5-N-methylfor-mamidopyrimidine, Proc. Natl. Acad. Sci. U. S. A. 93Ž .1996 5197–5202.

w x110 K. Arai, K. Morishita, K. Shinmura, T. Kohno, S.R. Kim,T. Nohmi, M. Taniwaki, S. Ohwada, J. Yokota, Cloning ofa human homolog of the yeast OGG1 gene that is involvedin the repair of oxidative DNA damage, Oncogene 14Ž .1997 2857–2861.

w x111 J.P. Radicella, C. Dherin, C. Desmaze, M.S. Fox, S. Boi-teux, Cloning and characterization of hOGG1, a humanhomolog of the OGG1 gene of Saccharomyces cereÕisiae,

Ž .Proc. Natl. Acad. Sci. U. S. A. 94 1997 8010–8015.w x112 T.A. Rosenquist, D.O. Zharkov, A.P. Grollman, Cloning

and characterization of a mammalian 8-oxoguanine DNAŽ .glycosylase, Proc. Natl. Acad. Sci. U. S. A. 94 1997

7429–7434.w x113 M. Takao, H. Aburatani, K. Kobayashi, A. Yasui, Mito-

chondrial targeting of human DNA glycosylases for repairŽ .of oxidative DNA damage, Nucleic Acids Res. 26 1998

2917–2922.w x114 W. Xiao, B.L. Chow, L. Rathgeber, The repair of DNA

methylation damage in Saccharomyces cereÕisiae, Curr.Ž .Genet. 30 1996 461–648.

w x115 J. Laval, J. Jurado, M. Saparbaev, O. Sidorkina, Antimuta-genic role of base-excision repair enzymes upon free radi-

Ž .cal-induced DNA damage, Mutat. Res. 402 1998 93–102.w x116 M. Furuta, J.O. Schrader, H.S. Schrader, T.A. Kokjohn, S.

Nyaga, A.K. McCullough, R.S. Lloyd, D.E. Burbank, D.Landstein, L. Lane et al., Chlorella virus PBCV-1 encodes ahomolog of the bacteriophage T4 UV damage repair gene

Ž .denV, Appl. Environ. Microbiol. 63 1997 1551–1556.

w x117 G. Barzilay, I.D. Hickson, Structure and function ofŽ .apurinicrapyrimidinic endonucleases, BioEssays 17 1995

713–719.w x118 C.F. Kuo, C.D. Mol, M.M. Thayer, R.P. Cunningham, J.A.

Tainer, Structure and function of the DNA repair enzymeexonuclease III from E. coli, Ann. N. Y. Acad. Sci. 726Ž .1994 223–234.

w x119 R.D. Camerini-Otero, P. Hsieh, Homologous recombinationproteins in prokaryotes and eukaryotes, Annu. Rev. Genet.

Ž .29 1995 509–552.w x120 A.J. Clark, S.J. Sandler, Homologous genetic recombina-

tion: the pieces begin to fall into place, Crit. Rev. Micro-Ž .biol. 20 1994 125–142.

w x121 S. Kowalczykowski, D. Dixon, A. Eggleston, S. Lauder, W.Rehrauer, Biochemistry of homologous recombination in

Ž .Escherichia coli, Microbiol. Rev. 58 1994 401–465.w x122 P.C. Hanawalt, P.K. Cooper, A.K. Ganesan, C.A. Smith,

DNA repair in bacteria and mammalian cells, Annu. Rev.Ž .Biochem. 48 1979 783–836.

w x123 A.K. Eggleston, S.C. West, Recombination initiation: easyŽ .as A, B, C, D . . . chi?, Curr. Biol. 7 1997 R745–749.

w x124 S.K. Farrand, I. Hwang, D.M. Cook, The tra region of thenopaline-type Ti plasmid is a chimera with elements relatedto the transfer systems of RSF1010, RP4, and F, J. Bacte-

Ž .riol. 178 1996 4233–4247.w x125 J. Alt-Morbe, J.L. Stryker, C. Fuqua, P.L. Li, S.K. Farrand,

S.C. Winans, The conjugal transfer system of Agrobac-terium tumefaciens octopine-type Ti plasmids is closelyrelated to the transfer system of an IncP plasmid anddistantly related to Ti plasmid vir genes, J. Bacteriol. 178Ž .1996 4248–4257.

w x126 M. el Karoui, D. Ehrlich, A. Gruss, Identification of thelactococcal exonucleaserrecombinase and its modulationby the putative Chi sequence, Proc. Natl. Acad. Sci. U. S.

Ž .A. 95 1998 626–631.w x127 J. Courcelle, C. Carswell-Crumpton, P.C. Hanawalt, recF

and recR are required for the resumption of replication atDNA replication forks in Escherichia coli, Proc. Natl.

Ž .Acad. Sci. U. S. A. 94 1997 3714–3719.w x128 K. Nakayama, S. Shiota, H. Nakayama, Thymineless death

in Escherichia coli mutants deficient in the RecF recom-Ž .bination pathway, Can. J. Microbiol. 34 1988 905–907.

w x129 H. Nakayama, K. Nakayama, R. Nakayama, N. Irino, Y.Nakayama, P. Hanawalt, Isolation and genetic characteriza-tion of a thymineless death-resistant mutant of Escherichia

Ž .coli K12: identification of a new mutation recQ1 thatblocks the RecF recombination pathway, Mol. Gen. Genet.

Ž .195 1984 474–480.w x130 M.D. Gray, J.C. Shen, A.S. Kamath-Loeb, A. Blank, B.L.

Sopher, G.M. Martin, J. Oshima, L.A. Loeb, The WernerŽ .syndrome protein is a DNA helicase, Nat. Genet. 17 1997

100–103.w x131 J.K. Karow, R.K. Chakraverty, I.D. Hickson, The Bloom’s

syndrome gene product is a 3X –5X DNA helicase, J. Biol.Ž .Chem. 272 1997 30611–30614.

w x132 P.M. Watt, I.D. Hickson, R.H. Borts, E.J. Louis, SGS1, a

Page 41: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213 211

homologue of the Bloom’s and Werner’s syndrome genes,is required for maintenance of genome stability in Saccha-

Ž .romyces cereÕisiae, Genetics 144 1996 935–945.w x133 C.E. Yu, J. Oshima, Y.H. Fu, E.M. Wijsman, F. Hisama, R.

Alisch, S. Matthews, J. Nakura, T. Miki, S. Ouais et al.,Positional cloning of the Werner’s syndrome gene, Science

Ž .272 1996 258–262.w x134 N.A. Ellis, J. Groden, T.Z. Ye, J. Straughen, D.J. Lennon,

S. Ciocci, M. Proytcheva, J. German, The Bloom’s syn-drome gene product is homologous to RecQ helicases, Cell

Ž .83 1995 655–666.w x135 T. Hirano, SMC protein complexes and higher-order chro-

Ž .mosome dynamics, Curr. Opin. Cell. Biol. 10 1998 317–322.

w x136 R. Jessberger, C. Frei, S.M. Gasser, Chromosome dynam-ics: the SMC protein family, Curr. Opin. Genet. Dev. 8Ž .1998 254–259.

w x137 A.J. Clark, V. Sharma, S. Brenowitz, C.C. Chu, S. Sandler,L. Satin, A. Templin, I. Berger, A. Cohen, Genetic andmolecular analyses of the C-terminal region of the recEgene from the Rac prophage of Escherichia coli K-12

Ž .reveal the recT gene, J. Bacteriol. 175 1993 7673–7682.w x138 R. Kolodner, S.D. Hall, C. Luisi-DeLuca, Homologous

pairing proteins encoded by the Escherichia coli recE andŽ .recT genes, Mol. Microbiol. 11 1994 23–30.

w x139 K. Kusano, N.K. Takahashi, H. Yoshikura, I. Kobayashi,Involvement of RecE exonuclease and RecT annealing pro-tein in DNA double-strand break repair by homologous

Ž .recombination, Gene 138 1994 17–25.w x140 P. Noirot, R.D. Kolodner, DNA strand invasion promoted

by Escherichia coli RecT protein, J. Biol. Chem. 273Ž .1998 12274–12280.

w x141 J.C. Connelly, L.A. Kirkham, D.R.F. Leach, The SbcCDnuclease of Escherichia coli is a structural maintenance of

Ž .chromosomes SMC family protein that cleaves hairpinŽ .DNA, Proc. Natl. Acad. Sci. U. S. A. 95 1998 7969–7974.

w x142 G.J. Sharples, D.R.F. Leach, Structural and functional simi-larities between the SbcCD proteins of Escherichia coli and

Ž .the Rad50 and Mre11 Rad32 recombination and repairŽ .proteins of yeast, Mol. Microbiol. 17 1995 1215–1217.

w x X X143 T.T. Paul, M. Gellert, The 3 to 5 exonuclease activity ofMre 11 facilitates repair of DNA double-strand breaks, Mol.

Ž .Cell 1 1998 969–979.w x144 J.H.J. Petrini, M.E. Walsh, C. Dimare, X.N. Chen, J.R.

Korenberg, D.T. Weaver, Isolation and characterization ofŽ .the human MRE11 homologue, Genomics 29 1995 80–86.

w x145 J.P. Carney, R.S. Maser, H. Olivares, E.M. Davis, M. LeBeau, J.R. Yates, L. Hays, W.F. Morgan, J.H.J. Petrini, ThehMre11rhRad50 protein complex and Nijmegen breakagesyndrome: linkage of double-strand break repair to the

Ž .cellular DNA damage response, Cell 93 1998 477–486.w x146 J.A. Eisen, The RecA protein as a model molecule for

molecular systematic studies of bacteria: comparison oftrees of RecAs and 16s rRNAs from the same species, J.

Ž .Mol. Evol. 41 1995 1105–1123.w x147 T.M. Gruber, J.A. Eisen, K. Gish, D.A. Bryant, The phylo-

genetic relationships of Chlorobium tepidum and Chlo-

roflexus aurantiacus based upon their RecA sequences,Ž .FEMS Microbiol. Lett. 162 1998 53–60.

w x148 N.Y. Stassen, J.M. Logsdon Jr., G.J. Vora, H.H. Offenberg,J.D. Palmer, M.E. Zolan, Isolation and characterization ofrad51 orthologs from Coprinus cinereus and Lycopersiconesculentum, and phylogenetic analysis of eukaryotic recA

Ž .homologs, Curr. Genet. 31 1997 144–157.w x149 K.W. King, A. Woodard, K. Dybvig, Cloning and charac-

terization of the recA genes from Mycoplasma pulmonisŽ .and M. mycoides subsp. mycoides, Gene 139 1994 111–

115.w x150 A. Marais, J.M. Bove, J. Renaudin, Spiroplasma citri virus

SpV1-derived cloning vector: deletion formation by illegiti-mate and homologous recombination in a spiroplasmal hoststrain which probably lacks a functional recA gene, J.

Ž .Bacteriol. 178 1996 862–870.w x151 N. Norioka, M.Y. Hsu, S. Inouye, M. Inouye, Two recA

Ž .genes in Myxococcus xanthus, J. Bacteriol. 177 19954179–4182.

w x152 P.R. Bianco, R.B. Tracy, S.C. Kowalczykowski, DNA strandexchange proteins: a biochemical and physical comparison,

Ž .Front. Biosci. 3 1998 d570–d603.w x153 J.D. Griffith, L.D. Harris, DNA strand exchanges, CRC

Ž .Crit. Rev. Biochem. 23 1988 S43–S86.w x154 B. Muller, S.C. West, Processing of Holliday junctions by

the Escherichia coli RuvA, RuvB, RuvC and RecG pro-Ž .teins, Experientia 50 1994 216–222.

w x155 T.N. Mandal, A.A. Mahdi, G.J. Sharples, R.G. Lloyd,Resolution of Holliday intermediates in recombination andDNA repair: indirect suppression of ruÕA, ruÕB, and ruÕC

Ž .mutations, J. Bacteriol. 175 1993 4325–4334.w x156 G.J. Sharples, S.N. Chan, A.A. Mahdi, M.C. Whitby, R.G.

Lloyd, Processing of intermediates in recombination andDNA repair: identification of a new endonuclease that

Ž .specifically cleaves Holliday junctions, EMBO J. 13 19946133–6142.

w x157 S.N. Chan, S.D. Vincent, R.G. Lloyd, Recognition andmanipulation of branched DNA by the RusA Hollidayjunction resolvase of Escherichia coli, Nucleic Acids Res.

Ž .26 1998 1560–1566.w x158 S.C. West, Processing of recombination intermediates by

Ž .the RuvABC proteins, Annu. Rev. Genet. 31 1997 213–244.

w x159 M.J. Schofield, D.M. Lilley, M.F. White, Dissection of thesequence specificity of the Holliday junction endonuclease

Ž .CCE1, Biochemistry 37 1998 7733–7740.w x160 K. Ishioka, H. Iwasaki, H. Shinagawa, Roles of the recG

gene product of Escherichia coli in recombination repair:effects of the delta recG mutation on cell division and

Ž .chromosome partition, Genes Genet. Syst. 72 1997 91–99.w x161 P.A. Jeggo, G.E. Taccioli, S.P. Jackson, Menage a trois:

Ž .double strand break repair, V D J recombination and DNA-Ž .PK, BioEssays 17 1995 949–957.

w x162 D.A. Ramsden, M. Gellert, Ku protein stimulates DNA endjoining by mammalian DNA ligases: a direct role for Ku in

Ž .repair of DNA double-strand breaks, EMBO J. 17 1998609–614.

Page 42: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213212

w x163 T.E. Wilson, U. Grawunder, M.R. Lieber, Yeast DNAligase IV mediates non-homologous DNA end joining, Na-

Ž .ture 388 1997 495–498.w x164 G. Chu, Double strand break repair, J. Biol. Chem. 272

Ž .1997 24097–24100.w x165 C.T. Keith, S.L. Schreiber, PIK-related kinases: DNA re-

pair, recombination, and cell cycle checkpoints, Science 270Ž .1995 50–51.

w x Ž .166 D.R. Edgell, W.F. Doolittle, Archaea and the origin s ofŽ .DNA replication proteins, Cell 89 1997 995–998.

w x167 H. Shinagawa, SOS response as an adaptive response toŽ .DNA damage in prokaryotes, Experientia 77 1996 221–

235.w x168 B.J. Hwang, J.M. Ford, P.C. Hanawalt, G. Chu, Expression

of the p48 xeroderma pigmentosum gene is p53-dependentand is involved in global genomic repair, Proc. Natl. Acad.

Ž .Sci. U. S. A. 96 1999 424–428.w x169 R.J. Lewis, P.C. Hanawalt, Ligation of oligonucleotides by

pyrimidine dimers — a missing ‘link’ in the origin of life?,Ž .Nature 298 1982 393–396.

w x170 A.E. Gorbalenya, E.V. Koonin, A.P. Donchenko, V.M.Blinov, Two related superfamilies of putative helicasesinvolved in replication, recombination, repair and expres-sion of DNA and RNA genomes, Nucleic Acids Res. 17Ž .1989 4713–4730.

w x171 K.E. Nelson, R.A. Clayton, S.R. Gill, M.L. Gwinn, R.J.Dodson, D.H. Haft, E.K. Hickey, J.D. Peterson, W.C. Nel-son, K.A. Ketchum et al., Evidence for lateral gene transferbetween Archaea and bacteria from genome sequence of

Ž .Thermotoga maritima, Nature 399 1999 323–329.w x172 F.R. Blattner, G.I. Plunkett, C.A. Bloch, N.T. Perna, V.

Burland, M. Riley, J. Collado-Vides, J.D. Glasner, C.K.Rode, G.F. Mayhew et al., The complete genome sequence

Ž .of Escherichia coli K-12, Science 277 1997 1453–1462.w x173 S.G. Andersson, A. Zomorodipour, J.O. Andersson, T.

Sicheritz-Ponten, U.C. Alsmark, R.M. Podowski, A.K.Naslund, A.S. Eriksson, H.H. Winkler, C.G. Kurland, Thegenome sequence of Rickettsia prowazekii and the origin of

Ž .mitochondria, Nature 396 1998 133–140.w x174 J.F. Tomb, O. White, A.R. Kerlavage, R.A. Clayton, G.G.

Sutton, R.D. Fleischmann, K.A. Ketchum, H.P. Klenk, S.Gill, B.A. Dougherty et al., The complete genome sequenceof the gastric pathogen Helicobacter pylori, Nature 388Ž .1997 539–547.

w x175 R.A. Alm, L.S. Ling, D.T. Moir, B.L. King, E.D. Brown,P.C. Doig, D.R. Smith, B. Noonan, B.C. Guild, B.L. de-Jonge et al., Genomic-sequence comparison of two unre-lated isolates of the human gastric pathogen Helicobacter

Ž .pylori, Nature 397 1999 176–180.w x176 Sanger-Centre, personal communication.w x177 A. Kunst, N. Ogasawara, I. Moszer, A. Albertini, G. Alloni,

V. Azevedo, M. Bertero, P. Bessieres, A. Bolotin, S.Borchert et al., The complete genome sequence of theGram-positive bacterium Bacillus subtilis, Nature 390Ž .1997 249–256.

w x178 C.M. Fraser, J.D. Gocayne, O. White, M.D. Adams, R.A.Clayton, R.D. Fleischmann, C.J. Bult, A.R. Kerlavage, G.

Sutton, J.M. Kelley et al., The minimal gene complement ofŽ .Mycoplasma genitalium, Science 270 1995 397–403, see

comments.w x179 R. Himmelreich, H. Hilbert, H. Plagens, E. Pirkl, B.C. Li,

R. Herrmann, Complete sequence analysis of the genome ofthe bacterium Mycoplasma pneumoniae, Nucleic Acids Res.

Ž .24 1996 4420–4449.w x180 S.T. Cole, R. Brosch, J. Parkhill, T. Garnier, C. Churcher,

D. Harris, S.V. Gordon, K. Eiglmeier, S. Gas, C.E. BarryIII et al., Deciphering the biology of Mycobacterium tuber-culosis from the complete genome sequence, Nature 393Ž .1998 537–544.

w x181 C.M. Fraser, S.J. Norris, G.M. Weinstock, O. White, G.G.Sutton, R. Dodson, M. Gwinn, E.K. Hickey, R. Clayton,K.A. Ketchum et al., Genomic sequence of a Lyme disease

Ž .spirochaete, Borrelia burgdorferi, Nature 390 1997 580–586.

w x182 C.M. Fraser, S.J. Norris, G.M. Weinstock, O. White, G.G.Sutton, R. Dodson, M. Gwinn, E.K. Hickey, R. Clayton,K.A. Ketchum et al., Complete genome sequence of Tre-ponema pallidum, the syphilis spirochete, Science 281Ž .1998 375–388.

w x183 R.S. Stephens, S. Kalman, C. Lammel, J. Fan, R. Marathe,L. Aravind, W. Mitchell, L. Olinger, R.L. Tatusov, Q. Zhaoet al., Genome sequence of an obligate intracellular pathogen

Ž .of humans: Chlamydia trachomatis, Science 282 1998754–759.

w x184 T. Kaneko, S. Sato, H. Kotani, A. Tanaka, E. Asamizu, Y.Nakamura, N. Miyajima, M. Hirosawa, M. Sugiura, S.Sasamoto et al., Sequence analysis of the genome of theunicellular cyanobacterium Synechocystis sp. strainPCC6803: II. Sequence determination of the entire genomeand assignment of potential protein-coding regions, DNA

Ž .Res. 3 1996 109–136.w x185 O. White, J.A. Eisen, J.F. Heidelberg, E.K. Hickey, J.D.

Peterson, R.J. Dodson, D.H. Haft, M.L. Gwinn, W.C. Nel-son, D.L. Richardson et al., Complete genome sequencingof the radioresistant bacterium, Deinococcus radiodurans.

Ž .Science 1999 , in press.w x186 G. Deckert, P.V. Warren, T. Gaasterland, W.G. Young,

A.L. Lenox, D.E. Grahams, R. Overbeek, M.A. Snead, M.Keller, M. Aujay et al., The complete genome of thehyperthemophilic bacterium Aquifex aeolicus, Nature 392Ž .1998 353–358.

w x187 C.J. Bult, O. White, G.J. Olsen, L. Zhou, R.D. Fleis-chmann, G.G. Sutton, J.A. Blake, L.M. Fitzgerald, R.A.Clayton, J.D. Gocayne et al., Complete genome sequence ofthe methanogenic archaeon, Methanococcus jannaschii,

Ž .Science 273 1996 1058–1073.w x188 D.R. Smith, L.A. Doucette-Stamm, C. Deloughery, H. Lee,

J. Dubois, T. Aldredge, R. Bashirzadeh, D. Blakely, R.Cook, K. Gilbert et al., Complete genome sequence ofMethanobacterium thermoautotrophicum DH: functional

Ž .analysis and comparative genomics, J. Bacteriol. 179 19967135–7155.

w x189 Y. Kawarabayasi, M. Sawada, H. Horikawa, Y. Haikawa,Y. Hino, S. Yamamoto, M. Sekine, S. Baba, H. Kosugi, A.

Page 43: A phylogenomic study of DNA repair genes, proteins, and ......Feb 19, 2012  · ods against complete genome sequences Table 1 .. . Homologs of repair genes that had been cloned from

( )J.A. Eisen, P.C. HanawaltrMutation Research 435 1999 171–213 213

Hosoyama et al., Complete sequence and gene organizationof the genome of a hyper-thermophilic archaebacterium,

Ž .Pyrococcus horikoshii OT3, DNA Res. 5 1998 55–76.w x190 H.-P. Klenk, R.A. Clayton, J.-F. Tomb, O. White, K.E.

Nelsen, K.A. Ketchum, R.J. Dodson, M. Gwinn, E.K.Hickey, J.D. Peterson et al., The complete genomic se-

quence of the hyperthermophilic, sulfate-reducing archaeonŽ .Archaeoglobus fulgidus, Nature 390 1997 364–370.

w x191 A. Goffeau, B.G. Barrell, H. Bussey, R.W. Davis, B.Dujon, H. Feldmann, F. Galibert, J.D. Hoheisel, C. Jacq, M.

Ž .Johnston et al., Life with 6000 genes, Science 274 1996546, 563–567.