| Nucleic Acids Research | Pages |
Conserved domains in DNA repair proteins and evolution of repair systems
Introduction
Approach And Methods
Conserved Domains And Domainarchitecture In DNA Repair Proteins
AP endonuclease/ENDO4 superfamily
UvrC endonuclease superfamily (Uri domain)
EndoV endonuclease superfamily
RAD1/ERCC4 endonuclease superfamily and its inactivated derivatives
The RecB nuclease domain family
DNA-Binding Domains
Adaptor domains
Phyletic Distribution And Evolution Of Repair Systems
Repair Proteins Conserved In All Three Superkingdoms Of Life
Repair Proteins And Pathways Confined To Only One Or Two Of The Superkingdoms
Repair systems of bacterial origin
Repair systems of archaeal and eukaryotic origin
Some General Trends In The Evolution Of Repair Systems
The Pressures Of External And Internal Environments
Horizontal Gene Transfer And Differential Gene Loss
Preadaptation: Which Repair Systems Have Been Inherited From The Cenancestor?
Continuing Evolution Of DNA Repair Proteins
Conclusions
References
Note Added In Proof
Conserved domains in DNA repair proteins and evolution of repair systems
ABSTRACT
INTRODUCTION
The DNA-based information system of most biological replicators present in the extant world is plagued by the possibility of insult from mutation. Given the vast number of mutagens present in the environment throughout the history of life, as well as the intrinsic error rate of DNA replication, one would imagine a strong selection for systems capable of safeguarding the genetic information. Indeed, the genomes of all cellular lifeforms and several large DNA viruses encode multiple proteins whose function is to repair the damaged DNA (1). In spite of the critical need for DNA repair, evolvability, that is, the ability to generate a certain level of uncorrected mutations, also seems to be selected for in the course of evolution. Organisms with an optimal level of evolvability have the best chance to survive environmental changes by virtue of stochastic variations in their genome, which provides the new raw material for natural selection. The complex interplay between the two opposing forces, namely the need for fidelity of transmission of genetic information and the need for evolvability, seem to define the organization of the repair systems.
DNA repair as a whole is a highly complex phenomenon. The repair mechanisms can be classified into several distinct, if not completely independent, major pathways that differ with regard to the level at which the lesions in damaged DNA are reversed or removed by the repair machinery: (i) direct damage reversal (DDR); (ii) base excision repair (BER); (iii) nucleotide excision repair (NER); (iv) mismatch repair (MMR); and (v) recombinational repair (RER). The general picture is further complicated by the existence of specialized, regulated forms of repair, such as the SOS response in bacteria, and by the intimate connection between repair, chromatin dynamics and the cell cycle in eukaryotes.
With the recent accumulation of complete genome sequences, it has become possible to systematically compare the repair systems of the respective organisms. Preliminary comparisons of this kind immediately made it clear that the repair machinery shows considerable variability, in terms of the present and absent genes, even in relatively close bacteria, such as Escherichia coli and Haemophilus influenzae (2). It was of major interest, therefore, to perform a systematic comparative analysis of the genes encoding proteins involved in repair in the three superkingdoms of life-bacteria, archaea and eukaryotes-and in the main bacterial subdivisions. Here we present the results of such an analysis and discuss several previously undetected conserved domains that were uncovered in the process, as well as functional and evolutionary implications of the phyletic distribution of various repair genes.
DNA repair systems and mechanisms have been described in a comprehensive monograph by Friedberg and co-workers (1) as well as in several more recent, excellent reviews dedicated to specific aspects of repair (3-10). In this article, we make no attempt to cover the functional aspects of repair in any depth. Instead, we concentrate on those new facets of our understanding of the relationships between repair proteins and the evolution of repair systems that have been brought about by the comparative analysis of repair systems encoded in completely sequenced genomes. Whenever available, review articles are cited, and experimental work is cited only in as much as it is has a direct bearing on the conclusions drawn from genome analysis. Even with this focused approach, however, the number of relevant publications is quite substantial, and choices had to be made. We apologize to those researchers whose important work is not cited because of this, or simply by inadvertent but certainly regrettable omission.
APPROACH AND METHODS
Proteins were considered to be involved in DNA repair if on the basis of literature searches, they were found to meet one or more of the following criteria: (i) a role in repair demonstrated by genetic studies on model organisms, such as E.coli and the yeast Saccharomyces cerevisiae; (ii) a demonstrated role in human repair deficiency syndromes, such as Xeroderma pigmentosum, Cockayne syndrome, Blooms syndrome, Werners syndrome and allied diseases; (iii) possession of a biochemical activity compatible with a role in repair and the genetic data. The sequences of repair proteins from E.coli and yeast were subjected to detailed analysis with the SEALS package (11) which allows automated large-scale database searches using the PSI-BLAST program (12) after masking compositionally biased regions in the query sequences with the SEG program (13). The PSI-BLAST program uses the sequences retrieved from the database with a certain cut-off similarity level to construct a position-dependent weight matrix that is used for further iterations of the search, resulting in a significantly increased sensitivity and allowing the detection of subtle sequence similarities. During this iterative search, the random expectation (e) value computed by PSI-BLAST at the first instance when the given sequence is retrieved from the database is a reliable indication of the significance of a match, provided the low complexity regions in the query are appropriately masked. By default, each repair protein sequence from E.coli and yeast was compared to the non-redundant (NR) database at the National Center for Biteochnology Information (NIH, Bethesda) using PSI-BLAST run for three iterations. Further, case-by-case dissection of the protein families was performed where needed using PSI-BLAST searches run to convergence with the sequences of individual domains as queries as well as motif searches using the MoST program (14). Multiple alignments for the protein families were constructed using the -m4 option of PSI-BLAST, the CLUSTALW program (15) or the Gibbs sampling option of the MACAW program (16,17). Protein secondary structure predictions and structural database threading was performed using the PHD program (18,19). Structural models were manipulated using the Swiss-PDB -viewer program. The phyletic distribution of homologous proteins detected by the PSI-BLAST searches was assessed using the Tax_collector program of the SEALS package.
Throughout this analysis, an attempt was made to identify orthologous genes in different genomes. By definition, orthologs are genes (proteins) related by vertical descent or, in other words, direct evolutionary counterparts in different species. By contrast, paralogs have been defined as homologous genes derived by duplication within a species (20,21). This dichotomy does not fully describe the relationships between genes in distantly related genomes. Firstly, due to multiple lineage-specific gene duplications occurring subsequent to the radiation of the respective lineages, orthology generally cannot be described as a one-to-one relationship between these individual genes (22). Secondly, it is common in comparisons of proteins from phylogenetically distant species that the given domain architecture found in one of them has no counterpart in the other genome; instead, certain proteins from the second genome share a homologous domain(s) with the protein in question but otherwise have different domain organizations. Approaches for the identification of likely orthologs in genome comparisons have been described previously (22,23). Briefly, proteins or protein families from different genomes were considered orthologous if they showed the greatest similarity to each other among all proteins encoded by the two genomes and a similar (but not necessarily identical) domain architecture. We tried to distinguish, as clearly as possible, between apparent orthologs with similar domain organizations and non-orthologous proteins sharing one or more conserved domains. This distinction appears critical for reliable prediction of protein functions and for the construction of realistic evolutionary scenarios.
CONSERVED DOMAINS AND DOMAIN ARCHITECTURE IN DNA REPAIR PROTEINS
Escherichia coli and the yeast S.cerevisiae are the two model organisms in which DNA repair has been studied in most detail. The identified repair genes from these species were used as the basis for the comparative analysis of the domain architecture of repair proteins and the phyletic distribution of repair systems (Tables
Figure
![]() |
![]() |
![]() |
![]() |
Table 1.
![]() |
![]() |
![]() |
Table 2.
A
![]() |
B
![]() |
C
![]() |
Figure 1. Domain architectures of selected repair proteins. (A) Helicases and nucleases; (B) ATPases; (C) BRCT domain-containing proteins. The figure is approximately to scale. Crossed symbols indicate domains with disrupted functional motifs. ERCC1 contains an apparently inactivated nuclease domain, whereas Mfd and ERCC4 contain inactivated helicase domains; ERCC4 also contains disrupted HhH motifs. SF1/2 stand for superfamily I and II helicases, respectively; pol, DNA polymerase I catalytic domain; URI, UvrC, intron nuclease domain; UVRBC, a possible adaptor domain shared by UvrB and UvrC; ENDOV, a nuclease domain found in endonuclease V and UVRC; FCL, Fe cluster; HhH, helix-hairpin-helix nucleic acid-binding domain; MdHhH, modified HhH domain; HRD, a predicted nucleic acid binding domain found in some recQ family helicases; RqC, RecQ C-terminal (domain); C2C2 (little finger), predicted small, metal-binding DNA binding module; PHD, a distinct type of Zn finger; ANK, ankyrin repeat; OB, oligonucleotide-binding domain; Nuc, nuclease; BRCT, BRCA1 C-terminal (domain); FHA, forkhead homology-associated (domain); S/T kinase, serine/threonine kinase. The different shapes of nuclease domains indicate the different nuclease (super)families described in the text and Figure 2. In (B) the NusA protein (a transcription factor not involved in repair) is shown to illustrate the conservation of the modified HhH domain that is also found in eukaryotic and archaeal RecA orthologs; NusA contains additional nucleic acid-binding domains, namely KH and S1. Other designations are directly on the figures. Double slash (//) shows that a middle portion was omitted in some large proteins. The proteins are identified by their gene names or names from the SWISS-PROT database, and the source species is indicated after an underline. The species abbreviations are:Af, A.fulgidus; Aq, A.aeolicus; Ce, C.elegans; Dm, Drosophila melanogaster; Ec, E.coli; Hs, Homo sapiens; Mj, M.jannaschii; Mta, M.thermoautotrophicum;Mtu, M.tuberculosis; Ph, P.horikoshii; Sc, S.cerevisiae.
A major outcome of comparative sequence analysis is the delineation of novel conserved domains and prediction of their functions as well as discovery of new structural and evolutionary connections between previously identified domains. The sequences and subsequently structures of the main catalytic domains of polymerases, helicases and other ATPases have been characterized in detail in previous studies, and are readily recognizable due to the conservation of diagnostic motifs (e.g. 28-30). Thus the current analysis did not significantly expand these protein superfamilies. An interesting finding, however, is that several well-characterized DNA repair proteins contain domains with statistically significant similarity to helicases but with disrupted functional motifs, which suggests that while retaining the overall structure typical of helicases, they do not possess enzymatic activity. Examples of such apparent inactivation of helicases in repair systems include bacterial RecC and AddB proteins, transcription-repair coupling factor (Mfd or TRCF) and eukaryotic ERCC4 (Fig.
A
![]() |
B
![]() |
C
![]() |
D
![]() |
E
![]() |
F
![]() |
Figure 2. Multiple sequence alignment of previously undetected and expanded domain families of repair proteins. (A) AP endonuclease/ENDO4 superfamily; (B) Uri domain endonuclease family; (C) EndoV endonuclease family; (D) RAD1/ERCC4 endonuclease superfamily; (E) RecB nuclease domain family; (F) PCNA family. The alignments were constructed on the basis of the PSI-BLAST results using the ClustalW program. The left column includes the protein names from the SWISS-PROT database or gene names, and the Gene Identification (GI) numbers (after the underscore). The species abbreviations are: ASFV, African Swine Fever Virus; BPML5, M.leprae bacteriophage 5; BPT4, bacteriophage T4; CHIV, Chilo Iridiscent virus; NPV, Nuclear Polyhedrosis virus; PBCV, Paramecium bursaria Chlorella virus; Aa, A.aeolicus; Aae, Alcaligenes eutrophus; Af, A.fulgidus; Amac, Allomyces macrogynus; At, Arabidopsis thaliana; Bb, Borrelia burgdorferi; Bs, B.subtilis, Ce, C.elegans; Celo, Chlorogonium elongatum; Ceug, Chlamydomonas eugametos; Dm, D.melanogaster; Hs, H.sapiens; Ct, Chlamydia trachomatis; Ec, E.coli; Hi, H.influenzae; Hp, Helicobacter pylori; Ll, Lactococcus lactis; Mj, M.jannaschii; Mge, Mycoplasma genitalium; Mhy, Mycoplasma hyorhinis; Mpn, Mycoplasma pneumoniae; Mta, M.thermoautotrophicum; Mtu, M.tuberculosis; Mpn, Mycoplasma pneumoniae; Nc, Neurospora crassa;Ngo, Neisseria gonorrhoeae; Pa, Podospora anserina; Pf, Pyrococcus furiosus; Ph, P.horikoshii; Pv, Phaseolus vulgaris; Rsph, Rhodopseudomonasspheroides; Sag, Streptococcus agalactiae; Sc, S.cerevisiae; Sp, S.pombe; Ss, Synechocystis sp.; St, Streptococcus thermophilus; Tp, T.pallidum; Um, Ustilago maydis; Vf, Vicia faba. In each panel, a consensus derived using the indicated percentage cut-off is shown, and the respective alignment columns are highlighted through differential coloring; b indicates a big residue (E,K,R,I,L,M,F,Y,W), h indicates hydrophobic residues (A,C,F,I,L,M,V,W,Y), s indicates small residues (A,C,S,T,D,N,V,G,P), u indicates tiny residues (G,A,S), p indicates polar residues (D,E,H,K,N,Q,R,S,T), c indicates charged residues (K,R,D,E,H), and - indicates negatively charged residues (D,E). The conserved charged residues that may be directly involved in enzymatic catalysis are indicated by asterisks. The distances from the aligned regions to the protein termini and the distances between the conserved blocks, where more variable regions were omitted, are indicated by numbers. In (F), the secondary structure elements derived from the crystal structure of PCNA are shown underneath the alignment; E indicates extended conformation ([beta]-strand), and H indicates [alpha]-helix.
Nucleases generally tend to be less conserved in evolution than ATPases or polymerases. Some superfamilies, e.g. the 3[prime]->5[prime] nucleases (32), the 5[prime]->3[prime]/FLAP nuclease superfamily (33), as well as the phosphoesterase superfamily that includes such nucleases as SbcD and Mre11 (34), have been extensively studied. There are, however, many other groups of nucleases that have not been characterized in comparable detail, and in the course of the present analysis, we have delineated four superfamilies of nucleases that to our knowledge, have not been recognized previously, and identified the likely origin of another major superfamily.
AP endonuclease/ENDO4 superfamily
Bacterial endonuclease IV is a homolog of eukaryotic apurinic endonucleases (35). Representatives of this family of endonucleases were detected in all bacterial, archaeal and eukaryotic species. Unexpectedly, iterative database searches revealed statistically significant similarity (e~10-4, iteration 3) between this endonuclease family and sugar isomerases (including xylose isomerases, tagatose epimerases and hexulose isomerases) that have the TIM barrel structural fold. The endonucleases and sugar isomerases share several conserved motifs, in particular the [DE]X2H signature as well as four histidines that are conserved in most of the proteins (Fig.
UvrC endonuclease superfamily (Uri domain)
UvrC protein is the endonuclease subunit of the bacterial excision repair complex that consists of the ABC-type ATPase UvrA and the helicase UvrB (41,42). Iterative database searches showed that UvrC contained a domain with statistically significant similarity(e < 10-3 at the sixth iteration) to intron-encoded endonucleases and several uncharacterized bacterial, archaeal and viral proteins (we designated this domain Uri after UvRC and Intron-encoded endonucleases). This previously undetected endonuclease family contains a RX3[YH] sequence signature, two conserved tyrosines that typically are separated by 10 residues, and a conserved glutamate (Fig.
EndoV endonuclease superfamily
The endonuclease V (E.coli nfi gene product), which is highly conserved in eukaryotes, showed subtle but statistically significant similarity (e < 10-3 in the second PSI-BLAST iteration) to a region of UvrC that is located between the Uri domain and the C-terminal helix-hairpin-helix (HhH) domain. Multiple alignment of the EndoV family with the UvrC sequences showed the conservation of two aspartates and a lysine that may be directly involved in catalysis as well as several potential structural elements (Fig.
RAD1/ERCC4 endonuclease superfamily and its inactivated derivatives
The human ERCC4 protein and its yeast ortholog RAD1 are endonucleases involved in NER (44). Our analysis revealed orthologs of this enzyme in archaea but not in bacteria. Additionally, a second paralog of ERCC4 was detected in the genomes of S.cerevisiae and Schizosaccharomyces pombe and may belong to a novel eukaryotic repair pathway. The only detectable bacterial member of this family is an uncharacterized protein from Mycobacterium tuberculosis. All the (predicted) nucleases of this superfamily contain the strikingly conserved signature ERKX2SD as well as an additional conserved aspartate; the conserved negatively-charged residues are likely to function in metal ion coordination and as nucleophiles in catalysis (Fig.
Further iterative database searches using the nuclease-HhH portion of the ERCC4 family proteins as the query detected a relationship with another family of eukaryotic repair proteins that includes human ERCC1 and its homologs in other eukaryotes, such as yeast RAD10 (Fig.
The RecB nuclease domain family
The C-terminal portion of the RecB (E.coli) and AddA (Bacillus subtilis) subunits is required for the nuclease activity of the recBCD and AddABC complexes, respectively (47,48). Sequence analysis performed using PSI-BLAST showed that this domain is present as a stand-alone version in several bacterial, archaeal, eukaryotic and phage proteins, and also is fused to other superfamily I helicases such as yeast DNA helicase 2 and its orthologs from other eukaryotes, in which it is located N-terminal to the helicase domain, in contrast to its location in RecB and AddA (Fig.
DNA-BINDING DOMAINS
All components of the DNA repair machinery must be delivered to the sites of their action on DNA-some bind DNA directly, whereas others rely on protein-protein interactions. Many repair proteins that interact with DNA contain distinct, compact DNA-binding domains that combine with different enzymatic or adaptor domains (Fig.
Some conserved domains in repair proteins are implicated in DNA binding even in the absence of direct experimental characterization for any representative, primarily on the basis of their predicted compact structure, small size and absence of conserved polar residues that could be involved in a catalytic activity. An example of such predicted nucleic acid-binding domain awaiting experimental corroboration is the HRD domain found in a subset of the RecQ family helicases, e.g. human Werners and Blooms syndrome gene products, and in RNase D (54).
Adaptor domains
The components of the repair machinery typically function in the form of macromolecular complexes that consist of multiple, diverse subunits. Therefore, in addition to DNA-binding domains, adaptor domains, that is domains that mediate protein-protein interactions between the components of repair complexes as well as between repair proteins and other cellular components, have a prominent role in repair. Adaptor domains are particularly important in eukaryotes where repair is intimately connected to the dynamics of chromatin-associated protein complexes and their alteration linked to the progression of the cell cycle, but prokaryotic adaptors also seem to exist. An example of likely bacterial adaptors is the domain shared by the UvrB (C-terminal domain) and UvrC proteins and implicated in the formation of the complex between these proteins (Fig.
Arguably, the most important adaptor domain involved in eukaryotic repair is the BRCT (BRca1 C-terminal) domain that has been detected in a vast variety of proteins involved in repair and cell cycle checkpoint regulation and may provide the critical connections between these processes (56,57; see also the discussion below). The BRCT domain occurs on its own in multiple copies as in yeast RAD9 or combines with a variety of enzymatic and DNA-binding domains as in terminal nucleotidyl transferases (TdT), REV1 and DNA ligases. In those instances where the function of the BRCT domain has been determined experimentally, BRCT domains of different repair proteins, such as DNA ligases III, XRCC1, poly(ADP-ribose) polymerase (PARP) and BRCA1, appear to mediate specific protein-protein interactions (58-60), which provides for the formation of protein complexes involved both in repair and in cell cycle checkpoints.
Examination of the protein sequences that have become available subsequent to the previous analyses of the BRCT domain revealed several interesting new occurrences (Fig.
The list of adaptor domains involved in repair and its interaction with cell cycle checkpoints is growing. The FHA (forkhead homology associated) domain has been detected in a variety of proteins with diverse functions, including protein kinases implicated in DNA damage response (61) and Xrs2 which participates in the repair of double strand breaks (62). The recent demonstration that the FHA domain of the RAD53 kinase interacts with the phosphorylated form of the BRCT protein RAD9 (63) indicates that FHA is a repair-checkpoint adaptor that may recognize phosphorylated proteins, perhaps even specifically phosphorylated BRCT domains. This possibility is of particular interest given the independent evolution of proteins combining the FHA and BRCT domains on at least two occasions (Fig.
The recently described HORMA domain that has been detected in the yeast REV7 protein involved in translesion DNA synthesis and in proteins that participate in the spindle assembly checkpoint and synaptonemal complex formation in meiosis, such as MAD2 and HOP1, is an example of an adaptor with a more limited distribution which, however, may have a critical role in linking repair with the cell cycle (64).
A protein with versatile adaptor functions is the proliferating cell nucleus antigen (PCNA) that originally has been identified as the sliding clamp that is required to increase the eukaryotic DNA polymerase processivity (65). More recently, it has been shown that PCNA is required for NER and MMR and interacts with a variety of repair proteins (65,66). In the course of the present analysis, we showed that PCNA is homologous to a group of proteins involved in repair and DNA damage checkpoints that include yeast RAD17, S.pombe Rad1 and Hus1, REC1 from Ustilago, and their mammalian orthologs (Fig.
Figure 3. A structural model of E.coli endonuclease IV built using the xylose isomerase structure as a template. The structural manipulations were doneusing the SWISSPDBviewer program. Using the multiple alignment shown in Figure 1A, a composite target sequence of the AP endonuclease was constructed, with the xylose isomerase structure (PDB coded 8XIA) serving as a template. The alignment of this composite sequence with 8XIA was further adjusted so that the energy of the target was globally minimized using a Sippl-like field. The resulting refined alignment was submitted as a PROMODII job, and the model was obtained. The deoxyribose of DNA appears to be positioned in the endonuclease model exactly as the xylose molecule is in xylose isomerase. The strands are colored red, the helices gold, the conserved aspartate orange, the conserved histidines (labeled H1-H5 from the N- to the C-terminus) green. The biochemical studies on repair systems have been mostly limited to a few model species, such as E.coli, the yeast S.cerevisiae, and humans. Therefore, analysis of the distribution of orthologs of repair proteins from these organisms in different phylogenetic lineages not only provides the material for evolutionary scenarios but effectively, amounts to the reconstruction of the repair systems in poorly studied organisms. Evidently, the completeness and precision of such a reconstruction depends both on the quality of analysis and on the level of conservation of the repair mechanisms between the organisms in question and one of the model species. The most striking aspect of the phyletic distribution of repair systems that becomes apparent through the comparison of complete protein sets from distant species is that while the repertoire of principal domains involved in repair, such as several distinct types of helicases and nucleases, is to a large extent conserved in all cells, the number of orthologous or even clearly functionally equivalent repair proteins that are shared by all the three superkingdoms is very small. By contrast, there is a much greater number of repair proteins that are conserved in one or two superkingdoms (Tables There seem to be no known repair proteins with an identical domain arrangement conserved in bacteria, archaea and eukaryotes. There are, however, a few highly conserved proteins with limited variations of domain architecture, of which the only one encoded in all genomes sequenced so far and apparently truly universal, is the RecA/RadA recombinase, which plays a central role in DNA recombination and RER (73,74). While RecA(RadA) appears to have been vertically transmitted throughout the history of life, its evolution has been accompanied by notable variations on the main theme, the most important being the fusion with a modified HhH domain that is shared by archaea and eukaryotes (Fig. Another universally conserved domain that is found, however, in significantly different structural and functional contexts in bacteria, on one hand, and in archaea and eukaryotes, on the other hand, is the FLAP nuclease (78-80). In archaea and eukaryotes, these nucleases (e.g. yeast RAD2 and RAD27) cleave recombination and repair intermediates containing overlapping 5[prime]-flaps at sites of nicks; they also possess 5[prime]-3[prime] exonuclease activity that may be involved in the hydrolysis of these flaps (78,79). The bacterial ortholog of the FLAP endonucleases is the N-terminal, 5[prime]-3[prime] exonuclease domain of DNA polymerase I (Fig. Several other repair proteins, though not ubiquitous, are found in most representatives of all three superkingdoms (Table 1). The most striking example of this kind are the SMC-like ATPases and the associated nucleases. These ATPases (typified by the E.coli SbcC protein) belong to the ABC superfamily but have an inserted large coiled-coil domain between the P-loop and the Mg2+-binding motif that together comprise the ATP-binding site. They are seen in almost all complete genomes (Table 1), and in eukaryotes, are involved in ATP-dependent, large-scale modifications of the chromatin structure (82,83). The SMC-like ATPases form complexes with the equally common nucleases of the calcineurin-like phosphoesterase superfamily, such as bacterial SbcD-like proteins and eukaryotic Mre11-like proteins (84-86). It seems likely that this ATPase-nuclease pair was vertically inherited in all life forms with a loss in a few lineages. Other conserved repair proteins found in all three superkingdoms, with a varying degree of representation among specific lineages, include photolyases (phrB gene product in E.coli), endonuclease III (nth and mutY), exonuclease III (xthA), 8-oxo-dGTPase (mutT) and the UmuC protein superfamily. Each of these enzymes is involved in a basic repair function (1 and references therein), but their activities are, in principle, dispensable as each of them is missing in some of the bacterial or archaeal species with small genomes (Table 1). The protein families discussed in the previous section represent the relatively small number of cases when homologous domains arranged in similar, if not identical, combinations appear to perform similar functions in repair in all three superkingdoms. By contrast, most of the repair systems have more limited phyletic distribution, which in some instances may suggest plausible scenarios for their evolution. Several repair systems are essentially unique to bacteria but some of these additionally are seen in eukaryotes, to the exclusion of the archaea (Table 1), which may suggest horizontal gene transfer, in most cases probably from the mitochondrial genome to the eukaryotic nuclear genome. The UvrABC excisionase, together with the UvrD helicase that is functionally coupled to it, are the principal components of NER in bacteria (4) and are encoded in all bacterial genomes sequenced to date, including the minimal genomes of Mycoplasma. Outside the bacteria, however, this system has been detected in only one archaeon, namely Methanobacterium thermoautotrophicum. Methanobacterium thermoautotrophicum has a complete operon including the uvrA, B and C genes, and UvrD encoded elsewhere in the genome, which strongly suggests horizontal transfer from bacteria. The domain architecture of all three excisionase subunits is conserved throughout bacteria, but the presence of the Uri and EndoV nuclease domains in other contexts (Fig. The second widespread bacterial repair system is the RuvAB(C) complex, which is the Holliday junction resolvase and the key component of bacterial RER (87,88). Interestingly, RuvC, the endonuclease subunit, is not detectable in Mycoplasma and spirochaetes, suggesting that a distinct nuclease may have been recruited in these bacteria for the participation in Holliday junction resolution. As in the case of the UvrABCD system, each of the Ruv proteins contains well known ancient conserved domains (Table 1) but orthologs of these proteins so far have been detected only in bacteria. A different phylogenetic pattern was observed among the components of the base MMR system (5,89). This system depends primarily on two proteins containing ATPase domains of different structures, namely MutL (90,91) and MutS (28), both of which are highly conserved among bacteria, though missing in Mycoplasma. Only the MutS family proteins are seen in the archaea M.thermoautotrophicum (with an additional HhH domain) and Pyrococcus horikoshi. This finding is of particular interest as these are so far the only genomes in which a gene for MutS is not accompanied by a MutL gene, suggesting the possibility of functional uncoupling between these MMR system components. Phylogenetic analysis of the MutS protein sequences shows that a gene duplication resulting in two distinct forms of MutS had occurred very early in bacterial evolution (data not shown). This is supported, in particular, by the presence of both forms in bacteria from several major lineages, such as Aquifex aeolicus, B.subtilis and Synechocystis. There is a major expansion of genes encoding MutL and MutS homologs in eukaryotes, with at least five or six members found in each eukaryotic genome. This expansion apparently involves functional diversification, in particular between nuclear and mitochondrial DNA repair. In the course of this analysis, we observed that one of the families of eukaryotic MutS homologs (GMBP1) contains an additional domain (BMB domain in Fig. Illegitimate recombination in bacteria and eukaryotes is suppressed by the RecQ helicase family members, which accordingly appear to play a major role in the maintenance of chromosomal integrity (93,94). There are two highly conserved RecQ paralogs, which differ by the presence or absence of the putative DNA-binding HRD domain (54); one or both paralogs may be present in the same genome amidst different bacterial lineages. Multiple orthologs of both of these RecQ-like helicases are detectable in eukaryotes but not in archaea. Remarkably, two human gene that are mutated in hereditary diseases associated with repair defects, namely Blooms and Werners syndromes (95,96), encode HRD domain-containing helicases of the RecQ family (Fig. The only repair protein that is conserved in most bacteria and apparently all archaea, to the exclusion of eukaryotes, is the RecJ 5[prime]-3[prime] exonuclease, which belongs to the recently identified DHH superfamily of phosphohydrolases (97). The eukaryotic members of this superfamily (e.g. the Drosophila Prune protein) are only distantly related to RecJ and do not seem to be involved in repair. RecJ has been implicated both in RER and in the post-incision removal of 5[prime]-deoxyribose phosphate in BER (98,99) but it appears that the common function of this nuclease underlying its notable conservation in bacteria and archaea remains to be identified. Additional, specifically bacterial repair pathways rely on distinct members of the ABC superfamily of ATPases, such as RecN and RecF, helicases, e.g. RecG (100) and accessory, single-stranded DNA-binding proteins, such as RecO and RecR (101). The evolution of RecR is of particular interest as it is a clear case of recruitment of an enzymatic domain, namely the recently identified common catalytic domain of DNA primases and topoisomerases (Toprim domain; 102), for a non-enzymatic function. Bacteria have evolved a unique regulatory system, which allows them to produce a complex response to DNA damage. This system depends on the DNA-binding transcription regulators LexA (103) and UmuD (104) containing a C-terminal signal peptidase-like domain, which catalyzes RecA-dependent autoproteolysis of these proteins, thus activating the DNA-binding domain. LexA is a general transcriptional regulator of repair functions; LexA orthologs are limited in their distribution to several bacterial lineages. The theme of the association of proteolysis with repair, however, appears to be more general. The bacteria-specific repair ATPase Sms consists of three domains (Fig. Coupling of transcription and repair appears to confer a definite selective advantage as it enables the organism to repair functional genes as they are expressed and thus escape the immediate effects of deleterious mutations resulting in non-functional proteins. This coupling seems to have evolved independently in bacteria and in eukaryotes. The bacterial version is dependent on the superfamily II helicase Mfd/TRCF (105,106) that is conserved in several bacterial lineages and contains a second, apparently inactivated helicase domain whose function could be the recruitment of other repair proteins (Fig. Several other repair pathways are restricted to just a few groups of bacteria (Table 1); a thoroughly studied example is the RecBCD helicase-exonuclease complex, which is the central component of RER. In some cases, recruitment of a repair enzyme in a subset of bacteria from rather unexpected sources seems likely. Thus the dcm and dam methylases (107) appear to have been recruited from restriction system methylases of phage origin. Similarly, the MutH endonuclease involved in MMR and so far found only in E.coli and H.influenzae probably has been derived from a restriction endonuclease related to Sau3 (108). The NER system, transcription-repair coupling components and the vast repertoire of regulatory proteins distinguish the eukaryotic repair systems from bacterial ones. While the NER system includes components that individually trace back to the common ancestor of the archaea and eukaryotes, the transcription-repair coupling mechanism and the regulatory apparatus seem to be true eukaryotic inventions that probably have evolved in response to the diversification of the eukaryotic chromatin structure and cell cycle control. Even within the eukaryotes, while the core machinery appears to be conserved throughout, there are several notable, lineage-specific modifications of the regulatory system. The understanding of the core eukaryotic repair systems has largely been derived from the RAD complementation groups in yeast (109) and the Xeroderma pigmentosum complementation groups in humans (110) (Table 2). The intersection of the results produced by these principal lines of research delineates the conserved central components of eukaryotic NER. The eukaryotic NER system is built up of a number of distinct helicases and nucleases. The helicases include ERCC2 (Xp-D) (111), ERCC3 (Xp-B) (112) and ERCC6 (Cs-B) (113). The ERCC2 helicase is conserved in all eukaryotes sampled so far and shows a distant but apparently orthologous relationship with the DinG helicase (114) seen in several bacteria and the archaeon M.jannaschii, suggesting an ancient involvement in repair. However, beyond the general helicase role, the members of this family appear to have undergone functional differentiation following independent duplication in different phylogenetic lineages. For example, the eukaryotic CHL1 helicase, a member of the ERCC2 family, has a role in maintaining the chromatin integrity (115). The ERCC3 helicase family shows an unusual phyletic distribution-in addition to its conservation in eukaryotes, it is also present in the archaeon A.fulgidus, the bacteria Mycobacterium leprae (116) and Treponema pallidum, African swine fever virus and some bacteriophages, suggesting multiple horizontal gene transfer events. Given the lack of orthologs of other members of the eukaryotic-type NER complex in bacteria and archaea, it is unlikely that these scattered ERCC3 orthologs share functional details with the eukaryotic enzyme. The ERCC6 helicase belongs to the ancient SWI/SNF family that is conserved in bacteria and eukaryotes. In eukaryotes, however, this family has undergone a striking expansion, with 17 paralogous members in yeast (117), many of which are involved in repair. Bacterial helicases of the HepA family, which are orthologous to the ERCC6 family (118), may be involved in repair and specifically in the repair-transcription coupling (119), but this family is represented by only one or two members in each bacterial genome when present. Thus it is obvious that the SWI/SNF family has attained its current functional differentiation only after the origin of the eukaryotes. This must have been an early event in eukaryotic evolution since for a number of these helicases, orthologous relationships can be traced in yeast, plants and animals. In some of these orthologous sets, such as RAD5 (120) and RAD16 (121), a unique domain organization, with a RING finger inserted into the helicase domain, between the helicase motifs 5 and 6 (Fig. The nuclease components of the NER system also are highly conserved, and as noted above, ERCC4 is seen in archaea as well, fused to an apparently active N-terminal helicase domain. The other nucleases in this pathway, such as Xp-G, Rad2 and Rad27, are members of the universally-conserved FLAP/FEN family (122). Another NER component is the UV-damaged DNA-binding protein (UV-DDB) which partially complements the XP-E defect (123). UV-DDB is a member of a family that has two additional paralogs conserved in eukaryotes, one of which is a component of the polyA cleavage specificity factor (CPSF-A) (Table 2). In this context it is interesting to note that another repair protein SNM1, which is involved in UV cross-link repair in yeast (124), is homologous to other CPSF subunits that contain a metallo-[beta]-lactamase domain (125). The regulation of repair and its connection with cell cycle checkpoints are the most dramatic distinguishing features of the eukaryotic repair system that have undergone considerable evolution after the divergence of the eukaryotes from the other superkingdoms of life. The proteins providing for these features typically have no orthologs in bacteria or archaea, even though some of the adaptor domains are conserved. The understanding of the likely structural basis of the repair-checkpoint coupling has been significantly advanced through the discovery of a single domain-the BRCT domain that appears to be the most common adaptor in the eukaryotic repair machinery. The yeast genome encodes 10 BRCT-containing proteins (57), and the number of these proteins encoded in the genomes of multicellular eukaryotes is expected to be even greater. As discussed above, certain distinct domain architectures of BRCT-containing proteins are highly conserved in evolution. Generally, however, domain shuffling seems to be the predominant trend in the evolution of the BRCT-containing proteins. Thus, of the 10 yeast BRCT-containing proteins, only three, namely the DNA ligase, DNA polymerase subunit 2 (DPB11) and the REV1 nucleotidyltransferase, are represented by orthologs with a conserved domain arrangement in Caenorhabditis elegans. Conversely, C.elegans encodes a number of BRCT-containing proteins with unique domain architectures. The BRCT domain thus far has not been detected in archaea but is invariably present at the C-terminus of bacterial DNA ligases. This phyletic distribution suggests that similarly to several other components of the repair system (e.g. MMR components), the BRCT domain most likely had invaded the eukaryotic genomes by gene transfer from bacteria and had subsequently undergone a dramatic expansion in the eukaryotes. The detection of a BRCT domain protein in trypanosomes indicates that the proposed horizontal gene transfer event dates to a very early stage in the evolution of eukaryotes. There are other proteins with very diverse functions that appear to connect the eukaryotic repair systems with chromatin. Typically, such proteins contain eukaryote-specific adaptor domains, such as the RING finger (126) in some of the SWI family helicases and other proteins like RAD18, the WD40 repeats in CS-A (127), and ubiquitin and duplicated ubiquitin hydrolase domains in Xp-C/Rad23 (128). The signal transmission from damaged DNA to the checkpoint machinery relies upon a phosphorylation cascade that includes FHA domain-containing kinases, such as SAD1 (129) and DUN1 (130), and the ATM kinases (131) of the lipid kinase superfamily. Finally, several eukaryotic proteins regulate the repair machinery at the level of transcription; the best characterized representatives of this group are p53 and retinoblastoma (Rb) (132). These regulators appear to have evolved in specific groups of eukaryotes, namely multicellular forms, and represent cases where a distinct [beta]-rich fold has been recruited for DNA binding (p53) (133) or where cell cycle regulatory elements, such as the helical cyclin box domain, have been recruited for protein-protein interactions important in the regulation of repair (Rb) (134). Obviously, the present discussion provides only a rough sketch of the comparative aspects of the eukaryotic repair system and by no means accounts for its entire complexity, particularly with respect to the connections with transcription and the cell cycle. There is no doubt that only some of the components providing these connections have been identified to date and, furthermore, the results of our analysis point out uncertainties with regard to the actual functions of some important eukaryotic repair proteins (Table 2). For example, the product of the yeast RNC1 gene has been reported to be a DNase essential for most recombination events (135,136). However, comparative sequence analysis clearly indicates that the RNC1 protein consists of a SAM-dependent methyltransferase domain and an S1-like RNA-binding domain, suggesting an RNA methylase activity and leaving no room for a nuclease domain (Table 2; data not shown). In a similar conundrum, the yeast RAD6-RAD18 heterodimer involved in the post-replicative bypass of UV lesions has been reported to possess not only the ubiquitin-conjugating activity (intrinsic in RAD6) but also a DNA-dependent ATPase activity (137). Not only, however, does neither of the two proteins involved show any resemblance of known ATPases, but there seems to be no unaccounted for globular domain to accommodate such an activity. Further experimental studies are indispensable to solve these contradictions. The evolutionary analysis of the repair machinery reveals some general features that may reflect the selection forces behind the evolution of the repair system. The most striking aspect of the phyletic distribution of repair system is the near lack of universal components. There seem to be at least three primary evolutionary forces that shape the repair systems. The environment and evolutionary history have profoundly affected the evolution of repair systems. Bacterial pathogens not only have small genomes, which may ease the requirement for sophisticated repair systems, but also thrive in environments where evolvability appears to be advantageous and selected for. More specifically, rapid evolution of variant antigens through replication errors and extensive recombination appears to be critical for the survival of these organisms. In these systems, the selective pressure to evade the host immune system may counterbalance the deleterious effect of weak, error-prone repair. As a consequence, the genomes of Mycoplasma, Helicobacter, Borrelia and Treponema lack many of the repair components present in such free-living bacteria as Synechocystis, E.coli or B.subtilis (Table 1). Even among these pathogens, however, there are considerable differences in the repertoires of the repair enzymes as demonstrated by a detailed comparison of the Borrelia and Treponema genomes (G.Subramanian, L.Aravind and E.V.Koonin, unpublished observations). Specifically, Borrelia that shows particularly prominent antigenic variation (138) and therefore could be expected to undergo selection for evolvability seems to have lost several genes coding for enzymes of RER that are seen in Treponema. This illustrates the dramatic effect of the specific lifestyle on the repair systems even among relatively close bacterial species. Conversely, the free-living organisms, for which highly efficient repair is a must, tend to recruit additional repair enzymes. Examples of such recruits include DNA polymerase II in E.coli (139), DNA polymerases of the X-family in some bacteria, as well as a host of novel predicted repair enzymes in the Mycobacteria (116) (Table 1; Fig. The internal environment within the cell is also critical for the evolution of the repair systems as becomes clear from the nature of changes seen in eukaryotes compared to the prokaryotes. Eukaryotes have histones with basic tails complexed with the DNA and a higher order chromatin structure that is significantly more complicated than its prokaryotic counterparts (142). The evolution of these structures placed additional barriers to the repair enzymes interacting with the damaged DNA and led to the concomitant evolution of specific structural elements that provide the connection between the repair machinery and the chromatin, such as the adaptor domains discussed above. Furthermore, the tight coupling of the repair machinery with transcription (7) seen in eukaryotes appears to have co-evolved with the components of eukaryotic chromatin and cell cycle regulation. Such central components of this coupling as Rb and the cyclins that as subunits of TFIIH, participate in both repair and transcription could have evolved from TFIIB-like proteins, which also have the cyclin fold (134), and given their conservation in archaea and eukaryotes, should have been already present in their common ancestor. It is further imaginable that the cyclins originally involved in the transcription-repair coupling could have been recruited for their present role in cell cycle control, given the requirement for the recognition of damaged DNA prior to the commencement of the S-phase and the progression of cell division. The rise of multicellularity may have mounted pressure for further developments in the coupling of repair and transcription. The need to have tissue-specific genes transcriptionally activated in the presence of damaged DNA may have provided the selective pressure for the evolution of multiple mechanisms linking the two processes. This could have been the driving force behind the evolution of such proteins as BRCA1, which participates in repair in conjunction with RAD51 (the recA ortholog) (143) and is also a part of the transcriptional machinery through its association with RNA polymerase II (144,145). While BRCA1-like proteins are seen in both plants and animals and thus seem to have an ancient origin, the transcription factor p53 is seen so far only in the coelomate animals. Three paralogs of this family are represented in mammals where there is evidence for a central role of p53 in repair (146). In addition to its function in transcription, p53 also directly associates with repair proteins, such as the recA homologs (147) and the xth-like Ap endonuclease ref-1 (148), and is involved in cell cycle arrest in response to DNA damage (149). This is a striking example of an entirely novel protein that may have evolved in only a subset of multicellular organisms, in response to the selective pressures for the coordination of transcription, repair and cell cycle. Another major but hitherto under-appreciated aspect of the evolution of the repair systems seems to be the role of lateral gene transfer and genomic chimerism in the generation of their diversity. As discussed above, many of the eukaryotic repair proteins clearly can be traced to bacterial and archaeal roots. Those shared with the archaea (Table 2) may come directly from the ancestor of the nuclear genome. By contrast, those repair proteins that are shared by eukaryotes and bacteria to the exclusion of the archaea, may have entered the eukaryotic lineage through horizontal transfer from the organellar (mitochondrial or chloroplast) genomes (Tables On many occasions, horizontal gene transfer events are difficult to distinguish from lineage-specific gene loss. In fact, this dilemma arises each time when an episodic distribution of a gene or a whole system is observed. The RecBCD exonuclease is a good example of such a situation (see above). It appears likely that the actual history of any particular repair system should have included both horizontal gene transfer and differential gene loss. The difficulties in deciphering the exact scenario notwithstanding, it is clear that the evolution of repair systems is a dramatic manifestation of the genome plasticity. Conceivably, horizontal gene transfer and lineage-specific gene loss could have been more rampant in the history of repair than in other cases, such as for example the evolution of the translation apparatus (though see 151,152), because while repair as such is essential for any organism, many of the specific repair systems can be inactivated without an immediate lethal effect (1). Evidently, the present layout of the repair systems in the three superkingdoms of life depends to a considerable extent on what had been inherited by each of them from their last common ancestor (the cenancestor). The comparison between bacteria, archaea and eukaryotes discussed above may help in at least partially defining this common heritage. All interpretations in this area are necessarily speculative. Nevertheless, the most parsimonious solution, considering all the data from complete genomes, is that the cenancestor at least encoded a RecA-like recombinase, a few helicases and nucleases of the conserved superfamilies, and ABC superfamily ATPases of the SbcC/SMC2 family. This leads to a reasonably confident estimate of approximately 10 types of repair protein domains in the cenancestor. The evolution of the conserved repair pathways by vertical descent, however, appears to be largely restricted to each single superkingdom of life. This pattern is reminiscent of the profound differences in the core replicative enzymes, such as the DNA polymerases, ligases and replicative helicases and ATPases, in the archaeal/eukaryotic and bacterial lineages and is in sharp contrast with the universal conservation of the translation machinery. As discussed previously, these observations put together may suggest that the cenancestor had an RNA genome (153). If so, how does one account for the about 10 universal families of repair proteins? The general explanation is that they already had functions in an RNA-based ancestral cell-most of these conserved families of nucleases and helicases have members with RNA substrates. It is notable in this regard that the most common nucleic acid-binding module in repair proteins, HhH, is represented by both RNA-binding and DNA-binding versions. It is of further interest that the version found in eukaryotic and archaeal orthologs of RecA shows the closest similarity to the RNA-binding version in the NusA protein (see above). This raises the possibility of direct recruitment of RNA interacting proteins for roles in DNA replication and repair. This might have happened on multiple occasions in evolution-like, for example, in the Werners syndrome protein that contains a RecQ helicase inserted into an RNase D-like domain (Fig. The diversity of the repair systems in different lineages indicates that they have been undergoing continuous evolution up until the terminal branches of the phylogenetic radiation. The helicases-nuclease fusions that are seen on multiple occasions in different lineages and apparently have evolved independently are a good case in point. One example is the human WRND protein, in which the helicase-nuclease fusion is not detectable in yeasts or in other animal lines, such as C.elegans, suggesting a relatively recent event. Similar fusions of the pol III [epsis] subunit-like nuclease domain with the DinG helicase in B.subtilis (32) and with the Uri nuclease domain and the BRCT domain, respectively, in two mycobacterial proteins also are indicative of continuous generation of novel repair proteins by domain fusion. Another notable feature observed in certain lineages is the disruption of the catalytic motifs detected on several independent occasions in ATPases and nucleases (Fig. Comparative analysis of DNA repair systems, made possible by the availability of multiple complete genome sequences, suggests a remarkably complex picture of evolution, contingent on the external and internal environment and replete with domain shuffling, horizontal gene transfer, and lineage-specific gene loss events. Repair systems rely on a limited set of conserved domains but the number of universal repair proteins with domain architectures that are at least partially conserved across the three domains of life is very small, and there is no orthology at the level of systems and pathways. By contrast, a much greater level of conservation is observed within each of the three superkingdoms of life. The dramatic complexity of the eukaryotic repair system in terms of the number of components can be traced to the intimate connections with chromatin dynamics and cell cycle control. The repair mechanisms in archaea have not been characterized in detail. Comparative analysis readily identifies a number of candidate repair proteins but is inadequate in terms of reconstructing entire pathways. While it seems fairly safe to infer the layout of the repair systems of poorly characterized bacteria on the basis of orthologous relationships between their genes and those from well-characterized model organisms (primarily E.coli), understanding the archaeal systems still requires the critical body of experimental data. Similarly, a lot remains to be learnt about the details of the relationships between repair, chromatin and cell cycle in eukaryotes. It is our hope that the present analysis of the relationships between repair domains and proteins, particularly the description of previously undetected domains, will help in the rational design of experiments to further our understanding of this essential cellular function.
PHYLETIC DISTRIBUTION AND EVOLUTION OFREPAIR SYSTEMS
REPAIR PROTEINS CONSERVED IN ALL THREE SUPERKINGDOMS OF LIFE
REPAIR PROTEINS AND PATHWAYS CONFINED TO ONLY ONE OR TWO OF THE SUPERKINGDOMS
Repair systems of bacterial origin
Repair systems of archaeal and eukaryotic origin
SOME GENERAL TRENDS IN THE EVOLUTION OF REPAIR SYSTEMS
THE PRESSURES OF EXTERNAL AND INTERNAL ENVIRONMENTS
HORIZONTAL GENE TRANSFER AND DIFFERENTIAL GENE LOSS
PREADAPTATION: WHICH REPAIR SYSTEMS HAVE BEEN INHERITED FROM THE CENANCESTOR?
CONTINUING EVOLUTION OF DNA REPAIR PROTEINS
CONCLUSIONS
REFERENCES















