| Nucleic Acids Research | Pages |
Alignment of whole genomes
Introduction
The Challenge Of Whole Genome Alignment
Algorithmic Methods
Maximal unique matching subsequence decomposition
Sorting the MUMs
Closing the gaps
Results And Discussion
Comparing two strains of tuberculosis
Comparing two Mycoplasma genomes
Comparing human and mouse
Summary
Acknowledgements
References
Alignment of whole genomes
ABSTRACT
INTRODUCTION
Since the first successful whole-genome shotgun sequence of Haemophilus influenzae (1), the number of organisms whose genomes have been completely sequenced has been increasing rapidly each year. As the number and variety of these genomes increase, it is becoming more common for a project to sequence the genome of an organism that is very closely related to another completed genome. For example, the genomes of Mycoplasma genitalium (2) and Mycoplasma pneumoniae (3), the third and fifth prokaryotic organisms to be completely sequenced, respectively, are very closely related and share sequence homology across large fractions of their genomes. More recently, there has been tremendous scientific interest in sequencing different strains of the same bacteria. Two strains of Mycobacterium tuberculosis, H37Rv (4) and CDC1551 (R.D.Fleischmann et al., manuscript in preparation), and two strains of Chlamydia trachomatis, serovar D (5) and mouse pneumonitis (Fraser et al., manuscript in preparation), will be completely sequenced in the near future; in each case one of the two strains is complete and the other is nearly so. It is clear that the future will see an increasing number of sequencing projects whose target is a strain or species that is closely related to an already-sequenced organism.
When the genome sequence of two closely related organisms becomes available, one of the first questions researchers want to ask is how the two genomes align. There is a large body of research, including many sophisticated algorithms, for aligning two sequences. This vast literature cannot be cited here, but important early work includes Needleman and Wunsch (6) and Smith and Waterman (7) (for recent reviews see 8,9). The focus of most prior research has been on comparing single proteins or genomic DNA sequences containing a single gene. The existing algorithms work extremely well on this task, but in most cases are ineffective in aligning entire genomes. The problem is really one of size: single gene sequences may be as long as tens of thousands of nucleotides, but whole genomes are usually millions of nucleotides or larger. When comparing a 4 Mb sequence such as M.tuberculosis to another 4 Mb sequence, many algorithms either run out of memory or take unacceptably long to complete. In addition, previous algorithms were designed primarily to discover insertions, deletions and point mutations, but not to look for the kinds of large-scale changes that can be discovered in whole-genome comparisons, such as differences in tandem repeats and large scale reversals.
In this paper we describe a system for pairwise alignment and comparison of very large scale DNA sequences. The algorithm assumes the sequences are closely related, and using this assumption can quickly compare sequences that are millions of nucleotides in length. It will also be able to compare entire chromosomes as large as human chromosome 1 (i.e., several hundred million basepairs), once such sequences are available, and in the process identify all differences between two different individuals.
The system is specifically designed to perform high resolution comparison of genome-length sequences. It outputs a base-to-base alignment of the input sequences, highlighting the exact differences in the genomes. It will locate all single nucleotide polymorphisms (SNPs), large inserts, significant repeats, tandem repeats and reversals, in addition to identifying the exact matches between the genomes.
We have applied this system to the CDC1551 and H37Rv strains of M.tuberculosis, to the two completed Mycoplasma genomes, and to two relatively long (225 kb) syntenic sequences from the human and mouse genomes. In the case of tuberculosis, the strains are very closely related, and the system was very useful at pinpointing the SNPs and the relatively small number of significant insertions between these two genomes. (For the context of this discussion, the term SNP is used to mean a sequence that appears in both genomes with a difference of just one base between the two copies. Such polymorphisms may or may not represent mutations that occur in a significant percentage of the population.) In the second case, where the organisms are much less closely related (differing by hundreds of thousands of nucleotides), the system is nonetheless able to align the genomes precisely. We also tested our system on even more distantly related sequences by comparing a syntenic region from the mouse and human genomes. The results of these comparisons are described in the Results and Discussion.
In addition to allowing for comparison between different organisms, the system described here can also be put to a different use. At different stages of any large genomic sequencing project, the assembled sequence will change as gaps are closed, sequencing errors are corrected and additional sequences are completed. Because the finishing stage involves many individuals, it can be difficult for a project leader (or any one person) to get a picture of what has changed each time a genome is reassembled. The program described here can compare two different versions of a genome at different stages of sequencing and highlight precisely what has changed.
The output of the system gives a clear picture at the sequence level of all the differences between two genomes. (The code is freely available; contact the authors by email for details.) To present a more global picture, we have also developed a graphical interface that allows a researcher to scroll along the two genomes being compared and zoom in on areas of interest. (See Figure
THE CHALLENGE OF WHOLE GENOME ALIGNMENT
The standard algorithms for sequence alignment rely on either dynamic programming (7,10) or hashing techniques (8,11). Naïve versions of dynamic programming use O(n2) space and time (where n is the length of the shorter of the two sequences being compared), which makes computation simply unfeasible for sequences of size [ge]4 Mb (such as the two M.tuberculosis genomes). [For an input with size n, a function X is O(n2) if, for sufficiently large n and for some constant C independent of n, X [le] C·n2. Informally stated, the O(n2) notation means that the amount of space and time required for the computation is no more than Cn2.] Hashing techniques operate faster on average, but they involve a `match and extend' strategy, where the `extend' part also takes O(n2) time. For dynamic programming, it is possible to reduce the required space to O(n) by taking more time; this solves the memory problem but still leaves one with an unacceptably slow algorithm. Faster algorithms can be developed for specialized purposes, such as a recent system for finding tandem repeats (12). This repeat finder uses a k-tuple hashing algorithm and couples it with a stochastic pattern matching strategy.
More complex dynamic programming methods can be used for alignment when the alignment error is expected to be low. For example, one can align two similar sequences with at most E differences (or errors) in time proportional to E times the length of the longer sequence. The sim3 program (13) uses a linear time algorithm that works well when the input sequences are highly similar; it runs very fast even on very long sequences. Unfortunately, this class of algorithms does not always work for whole genome alignments, since the `errors' may include multiple large inserts on the order of 104 or 105 nucleotides. As we demonstrate below, the number of differences may be greater than 100 000 despite the fact that the genomes (in this case M.genitalium and M.pneumoniae) are closely related and can in fact be aligned with one another.
Another system developed to align long sequences is sim2 (14). This system uses a BLAST-like hashing scheme to identify exact k-mer matches, which are extended to maximal-length matches. These maximal matches are then combined into local alignment chains by a dynamic programming step. In contrast, our suffix-tree approach directly finds maximal matches that are unique. These matches can then be easily ordered to form the basis of an alignment that can span even very long mismatch regions between the two input genomes.
The system described here was developed in response to our own efforts as part of sequencing strain CDC1551 of M.tuberculosis; we realized it was essential to describe all the differences between CDC1551 and the recently completed H37Rv strain (4). The well-known and widely used BLAST (15,16) and FASTA (8,11) systems are not designed to perform large scale alignment of genomes, and our attempts to use these did not produce all the information we needed. It is possible, of course, to align two genomes gene-by-gene, or to align shorter pieces and concatenate all the results. By assuming that the two input sequences are closely related, our algorithm can perform large scale alignments quickly and precisely; the result is a very detailed and inclusive base-to-base mapping between the two sequences.
ALGORITHMIC METHODS
The basis of the algorithm is a data structure known as a suffix tree, which allows one to find, extremely efficiently, all distinct subsequences in a given sequence. The first efficient algorithms to construct suffix trees were given by Weiner (17) and McCreight (18), and this data structure has been studied extensively for more than two decades (9). For the task of comparing two DNA sequences, suffix trees allow us to quickly find all subsequences shared by the two inputs. The alignment is then built upon this information.
Our system uses a combination of three ideas: suffix trees, the longest increasing subsequence (LIS) and Smith-Waterman alignment (7). The novelty of the system derives from the integration of these ideas into a coherent system for large-scale genome alignment. We focus here on the high-level design of the system and exclude some of the low-level algorithmic details; those details can be found in the references.
The inputs to the system are two sequences, which for convenience we refer to as Genome A and Genome B. Note that any sequences can be provided as input (in fact, we have a modified version of the system that handles protein sequences), but we will use DNA for the purposes of discussion. We assume the sequences to be compared are closely homologous. In particular, we assume that there is a mapping between large subsequences of the two inputs, presumably because they evolved from a common ancestor. The main biological features that the system identifies are as follows. (i) SNPs, defined here as a single mutation `surrounded' by two matching regions on both sides of the mutation. (ii) Regions of DNA where the two input sequences have diverged by more than an SNP. (iii) Large regions of DNA that have been inserted into one of the genomes, for example by transposition, sequence reversal or lateral transfer from another organism. (iv) Repeats, usually in the form of a duplication that has occurred in one genome but not the other. The repeated regions can appear in widely scattered locations in the input sequences. (v) Tandem repeats, regions of repeated DNA that might occur in immediate succession, but with different copy numbers in the two genomes. The copy numbers do not have to be integers; e.g., a repeat could occur 2.5 times in one genome and 4.2 times in the other.
Figure 1. A maximal unique matching subsequence (MUM) of 39 nt (shown in uppercase) shared by Genome A and Genome B. Any extension of the MUM will result in a mismatch. By definition, an MUM does not occur anywhere else in either genome. The alignment process consists of the following steps, which are described in more detail in subsequent sections. (i) Perform a maximal unique match (MUM) decomposition of the two genomes. This decomposition identifies all maximal, unique matching subsequences in both genomes. An MUM is a subsequence that occurs exactly once in Genome A and once in Genome B, and is not contained in any longer such sequence. Thus, the two character positions bounding an MUM must be mismatches, as shown in Figure MUMs on both DNA strands are identified; this allows the system to identify sequences from one genome that appear reversed in the other genome. (ii) Sort the matches found in the MUM alignment, and extract the longest possible set of matches that occur in the same order in both genomes. This is done using a variation of the well-known algorithm to find the LIS of a sequence of integers. Thus, we compute an ordered MUM alignment that provides an easy and natural way to scan the alignment from left to right. (iii) Close the gaps in the alignment by performing local identification of large inserts, repeats, small mutated regions, tandem repeats and SNPs. (iv) Output the alignment, including all the matches in the MUM alignment as well as the detailed alignments of regions that do not match exactly. The system, which is called MUMmer, is packaged as three independent modules: suffix tree construction, sorting and extraction of the LIS, and generation of Smith-Waterman alignments for all the regions between the MUMs. The last step can easily be replaced with another alignment program if a user wishes. In the ensuing sections we elaborate further on each of these steps. As mentioned above, identification of MUMs is the key step in the alignment. By identifying the sequences that occur only once in each genome we can complete the alignment by closing the gaps between the aligned MUMs. The problem of finding a set of maximal unique matching strings (subsequences) in two very long sequences is by no means computationally trivial. The naïve algorithm for this problem will imply matching every subsequence in Genome A with Genome B. There are O(n2) such subsequences (where n is the sum of the lengths of the two genomes), and each match requires approximately O(n) time using standard pattern matching methods. Fortunately, we can employ an ingenious computational data structure introduced by Weiner (17) called a suffix tree. An example of a suffix tree for the string gaaccgacct is shown in Figure Figure 2. Suffix tree for the sequence gaaccgacct. Square nodes are leaves and represent complete suffixes. They are labeled by the starting position of the suffix. Circular nodes represent repeated sequences and are labeled by the length of that sequence. In this example the longest repeated sequence is acc occurring at positions 3 and 7. As the name implies, a suffix tree is a compact representation that stores all possible suffixes of an input sequence S. A suffix is simply a subsequence that begins at any position in the sequence and extends to the end of the sequence. Each suffix in S can be located by traversing a unique path in the tree from the root node to a leaf node. In other words, each leaf node represents a unique suffix. A sequence of length N has N suffixes, one starting at each sequence position, so the tree must have N leaves, and therefore at most N-1 internal nodes since each internal node has at least two child nodes. Note that each internal node in a tree corresponds to a repeated sequence in the original genome, where the repeat number equals the number of leaf nodes underneath that node in the tree. [Recently suffix trees have also been used to help discover regulatory elements in genomic yeast sequences (19). For other applications of suffix trees to sequence analysis, see Gusfield (9).] The simple, brute-force algorithm to construct suffix trees runs in quadratic time; this is no faster than dynamic programming and, as explained above, is impractical for comparing whole genomes. However, it is possible to build a suffix tree in linear time by clever use of sets of pointers (17,18,20,21); our system uses McCreight's (18) algorithm. The total size of the tree is also linear in the sum of the lengths of the genomes in it, since there is exactly one leaf and at most one internal node for each base, and the sizes of these nodes are fixed. Note that the sequence label on each edge can be represented by two integers (its length and starting position in the genome), no matter how long it is. Our particular implementation uses 12 bytes per leaf node, 24 bytes per internal node, plus 1 byte for each base in the genome. More compact representations are possible (22). Because suffix-tree construction and all subsequent steps require no more than linear time and space, the overall running time (and space) required by the system is also linear. As a very generous upper bound, the system as implemented requires no more than 37 bytes per base of the input sequences; thus a comparison of two 100 Mb chromosomes would require <8 gigabytes of memory (and probably far less than that). MUMmer begins by constructing a suffix tree T for genome A, and then adding the suffixes for genome B to T. Adding suffixes from an additional string to a suffix tree is a trivial modification of the construction algorithm for a single string, since the construction is accomplished by adding one suffix at a time to the portion of the tree that has already been constructed. Alternately, we can achieve the same effect by concatenating the two genomes (separated by a dummy character that does not occur in either genome) and constructing a suffix tree from that single concatenated string. Each leaf node in T is labeled to indicate which suffix it represents in which genome, A or B. The system needs to identify the nodes in the tree that correspond to MUMs. It is not hard to see that every unique matching sequence is represented by an internal node with exactly two child nodes, such that the child nodes are leaf nodes from different genomes. The unique matches that are maximal can be identified by mismatches at their ends. (MUMmer as actually implemented determines whether a match is maximal based on pointers used to construct the suffix tree.) Thus, in a single scan of the suffix tree, all MUMs can be identified. The main input parameter to the system, besides the genomes themselves, is the length of the shortest MUM that the system will identify. We typically do not want to report short MUMs that are likely to be random matches. For highly similar genomes (as with the two tuberculosis strains), we set this parameter to 50 bp. However, for more distantly related genomes, fewer MUMs of 50 bp might exist, and therefore this parameter can be adjusted. For aligning the two Mycoplasma species, we used a minimum MUM length of just 20 bp. After finding all the MUMs, we sort them according to their position in Genome A. Now we consider the order of their matching positions in Genome B. In some cases, e.g. a transposition or reversal between the genomes, the B positions are not in ascending order. See Figure Figure 3. Aligning Genome A and Genome B after locating the MUMs. Each MUM is here indicated only by a number, regardless of its length. The top alignment shows all the MUMs. The shift of MUM 5 in Genome B indicates a transposition. The shift of MUM 3 could be simply a random match or part of an inexact repeat sequence. The bottom alignment shows just the LIS of MUMs in Genome B. We now employ a variation of the LIS algorithm (9) to find the longest set of MUMs whose sequences occur in ascending order in both Genome A and Genome B. Essentially, we want the LIS contained in the sequence of B-position integers. For instance, if the order of B positions is given by the sequence <1, 2, 10, 4, 5, 8, 6, 7, 9, 3>, the LIS is <1, 2, 4, 5, 6, 7, 9>. The LIS technique allows us to browse the alignment from left to right, as well as `close the gaps' in the alignment consistently. MUMmer implements a variation of this algorithm that takes into account the lengths of the sequences represented by the MUMs and the fact that they can overlap. An example of an initial MUM-alignment and the ordered MUM-alignment that results from applying the LIS algorithm is shown in Figure Once a global MUM-alignment is found, we deploy several algorithms for closing the local gaps and completing the alignment. A gap is defined as an interruption in the MUM-alignment which falls into one of four classes: (i) an SNP interruption, (ii) an insertion, (iii) a highly polymorphic region or (iv) a repeat. These classes are depicted in Figure Figure 4. The four types of gaps in MUM alignment. These examples are drawn from the alignment of the two M.tuberculosis genomes. SNP processing. The identification of SNPs is becoming an increasingly important task in DNA sequence analysis, especially as the number of sequences from closely related organisms increases (23,24). SNPs in human DNA appear to be associated with many important health issues, including genetic illnesses and disease susceptibility. SNPs manifest themselves in two ways in the MUM alignment. In the simpler case, the SNP is surrounded by maximal unique matching subsequences. In this case our program easily identifies the SNP, which manifests itself as a simple gap of one base between the MUMs. In some cases, however, an SNP is adjacent to sequences that appear elsewhere in one of the genomes; i.e., the adjacent sequences are not unique. In this case the adjacent sequence plus the SNP is captured and processed by the repeat processing procedure described below. Inserts can be divided into two classes. Transpositions are subsequences that have been deleted from one location and inserted elsewhere (with respect to one of the genomes, of course). These are detected during a post-processing step; they appear in the MUM alignment out of sequence. Simple insertions are subsequences that appear in only one of the genomes; these may be the result of lateral transfer, simple deletions or other evolutionary processes. Regardless of the cause, these simple insertions can be identified as such because they do not appear in the MUM alignment.Polymorphic regions. Gaps in the MUM alignment can be also caused by sequences that have large numbers of differences, but that still should be aligned in the whole genome alignment. Because the number of differences is high, it is less meaningful to define these regions in terms of the SNPs; for example, if the sequence identity is 25%, the sequences might be considered highly homologous although the number of SNPs is triple the number of conserved positions. If these regions are sufficiently small, we align them with a standard dynamic programming algorithm, essentially equivalent to Smith-Waterman (7). This produces an optimal alignment with respect to pre-specified insertion and mutation costs. For very large polymorphic regions, we can apply our entire matching procedure recursively using a reduced minimum MUM length, if desired.
Maximal unique matching subsequence decomposition
Sorting the MUMs
Closing the gaps
Figure
Figure 5. Repeat sequences surrounded by unique sequences. For the purposes of illustration, other characters besides the four DNA nucleotides are used. Figure 6. Alignment of M.tuberculosis strain CDC1551 (top) and H37Rv (bottom). This alignment was extracted from the graphical display tool developed to scan and zoom in on the output of the genome alignment program. In the view shown, single green lines in the center connect single-base differences between the genomes. Blue v-shaped lines indicate insertions. The first two v-shaped insertions are large insertions in the H37Rv strain, and the third insertion is a very small insertion in CDC1551. This last insertion appears as a line rather than a v-shape due to the resolution of the displayed region. The genes from both genomes are displayed as arrows, with color-coding to indicate the role assigned to each gene. Role assignments and gene IDs are taken from annotation of the TIGR and Sanger genome centers, respectively. Note that both of the large insertions shown here contain genes. White lines (gaps) appearing in the middle of some arrows indicate silent mutations in those genes. Point mutations that change an amino acid are displayed differently; none occur in this region. Either genome can be scrolled independently, and the scale can be adjusted up or down, from viewing of individual bases all the way out to viewing the entire genome on the screen. We used MUMmer to perform a comparison of two strains of tuberculosis that have recently been sequenced, H37Rv (4) and CDC1551 (R.D.Fleischmann et al., manuscript in preparation). H37Rv is a laboratory strain that has been in continuous culture for >90 years, while CDC1551 is a recent clinical isolate that has demonstrated itself to be highly virulent (25). Tuberculosis is known to mutate relatively slowly (26), so despite the length of time that these strains have had to diverge, their genomes are still >99% identical (not counting several large repeat sequences that appear in different copy numbers). Understanding the differences is critical to understanding the different biological behavior of the two strains. Running the two genomes through MUMmer produced a whole-genome alignment that mapped every base of one genome onto the other. Thus we were able to catalog all SNPs, all insertions of every length, all tandem repeats with different copy numbers and other miscellaneous differences. A detailed description of the biological consequences of this comparison will appear elsewhere (R.D.Fleischmann et al., manuscript in preparation). Our alignment revealed thousands of individual differences between the two genomes, most of which were single base changes. There were several dozen large insertions unique to each genome, many of which contained genes or partial genes. An example is shown in Figure The time required to generate the alignments was 55 s on a DEC Alpha 4100, broken down as follows: 5 s for suffix tree construction, 45 s for sorting the MUMs and finding the longest increasing sequence and 5 s for generating the Smith-Waterman alignments of the gaps. (Because the sequences are so close to identical, the Smith-Waterman step needs to do very little work.)
The H37Rv and CDC1551 strains of M.tuberculosis are highly homologous, containing many subsequences thousands of nucleotides long that are perfect matches. To test the limits of our system, we turned to two bacteria that are `cousins' but that are much more distantly related. The genome of M.genitalium is 580 074 nt in length, while M.pneumoniae is 816 394 nt. Clearly there are at least 226 000 nt of additional DNA in M.pneumoniae; however, alignments of proteins indicate that nearly all of M.genitalium is contained in M.pneumoniae (27). The protein alignments further indicate that very large fragments of M.genitalium retain the same order and orientation in M.pneumoniae. Thus, despite the evolutionary distance between these organisms, we believed that it might be possible to align them using MUMmer. The system aligned these two genomes quite easily, as it turned out. This was somewhat surprising given previous difficulties at producing such an alignment using alternative methods. Not only did it work, but it worked very fast: the suffix tree portion of the computation took 6.5 s, while sorting and finding the LIS of MUMs took 0.02 s (on a DEC Alpha 4100). Generating the Smith-Waterman alignments of the gaps between the MUMs took 116 s; not surprisingly, this was much slower than for the tuberculosis genome comparison because of the larger amount of sequence that fell into the gaps. Figure Figure 7. Alignment of M.genitalium and M.pneumoniae using FASTA (top), 25mers (middle) and MUMs (bottom). In all three plots, a point indicates a `match' between the genomes. In the FASTA plot a point corresponds to similar genes. In the 25mer plot, each point indicates a 25-base sequence that occurs exactly once in each genome. In the MUM plot, points correspond to MUMs as defined in the main text. The MUM alignment clearly shows five translocations of M.genitalium sequence with respect to M.pneumoniae, in agreement with the analysis of Himmelreich et al. (27). In the FASTA and 25mer alignments, these translocations are either missing or very difficult to identify amidst the noise. In addition to the data shown in Figure To test MUMmer on sequences even more distant than the two Mycoplasmas, we chose a 222 930 bp subsequence of human chromosome 12p13 (accession no. U47924) that is syntenic to a 227 538 bp contiguous subsequence of mouse chromosome 6 (accession no. AC002397). These sequences were the subject of a recent study by Ansari-Lari et al. (28), who used a combination of alignment tools to produce a detailed alignment. These tools included DOTTER (29) for the initial comparison, a modified version of SIM (30) to find good local alignments, percent identity plots (pip) to display the results (31) and a program called CONSERVED (32) to identify segments >50 bp with >60% sequence identity in the SIM alignment. As reported in that study, the nucleotide similarity for the coding portions of the 17 genes in this region ranges from 70 to 92%, while the percent identity for amino acids ranges from 56 to 100%. Although our alignment does not contain all the details generated and displayed by the combination of methods used in Ansari-Lari et al., the overall alignment of the two sequences is easily apparent from the output of our program. The program required 29 s of CPU time to generate the complete alignment, of which 1.6 s was used to build the suffix tree. The alignment using 15mers as the minimum MUM length contained several large gaps corresponding to intergenic regions, as shown in Figure Figure 8. Alignment of a 222 930 bp subsequence of human chromosome 12p13, accession no. U47924, to a 227 538 bp subsequence of mouse chromosome 6, accession no. AC002397. Each point in the plot corresponds to an MUM of [ge]15 bp. The Smith-Waterman alignments of the regions between the MUMs show that the aligned region stretches almost the entire length of both sequences, containing 220 326 bp of the human sequence and 223 751 of mouse. The total number of bases contained in the MUM portion of the alignment is 14 026, or 6.3% of the human sequence. Only 10 MUMs in the alignment are >50 bp, with the longest being 117 bp. This paper describes MUMmer, a new system for high resolution comparison of complete genome sequences. The system was used to perform complete alignments of two pairs of genomes: the first pair were two closely related strains of M.tuberculosis of some 4.4 million nucleotides each, while the second pair were two different Mycoplasma bacteria that differed in length by almost half the length of the shorter genome. We also compared two sequences of ~225 kb each from the genomes of human and mouse. We expect many more closely related genomic sequences to appear in the coming years, and the system described here should prove useful in aligning those as well. We see no technical problems to using MUMmer for comparing even the longest genomic sequences, as long as sufficient computer memory is available. Thanks to Edward Arnold for graphical assistance. S.L.S. is supported in part by the National Human Genome Research Institute at NIH under Grant no. K01-HG00022-1. S.L.S. and A.L.D. are supported in part by the National Science Foundation under Grant no. IRI-9530462 and S.K. is supported by NSF IRI-9529227. R.D.F. is supported in part by National Institute of Allergy and Infectious Diseases at NIH under Grant no. R01-AI40125-01.
RESULTS AND DISCUSSION
Comparing two strains of tuberculosis
Comparing two Mycoplasma genomes
Comparing human and mouse
SUMMARY
ACKNOWLEDGEMENTS
REFERENCES
This article has been cited by other articles:
This page is run by Oxford University Press, Great Clarendon Street, Oxford OX2 6DP, as part of the OUP Journals
Comments and feedback: jnl.info{at}oup.co.uk
Last modification: 14 May 1999
Copyright©Oxford University Press, 1999.
![]()
CiteULike
Connotea
Del.icio.us What's this?
![]()
![]()

![]()
![]()
![]()
T. Rausch, A.-K. Emde, D. Weese, A. Doring, C. Notredame, and K. Reinert
Segment-based multiple sequence alignment
Bioinformatics,
August 15, 2008;
24(16):
i187 - i192.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
N. R. Thomson, M. T.G. Holden, C. Carder, N. Lennard, S. J. Lockey, P. Marsh, P. Skipp, C. D. O'Connor, I. Goodhead, H. Norbertzcak, et al.
Chlamydia trachomatis: Genome sequence analysis of lymphogranuloma venereum isolates
Genome Res.,
January 1, 2008;
18(1):
161 - 171.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
A. V. Zimin, D. R. Smith, G. Sutton, and J. A. Yorke
Assembly reconciliation
Bioinformatics,
January 1, 2008;
24(1):
42 - 45.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
G. A. Pandya, M. H. Holmes, S. Sunkara, A. Sparks, Y. Bai, K. Verratti, K. Saeed, P. Venepally, B. Jarrahi, R. D. Fleischmann, et al.
A bioinformatic filter for improved base-call accuracy and polymorphism detection using the Affymetrix GeneChip(R) whole-genome resequencing platform
Nucleic Acids Res.,
December 18, 2007;
35(21):
e148 - e148.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
N. L. Hiller, B. Janto, J. S. Hogg, R. Boissy, S. Yu, E. Powell, R. Keefe, N. E. Ehrlich, K. Shen, J. Hayes, et al.
Comparative Genomic Analyses of Seventeen Streptococcus pneumoniae Strains: Insights into the Pneumococcal Supragenome
J. Bacteriol.,
November 15, 2007;
189(22):
8186 - 8195.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
C. M. Bergman and H. Quesneville
Discovering and detecting transposable elements in genome sequences
Brief Bioinform,
November 1, 2007;
8(6):
382 - 392.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
A. Vishnoi, R. Roy, and A. Bhattacharya
Comparative analysis of bacterial genomes: identification of divergent regions in mycobacterial strains using an anchor-based approach
Nucleic Acids Res.,
June 28, 2007;
35(11):
3654 - 3667.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
H. Yukawa, C. A. Omumasaba, H. Nonaka, P. Kos, N. Okai, N. Suzuki, M. Suda, Y. Tsuge, J. Watanabe, Y. Ikeda, et al.
Comparative analysis of the Corynebacterium glutamicum group and complete genome sequence of strain R
Microbiology,
April 1, 2007;
153(4):
1042 - 1058.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
M. Choudhary, X. Zanhua, Y. X. Fu, and S. Kaplan
Genome Analyses of Three Strains of Rhodobacter sphaeroides: Evidence of Rapid Evolution of Chromosome II
J. Bacteriol.,
March 1, 2007;
189(5):
1914 - 1921.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
Y. Wan, S. L. Broschat, and D. R. Call
Validation of Mixed-Genome Microarrays as a Method for Genetic Discrimination
Appl. Envir. Microbiol.,
March 1, 2007;
73(5):
1425 - 1432.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
I. M. Sulaiman, K. Tang, J. Osborne, S. Sammons, and R. M. Wohlhueter
GeneChip Resequencing of the Smallpox Virus Genome Can Identify Novel Strains: a Biodefense Application
J. Clin. Microbiol.,
February 1, 2007;
45(2):
358 - 363.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
D. A. Rasko, M. J. Rosovitz, O. A. Okstad, D. E. Fouts, L. Jiang, R. Z. Cer, A.-B. Kolsto, S. R. Gill, and J. Ravel
Complete Sequence Analysis of Novel Plasmids from Emetic and Periodontal Bacillus cereus Isolates Reveals a Common Evolutionary History among the B. cereus-Group Plasmids, Including Bacillus anthracis pXO1
J. Bacteriol.,
January 1, 2007;
189(1):
52 - 64.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
D. E. Fouts
Phage_Finder: Automated identification and classification of prophage regions in complete bacterial genome sequences
Nucleic Acids Res.,
November 6, 2006;
34(20):
5839 - 5851.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
J.-I. Won, S. Park, J.-H. Yoon, and S.-W. Kim
An efficient approach for sequence matching in large DNA databases
Journal of Information Science,
February 1, 2006;
32(1):
88 - 104.
[Abstract]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
A. Caspi and L. Pachter
Identification of transposable elements using multiple alignments of related genomes
Genome Res.,
February 1, 2006;
16(2):
260 - 270.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
D. Shapira and J. A. Storer
In Place Differential File Compression
The Computer Journal,
November 1, 2005;
48(6):
677 - 691.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
C. L. Lu, T. C. Wang, Y. C. Lin, and C. Y. Tang
ROBIN: a tool for genome rearrangement of block-interchanges
Bioinformatics,
June 1, 2005;
21(11):
2780 - 2782.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
H. L. Chan, T. W. Lam, W. K. Sung, P. W. H. Wong, S. M. Yiu, and X. Fan
The mutated subsequence problem and locating conserved genes
Bioinformatics,
May 15, 2005;
21(10):
2271 - 2278.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
T. D. Wu and C. K. Watanabe
GMAP: a genomic mapping and alignment program for mRNA and EST sequences
Bioinformatics,
May 1, 2005;
21(9):
1859 - 1875.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
A. L. Bishop, S. Baker, S. Jenks, M. Fookes, P. O Gaora, D. Pickard, M. Anjum, J. Farrar, T. T. Hien, A. Ivens, et al.
Analysis of the Hypervariable Region of the Salmonella enterica Genome Associated with tRNAleuX
J. Bacteriol.,
April 1, 2005;
187(7):
2469 - 2482.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
M. Itoh, S. Goto, T. Akutsu, and M. Kanehisa
Fast and accurate database homology search using upper bounds of local alignment scores
Bioinformatics,
April 1, 2005;
21(7):
912 - 921.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
X. H. Zheng, F. Lu, Z.-Y. Wang, F. Zhong, J. Hoover, and R. Mural
Using shared genomic synteny and shared protein functions to enhance the identification of orthologous gene pairs
Bioinformatics,
March 15, 2005;
21(6):
703 - 710.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
P. A. Pevzner, H. Tang, and G. Tesler
De Novo Repeat Classification and Fragment Assembly
Genome Res.,
September 1, 2004;
14(9):
1786 - 1796.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
D. W. Ussery, M. S. Jensen, T. R. Poulsen, and P. F. Hallin
Genome Update: alignment of bacterial chromosomes
Microbiology,
August 1, 2004;
150(8):
2491 - 2493.
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
A. C.E. Darling, B. Mau, F. R. Blattner, and N. T. Perna
Mauve: Multiple Alignment of Conserved Genomic Sequence With Rearrangements
Genome Res.,
July 1, 2004;
14(7):
1394 - 1403.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
M. Choudhary, Y.-X. Fu, C. Mackenzie, and S. Kaplan
DNA Sequence Duplication in Rhodobacter sphaeroides 2.4.1: Evidence of an Ancient Partnership between Chromosomes I and II
J. Bacteriol.,
April 1, 2004;
186(7):
2019 - 2027.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()