Skip Navigation

This Article
Right arrow Abstract Freely available
Right arrow Print PDF (67K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (41)
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Bahr, A.
Right arrow Articles by Poch, O.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Bahr, A.
Right arrow Articles by Poch, O.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Nucleic Acids Research, 2001, Vol. 29, No. 1 323-326
© 2001 Oxford University Press

BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations

Anne Bahr, Julie D. Thompson, J.-C. Thierry and Olivier Poch*

Laboratoire de Biologie et Genomique Structurales, Institut de Génétique et de Biologie Moléculaire et Cellulaire, (CNRS/INSERM/ULP), BP 163, 67404 Illkirch Cedex, France

Received August 22, 2000; Accepted October 4, 2000.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 CONTENTS OF BALIBASE 2.0
 CONCLUSIONS AND PERSPECTIVES
 REFERENCES
 
BAliBASE is specifically designed to serve as an evaluation resource to address all the problems encountered when aligning complete sequences. The database contains high quality, manually constructed multiple sequence alignments together with detailed annotations. The alignments are all based on three-dimensional structural superpositions, with the exception of the transmembrane sequences. The first release provided sets of reference alignments dealing with the problems of high variability, unequal repartition and large N/C-terminal extensions and internal insertions. Here we describe version 2.0 of the database, which incorporates three new reference sets of alignments containing structural repeats, trans­membrane sequences and circular permutations to evaluate the accuracy of detection/prediction and alignment of these complex sequences. BAliBASE can be viewed at the web site http://www-igbmc.u-strasbg.fr/BioInfo/BAliBASE2/index.html or can be downloaded from ftp://ftp-igbmc.u-strasbg.fr/pub/BAliBASE2/.


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 CONTENTS OF BALIBASE 2.0
 CONCLUSIONS AND PERSPECTIVES
 REFERENCES
 
The alignment of protein sequences is an essential tool in molecular biology and genome analysis. A good alignment is crucial to reliable phylogenetic studies and in the identification of characteristic motifs and conserved residues in protein families. However, the quality of an alignment is strongly dependent on the domain organization and the residue composition of the sequences to be aligned. In previous studies, we addressed the problems of aligning sequences with low similarity, and sequences containing large internal insertions or N/C-terminal extensions (1,2). The remaining problems encountered by most alignment programs concern mainly protein families containing either repeats, low complexity regions such as transmembrane helices, inverted domains or circular permutations. Obviously, proteins containing repeated domains/motifs or circular permutations pose particular problems for global alignment programs. In the case of transmembrane proteins, alignment difficulties are related to the important bias in residue composition in the transmembrane regions and the frequent insertion of large domains inbetween. Effectively, these proteins contain a large proportion of fairly conserved aliphatic amino acids distributed in multiple transmembrane helices and the optimal alignment is often uncertain.

A number of programs have been developed to detect repeats in protein sequences, either by comparison with an existing database of known repeats (3,4), or ab initio (5,6). Much more work has been done on transmembrane prediction (712). The problem is in some ways similar to repeat detection since transmembrane regions may be considered as ‘repeats with low sequence complexity’. Although this bias in residue composition allows easier detection of transmembrane helices, the alignment of such sequences is more problematic. Concerning the detection of circular permutations, very few algorithms have been developed so far (13). However, few of these methods consider the global alignment of such proteins. In order to construct a multiple alignment, three steps are necessary: (i) detection of the repeat, transmembrane helix or permutated domains, (ii) local alignment of these elements and (iii) integration of the local alignment into a global family alignment. Comprehensive evaluation of the programs specifically dedicated to repeats, transmembrane or circular permutation requires a large number of accurate reference alignments that can be used as test cases. For this reason, we have developed BAliBASE (2.0), a new version of a manually refined multiple sequence alignment database.


    CONTENTS OF BALIBASE 2.0
 TOP
 ABSTRACT
 INTRODUCTION
 CONTENTS OF BALIBASE 2.0
 CONCLUSIONS AND PERSPECTIVES
 REFERENCES
 
The first version of BAliBASE (1) was dedicated to the evaluation of multiple alignment programs and was divided into five hierarchical reference sets of (i) equidistant sequences with various levels of conservation, (ii) families aligned with a highly divergent ‘orphan’ sequence, (iii) subgroups with <25% residue identity between groups, (iv) sequences with N/C-terminal extensions and (v) internal insertions. For release 2.0 of BAliBASE, these alignments have been manually verified and corrected by superposition of all known three-dimensional structures, using the lsqman program (14).

BAliBASE 2.0 includes three new alignment references sets (references 6–8) specifically dedicated to repeats, circular permutations and transmembrane proteins. The protein families included in the new reference sets were selected from the Pfam database (15) or from the literature and new family members were detected by searching the sequence databases with BLAST (16). The alignments were confirmed by super­position of the three-dimensional structures where possible. As very few three-dimensional structures of transmembrane proteins have been resolved, a Gribskov profile (17) was constructed of the known family members and new sequences were included in the alignment based on the score obtained in the profile search. BAliBASE 2.0 now consists of 167 reference alignments, with more than 2100 sequences. The three new reference sets (references 6–8) contain 26 protein families with 12 distinct repeat types, eight transmembrane families and five families with inverted domains (Table 1), representing more than 1100 sequences. As in references 1–5, core blocks are defined that only include the regions of the sequences that can be reliably aligned.


View this table:
[in this window]
[in a new window]
 
Table 1. New reference sets in BAliBASE version 2.0
 
BAliBASE 2.0 can be viewed on the WWW at http://www-igbmc.u-strasbg.fr/BioInfo/BAliBASE2 or can be downloaded from ftp://ftp-igbmc.u-strasbg.fr/pub/BAliBASE2/. For each new reference set, the alignments are provided in both RSF and MSF formats with an associated annotation file containing a description of the alignment, including the number and position of repeats/transmembrane helices. The unaligned sequences are also available in FastA format.

Reference 6
The difficulties encountered when detecting or aligning proteins containing repeats are strongly related to the residue similarity of the repeated regions. Therefore, for each of the 12 reference families, a multiple alignment has been constructed by fragmenting the sequences in order to align all the repeated regions. The repeats were then classified into a number of subtypes according to their residue similarity. The number of repeats and the presence of additional domains also affect the ability of a given program to construct an accurate alignment. To address these questions, subsets of each reference alignment are proposed, selected according to repeat subtype and presence of additional domains (Table 2). These 83 subsets present a number of aligned sequences verifying specific criteria:


View this table:
[in this window]
[in a new window]
 
Table 2. Reference 6, subsets of sequences containing repeats showing the number of alignments in each subset
 
• C1(a) The same number of repeats of a unique subtype• C1(b) A variable number of repeats of a unique subtype• C2(a) The same number of repeats with different subtypes in the same order• C2(b) The same number of repeats with different subtypes, but in a different order• C2(c) A variable number of different repeat subtypes• C3 The presence of an additional non-repeated conserved domain• C4 The presence of several different repeat types

By combining these alignment subsets, a flexible benchmark test system can be designed to evaluate and compare the various algorithms currently available for the detection and alignment of repeats in protein sequences.

Reference 7
In the case of transmembrane proteins, the problems are similar to those defined previously for repeats, i.e. detection, local and global alignment. Reference 7, presented in Table 1B, consists of eight families of transmembrane proteins containing approximately 400 aligned protein sequences. A global multiple alignment of each family is available, in which the known/predicted transmembrane helices are identified. Unfortunately, due to the variability observed in the automatic prediction of transmembrane helices and the lack of known three-dimensional structures, the exact locations of the transmembrane helices cannot be reliably determined. We have therefore defined two types of ‘blocks’ in reference 7. The core blocks contain those regions in which all the sequences can be reliably aligned, as before. In addition, we define ‘fuzzy’ blocks in which predicted transmembrane helices overlap, but where the optimal multiple alignment may vary by several residues. In addition, the sequences in each family alignment are divided into a number of sub-groups using the Secator program (N.Wicker, unpublished results). Within each sub-group, the sequences are highly related and may constitute tests for the accuracy and reliability of the prediction for transmembrane prediction programs.

Reference 8
Reference 8 concerns two different, but related, problems. Inversions in proteins can result from various phenomena, such as the insertion of a complete domain at different sites in a protein, or the transfer of part of the C-terminal of the protein to its N-terminal, thus causing a discontinuity of the terminal domain. Examples of both of these cases are presented in Table 1C. Reference 8 consists of five protein families in which the sequential ordering of the domains is not preserved, corresponding to 118 sequences. For each family in this reference, we propose an independent alignment of each permuted domain.


    CONCLUSIONS AND PERSPECTIVES
 TOP
 ABSTRACT
 INTRODUCTION
 CONTENTS OF BALIBASE 2.0
 CONCLUSIONS AND PERSPECTIVES
 REFERENCES
 
With the addition of the three new reference sets, BAliBASE (2.0) now provides specific test cases representing most of the problems encountered in sequence detection and alignment. A comparison of the programs currently available (work in progress) for the detection and/or alignment of repeats, transmembrane or inverted regions will enable us to identify the strong and weak points of each. The results of the study should allow us to further develop the multiple alignment program DbClustal (18). The modular design of this program facilitates the incorporation of new modules, allowing the integration of information from different sources into the multiple alignment process. The resulting program should enhance the automatic and reliable multiple sequence alignment of these highly modular proteins.


    ACKNOWLEDGEMENTS
 
We would like to thank D. Moras for his continuous support during this work, and F. Plewniak and R. Ripp for helpful discussions. The work was supported by institute funds from Institut National de la Santé et de la Recherche Médicale, the Centre National de la Recherche Scientifique, the Hôpital Universitaire de Strasbourg and the Fond de Recherche Hoechst Marion Roussel.


    FOOTNOTES
 
* To whom correspondence should be addressed. Tel: +33 3 88 65 32 00; Fax: +33 3 88 65 32 01; Email: poch{at}igbmc.u-strasbg.fr Back


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 CONTENTS OF BALIBASE 2.0
 CONCLUSIONS AND PERSPECTIVES
 REFERENCES
 

    1 Thompson,J.D., Plewniak,F. and Poch,O. (1999) BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics, 15, 87–88.[Abstract/Free Full Text]

    2 Thompson,J.D., Plewniak,F. and Poch,O. (1999) A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res., 27, 2682–2690.[Abstract/Free Full Text]

    3 Andrade,M.A., Ponting,C.P., Gibson,T.J. and Bork,P. (2000) Homology-based method for identification of protein repeats using statistical significance estimates. J. Mol. Biol., 298, 521–537.[ISI][Medline]

    4 Bornberg-Bauer,E., Rivals,E. and Vingron,M. (1998) Computational approaches to identify leucine zippers. Nucleic Acids Res., 26, 2740–2746.[Abstract/Free Full Text]

    5 Heringa,J. and Argos,P. A method to recognize distant repeats in protein sequences. (1993) Proteins, 17, 391–441.[ISI][Medline]

    6 Pellegrini,M., Marcotte,E.M. and Yeates,T.O. (1999) A fast algorithm for genome-wide analysis of proteins with repeated sequences. Proteins, 35, 440–446.[ISI][Medline]

    7 Cserzo,M., Wallin,E., Simon,I., von Heijne,G. and Elofsson,A. (1997) Prediction of transmembrane alpha-helices in prokaryotic membrane proteins: the dense alignment surface method. Protein Eng., 10, 673–676.[Abstract/Free Full Text]

    8 Hirokawa,T., Boon-Chieng,S. and Mitaku,S. (1998) SOSUI: classification and secondary structure prediction system for membrane proteins. Bioinformatics, 14, 378–379.[Abstract/Free Full Text]

    9 Kihara,D., Shimizu,T. and Kanehisa,M. (1998) Prediction of membrane proteins based on classification of transmembrane segments. Protein Eng., 11, 961–970.[Abstract/Free Full Text]

    10 Gromiha,M.M. (1999) A simple method for predicting transmembrane alpha helices with better accuracy. Protein Eng., 12, 557–561. [Abstract/Free Full Text]

    11 Pasquier,C., Promponas,V.J., Palaios,G.A., Hamodrakas,J.S. and Hamodrakas,S.J. (1999) A novel method for predicting transmembrane segments in proteins based on a statistical analysis of the SwissProt database: the PRED-TMR algorithm. Protein Eng., 12, 381–385.[Abstract/Free Full Text]

    12 Lio,P. and Vannucci,M. (2000) Wavelet change-point prediction of transmembrane proteins. Bioinformatics, 16, 376–382.[Abstract/Free Full Text]

    13 Uliel,S., Fliess,A., Amir,A. and Unger,R. (1999) A simple algorithm for detecting circular permutations in proteins. Bioinformatics, 15, 930–936.[Abstract/Free Full Text]

    14 Kleywegt,G.J. and Jones,T.A. (1995) Where freedom is given, liberties are taken. Structure, 3, 535–540.[Medline]

    15 Bateman,A., Birney,E., Durbin,R., Eddy,S.R., Howe,K.L. and Sonnhammer,E.L. (2000) The Pfam protein families database. Nucleic Acids Res., 28, 263–266.[Abstract/Free Full Text]

    16 Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402.[Abstract/Free Full Text]

    17 Gribskov,M., McLachlan,A.D. and Eisenberg,D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl Acad. Sci. USA, 84, 4355–4358.[Abstract/Free Full Text]

    18 Thompson,J.D., Plewniak,F., Thierry,J.-C. and Poch,O. (2000) DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches. Nucleic Acids Res. 15, 2919–2926.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Mol Biol EvolHome page
B. G. Hall
How Well Does the HoT Score Reflect Sequence Alignment Accuracy?
Mol. Biol. Evol., August 1, 2008; 25(8): 1576 - 1580.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
B. G. Hall
Simulating DNA Coding Sequence Evolution with EvolveAGene 3
Mol. Biol. Evol., April 1, 2008; 25(4): 688 - 695.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
W. Pirovano, K. A. Feenstra, and J. Heringa
PRALINETM: a strategy for improved multiple alignment of transmembrane proteins
Bioinformatics, February 15, 2008; 24(4): 492 - 497.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
T. J. Wheeler and J. D. Kececioglu
Multiple alignment by aligning alignments
Bioinformatics, July 1, 2007; 23(13): i559 - i568.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J. S. Papadopoulos and R. Agarwala
COBALT: constraint-based alignment tool for multiple protein sequences
Bioinformatics, May 1, 2007; 23(9): 1073 - 1079.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
U. Roshan and D. R. Livesay
Probalign: multiple sequence alignment using partition function posterior probabilities
Bioinformatics, November 15, 2006; 22(22): 2715 - 2721.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
T. M. Phuong, C. B. Do, R. C. Edgar, and S. Batzoglou
Multiple alignment of protein sequences with repeats and rearrangements
Nucleic Acids Res., November 6, 2006; 34(20): 5932 - 5942.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
J. Pei and N. V. Grishin
MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information
Nucleic Acids Res., September 11, 2006; 34(16): 4364 - 4374.
[Abstract] [Full Text] [PDF]


Home page
Plant Physiol.Home page
E. Rivals, C. Bruyere, C. Toffano-Nioche, and A. Lecharny
Formation of the Arabidopsis Pentatricopeptide Repeat Family
Plant Physiology, July 1, 2006; 141(3): 825 - 839.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
Y.-m. Huang and C. Bystroff
Improved pairwise alignments of proteins in the Twilight Zone using local structure predictions
Bioinformatics, February 15, 2006; 22(4): 413 - 422.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
W. Huang, D. M. Umbach, and L. Li
Accurate anchoring alignment of divergent sequences
Bioinformatics, January 1, 2006; 22(1): 29 - 34.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
P. P. Gardner, A. Wilm, and S. Washietl
A benchmark of multiple sequence alignment programs upon structural RNAs
Nucleic Acids Res., April 28, 2005; 33(8): 2433 - 2439.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
I. Van Walle, I. Lasters, and L. Wyns
SABmark--a benchmark for sequence alignment that covers the entire known fold space
Bioinformatics, April 1, 2005; 21(7): 1267 - 1268.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
V. A. Simossis, J. Kleinjung, and J. Heringa
Homology-extended sequence alignment
Nucleic Acids Res., February 7, 2005; 33(3): 816 - 824.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
R. C. Edgar
MUSCLE: multiple sequence alignment with high accuracy and high throughput
Nucleic Acids Res., March 19, 2004; 32(5): 1792 - 1797.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Print PDF (67K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (41)
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Bahr, A.
Right arrow Articles by Poch, O.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Bahr, A.
Right arrow Articles by Poch, O.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?