Skip Navigation

This Article
Right arrow Abstract Freely available
Right arrow Print PDF (139K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (16)
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Wu, C. H.
Right arrow Articles by Barker, W. C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wu, C. H.
Right arrow Articles by Barker, W. C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Nucleic Acids Research, 2001, Vol. 29, No. 1 52-54
© 2001 Oxford University Press

iProClass: an integrated, comprehensive and annotated protein classification database

Cathy H. Wu*, Chunlin Xiao, Zhenglin Hou, Hongzhan Huang and Winona C. Barker

Protein Information Resource, National Biomedical Research Foundation, Georgetown University Medical Center, 3900 Reservoir Road, NW Washington, DC 20007-2195, USA

Received September 5, 2000; Revised and Accepted October 27, 2000.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 iPROCLASS OVERVIEW AND CURRENT...
 DATABASE ACCESS AND USAGE
 REFERENCES
 
The iProClass database is an integrated resource that provides comprehensive family relationships and structural and functional features of proteins, with rich links to various databases. It is extended from ProClass, a protein family database that integrates PIR superfamilies and PROSITE motifs. The iProClass currently consists of more than 200 000 non-redundant PIR and SWISS-PROT proteins organized with more than 28 000 superfamilies, 2600 domains, 1300 motifs, 280 post-translational modification sites and links to more than 30 databases of protein families, structures, functions, genes, genomes, literature and taxonomy. Protein and family summary reports provide rich annotations, including membership information with length, taxonomy and keyword statistics, full family relationships, comprehensive enzyme and PDB cross-references and graphical feature display. The database facilitates classification-driven annotation for protein sequence databases and complete genomes, and supports structural and functional genomic research. The iProClass is implemented in Oracle 8i object-relational system and available for sequence search and report retrieval at http://pir.georgetown.edu/iproclass/ .


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 iPROCLASS OVERVIEW AND CURRENT...
 DATABASE ACCESS AND USAGE
 REFERENCES
 
In this post-genomic era, advanced databases are essential to facilitate retrieval of relevant information from the voluminous data and to provide insight into protein structure and function. Protein family classification is now well recognized as an effective approach for large-scale genomic sequence annotation. Moreover, it provides an important mechanism for database organization and integration of protein sequence, structure and function. There has been a proliferation of protein databases and a variety of classification schemes to organize the data. Major protein family organizations include hierarchical families of proteins, such as the superfamilies (1) in the PIR-International Protein Sequence Database (PIR-PSD) (2); families of protein domains, such as those in Pfam (3); sequence motifs or conserved regions, as in PROSITE (4); and structural classes, as in SCOP (5). InterPro (http://www.ebi.ac.uk/interpro/) has taken a further step, integrating PROSITE, PRINTS (6), Pfam and ProDom (7) protein signature databases. MetaFam (8) is more comprehensive, assembling about 10 public domain classification databases into a ‘superset’ using set theory and providing a distinctive graphical interface. Still, none of these databases integrates sequence and family annotations with structure and function classifications, which would be valuable not only for data mining and information retrieval, but also as an integral part of family identification algorithms. Moreover, none of these currently include the PIR superfamily and MIPS family classifications (1), which are unique in being based on end-to-end sequence comparisons and include over 182 000 sequences.

The iProClass is an integrated classification database devised as a central resource of annotated protein family information with comprehensive family relationships and structural and functional features of proteins. Its design is extended from ProClass (9,10), the first integrated protein family database, which organizes proteins based on PIR superfamilies and PROSITE motifs. The objectives of the iProClass database are to support knowledge discovery by easy retrieval of family information, database management by full-scale family assignment and complete database organization, and genomic sequence annotation by classification-driven annotation of protein sequences.


    iPROCLASS OVERVIEW AND CURRENT CONTENTS
 TOP
 ABSTRACT
 INTRODUCTION
 iPROCLASS OVERVIEW AND CURRENT...
 DATABASE ACCESS AND USAGE
 REFERENCES
 
The vast information in iProClass is organized into multiple data sets (Fig. 1). The ClassSeQuence (CSQ) component describes protein sequence entries, the ClassSuperFamily (CSF), ClassDoMain (CDM) and ClassMoTif (CMT) define family relationships at the superfamily, domain and motif levels, and the ClassFuNction (CFN) and ClassSTructure (CST) describes protein functional (activity/enzyme) and structural properties and relationships.



View larger version (51K):
[in this window]
[in a new window]
 
Figure 1. iProClass database overview.

 
The database has three major features: integration, comprehensiveness and annotation. It integrates protein sequence families with functional and structural classes. It also integrates family classification at the whole protein, domain and motif/site levels, supported by PIR/MIPS superfamilies/families, PIR homology domains and Pfam domains, ProClass/PROSITE motifs and PIR binding, active and modification sites. Since the introduction of the concept three decades ago, the PIR superfamily classification is still the only inclusive scheme that provides a unique hierarchical ordering of proteins to reflect their evolutionary origins and relationships.

The iProClass is comprehensive, containing data derived from and links to several PIR databases, including PIR-PSD, ProClass, PIR-ALN alignment database (11), RESID database of post-translational modifications (12) and PIR-ASDB of precompiled FASTA similarity results, as well as numerous external databases. The current version (ß release, September 30, 2000) is based on PIR-PSD release 66.00 (09/00), ProClass 6.0 (08/00), PIR-ALN 25.03 (08/00), RESID 22.01 (07/00), SWISS-PROT 39.0 (05/00) and TrEMBL 14.0 (06/00) (13), Pfam 5.4 (06/00), BLOCKS 12.0 (06/00) (14), PRINTS 27.0 (04/00) (6), PROSITE 14.0 (07/99), PDB (07/00) (15) and COG (01/00) (16).

The database presently consists of more than 200 000 non-redundant sequences, derived from PIR and SWISS-PROT, and more than 29 000 PIR superfamilies, correlated with 100 000 MIPS families, 380 PIR homology domains, 2200 Pfam domains, 1300 ProClass/PROSITE motifs, 280 post-translational modification sites. Also included are 3400 PDB identity links (100% identity), 40 000 PDB similarity links (30–99% sequence identity) and 30 000 enzyme (EC) links. Complete PIR/MIPS superfamily/family, domain and motif alignments are provided in MIPS-ProtFam, PIR-ALN and ProClass, respectively. The iProClass provides cross-references and links to more than 30 databases of protein sequence (PSD, SWISS-PROT, TrEMBL), family and alignment (PIR-ASDB, ProClass, Pfam, PROSITE, PRINTS, BLOCKS, COG, MetaFam, ProtFam, PIR-ALN), protein enzyme/pathway (KEGG, BRENDA, WIT, EcoCyc), protein structure and structural class (PDB, SCOP, CATH, RESID), gene and genome (GenBank, EMBL, DDBJ, TIGR, UWGP, SGD, Flybase, MGI, GDB, OMIM), literature (MEDLINE) and taxonomy (NCBI Taxonomy).

Annotated summary reports and lists have been compiled for all iProClass protein sequence entries and PIR superfamilies. Homology domain, ProClass motif, protein function and structure reports will be included in future releases, as will sequences unique to TrEMBL. Each sequence summary report has sections on general information, database cross-references, family assignments/relationships, and functional and structural information. Family reports contain additional summaries on length, taxonomy and keyword statistics, and a membership section that lists all sequence entries separated by major kingdoms, and denotes those from model organisms or with validated experimental status. Each protein sequence report also includes a graphical feature display that delineates regions of domains, motifs, sites and known structure chains, whichever is applicable.


    DATABASE ACCESS AND USAGE
 TOP
 ABSTRACT
 INTRODUCTION
 iPROCLASS OVERVIEW AND CURRENT...
 DATABASE ACCESS AND USAGE
 REFERENCES
 
The iProClass is implemented in the Oracle 8i object-relational database management system to support database query and management. It is freely accessible from our web site at http://pir.georgetown.edu/iproclass and searchable using different modes (Table 1). Direct report retrieval is based on report unique identifiers such as PIR protein ID or superfamily number. Matching lists of summary reports are retrievable by sequence search or text search. Sequence search, based on BLAST search (17) of user-supplied query sequence against all iProClass protein sequences, returns lists of best-matched families and all sequences above a given threshold. Text search provides list retrieval by using combinations of text string (and substring) searches, including protein title, superfamily or domain name, EC number, keyword, species and other database unique identifiers (such as GenBank accession and protein ID, MEDLINE ID, PDB, SWISS-PROT, TrEMBL, Pfam, PROSITE and COG).


View this table:
[in this window]
[in a new window]
 
Table 1. iProClass database search options with examples
 
The recognition of sequence similarity at the whole protein (superfamily) level, together with a full view of family (superfamily–domain–motif) relationships, is essential for accurate genomic sequence annotation. The integration in iProClass allows it to present relationships not available in individual databases alone and to contain more comprehensive information than any other single information resource. The database supports both sequence-based and annotation-based searches. Comparative studies between or among the various family relationships will be facilitated. Knowledge of these relationships is crucial to our understanding of protein evolution, structure and function, and important for functional and structural genomic research.


    ACKNOWLEDGEMENTS
 
This study is supported in part by grant DBI-9974855 from National Science Foundation, and grant P41 LM05798 from National Library of Medicine, NIH.


    FOOTNOTES
 
* To whom correspondence should be addressed. Tel: +1 202 687 2121; Fax: +1 202 687 1662; Email: wuc{at}nbrf.georgetown.edu Back


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 iPROCLASS OVERVIEW AND CURRENT...
 DATABASE ACCESS AND USAGE
 REFERENCES
 

    1 Barker,W.C., Pfeiffer,F. and George,D. (1996) Superfamily classification in PIR-international protein sequence database. Methods Enzymol., 266, 59–71.[ISI][Medline]

    2 Barker,W.C., Garavelli,J.S., Hou,Z., Huang,H., Ledley,R.S., McGarvey,P.B., Mewes,H.-W., Orcutt,B.C., Pfeiffer,F., Tsugita,A, Vinayaka,C.R., Xiao,C., Yeh,L.-S.L and Wu,C. (2001) Protein Information Resource: a community resource for expert annotation of protein data. Nucleic Acids Res., 29, 29–32.[Abstract/Free Full Text]

    3 Bateman,A., Birney,E., Durbin,R., Eddy,S.R., Howe,K.L. and Sonnhammer,E.L.L. (2000) The Pfam protein families database. Nucleic Acids Res., 28, 263–266.[Abstract/Free Full Text]

    4 Hofmann,K., Bucher,P., Falquet,L. and Bairoch,A. (1999) The PROSITE database, its status in 1999. Nucleic Acids Res., 27, 215–219.[Abstract/Free Full Text]

    5 Lo Conte,L., Ailey,B., Hubbard,T.J.P., Brenner,S.E., Murzin,A.G. and Chothia,C. (2000) SCOP: a structural classification of proteins database. Nucleic Acids Res., 27, 254–256.[Abstract/Free Full Text]

    6 Attwood,T.K., Croning,M.D.R., Flower,D.R., Lewis,A.P., Mabey,J.E., Scordisw,P., Selley,J.N. and Wright,W. (2000) PRINTS-S: the database formerly known as PRINTS. Nucleic Acids Res., 28, 225–227.[Abstract/Free Full Text]

    7 Copet,F., Servant,F., Gouzy,J. and Kahn,D. (2000) ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons. Nucleic Acids Res., 28, 267–269.[Abstract/Free Full Text]

    8 Silverstein,K.A.T., Shoop,E., Johnson,J.E., Kilian,A., Freeman,J.L., Kunau,T.M., Awad,I.A., Mayer,M. and Retzel,E.F. (2001) The MetaFam Server: a comprehensive protein family resource. Nucleic Acids Res., 29, 49–51.[Abstract/Free Full Text]

    9 Wu,C., Zhao,S. and Chen,H.L. (1996) A protein class database organized with ProSite protein groups and PIR superfamilies. J. Comp. Biol., 3, 547–562.

    10 Huang,H., Xiao,C. and Wu,C.H. (2000) ProClass protein family database. Nucleic Acids Res., 28, 273–276.[Abstract/Free Full Text]

    11 Srinivasarao,G.Y., Yeh,L.-S., Marzec,C.R., Orcutt,B.C. and Barker,W.C. (1999) PIR- ALN: A database of protein sequence alignments, Bioinformatics, 15, 382–390.[Abstract/Free Full Text]

    12 Garavelli,J.S. (2000) The RESID database of protein structure modifications: 2000 update. Nucleic Acids Res., 28, 209–211. Updated article in this issue: Nucleic Acids Res. (2001), 29, 199–201.[Abstract/Free Full Text]

    13 Bairoch,A. and Apweiler,R. (2000) The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res., 28, 45–48.[Abstract/Free Full Text]

    14 Henikoff,J.G., Greene,E.A., Pietrokovski,S. and Henikoff,S. (2000) Increased coverage of protein families with the Blocks database servers. Nucleic Acids Res., 28, 228–230.[Abstract/Free Full Text]

    15 Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) The Protein Data Bank. Nucleic Acids Res., 28, 235–242. Updated article in this issue: Nucleic Acids Res. (2001), 29, 214–218.[Abstract/Free Full Text]

    16 Tatusov,R.L., Galperin,M.Y., Natale,D.A. and Koonin,E.V. (2000) The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res., 28, 33–36. Updated article in this issue: Nucleic Acids Res. (2001), 29, 22–28.[Abstract/Free Full Text]

    17 Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res., 25, 3389–3402.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
N. J. Mulder, R. Apweiler, T. K. Attwood, A. Bairoch, D. Barrell, A. Bateman, D. Binns, M. Biswas, P. Bradley, P. Bork, et al.
The InterPro Database, 2003 brings increased coverage and new features
Nucleic Acids Res., January 1, 2003; 31(1): 315 - 318.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
C. H. Wu, L.-S. L. Yeh, H. Huang, L. Arminski, J. Castro-Alvear, Y. Chen, Z. Hu, P. Kourtesis, R. S. Ledley, B. E. Suzek, et al.
The Protein Information Resource
Nucleic Acids Res., January 1, 2003; 31(1): 345 - 347.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
H. Huang, W. C. Barker, Y. Chen, and C. H. Wu
iProClass: an integrated database of protein family, function and structure information
Nucleic Acids Res., January 1, 2003; 31(1): 390 - 392.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
C. H. Wu, H. Huang, L. Arminski, J. Castro-Alvear, Y. Chen, Z.-Z. Hu, R. S. Ledley, K. C. Lewis, H.-W. Mewes, B. C. Orcutt, et al.
The Protein Information Resource: an integrated public resource of functional annotation of proteins
Nucleic Acids Res., January 1, 2002; 30(1): 35 - 37.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
W. C. Barker, J. S. Garavelli, Z. Hou, H. Huang, R. S. Ledley, P. B. McGarvey, H.-W. Mewes, B. C. Orcutt, F. Pfeiffer, A. Tsugita, et al.
Protein Information Resource: a community resource for expert annotation of protein data
Nucleic Acids Res., January 1, 2001; 29(1): 29 - 32.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Print PDF (139K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (16)
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Wu, C. H.
Right arrow Articles by Barker, W. C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wu, C. H.
Right arrow Articles by Barker, W. C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?