| Title: | Epitome: Database of structure-inferred antigenic epitopes |
| Author: | Avner Schlessinger, Yanay Ofran, Guy Yachdav & Burkhard Rost |
| Quote: | Epitome: Database of structure-inferred antigenic epitopes; NAR (Database Issue 2005) in press. |
Epitome: Database of structure-inferred antigenic epitopes
| 1 | Dept. of Biochemistry and Molecular Biophysics, Columbia University, 630 West 168th Street, New York, NY 10032, USA |
| 2 | Columbia University Center for Computational Biology and Bioinformatics (C2B2), 1130 St. Nicholas Avenue Rm 802, New York, NY 10032, USA |
| 3 | North East Structural Genomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, 630 West 168th Street, New York, NY 10032, USA |
| & | The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. |
| * | Corresponding author: schles@rostlab.org URL http://www.rostlab.org/ Tel: +1-212-851-4669, fax: +1-212-305-7932 |
Immunoglobulin molecules specifically recognize particular areas on the surface of proteins. These areas are commonly dubbed B-cell epitopes. The identification of epitopes in proteins is important both for the design of experiments and vaccines. Additionally, the interactions between epitopes and antibodies have often served as a model for protein-protein interactions. One of the main obstacles in creating a database of antigen-antibody interactions is the difficulty in distinguishing between antigenic and non-antigenic interactions. Antigenic interactions involve specific recognition sites on the antibodys surface, while non-antigenic interactions are between a protein and any other site on the antibody. To solve this problem, we performed a comparative analysis of all protein-antibody complexes for which structures have been experimentally determined. Additionally, we developed a semi-automated tool that identified the antigenic interactions within the known antigen-antibody complex structures. We compiled those interactions into Epitome, a database of structure-inferred antigenic residues in proteins. Epitome consists of all known antigen/antibody complex structures, a detailed description of the residues that are involved in the interactions, and their sequence/structure environments. Interactions can be visualized using an interface to Jmol. The database is available at http://www.rostlab.org/services/epitome/ .
Key words: Antigenic determinants, B-cell epitopes, Antigens, Antibodies, database .
| ; 3D | three-dimensional |
| DSSP | automatic assignment of secondary structure and solvent accessibility from 3D coordinates [1] |
| HSSP | database of protein structure-sequence alignments [2, 3] |
| CDR | Complementarity Determining Regions |
Protein-antigen structures. Antigen-antibody complexes have long been used as a model for understanding the general phenomenon of molecular recognition [4, 5, 6, 7, 8] . The number of experimental high-resolution three-dimensional (3D) structures of antibody-antigen complexes in the PDB [9] has significantly increased over the last years. Several groups have used these data to analyze and characterize antigenic interactions, i.e., interactions between the protein (the antigen) and the Complementarity Determining Regions (CDRs) of the antibody [10, 11, 4, 12, 13, 6, 14, 15] . An important first step in studying antigenic interactions is the characterization of CDRs. MacCallum et al. observed that the hypervariable loops of CDRs adopt only a limited number of backbone conformations that are determined by a few key residues [12] . Two recent studies have suggested that the amino acid composition and the length of CDRs determine the type of antigen that can be bound [14, 15] . Several studies have attempted to differentiate the residues on the antigen surface that are involved in the antigenic interaction from all others [10, 11, 4] . The results of these studies were rather inconsistent. Differences in the data sets chosen (some of which were very small) and in the methodologies may explain some of those inconsistencies. Most importantly, however, the definitions of the CDRs often differed greatly, i.e. if two studies investigate the same PDB complex and use the same methodology, they might disagree on which of the interactions are antigenic [11, 16] . An important ramification of this problem was unveiled by Blythe and Flower [17] , who showed that most existing B-cell epitope prediction methods do not work adequately. One explanation for this observation could be that most methods rely on inaccurate identifications of epitopes.
Definition of the CDRs. Antibodies are composed of a skeleton of beta-sheets. Most of the amazing variety of antibodies is realized by differences in six hypervariable loops of the CDRs. Therefore, the CDRs have previously been defined through these six loops. The first definition of CDRs was as regions in the Kabat sequence variability plot [18, 19] . The residues in these regions are identified through an alignment between the query sequence and a consensus motif for antibodies. Although widely used, the Kabat CDR-definitions can be problematic because CDRs that are in structural loops often have very unusual sequences that are not captured by regular sequence-motifs [16] . In fact, any method based only on sequence information is prone to misaligning and therefore mis-assigning loopy CDRs. Chothia and colleagues therefore based their CDR identification on structural information [20, 21] . Initially, hypervariable loops were defined according to a few structures. Later, the numbering of the residues that was used to locate the CDRs was changed to account for structures that became available subsequently [21, 13] . Studies also differ in their definition of secondary structures, thereby increasing the inconsistency in defining hypervariable loops. Additional disadvantages of both the Kabat and Chothia et al. method are described elsewhere ( http://www.bioinf.org.uk/abs/ ).
Here, we address these problems through a comprehensive study of all known antigen-antibody complexes in the PDB. Analyzing the structures, we identified the consensus residues on the antibodies and thereby identified the CDRs on all known protein-antibody complexes (details below). This initial set of CDRs facilitated the automatic generation of a database with all known antigenic residues in the PDB; we also included the sequence environment and a detailed description of the CDR with which they interact. Several databases of antibody-antigen complex structures are available [16, 22, 23] . Some of these databases focus on the structural aspects of the interaction [24, 22] . There are also databases that compile B-cell epitopes without their corresponding antibodies [17, 25] . However, none of these databases explicitly locates the CDRs or identifies the antigenic residues semi-automatically. In this sense, our resource is more comprehensive and easily adjustable to growing data, as more 3D structures of antigen-antibody complexes become available. Thus, the databases mentioned above - particularly the ones that are not structure based - are complementary to Epitome.
Extraction of 3D structures and identification of CDRs. In order to identify all structures in the PDB that contain at least one antibody-antigen complex, we searched with BLAST [26] for a consensus sequence of an antibody against the PDB. The rationale for using BLAST rather than PSI-BLAST was to avoid capturing molecules such as T-cell receptors which, despite their similarity to antibodies, participate in cell-mediated immune response, and therefore represent a different type of antigenic interaction. We then added PDB structures that contain an immunoglobulin fold from the Structural Classification of Proteins database (SCOP) [27] , and PDB entries that are identified as antibody-antigen complexes through keywords (e.g. antibody and antigen). We discarded all complexes with T-cell receptors or MHC molecules, since these are formed during cell-mediated immune response. We labeled residues as interacting if any of their respective atoms were within a sphere of 6 [28] . This resulted in our final list of interactions between antibodies and antigens. Thus, we define antibody-antigen interaction as spatial proximity between a residue within the CDRs and a residue on the surface of the antigenic protein.
We located the CDRs in the known protein-antibody complexes through the following knowledge-based approach. We began by creating multiple structure alignments of antibody structures using SKA [29, 30] . Since the light and heavy chains have different CDRs, two different multiple structure alignments were performed corresponding to each type of antibody chain. Additionally, due to the fact that our database included several redundant sequences, we ran the structural alignment program on a sequence-unique subset of all protein-antibody complexes. As antibody sequences are highly similar to each other, the criteria for the redundancy of the complex set was determined by the antigen sequences; sequence redundancy was reduced at HSSP-values of 0 (corresponding to less than 33% pairwise sequence identity for long alignments) [31, 32] . Then, we identified structurally aligned positions that interact with a protein in more than 10% of the complexes of the alignment. We defined the borders of the CDRs through those highly populated positions. Given the CDRs in the aligned antibodies, we transferred their location to the antibody chains of the corresponding sequence-structure family that they represent by structural pairwise alignments using Combinatorial Extension (CE) [33] ( Fig. 1). Finally, we defined all the residues on the protein surface that are in contact with the residues on the antibody CDRs as antigenic residues.
Fig 1: Antigenic residues according to Epitome. Complex structure of quail lysozyme (in blue) and the light chain of an antibody (in green), as taken from PDB id 1bql [34] . The residues that are defined to be in CDR 1 of the light chain according to Kabat definition [18] are colored in black. Residues in red are all the residues that are involved in the interaction according to Epitome. Note that not all of the residues on the antibody surface that are located on "Kabat" CDR are involved in the antigenic reaction. Additionally, although 1bql antibody chains did not participate in the multiple structure alignment, i.e., the information about the location of the CDR was transferred from a homologous structure, the interaction was correctly identified.
Content statistics. Epitome currently contains 142 antigens from protein-antibody complex structures with a current total of 10,180 antigenic interactions. 63 of the complexes consist of antigens that are sequence-unique, i.e., 63 are such that no other antigen in the database has a level of sequence similarity to any other of the 63 that would enable coarse-grained homology modeling.
Input and fields. Epitome users can search for epitopes either by querying the database or by entering a sequence and "BLASTing" for similar sequences that are stored in the database. The fields that can be queried include one or more of the following: PDB identifier (four-letter code used by the PDB, e.g. 1pdb; Antigen chain ID (PDB identifier for the chain of the antigen, e.g. 1pdb_C) antigen residue type (one letter code for amino acids, for instance Y corresponds to Tyrosine), antigen residue secondary structure state as defined by DSSP [1] (1 letter code; GHI correspond to helical structures, EB to strands, and TSL to other.), antigen residue solvent accessibility (the input is the accessible surface in Angstrom2 as defined by DSSP [1] and the search is on all residues with accessibility values that are bigger or equal to the input value), antigen residue position (the residue number as annotated in the PDB file), heavy/light chain (the interaction involves residues that are located either on the light or the heavy or both chains of the antibody), antibody chain identifier (similar to the antigen chain identifier), antibody residue type (one letter code for amino acids, for instance, C corresponds to Cysteine), antibody residue position in the PDB (the position of the antibody residue that is involved in the interaction as annotated by the PDB), and CDR number (possible values: 1,2,3).
Output. Results for database queries are presented as a table that lists all features of the result sets (Fig. 2). The antigen results include the residues in the environment of the antigen (highlighted in red). If a user performs a BLAST sequence search against the Epitome database to find PDB structures containing antigens with similar sequences, the output will be all complex structures consisting of proteins with high degree of similarity to the input sequence, the corresponding e-value, and BLAST score of the pairwise sequence alignments. Additionally, each PSI-BLAST hit contains a link that can trigger another database query.
Fig. 2: Screenshot of a database entry. Each line of the table represents different antigenic interaction, i.e., interaction of a protein surface residue with an antibody surface residue that is located on one of the antibody's 6 CDRs. Note that the search could be performed using any of the table fields and that there is additional link to visualize the interaction using Jmol ( http://jmol.sourceforge.net/ ).
Updates. Since most Epitome entries were identified using the SCOP database, Epitome updates will follow updates of SCOP, i.e. Epitome will be updated twice a year as soon as SCOP updates its parseable files. Additionally, all the other programs used to create the database are installed locally and can be run automatically.
Thanks to Jinfeng Liu (Columbia) for computer assistance and to Andrew Kernytsky and Henry Bigelow for helpful comments on the manuscript. This work was supported by the grants RO1-GM64633-01 from the National Institutes of Health (NIH), and RO1-LM07329-01 from the National Library of Medicine (NLM). Last, not least, thanks to Helen Berman (Rutgers), Phil Bourne (UCSD), and their crews for maintaining an excellent PDB, and to all experimentalists who enabled this analysis by making their data publicly available.
| 1. | Kabsch, W., Sander, C. (1983). Dictionary of proteinsecondary structure: pattern recognition of hydrogen-bonded and geometricalfeatures. Biopolymers, 12,2577-2637. |
| 2. | Sander, C. S., R. (1991). Database of homology-derivedstructures and the structural meaning of sequencealignment. Proteins:structurefunction and genetics, 9,56-68. |
| 3. | Schneider, R., de Daruvar, A. & Sander, C. (1997).The HSSP database of protein structure-sequence alignments. Nucleic AcidsResearch, 25, 226-230. |
| 4. | Jones, S. & Thornton, J. M. (1996). Principles ofprotein-protein interactions. Proc Natl Acad Sci U S A, 93, 13-20. |
| 5. | Jones, S., Thornthn, J.M. (1997). Prediction ofprotein-protein interaction sites using patch analysis. J Mol Biol,, 272, 133-143. |
| 6. | Jones, S. & Thornton, J. M. (1997). Analysis ofprotein-protein interaction sites using surface patches. J Mol Biol, 272, 121-32. |
| 7. | Lo Conte, L., Chothia, C., Janin, J. (1999). The atomicstructure of protein-protein recognition sites. J Mol Biol,, 285, 2177-98. |
| 8. | Chen, R., Mintseris, J., Janin, J., Weng, Z. (2003). Aprotein-protein docking benchmark. Proteins,52, 88-91. |
| 9. | Berman, H. M., Westbrook, J., Feng,Z., Gillliland, G.,Bhat, T. N. et al. (2000). The Protein Data Bank. Nucl.Acids Res., 28, 235-242. |
| 10. | Van Regenmortel, M. H. V. (1992). Structure ofantigens. CRC Press, . |
| 11. | Davies, D. R., Cohen, G.H. (1996). Interactions ofprotein antigens with antibodies. Proc Natl Acad Sci U S A., 93, 7-12. |
| 12. | MacCallum, R. M., Martin, A.C., Thornton, J.M. (1996).Antibody-antigen interactions: contact analysis and binding site topography. JMol Biol,, 262, 732-45. |
| 13. | Al-Lazikani, B., Lesk, A. M. & Chothia, C. (1997).Standard conformations for the canonical structures of immunoglobulins. JMol Biol, 273, 927-48. |
| 14. | Collis, A. V., Brouwer, A. P. & Martin, A. C.(2003). Analysis of the antigen combining site: correlations between length andsequence composition of the hypervariable loops and the nature of the antigen. JMol Biol, 325, 337-54. |
| 15. | Almagro, J. C. (2004). Identification of differences inthe specificity-determining residues of antibodies that recognize antigens ofdifferent size: implications for the rational design of antibody repertoires. JMol Recognit, 17, 132-43. |
| 16. | Allcorn, L. C. & Martin, A. C. (2002).SACS--self-maintaining database of antibody crystal structure information. Bioinformatics, 18, 175-81. |
| 17. | Blythe, M. J., Doytchinova, I. A. & Flower, D. R.(2002). JenPep: a database of quantitative functional peptide data forimmunology. Bioinformatics, 18,434-9. |
| 18. | Wu, T. T., Kabat, E.A. (1970). An analysis of thesequences of the variable regions of Bence Jones proteins and myeloma lightchains and their implications for antibody complementarity. J Exp Med, 132, 211-250. |
| 19. | Johnson, G., Wu, T.T. (2000). Kabat Database and itsapplications:30 years after the first variability plot. Nucl Acid Res, 28, 214-218. |
| 20. | Chothia, C. & Lesk, A. M. (1987). Canonicalstructures for the hypervariable regions of immunoglobulins. J. Mol. Biol., 196, 901-917. |
| 21. | Chothia, C., Lesk, A.M., Tramontano, A., Levitt, M.,Smith-Gill, S.J., Air, G., Sheriff, S., Padlan, E.A., Davies, D., Tulip, W.R.(1989). Conformations of immunoglobulin hypervariable regions. Nature, 342, 877-83. |
| 22. | Peters, B., Sidney, J., Bourne, P., Bui, H. H., Buus,S. et al. (2005). The design and implementation of the immune epitope databaseand analysis resource. Immunogenetics,57, 326-36. |
| 23. | Saha, S., Bhasin, M. & Raghava, G. P. (2005).Bcipep: a database of B-cell epitopes. BMC Genomics, 6, 79. |
| 24. | Kaas, Q., Ruiz, M. & Lefranc, M. P. (2004).IMGT/3Dstructure-DB and IMGT/StructuralQuery, a database and a tool forimmunoglobulin, T cell receptor and MHC structural data. Nucleic Acids Res, 32, D208-10. |
| 25. | McSparron, H., Blythe, M. J., Zygouri, C., Doytchinova,I. A. & Flower, D. R. (2003). JenPep: a novel computational informationresource for immunobiology and vaccinology. J Chem Inf Comput Sci, 43, 1276-87. |
| 26. | Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang,J., Zhang, Z. et al. (1997). Gapped BLAST and PSI-BLAST: a new generation ofprotein database search programs. Nucleic Acids Res, 25, 3389-402. |
| 27. | Murzin, A. G., Brenner, S. E., Hubbard, T., Chothia, C.(1995). SCOP: a structural classification of proteins database for theinvestigation of sequences and structures. J. Mol. Biol., 247, 536-540. |
| 28. | Ofran, Y., Rost, B. (2003). Analysing Six Types ofProteinProtein Interfaces. J Mol Biol,,325, 377-387. |
| 29. | Petrey, D. & Honig, B. (2003). GRASP2:visualization, surface properties, and electrostatics of macromolecularstructures and sequences. Methods Enzymol,374, 492-509. |
| 30. | Petrey, D., Xiang, Z., Tang, C. L., Xie, L., Gimpelev,M. et al. (2003). Using multiple structure alignments, fast model building, andenergetic analysis in fold recognition and homology modeling. Proteins, 53 Suppl 6, 430-5. |
| 31. | Rost, B. (1999). Twilight zone of protein sequencealignments. Protein Engineering, 12,85-94. |
| 32. | Mika, S., Rost, B. (2003). UniqueProt: Creatingrepresentative protein sequence sets. Nucl Acid Res, 31, 3789-3791. |
| 33. | Shindyalov, I. N., Bourne, P.E., (1998). Proteinstructure alignment by incremental combinatorial extension (CE) of the optimalpath. Protein Engineering, 11,739-747. |
| 34. | Chacko, S., Silverton, E. W., Smith-Gill, S. J.,Davies, D. R., Shick, K. A. et al. (1996). Refined structures of bobwhite quaillysozyme uncomplexed and complexed with the HyHEL-5 Fab fragment. Proteins, 26, 55-65. |
| Contact: server@rostlab.org | Version: Sep 28, 2005 |