See the tool's GitHub repository, for the most updated information.
goPredSim predicts gene ontology (GO) terms for protein sequences through annotation transfer similar to homology-based inference but using similarity in SeqVec embedding space instead of sequence similarity. It first runs the language model SeqVec on the query protein sequence to derive a contextualized embedding. This query embedding is compared to a list of precomputed embeddings of previously annotated proteins using pairwise Euclidean distance. For each ontology, the GO annotation of the closest protein in embedding space is transferred to the query protein. The distance between the query protein and the closest annotated protein are converted to a reliability ranging from very low (0) to very high (100).
What is predicted?
The output of goPredSim is a list of Gene Ontology (GO) terms. GO thrives to capture the complexity of protein function and standardize the vocabulary used to describe those in a human- and machine-readable manner. GO separates different aspects of function into three hierarchies: MFO (Molecular Function Ontology), BPO (biological process ontology), and CCO (cellular component(s) or subcellular localization(s) in which the protein acts). Each ontology is a rooted graph in which each node represents a GO term and each link a functional relationship. Thus, the prediction of our method can be seen as three subgraphs of the full ontologies. These three subgraphs are displayed below the tabular result. Often, the tabular result only contain very specific functional terms not reflecting the more general role of the protein that can be inferred by going to the root of the ontology. The graphical results show such terms (predicted: yellow boxes, inferred: white boxes).
What can you expect from GO term predictions?
Replicating the conditions of CAFA3 which allows a comparison of our method to other state-of-the-art approaches showed that our method would have been competitive with the top 10 CAFA3 competitors and clearly outperformed homology-based inference achieving Fmax(BPO)=37±2%, Fmax(MFO)=50±2%, Fmax(CCO)=58±2%. Applying a new dataset not available during method development and preliminary results from CAFA4 support those results. For each prediction a reliability score is provided which is derived based on the distance of the query protein and the closest annotated protein in SeqVec embedding space. If this score is >0.5 we expect a precision and recall of ~50% for BPO and MFO and ~60% for CCO.
Our method consists of three steps: first, the language model SeqVec is used to represent the query protein as vectors (embeddings). That is used to compute the pairwise Euclidean distance to each embedding of a set of annotated proteins (this lookup set is pre-computed). As not all proteins hold annotations to all three ontologies, we pick the most similar protein for each of the three ontologies separately. Then, the annotation of the most similar protein for each ontology is transferred to the query protein.
- The program can be accessed online via the PredictProtein service.
For questions, please contact email@example.com