Learned Embeddings from Deep Learning to Visualize and Predict Protein Sets.

TitleLearned Embeddings from Deep Learning to Visualize and Predict Protein Sets.
Publication TypeJournal Article
Year of Publication2021
AuthorsDallago, C, Schütze, K, Heinzinger, M, Olenyi, T, Littmann, M, Lu, AX, Yang, KK, Min, S, Yoon, S, Morton, JT, Rost, B
JournalCurr Protoc
Date Published2021 May
KeywordsArtificial Intelligence, Deep Learning, Machine Learning, Natural Language Processing, Proteins

Models from machine learning (ML) or artificial intelligence (AI) increasingly assist in guiding experimental design and decision making in molecular biology and medicine. Recently, Language Models (LMs) have been adapted from Natural Language Processing (NLP) to encode the implicit language written in protein sequences. Protein LMs show enormous potential in generating descriptive representations (embeddings) for proteins from just their sequences, in a fraction of the time with respect to previous approaches, yet with comparable or improved predictive ability. Researchers have trained a variety of protein LMs that are likely to illuminate different angles of the protein language. By leveraging the bio_embeddings pipeline and modules, simple and reproducible workflows can be laid out to generate protein embeddings and rich visualizations. Embeddings can then be leveraged as input features through machine learning libraries to develop methods predicting particular aspects of protein function and structure. Beyond the workflows included here, embeddings have been leveraged as proxies to traditional homology-based inference and even to align similar protein sequences. A wealth of possibilities remain for researchers to harness through the tools provided in the following protocols. © 2021 The Authors. Current Protocols published by Wiley Periodicals LLC. The following protocols are included in this manuscript: Basic Protocol 1: Generic use of the bio_embeddings pipeline to plot protein sequences and annotations Basic Protocol 2: Generate embeddings from protein sequences using the bio_embeddings pipeline Basic Protocol 3: Overlay sequence annotations onto a protein space visualization Basic Protocol 4: Train a machine learning classifier on protein embeddings Alternate Protocol 1: Generate 3D instead of 2D visualizations Alternate Protocol 2: Visualize protein solubility instead of protein subcellular localization Support Protocol: Join embedding generation and sequence space visualization in a pipeline.

Alternate JournalCurr Protoc
PubMed ID33961736
Grant ListRO1320/4-1 / / Deutsche Forschungsgemeinschaft /