GSoC < BioJS

The BioJS team

The BioJS development team is happy to participate in Google Summer of Code for the first time. We are an international team developing beautiful, interactive and easy-to-share JavaScript applications for visualization of biological data. This web site gives an overview of project ideas we are offering to students who wish to spend their summer working with us. We are looking for 2 to 3 excellent students for collaboration on the following projects.

Students: how to apply?

Start early

Experience from other organisations has shown that the students often make a mistake converting the official 19:00 UTC deadline to their own timezone or have technical problems submitting in Melange during the final few seconds. We very much want to receive your application and don't want you to be disappointed or miss your chance of a great summer! So please remember to submit early and then edit as often as you want before the deadline (21.03, 19:00 UTC).

Present yourself

As we don't know you, we would like to know more about your awesome personality.
How can we contact you?
How are your programming skills?
- What programs have you coded so far?
- Do you have any code samples?
- Are you familiar with JavaScript/jQuery?
- Is there anybody who can vouch for your experience?
Academic Experience
- At wich university/college are you currently enrolled?
- At what year/semester are you? What is your major/degree/focus?
- Are you part of a research group?
- Have you had any lectures about Bioinformatics or Biochemistry?
Have you contributed to any Open Source project before?

Explain your goals

What exactly do you intend to do? Please be as specific as possible
How do you plan to achieve it?
- What parts do you expect to be tricky?
- Which technologies do you plan to use?
What are your milestones for your project?
- A timeline with your milestones is highly recommended
- A task/milestone for each week is generally a good idea
- Other commitments: Do you have any other commitments during this time that could impact your ability to be online, coding and/or communicating with your mentor? Include school, work and family commitments.

Communication

What is your ideal approach to keeping everybody informed of your progress, problems, and questions over the course of the project?
What should be our strategy if you vanish during the project? How do you plan to plan to keep yourself on track?

Why BioJS?

Tell us why you want to work with us.
How do you envision your involvment with BioJS after the GSoC?

Some tips

Is there any special reason why we should pick you? Name it.
Get in touch with us as soon as possible and post your ideas on our mailing list or IRC channel.
Look at open issues on github
Keep it simple: we don't need a 10-page essay on the project, so it's absolutely ok to be concise. However you shouldn't miss the main points.

Further resources

Meet the mentors

Alex Kalderimis

Alex Kalderimis is a Software Developer currently working on the Biological Data Warehouse InterMine at University of Cambridge.

Projects to mentor: all projects

Contact: alex (at) flymine.org

Andrew Nightingale

P.hD. Bioinformatician/Biochemist, expert in protein sequence classification and structure based drug design. Currently implementing the incorporation of genetic variants into UniProt.

Projects to mentor: Human Genetic Variation Viewer

Contact: anight (at) ebi.ac.uk

Anil Thanki

Anil Thanki is a Scientific Programmer at TGAC.

Projects to mentor: BAM File Viewer

Contact: thanki.anil (at) gmail.com

Fabian Schreiber

Dr. Fabian Schreiber is the Project Leader of the TreeFam project at the Wellcome Trust Sanger Institute and the EMBL-EBI.

Projects to mentor: Phylogenetic Tree Viewer

Contact: fs (at) ebi.ac.uk

Francis Rowland

Francis Rowland is a UX Designer and a Web Developer at EMBL-EBI and an organizer at the Cambridge Usability Group .

Projects to mentor: all projects

Contact: frowland (at) ebi.ac.uk

Gustavo Salazar

Gustavo Salazar is a PhD student at the CBIO group in the University of Cape Town with experience in Web development for over 10 years.

Projects to mentor: all projects

Contact: gustavoadolfo.salazar ( at ) gmail.com

Guy Yachdav

Guy Yachdav is a Co-Founder and a Co-CEO at Biosof LLC .

Projects to mentor: Sequence Logo Viewer

Contact: gyachdav (at) bio-sof.com

Ian Mulvany

Ian Mulvany is Head of Technology for eLife , a new open access publisher. He has responsibility for digital product development, digital strategy and improving the author and reader experience of interacting with a publisher site. In 2013 eLife developed a new open source web based viewer for scholarly articles - eLife Lens . Previously to eLife Ian has been Head of Product for Mendeley , and a product manger for Nature Publishing group working on products such as Connotea and Nature Network .

Projects to mentor: Integrate BioJS into an academic publishing workflow

Contact: i.mulvany (at) elifesciences.org

Ian Sillitoe

Ian Sillitoe is a Senior Research Fellow in the Orengo Group at UCL, UK; Technical Manager of CATH, protein structure classification database; Lead Developer of Genome3D, a UK collaboration to annotate genomes with structures.

Projects to mentor: Multiple Sequence Alignment Viewer

Contact: ian.sillitoe (at) googlemail.com

Leyla Garcia

Leyla Garcia is currently a Web & Data Integration Developer in the UniProt group at EMBL-EBI. She has also participated in projects related to Link Open Data, and worked previously as a Computer Science lecturer.

Projects to mentor: all projects

Contact: ljgarcia (at) ebi.ac.uk

Miguel Pignatelli

Miguel Pignatelli is a Software Developer at Ensembl (EMBL-EBI).

Projects to mentor: Phylogenetic Tree Viewer

Contact: mp (at) ebi.ac.uk

Rafael Jimenez

Rafael Jimenez is a Biologist and a Computer Scientist specialised in Bioinformatics services. Technical Coordinator of ELIXIR, a European life-sciences Infrastructure for biological Information. Experienced in projects related to data integration, visualisation, best practices and reusability.

Projects to mentor: Integrate BioJS into an academic publishing workflow

Contact: rafael.jimenez (at) elixir-europe.org

Sangya Pundir

Sangya Pundir works at EMBL-EBI as a User Experience Analyst and Designer for the UniProt team. Her work is focused on making scientific resources easier to use through intuitive and innovative design.

Projects to mentor: Taxonomy Viewer

Contact: spundir (at) ebi.ac.uk

Seth Carbon

Seth Carbon is an AmiGO/Gene Ontology Software Developer and a BBOP System Architect/Administrator at the Berkeley Bioinformatics Open-source Project (BBOP) .

Projects to mentor: Multiple Sequence Alignment Viewer

Contact: sjcarbon (at) berkeleybop.org

Suzanna Lewis

Dr. Suzanna Lewis is a Principal Investigator (PI) at the Berkeley Bioinformatics Open-source Project (BBOP) . She is also a PI and a Co-Founder of the Gene Ontology Consortium, one of the largest and most successful solutions to describe large-scale biological data. Suzanna's work has contributed significantly to the development of open source standards and many different state-of-the-art bioinformatics software.

Projects to mentor: Multiple Sequence Alignment Viewer

Contact: selewis (at) lbl.gov

Tatyana Goldberg

Tatyana Goldberg is a PhD student in Bioinformatics in the RostLab at Technical University Munich.

Projects to mentor: Sequence Logo Viewer

Contact: goldberg (at) rostlab.org

Project ideas

Multiple Sequence Alignment (MSA) Viewer

Rationale

Multiple (i.e. >2) sequence (i.e. protein or nucleic acid) alignments (MSA) are used to investigate the evolutionary relationships between those sequences. Researchers use it to explore numerous questions such as: identifying conserved active sites and structural domains; training HMMs (Hidden Markov Models) to search for other members of the gene family when new sequencing is carried out; locating conserved lengths that may indicate important spatial structural roles; and building phylogenetic trees. The alignment step itself is algorithmic, however the automatically generated output needs to be checked manually because none of the available computer programs are as good as an experienced scientist. Thus the need for visualization.

Approach

The goal is to implement a client-side MSA viewer accessible from the most common Web browsers (Chrome, Firefox). It must take as input PIR or CLUSTAL alignments (optionally MSF and BLC). The display itself should support different coloring schemes (Zappo, Taylor, hydorphobicity, and percent identity threshold) and visualization of particular sequences features (domains, trans-membrane, signal, active sites, etc.). Users will need to be able to choose the coloring scheme and hide/show (expand/collapase) selected rows of the alignment. The API must support integrating this widget with other bioJS components, particularly JS-Paint which is a phylogenetic tree viewer.

Challenges

Providing a smooth intuitive experience for the biologists viewing the MSA
Interacting with the server side data repositories for accessing a hybrid of MSA data and descriptive data (e.g. function, active sites, etc.)
Rapid responsive integration with other bioJS components

Involved toolkits or projects

PANTHER database API
UniProt web services
OWL-API
bioJS

Degree of difficulty and needed skills

Medium difficulty
Knowledge of JavaScript
Familiarity with OWL helpful

Involved developer communities

The developer communities will include the bioJS initiative, the Gene Ontology Consortium, and the Berkeley Bioinformatics Open-source Projects. The final repository of the project would depend in part on design decisions led by the student coder.

Mentors

Suzanna Lewis, Seth Carbon, Ian Sillitoe

Sequence Logo Viewer

Rationale

Conservation patterns in biological sequences (DNA, RNA or proteins) often reveal important functional characteristics such as binding sites, active sites or other regulatory elements. Given a multiple sequence alignment, conservation patterns are commonly presented as a stack of letters drawn for every position in the alignment, highlighting the information content of each letter in the alignment. Sequence Logos are highly used in Life Sciences, though there is no software available for reutilization by the community.

Approach

Within this project you will implement a graphical software for the visualization of conservation patterns in a set of sequences. The software will first calculate a multiple sequence alignment and based on the distribution of letters in the alignment will generate a highly customizable sequence logo, which can be scrolled, zoomed and closely inspected at every position.

Challenges

Understanding of biological data and multiple sequence alignments (MSA)
Constuction of the background frequencies from the universe of amino acid & DNA sequences
Rapid responsive integration with other bioJS components (e.g. MSA Viewer)

Degree of difficulty and needed skills

Medium difficulty
Knowledge of JavaScript
Familiarity with biological sequence databases helpful

More information

Mentors

Tatyana Goldberg, Guy Yachdav

Integrate BioJS into an academic publishing workflow

Rationale & Approach

eLife publishes one of the premier open access journals in the life sciences. We want to bring the great work that BioJS have done into the publishing platform, but we also want to do so in a way that can be of benefit to other publishers too. We propose to work on creating a Drupal module that can integrate seamlessly with BioJS components, and that can be used as a way to install BioJS on publisher sites that use Drupal. We will investigate how to make data in published papers interactive by using BioJS, by understanding the full life cycle of an academic paper, from Word submission, through XML rendering and final display on the web.

Degree of difficulty and needed skills

Medium difficulty
Knowledge of JavaScript
Familiarity with Drupal

Mentors

Ian Mulvany, Rafael Jimenez

Phylogenetic Tree Viewer

Rationale

In bioinformatics and computational biology, the evolution of genes and species is often represented using dendrograms called phylogenetic trees. As the history of gene families can be complex (including lots of gene duplications or losses in different moments of its history) its visualisation can become a difficult task. A good/accurate visualisation of phylogenetic trees allows easier understanding and interpretation of these trees helping to reveal the mechanisms that shape the evolution of a specific set of gene/species. Current web technologies represent a great framework for this kind of representations and we envisage a great opportunity for someone interested in coding a visually appealing and useful javascript plug-in to represent trees on the web. As a primer, we have started a BioJS component, called treeWidget, to visualise phylogenetic trees on the web. Through its modularity, treeWidget can be easily customized to add sequence information to the displayed tree, e.g. protein domains and alignment conservation patterns. In this project we propose to extend the idea behind treeWidget adding a range of functionalities to improve the usability of the plug-in across different kind of tree visualizations.

Approach

The goal is to build a generic plugin to view and interact with (phylogenetic) trees on HTML5/CSS3 compliant browsers. The plugin will show the tree layout separate from any annotation (e.g. domain architecture for each leaf, see picture 1 for an example). This allows to work on the tree and annotation independently. Eventually, the tree plugin could cover many of the use cases for displaying tree-like information on websites and can be a valuable tool in computational biology.

As a generic/reusable plug-in, the component should offer different functionalities exposed via its own API. Some basic functionalities include: Delete/insert nodes, swap branches, select subset of leaves/nodes, animation between different tree layouts (expanding a small trees into a bigger one), etc... In many uses cases, the tree serves as an anchor to display different kind of information associated with its nodes. For this reason the API should also expose methods to locate nodes, extract metadata (e.g. a node.s .x. and .y. location in the layout), traverse the tree, etc. This will help in the integration with different ways of annotate the nodes. The existing BioJS tree component can be used as a basis ( Biojs.Tree.html ).

Challenges

Tree viewer implementation that is reusable in different scenarios/projects (e.g. TreeFam , Ensembl ).
interactive and dynamic display to allow the user to follow the different tree layouts.
Allow the user to switch different annotations

Involved toolkits or projects

D3, Javascript, HTML5, CSS3
TreeFam tree viewer (BioJS tree: https://www.ebi.ac.uk/Tools/biojs/registry/Biojs.Tree.html )

Degree of difficulty and needed skills

Medium difficulty
Knowledge of JavaScript
previous knowledge of D3 would be advantageous but not essential

Involved developer communities

The developer communities will include the bioJS initiative
D3

Mentors

Fabian Schreiber, Miguel Pignatelli

BAM File Viewer

Rationale

A BAM file (.bam) is a binary version of a SAM file that contains sequence alignment data. BAM is the key type of format for dealing with Next Generation Sequencing (NGS) data. Visualization of data in BAM files is crucial for interpretation of read alignments, variants, gene expression, etc. Some web tools have been developed to visualise BAM data on the web such as GBrowse [1], JBrowse [2], TGAC Browser [3], etc. The problem with these visualisation solutions is that they are not necessarily easily extendable or documented. The BioJS visualisation framework builds on a common minimum standard for development of JavaScript functionality that makes it easy to share, reuse and create web components for visualisation of biological data. We propose to develop a BioJS component that can cope with the visualisation of BAM files.

A screenshot of the TGAC Browser visualising paired end reads data from a BAM file. The first pair is shown as blue and the second as brown. Non-paired reads are shown in orange. The grey lines in the middle indicate introns. BAM viewer

[1] Donlin MJ: Using the Generic Genome Browser (GBrowse). Curr Protoc Bioinformatics. John Wiley and Sons, Inc., 2009. PubMed Abstract | Publisher Full Text
[2] Skinner ME, Uzilov AV, Stein LD, et al.: JBrowse: A next-generation genome browser. Genome Res. 2009;19(9): 1630.1638. PubMed Abstract | Publisher Full Text | Free Full Text
[3] Thanki AS, Bian X, Davey RP, Caccmo M: TGAC Browser: visualisation solutions for big data in the genomic era. 2013. Reference Source

Approach

The goal is to implement a client-side BAM viewer accessible from the most common web browsers (Chrome, Firefox). The component should take as input BAM or SAM files, which can then be read and processed by the server. The server can then send the formatted data to the client web browser. Users will need to be able to move around and zoom in and out to visualise certain area of a reference genome. An Application Programming Interface (API) should be able to support the integration of this widget with other BioJS components such as wigExplorer [1].
We currently have a working system for visualisation of BAM files with JavaScript implemented by the TGAC Browser. This system, however, is not independently reusable. We propose to adapt TGAC Browser's code to create a JavaScript BAM viewer conforming to BioJS standard. Users should be able to use this newly developed component with little programming knowhow.

IGV visualising BAM file:
IGV visualization

[1] Thanki A, Jimenez RC, Kaithakottil GG Corpas M, Davey RP. (2014) wigExplorer, a BioJS component to visualise wig data. F1000Research 2014, 3:53 [v1; ref status: awaiting peer review, http://f1000r.es/2vo]

Challenges

BAM files tend to be in the order of GB, so how to handle them will be a challenge
To provide a smooth, intuitive experience for biologists viewing the alignments/reads
Rapid responsive integration with other BioJS components (e.g. wigExplorer)

Involved toolkits or projects

D3, JavaScript, HTML, CSS
Java, Ajax

Degree of difficulty and needed skills

Medium difficulty
Knowledge of JavaScript, Basic Java and Ajax
Previous experience on web development

Involved developer communities

BioJS
D3

More information

BioJS registry
SAM tools
Java-Genomics-IO
TGAC Browser

Mentors

Anil Thanki, Manuel Corpas

Human Genetic Variation Viewer

Rationale

Have you ever wonder what make humans unique? The key is in our genetic variation. The study of human genetic variations help us to look backwards and forwards at human history. On one side, it helps to better understand the human evolution; for instance human migrations from one place to other as well as how different groups relate to each other. On the other hand, genetic variation is key to identify disease-associated genes as well as drug targets; furthermore, understanding variations could help to accurately diagnose a patient and define the procedure to follow, and from one patient we could move to a population affected by a particular disease.

Approach

We envision a graphical hub, bringing together information related to human gene variation from different sources. Such hub should initiate at the protein level, showing protein variations amino acid by amino acid on a protein sequence. The first stage consists on displaying the various levels of deleteriousness. At a first glance it would look as traffic lights but on zooming changes at amino acid level would be revealed. Then, gene information can be taken into account so additional information can be incorporated, such as genomic location, population frequency of the variant, etc. The next stage consists of including chemical information so drug-targets relating variants to specific diseases can be identified. As you zoom in, the visualization will change in order to reflect the level you are: protein, drug targets, diseases, and so on. Variations can also be related to particular organs, so, at any stage, they can be easily located in the human body, giving us an idea of where specific variations have occurred in the body. A partial idea for the first stage is provided here:
Human genetic viewer

Challenges

Integrating data from multiple sources and combining them on a coherent manner. They come in different sizes and formats. Scalability and maintainability should be considered because variation data is continuously growing at an exponential rate.
Understanding data related to human gene variations. No deeply understanding is required but the student is expected to familiarize with the basis in order to visualize the data in an articulated way.

Degree of difficulty and needed skills

Medium difficulty
JavaScript and JavaScript libraries, particularly for data extraction and visualization purposes
SQL and noSQL databases
Web services

Involved developer communities

JavaScript community (depending on the libraries, e.g., jQuery, D3), biological data visualization (BioJS), most probably Apache projects.

Mentors

Leyla García, Andrew Nightingale

Taxononmy Viewer

Rationale

Proteins are often called the building blocks within our bodies because they carry out many important functions such as growth and repair, cell signaling, catalyzing chemical reactions and so on. Some species been thoroughly researched over the years but some others are under-represented, while equally important to the wider scientific. For example, plants have been under studied but could be instrumental in the future for issues like food security. To help researchers focus on new target species, we want to visualize the existing taxonomic diversity in the world’s largest protein information resource, UniProt.

Approach

We envision a dynamic visualization that is zoomable to show different levels of detail at different resolutions. We would start with a high level view showing clusters of species domains present in UniProt Knowledge Base, for example all the Eukaryota grouped together. Zooming into one species domain could show proteomes and reference proteomes visually highlighted. Zooming in further down to proteome level could show Swiss-Prot (manually reviewed) and TrEMBL (unreviewed) entries visually highlighted. We would also encourage investigation into 3d Graphic Libraries to enhance the visualization. A preliminary mockup is provided here: Taxononmy tree

Challenges

Integrating taxonomic data at different levels.
Customizing the data retrieved and it’s visualization depending on the zoom level
Understanding data related to proteins and taxonomy. No deeply understanding is required but the student is expected to familiarize with the basis in order to visualize the data in an articulated way.

Degree of difficulty and needed skills

Medium difficulty
JavaScript and JavaScript libraries, particularly for data extraction and visualization purposes
SQL and noSQL databases
Web services

Involved developer communities

JavaScript community (depending on the libraries, e.g., jQuery, D3), biological data visualization (BioJS), most probably Apache projects.

Mentors

Sangya Pundir, Leyla García

Your idea!

None of our proposed ideas attracts you? You have your own awesome idea? Propose it! Before applying we recommend you to send a short description of your idea to our mailing list .

The BioJS team

Org. admins

Get in touch

Students: how to apply?

Start early

Present yourself

Explain your goals

Communication

Why BioJS?

Some tips

Further resources

Meet the mentors

Alex Kalderimis

Andrew Nightingale

Anil Thanki

Fabian Schreiber

Francis Rowland

Gustavo Salazar

Guy Yachdav

Ian Mulvany

Ian Sillitoe

Leyla Garcia

Miguel Pignatelli

Rafael Jimenez

Sangya Pundir

Seth Carbon

Suzanna Lewis

Tatyana Goldberg

Project ideas

Multiple Sequence Alignment (MSA) Viewer

Rationale

Approach

Challenges

Involved toolkits or projects

Degree of difficulty and needed skills

Involved developer communities

Mentors

Sequence Logo Viewer

Rationale

Approach

Challenges

Degree of difficulty and needed skills

More information

Mentors

Integrate BioJS into an academic publishing workflow

Rationale & Approach

Degree of difficulty and needed skills

Mentors

Phylogenetic Tree Viewer

Rationale

Approach

Challenges

Involved toolkits or projects

Degree of difficulty and needed skills

Involved developer communities

Mentors

BAM File Viewer

Rationale

Approach

Challenges

Involved toolkits or projects

Degree of difficulty and needed skills

Involved developer communities

More information

Mentors

Human Genetic Variation Viewer

Rationale

Approach

Challenges

Degree of difficulty and needed skills

Involved developer communities

Mentors

Taxononmy Viewer

Rationale

Approach

Challenges

Degree of difficulty and needed skills

Involved developer communities

Mentors

Your idea!