Murder at the airport

Adapted at  from

The case

You have been called to assist in a crime scene investigation: the body of an American tourist was found at the airport. He seems to have suffered from convulsions and internal bleeding. Detectives at the crime scene found a drink carton with some sort of beverage: it still contained some fluid which looks like milk. This may be key evidence.

The fluid was sent to the lab and you receive a list of the components of the beverage. Some small molecules such as sugar were found, but also four unidentified proteins were detected. It is your job to analyse these proteins to see if you can help figuring out how the tourist died. Use your computer to search for and study information about these proteins. Some of the powerful tools and databases used in bioinformatics will help you during your investigation.

Identification of the suspicious proteins

A list containing the amino acid sequences of the 4 proteins (called suspect1 through 4) is given here.

Attention: The amino acid sequences are given in the 1-letter code most scientists use. This was discussed in the introduction. If you do not know the code, you can use this list.

You now have enough information to start your investigation. For each of the unidentified proteins you must answer these five questions:

1.    Which protein is it?

2.    From which organism does it originate?

3.    What is the function of this protein?

4.    Is this protein "guilty"? Could it be responsible for the death of the tourist? Why (not)?

5.    Does the protein have any other remarkable features?

You can use this form (Word | PDF) to write down you answers.

BLAST is a program used to compare a given protein sequence with all the protein sequences present in the Uniprot database. This database contains all protein sequences known to science. We will use BLAST within uniprot.org.

To help you get acquainted with Uniprot and BLAST, we will take protein suspect1 and guide you through the process. Afterwards, you can continue your investigation with suspect2, 3, and 4 to find out which of the suspects is guilty.


Exercise 1:

Start a browser and go to uniprot.org. The opening screen looks like this:

 

Description: rikajoho HD:Users:andrea:Desktop:Screen shot 2012-02-10 at 10.50.30.png

Exercise 2:

Click on the BLAST tab in the top row. Below, you see an example containing the amino acid sequence of the first suspect. The first line should always start with a >, followed by the name of the protein (this is the so-called FastA format). Here we chose the name suspect1.

Description: rikajoho HD:Users:andrea:Desktop:Screen shot 2012-02-10 at 10.53.41.png

Exercise 3:

Go to the webpage that shows the list of unidentified proteins. The amino acid sequences of the 4 candidate proteins are written in the correct format for use in BLAST.

á      Copy the sequence of suspect1 into the Blast input field.

á      Click on the  button (next to the input field).

You are now forwarded to the next screen, showing ÒJob Status: RUNNINGÓ while BLAST is busy.

Description: rikajoho HD:Users:andrea:Desktop:Screen shot 2012-02-10 at 11.00.41.png

If many people are using BLAST simultaneously, this may take some time. Do not forget that you are working with a database containing hundreds of thousands of protein sequences. 

When the search is completed, a table showing the results appears, which looks like the one given here (the numbers may vary because of UniProt updates) :

 

Description: rikajoho HD:Users:andrea:Desktop:Screen shot 2012-02-10 at 11.06.11.png

 

This output means that 250 hits have been found in the UniProt database. In other words, at least 250 proteins have an amino acid sequence similar to suspect1.

The graphical overview gives you the so-called hitlist, the list of protein sequences found in the UniProt database that are similar to suspect1. The protein at the top is the most similar to your query sequence, suspect1. The code of the sequence is CASA1_BOVIN.


The green and yellow bars show how much of the query sequence and how much of the match sequence is similar. Here, the complete query sequence (see length of the blue line at the top) is covered by similar matching sequences (see length of green and yellow bars). The colored bar at the top gives you the legend for the level of sequence similarity. How similar is the best hit?

Exercise 4:

Scroll down to see some more detailed information about the individual matches:

Description: rikajoho HD:Users:andrea:Desktop:Screen shot 2012-02-10 at 11.14.58.png

 

The number under E-Value shows how good the best result is. If this number is small (which is the case here, since it is 1.0 x 10-112), the result is reliable.

From left to right, every line shows the following output for each one of the hits:

á      Alignment: Again the graphical overview of the match.

á      Entry: This is the unique code of the protein sequence from the database.

á      Entry name: This is the more readable name of the protein sequence from the database. The first part of the name (CASA1) is the name of the protein, the second part contains the (shortened) name of the organism, in this case BOVINE.

á      Protein names: More detailed name of the protein.

á      Organism: Scientific and common name of the organism

á      Length: Length of the sequence overlap (alignment)

á      Identity: The percentage of residues of the query sequence that are identical to the matched sequence.

á      Score and E-value: Values calculated by BLAST to give you an idea about the reliability of the result. The best hit is always on top.

á      Gene name: name of the gene (piece of DNA sequence) that encodes this protein

Exercise 5:

Let's have a look at the first hit. Click on the green bar in the line with CASA1_BOVIN. Analyse the results.

Description: rikajoho HD:Users:andrea:Desktop:Screen shot 2012-02-10 at 11.39.31.png

You now see a so-called alignment of the two amino acid sequences. Query is the sequence that you have entered, in this case suspect1.  P02662 is the protein sequence from the database that looks most like suspect1. The line between these two sequences contains the letters of all amino acids that are identical in Query and P02662. If not all amino acids are identical, you will see gaps (you can take a closer look later).


As you can see, the sequence of suspect1 is 100% identical to the sequence of CASA1_BOVIN, the Alpha-S1-casein protein (all 199 amino acids are identical). Thus, you have successfully identified one of the four proteins. Click on P02662 to go to the database record and see all known data about this protein, in order to find out if this protein could be involved in the death of the tourist.

Description: rikajoho HD:Users:andrea:Desktop:Screen shot 2012-02-10 at 11.47.21.png

Exercise 6:

The database record for CASA1_BOVIN contains everything you need to know to answer the five questions about your protein:

1.    Which protein is it?

2.    From which organism does it originate?

3.    What is the function of this protein?

4.    Is this protein "guilty"? Could it be responsible for the death of the tourist? Why (not)?

5.    Does the protein have any other remarkable features?

If you do not know where to look, use the hints below:

á      What information is given for Protein name?

á      What information is given for Organism?

á      What information relevant for our investigation is given under Comments, especially Function?

So, what is your conclusion? Is protein suspect1 involved in the death of the tourist, yes or no?

If you want to go back to your result later today, then bookmark the URL of your BLAST result before you run new searches.

Exercise 7: Now analyse the other 3 suspicious proteins by yourself!

Run a BLAST search with the sequences of the other 3 proteins and write down your results.

What is your final conclusion about the murder? How did the victim die?

A closer look...

Well done, you have now established how the victim died; the police will have to figure out the rest. In the process you have learned a lot about a specific part of bioinformatics: analysing protein sequences using bioinformatics tools and databases. This brings us to the end of our murder investigation.

If you have some time left and if you are interested, you can see some more of the many exciting possibilities of bioinformatics. In this additional part, you will find a few exercises that further explore protein sequences, important amino acids, homology between similar organisms, etc.


 

Exercise 8:

You may already know that proteins from different organisms with similar amino acid sequences (called homologues) often have a similar function. We will now look closer into this, using the four proteins that you have already analysed.

Go back to the BLAST results of suspect4 (or rerun the search).

The first hit was LACB_BOVIN. Suspect4 is a β-lactoglobulin from bovine. Now look for β-lactoglobulin of the goat (Latin name: Capra hircus). You can see that the amino acid sequences of the two proteins are 96% identical. Only a few amino acids are different.

Find out how many amino acids are different between the goat and the bovine β-lactoglobulin. Answer

Description: rikajoho HD:Users:andrea:Desktop:Screen shot 2012-02-10 at 12.15.43.png

Once again, you can see that the line between the two sequences contains all amino acids that are identical between the Query and the matched sequence (P02756). Nevertheless, some spots are empty or contain a +sign.
For example, on the first line, you see an L (leucine) in the bovine sequence and an I (isoleucine) in the goat sequence. The amino acids leucine and isoleucine are very much alike, which is why there is a +sign in between. At some places, there is a gap in the middle row, for example right before the end of the sequence. The bovine sequence contains an E (glutamate) and the goat sequence contains a G (glycine). Glycine and glutamate are very different, hence the empty spot.
If you use this representation when you compare two sequences, you can easily and quickly find regions where two proteins are much alike, and where they differ a lot from each other.

You can also color the alignment by amino acid properties to further highlight similarities and differences. This happens when you click the buttons under ÒAmino acid propertiesÓ.

Exercise 9:

Using the BLAST results list, find how many amino acids are identical between the bovine and the dog (Latin: Canis familiaris) β-lactoglobulin. Answer

Exercise 10:

Look at the BLAST output comparing bovine and dog β-lactoglobulin.

1     LIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDLEILLQK    60 Query

      ++V +TM+ LD+QKVAGTW+S+AMAASDISLLD+++APLRVY++EL+PTP+ +LEI+L+K      

1     IVVPRTMEDLDLQKVAGTWHSMAMAASDISLLDSETAPLRVYIQELRPTPQDNLEIVLRK    60 P33685

 

61    WENGECAQKKIIAEKTKIPAVFKIDALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQ   120 Query

      WE+G CA++K++AEKT++PA FKI+ + EN++ +LDTDY  YL FC  N+  P+QSL CQ      

61    WEDGRCAEQKVLAEKTEVPAEFKINYVEENQIFLLDTDYDNYLFFCEMNADAPQQSLMCQ   120 P33685

 

121   CLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI   162 Query

      CL RT EVD+E +EKF++ALK LP+H++L  NPTQ EEQC I      

121   CLARTLEVDNEVMEKFNRALKTLPVHMQL-LNPTQAEEQCLI   161 P33685

All important amino acids of β-lactoglobulin are identical in the two animals. They remained identical (were conserved) during evolution. For example, a sulfur bond exists between Cys66 and Cys160 in the bovine protein. These two cysteines are marked in the BLAST output. You can see that in the dog protein sequence, two cysteines are located at exactly the same place. Their numbers are now 66 and 159 (instead of 66 and 160). At position 150 of the bovine protein sequence, there is a Ser (S) which is missing in the dog protein. This is called a deletion, caused by 3 nucleotides that have disappeared from the dog gene in the course of evolution. Therefore, there is 1 amino acid less in the protein.

 

About this document

This exercise is based on the work of  Bioinformatics@school, but has been modified to use the UniProt site for sequence searches. We () thank the Bioinformatics@school project for their great work!

Bioinformatics@school was developed by the Centre for Molecular and Biomolecular Informatics (CMBI), Radboud University Nijmegen Medical Centre and the Netherlands Bioinformatics Centre (NBIC).