Murder at
the airport
You have
been called to assist in a crime scene investigation: the body of an American tourist
was found at the airport. He seems to have suffered from convulsions and
internal bleeding. Detectives at the crime scene found a drink carton with some
sort of beverage: it still contained some fluid which
looks like milk. This may be key evidence.
The fluid
was sent to the lab and you receive a list of the components of the beverage.
Some small molecules such as sugar were found, but also four unidentified
proteins were detected. It is your job to analyse
these proteins to see if you can help figuring out how the tourist died. Use
your computer to search for and study information about these proteins. Some of
the powerful tools and databases used in bioinformatics will help you during
your investigation.
A list
containing the amino acid sequences of the 4 proteins (called suspect1 through
4) is given here.
Attention: The amino acid sequences are given
in the 1-letter code most scientists use. This was discussed in the
introduction. If you do not know the code, you can use this list.
You now
have enough information to start your investigation. For each of the
unidentified proteins you must answer these five questions:
1. Which protein is it?
2. From which organism does it
originate?
3. What is the function of this
protein?
4. Is this protein "guilty"?
Could it be responsible for the death of the tourist? Why (not)?
5. Does the protein have any other
remarkable features?
You can use
this form (Word | PDF) to write down you answers.
BLAST is a
program used to compare a given protein sequence with all the protein sequences
present in the Uniprot database. This
database contains all protein sequences known to science. We will use BLAST
within uniprot.org.
To help
you get acquainted with Uniprot and BLAST, we will
take protein suspect1 and guide you through the process. Afterwards, you can
continue your investigation with suspect2, 3, and 4 to find out which of the
suspects is guilty.
Start a
browser and go to uniprot.org. The opening screen looks like this:
Click on
the BLAST tab in the top row. Below, you see an example containing the amino
acid sequence of the first suspect. The first line
should always start with a >, followed by the name of the
protein (this is the so-called FastA format). Here we
chose the name suspect1.
Go to the
webpage that shows the list of unidentified proteins. The amino acid
sequences of the 4 candidate proteins are written in the correct format for use
in BLAST.
á Copy the sequence of suspect1 into
the Blast input field.
á Click on the button (next to the input field).
You are now
forwarded to the next screen, showing ÒJob Status: RUNNINGÓ while BLAST is
busy.
If many people
are using BLAST simultaneously, this may take some time. Do not forget that you
are working with a database containing hundreds of thousands of protein
sequences.
When the search is completed, a table showing the
results appears, which looks like the one given here (the numbers may vary
because of UniProt updates) :
This output
means that 250 hits have been found in the UniProt
database. In other words, at least 250 proteins have an amino acid sequence
similar to suspect1.
The
graphical overview gives you the so-called hitlist,
the list of protein sequences found in the UniProt
database that are similar to suspect1. The protein at the top is the most
similar to your query sequence, suspect1. The code of the sequence is CASA1_BOVIN.
The green and yellow bars show how
much of the query sequence and how much of the match sequence is similar. Here,
the complete query sequence (see length of the blue line at the top) is covered
by similar matching sequences (see length of green and yellow bars). The
colored bar at the top gives you the legend for the level of sequence
similarity. How similar is the best hit?
Scroll down
to see some more detailed information about the individual matches:
The number
under E-Value shows how good the best result is. If this
number is small (which is the case here, since it is 1.0 x 10-112), the result is reliable.
From left
to right, every line shows the following output for each one of the hits:
á Alignment: Again
the graphical overview of the match.
á Entry: This is the unique code of the protein sequence from
the database.
á Entry name: This is
the more readable name of the protein
sequence from the database. The first part of the name (CASA1) is the name
of the protein, the second part contains the (shortened)
name of the organism, in this case BOVINE.
á Protein names: More detailed name of the protein.
á Organism: Scientific and common name of the
organism
á Length: Length of
the sequence overlap (alignment)
á Identity: The percentage
of residues of the query sequence that are identical
to the matched sequence.
á Score and E-value: Values
calculated by BLAST to give you an idea about the reliability of the result.
The best hit is always on top.
á Gene name: name of the gene (piece of DNA
sequence) that encodes this protein
Let's have
a look at the first hit. Click on the green bar in the line with CASA1_BOVIN. Analyse the results.
You now see
a so-called alignment of the two amino acid sequences. Query is the sequence
that you have entered, in this case suspect1. P02662 is the protein sequence from the database that
looks most like suspect1. The line between these two sequences contains the
letters of all amino acids that are identical in Query and P02662. If not all
amino acids are identical, you will see gaps (you can take a closer look later).
As you can
see, the sequence of suspect1 is 100% identical to the sequence of CASA1_BOVIN, the
Alpha-S1-casein protein (all 199 amino acids are identical). Thus, you have
successfully identified one of the four proteins. Click on P02662 to go to the
database record and see all known data about this protein, in order to find out
if this protein could be involved in the death of the tourist.
The
database record for CASA1_BOVIN contains everything you need to know to answer the five questions about
your protein:
1. Which protein is it?
2. From which organism does it
originate?
3. What is the function of this
protein?
4. Is this protein "guilty"?
Could it be responsible for the death of the tourist? Why (not)?
5. Does the protein have any other
remarkable features?
If you do
not know where to look, use the hints below:
á What information is given for Protein name?
á What information is given for Organism?
á What information relevant for our
investigation is given under Comments, especially Function?
So, what is
your conclusion? Is protein suspect1 involved in the death of the tourist, yes
or no?
If you want
to go back to your result later today, then bookmark the URL
of your BLAST result before you run new searches.
Run a BLAST
search with the sequences of the other 3 proteins and write down your results.
What is your final conclusion about the
murder? How did the victim die?
Well done,
you have now established how the victim died; the police will have to figure
out the rest. In the process you have learned a lot about a specific part of
bioinformatics: analysing protein sequences using bioinformatics
tools and databases. This brings us to the end of our murder investigation.
If you have some time left and if you are
interested, you can see some more of the many exciting possibilities of
bioinformatics. In this additional part, you will find a few exercises that
further explore protein sequences, important amino acids, homology
between similar organisms, etc.
You may
already know that proteins from different organisms with similar amino acid
sequences (called homologues) often have a similar function. We will now
look closer into this, using the four proteins that you have already analysed.
Go back to
the BLAST results of suspect4 (or rerun the search).
The first hit was LACB_BOVIN.
Suspect4 is a β-lactoglobulin from bovine. Now
look for β-lactoglobulin of the goat (Latin
name: Capra hircus). You can see that the amino acid
sequences of the two proteins are 96% identical. Only a few amino acids are
different.
Find out how many amino acids
are different between the goat and the bovine β-lactoglobulin.
Answer
Once again,
you can see that the line between the two sequences contains all amino acids
that are identical between the Query and the matched sequence (P02756). Nevertheless, some spots are
empty or contain a +sign.
For
example, on the first line, you see an L (leucine) in the bovine sequence and an I (isoleucine) in
the goat sequence. The amino acids leucine and
isoleucine are very much alike, which is why there is a +sign in between.
At some places, there is a gap in the middle row, for example right
before the end of the sequence. The bovine sequence contains an E (glutamate) and
the goat sequence contains a G (glycine). Glycine and glutamate are very
different, hence the empty spot.
If you use this
representation when you compare two sequences, you can easily and quickly find
regions where two proteins are much alike, and where they differ a lot from
each other.
You can
also color the alignment by amino acid properties to further highlight
similarities and differences. This happens when you click the buttons under ÒAmino acid
propertiesÓ.
Using the BLAST results list,
find how many amino acids are identical between the bovine and the dog (Latin: Canis familiaris) β-lactoglobulin. Answer
Look at the
BLAST output comparing bovine and dog β-lactoglobulin.
1 LIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDLEILLQK 60 Query
++V +TM+ LD+QKVAGTW+S+AMAASDISLLD+++APLRVY++EL+PTP+ +LEI+L+K
1 IVVPRTMEDLDLQKVAGTWHSMAMAASDISLLDSETAPLRVYIQELRPTPQDNLEIVLRK 60 P33685
61 WENGECAQKKIIAEKTKIPAVFKIDALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQ 120 Query
WE+G CA++K++AEKT++PA FKI+ + EN++ +LDTDY YL FC N+ P+QSL CQ
61 WEDGRCAEQKVLAEKTEVPAEFKINYVEENQIFLLDTDYDNYLFFCEMNADAPQQSLMCQ 120 P33685
121 CLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI 162 Query
CL RT EVD+E +EKF++ALK LP+H++L NPTQ EEQC I
121 CLARTLEVDNEVMEKFNRALKTLPVHMQL-LNPTQAEEQCLI 161 P33685
All important amino acids of β-lactoglobulin are
identical in the two animals. They remained identical (were conserved) during
evolution. For example, a sulfur bond exists between Cys66 and Cys160 in the
bovine protein. These two cysteines are marked in the BLAST
output. You can see that in the dog protein sequence, two cysteines
are located at exactly the same place. Their numbers are now 66 and 159
(instead of 66 and 160). At position 150 of the bovine protein sequence, there
is a Ser (S) which
is missing in the dog protein. This is called a deletion, caused
by 3 nucleotides that have disappeared from the dog gene in the course of
evolution. Therefore, there is 1 amino acid less in the protein.
This exercise is based on the work of Bioinformatics@school, but has been modified to use the UniProt
site for sequence searches. We ()
thank the Bioinformatics@school project for their great work!
Bioinformatics@school was developed by the Centre for Molecular and Biomolecular Informatics (CMBI), Radboud University Nijmegen Medical Centre and the Netherlands Bioinformatics Centre (NBIC).