Supporting online material for:
Better prediction of sub-cellular localization by combining evolution and structure

Rajesh Nair & Burkhard Rost

Original paper

 

 

Materials and Methods for 'Supporting online material'

Linear analysis of composition vectors. A principal component analysis (PCA) was performed on the test proteins to determine whether the data set clusters according to sub-cellular localization group. The total composition vector,, for a protein i is defined as the row vector , where j=1,É,20 indicates the amino acid type. The composition of the amino acid, , is defined as

                                                                                                                                 

where is the number of residues of amino acid type j in protein i. The surface composition vectors were similarly calculated, with the now representing the number of residues of type j at the surface of the protein. We used these composition vectors to define a sample variance-covariance matrix, S, as follows:

                                                                                                   

where,

                                                                                                                                     

is the average composition of the jth amino acid type over the n proteins in the data set. The principal components of the set of composition vectors are then the eigenvectors of S (e.g. see Anderberg 1973). The composition vector for each protein was then projected onto the plane defined by the first two principal components using the standard inner product. This provides a two dimensional representation of the clustering of component vectors as shown in Fig 2.

 

Results for 'Supporting online material'

Linear separation by principal component analysis not enough. The overall amino acid composition (Fig. 5A), the surface composition (Fig. 5B), the three-state secondary structure composition (Fig. 5C), and the combined sequence-structure composition (Fig. 5D) all showed some correlation with sub-cellular localization in two dimensions (the first two principal components). However, in contrast to previous studies [Andrade, 1998 #20], we could not fully discriminate between the three major classes by a linear separation on any single feature. The full principal component analysis (Methods) revealed that the Eigen-values of the first eight principal components were of similar magnitude for overall amino acid composition. Hence, projecting the composition vectors onto two dimensions resulted in a considerable loss of information. In order to fully resolve the signal for sub-cellular localization present in the different global composition features, we implemented a neural network based machine-learning algorithm.

Fig. 5

Fig. 5: Maximal linear separation of sub-cellular localization. Given are the projections onto the first two principal components of: (A) the overall amino acid composition vectors, (B) the surface composition vectors (from DSSP), (C) the three state secondary structure composition vectors (from DSSP) and (D) the product sequence-structure composition vectors (from DSSP) for the proteins in the test set. For all four features the composition vectors have been projected onto the plane defined by the first two principal components, respectively represented by the x and y-axis. The axis labels indicate the amino acid/secondary structure types that contribute most significantly to the two principal components. The extra-cellular class (open plusses) is the best resolved and the cytoplasmic class (shaded circles) the worst resolved for all four features.

 

Fig. 6

Fig. 6: Better prediction through combining neural networks. Combining the various sources of information (amino acid composition, surface composition and amino acid composition separated into the three secondary structure states) yielded by far the best results. Prediction accuracy increased by up to three percentage points over simple first level networks. The different sources of information were combined in two different ways; using a statistical jury decision on predictions from the first level networks (marked Sum in figure), and using predictions from the first level networks as input to a second level neural network (Net in figure). (A) For the single networks (i.e. using only single sequences and no evolutionary information), combining the networks in a simple jury (SumObs and SumPrd) performed as well as the neural network combinations (NetObs and NetPrd). Here, Obs and Prd represent networks based on the observed and predicted surface and secondary structure of the protein respectively. The standard error in prediction accuracy was approximately 0.25% points. (B) For profile based networks using evolutionary information, the second level neural network combinations performed best. For the combination networks, using profiles rather than single sequences as input improved prediction accuracy by up to 2%. Profile based NetObs (the final LOC3DnetObs system) networks gave the best overall localization prediction (accuracy over 65%).

 

Table 5: Confusion matrix for the LOC3DnetObs system. *

 

nuc

ext

cyt

mit

lys

ret

vac

gol

oxi

SUMprd

nuc

95

11

21

1

0

1

0

0

1

130

ext

5

78

10

2

2

0

3

0

0

100

cyt

19

10

52

10

4

1

1

1

2

100

mit

5

0

11

10

0

1

0

1

1

29

SUMobs

124

99

95

23

6

3

4

2

4

359

 

* Abbreviations used:

obs:   annotated localization;

prd:   predicted localization;

Localizations: nuc: nucleus; ext: extracellular space; cyt: cytoplasm; mit: mitochondria; lys: lysosome; ret: Endoplasmic reticulum; vac: vacuoles; gol: Golgi apparatus; oxi: peroxysome;

SUMobs:         sum over all proteins annotated in particular compartment; 

SUMprd:         sum over all proteins predicted in particular compartment;  

Note 1:            The numbers give the proteins used in the four-fold cross-validation experiment for which LOC3DnetObs assigned any localization.        

Note 2:            The diagonal shows the correctly predicted proteins in bold face.