bottom - TOC - CUBIC-papers - CUBIC

Title: Supporting online material
Author:Jinfeng Liu & Burkhard Rost
Quote: Proteins, 2004, vol, pages

Supporting online material
for:
CHOP proteins into structural domain-like fragments

Jinfeng Liu 1,3,4,*, & Burkhard Rost 1,2,3,*

1 CUBIC, Dept. of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA
2 Columbia University Center for Computational Biology and Bioinformatics (C2B2), Russ Berrie Pavilion, 1150 St. Nicholas Avenue, New York, NY 10032, USA
3 North East Structural Genomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA
4 Dept. of Pharmacology, Columbia Univ., 630 West 168th Street, New York, NY 10032, USA
* Corresponding authors: cubic@cubic.bioc.columbia.edu URL http://cubic.bioc.columbia.edu/  Tel: +1-212-305-4018, fax: +1-212-305-7932


TOC for Supplement:

Table of contents



Fig. S1
figs1.gif

Fig. S1: CHOP was robust with respect to parameter choices. The two major free parameters for CHOP are (1) the required coverage of a domain annotated by PrISM or Pfam-A (values from 0.7-0.9, i.e. 70-90% of the annotated domain have to be sequence similar), and (2) the required threshold for sequence similarity (shown BLAST E-values from 0.1-0.001, i.e. matches are accepted if more similar than this threshold). The data shown was compiled for the yeast proteome. The maximum number (14,047) was obtained at coverage of 80% and E-value of 0.1 (right dashed line), the minimum was 13,915 (left dashed line). For the parameters we used (coverage of 80% and E-value of 0.01, central solid line), the number was 13,988. Thus, the variations in the results for CHOP between throughout this extensive parameter range were smaller than 1%.



Fig. S2
figs2.gif

Fig. S2: PrISM and Pfam-A predicted similar number of domains for same proteins. 6,349 yeast proteins were chopped according to PrISM alone or Pfam-A alone and compared for the number of predicted domains. 1,521 of them can be dissected by either of them. In 40% of the cases, predictions from PrISM and Pfam are the same. Cumulatively, the difference is smaller than three domains for 90% of the proteins.



Fig. S3
figs3.gif

Fig. S3: Percentage of single domain proteins in proteomes. As expected, eukaryotic proteomes (blue) have more multi-domain proteins, than do prokaryotes (green) and archae bacteria (red).



Fig. S4
figs4.gif

Fig. S4: CLUP versus other family-based databases. The distribution of CLUP lengths was most similar to that of the semi-automatic domain parsing method SBase: both slightly over-represented regions ²50 residues, and over-represented (with respect to SCOP) fragments with over 360 residues. Both also came closest to Pfam-A regions. For all public methods the results are taken from the respective databases.



Table S1

Table S1: Names and numbers of ORFs for 62 proteomes used

 

Organism

Number of proteins

Number of CHOP frag


Archae

 

 

Aeropyrum pernix K1

1013

3931

Achaeoglobus fulgidus

2019

4240

Halobacterium sp. (strain NRC-1)

1338

3757

Methanosarcina acetivorans

2744

8592

Methanococcus jannaschii

1707

3142

Methanopyrus kandleri

1091

2853

Methanobacterium thermoautotrophicum

1335

3388

Pyrococcus abyssi

1379

3273

Pyrococcus furiosus

1469

3647

Pyrococcus horikoshii

1233

3431

Sulfolobus solfataricus

586

1568

Sulfolobus tokodaii

1590

4497

Thermoplasma acidophilum

1053

2633

Thermoplasma volcanium

1039

2686


Prokaryotes

 

 

Aquifex aeolicus

1390

3039

Bacillus subtilis

3266

7611

Bifidobacterium longum

1291

3676

Borrelia burgdorferi

646

1671

Brucella melitensis

1471

3980

Campylobacter jejuni

1198

3001

Caulobacter crescentus

2587

7117

Chlamydia pneumoniae

692

1938

Chlorobium tepidum

1465

4182

Chlamydia trachomatis

661

1765

Clostridium acetobutylicum

2655

7350

Clostridium perfringens

1987

5241

Deinococcus radiodurans

2058

5883

Escherichia coli

4089

8225

Fusobacterium nucleatum

1404

3742

Haemophilus influenzae

1694

3320

Helicobacter pylori

1088

2803

Lactococcus lactis (subsp. lactis)

1621

4153

Leptospira interrogans

2043

7569

Listeria innocua

2218

5454

Listeria monocytogenes

2211

5402

Mycoplasma genitalium

468

991

Mycobacterium leprae

1220

3346

Mycoplasma pneumoniae

683

1386

Mycobacterium tuberculosis

2865

8233

Neisseria meningitidis

1385

3688

Oceanobacillus iheyensis

2658

6540

Pasteurella multocida

1815

4057

Pseudomonas aeruginosa

4337

11058

Rickettsia conorii

806

2283

Rickettsia prowazekii

781

1678

Staphylococcus aureus

1903

4785

Streptomyces coelicolor

5102

15121

Streptococcus pyogenes

1301

3395

Synechococcus elongatus

1767

4928

Synechocystis PCC6803

2216

6356

Thermotoga maritima

1446

3608

Treponema pallidum

955

2104

Ureaplasma urealyticum

454

1129

Vibrio cholerae

2135

5521

Xanthomonas campestris (pv. citri)

2988

8271

Xylella fastidiosa

1513

4738


Eukaryotes

 

 

Arabidopsis thaliana

16992

61241

Caenorhabditis elegans

12519

45427

Drosophila melanogaster

8762

33601

Saccharomyces cerevisiae

5447

13334

Homo sapiens

24383

93619

 


Contact:    rost@columbia.edu Version:    Dec 2, 2003
 top - TOC - CUBIC-papers - CUBIC