|
CUBIC: NLProt / Data
|
Page Index:
Index
Submit
License
Help
|
Explanation |
This page contains data that we used for developing NLProt. Those are in particular the text corpora with the tagged protein names and the dictionary files used for the
filtering procedure as it is described in our paper. The only tag we use in the corpora is the < n> tag for a protein-name (< /n> = terminating tag).
|
Tagged Corpora |
Yapex corpus (we kept the original tagging by Franzen et al.; 200 abstracts; Yapex website)
GENIA corpus as it was retagged by us (original tag "protein_molecule" was transformed to < n>; 2000 abstracts; GENIA website)
BioCreative: a corpus used in the BioCreative competition (7,500 sentences for training; 2,500 for testing)
Please note that none of the corpora above was tagged by ourselves!
Recent166: the recent 166 abstracts (Nov/Dec '03 from EMBO J and Cell) automatically tagged by the final version of our program
The Recent166 corpus should not be used for training, since not all tags were correctly placed by NLProt.
|
|
| Data used for Filtering |
Common Dictionary as we derived it from the Merriam-Webster (MW) online dictionary (Note that this file is not complete since our algorithm can
access the MW through the internet and constantly adds words to this local version)
Species: List of species' names from SWISS-PROT
Tissue: List of tissue names from SWISS-PROT
Minerals: List of mineral/salt formulas and their names
Endings of Chemicals: List of 130 typical endings of chemicals
|
| Filtering rules |
Filtering rules List of all filtering rules used by NLProt to pre-filter input text
|
|
|