NLProt

From Rost Lab Open
Jump to: navigation, search

Contents

Intro

NLProt is a tool for finding protein-names in natural language-text. It is based on Support Vector Machines (SVMs), which are trained on contextual-features of named entities in scientific language. Additionally, simple filtering rules and a protein-name dictionary are used to increase performance. NLProt reached a precicion (accuracy) of 70% at a recall (coverage) of 85% after running it on the 166 most recent abstracts of EMBL and Cell (Nov/Dec 2003). When run from the command line, NLProt takes about 1 second per abstract to finish.

Availability

Web service

We're working to restore the web service for NLProt

Package

The NLPRot package for LINUX and MAC OSX systems will be available soon. Requests for such package should be sent to assistant at rostlab dot org.

License

NLProt licenses: is licensed to academic users under the GPL license. Commercial licenses can be obtained by writing to Biosof LLC

If you download NLProt 2.0, you are presumed to have read and accepted the NLProt 2.0 Licensing Conditions for either:

  1. a free, perpetual Academic License, which is free to any academic organization wanting to use NLProt 2.0 entirely for academic (teaching and/or research) purposes.
  2. an annual Commercial License, which is available for organizations wishing to use NLProt 2.0 internally or to exploit it commercially.

References

Sven Mika and Burkhard Rost (2004). Protein names peeled precisely off free text. Bioinformatics. 20 (Suppl 1), I241-I247.

Sven Mika and Burkhard Rost (2004). NLProt: extracting protein names and sequences from papers. Nucleic Acids Research. 32 (Supplement 2), W634-W637.

Installation Instructions - NLProt 2.0 command-line version (LINUX)

Files

Files contained in the NLProt-package:

executables

  • nlprot
  • install
  • svm
  • bl

SVM files

  • svm_model_1_j10.txt
  • svm_model_2_j10.txt
  • svm_model_5_j10.txt
  • svm_model_3_j1.txt
  • word_frequencies_names.txt
  • word_frequencies_overlap.txt
  • word_frequencies.txt

dictionary files

  • dictionary.txt
  • protein_dictionary.txt
  • common_words_not_in_sp_or_tr.txt
  • chemical_compounds_endings.txt
  • in_ending_negatives.txt
  • minerals.txt
  • species.txt
  • tissue.txt

other

  • trembl_species_links.txt

service

  • README.txt

System Requirements

Installation

1. create a directory (e.g. /home/test/nlprot/) on your local machine

2. download a compressed NLProt_LINUX-file from our server (.tar.gz or .zip) into this directory

3. change to the new directory

 $ cd /home/test/nlprot/

4. Decompress the downloaded archive

 $ unzip NLProt_LINUX.zip

OR

 $ gunzip NLProt_LINUX.tar.gz
 $ tar -xf NLProt_LINUX.tar

5. Run the installation program

 $ ./install

install will set up your local copy of NLProt and then ask you to install the name-databases on your machine. You need these databases to assign UniProt IDs to all the found names. If you do not need this feature, you can skip this process. It can take up to 20min on a 2.5GHz machine to finish this step but it is only necessary once.

Deinstallation

Delete all contents of the NLProt directory. Since no files are being copied to other directories by the install.pl script, this should be sufficient to remove the program completely from your machine.

Running NLProt

To use NLProt just run the executable 'nlprot': Type "./nlprot" on the command line and give some necessary options.

Options are submitted to the program as it is shown in the following example: ./nlprot -i /home/test/test_input.txt -o /home/test/test_output.txt

Options:

-i  the input file (input-format: see -n)
    This is a mandatory option for the program!

    input format:
    plain natural language text (each line = one abstract/paper)
		lines have to start with number followed by ">" and then the text
		e.g. 0001>abstract1 abstract1 abstract1 ...
			 0002>abstract2 abstract2 abstract2 ...
			 0003>abstract3 abstract3 abstract3 ...
			 etc
			 .
			 .
			 .

	NLProt runs most efficient if you run it on a big input file with many abstracts (lines). This is due to
	the fact that it only has to read the databases into memory once.

-o  the output file (output-format: see -f)
    This is a mandatory option for the program!

-f  output format:
    html (default) = html formatted output (font color = red for protein names)
	txt = plain text (tags <n> and </n> for protein names)

-d  sequence database:
    sptr = show both, SWISS-PROT and TrEMBL IDs (default)
	sp = show only SWISS-PROT IDs
	tr = show only TrEMBL IDs

-s  on = only provide one database ID for each found name. NLProt will scan the surrounding text for organism names and
         assign the most likely ID to each protein name. (default)
    off = provide all possible IDs (organism unspecific)

-a  on = create a fasta file ([output file-name].fasta) with all sequences for the provided database IDs
    off = do not create fasta output (default)


Temporary Files

NLProt generates a working directory ('tmp') to dump certain temporary files. All temporary files will be deleted after NLProt finishes.

Help

Who should use NLProt

NLProt should be used by researchers who want to build databases on a fully or partially automatic basis. NLProt is highly accurate in finding protein names in free language text and optimally assigns database IDs (SWISS-PROT, TrEMBL) to the found names.

Example Files

Example Input Files:

Example Output File:

Input

  • Create a simple ASCII-file on your machine containing the text you want to scan for protein names. Copy and paste this file into the text box on the submit-page and press the Submit Text button. Please note that your input text has to consist of full sentences, since the algorithm needs the surrounding context of protein names in order to work properly.
  • Each request only takes a few seconds to finish. After that time, the output will appear on the screen.

Output

The output of the program is either an ASCII- or html-file depending on the user's preferences. It contains the tagged input text (if html-format, names are indicated in red) followed by a detailled table listing all found (tagged) names. Each found name is listed together with its position, its score and sometimes a database ID (SWISS-PROT, TrEMBL). For ASCII-output, the < n> tag indicates the beginning of a protein name and the < /n> tag indicates the end of a name. In the table at the end of the output-file, TXT-POS means the position of the name in the text, SCORE is the output-score of NLProt for this name and METHOD is the method by which the name was found. The following methods can be applied:

  1. SVM: the name was found by the SVM-system
  2. projected: the name was found the SVM-system, but at a different position of the text (thus the name was 'projected' to the rest of the text).
  3. dictionary: the name is a long name, found in the dictionary (high length of names + name is in dictionary = strong indication for a protein name)
  4. abbr.-ext.: name is the long form of an abbreviation that was found by the SVM-system.

Additionally, NLProt searches the text for tissue types and species names in order to assign the correct UniProt ID (SWISSPROT and TrEMBL) to each found name. In html-output, tissues and species are marked with green and blue, respectively. In ASCII-format, they are tagged with < t> or < s>.

Data

Explanation

This page contains data that we used for developing NLProt. Those are in particular the text corpora with the tagged protein names and the dictionary files used for the filtering procedure as it is described in our paper. The only tag we use in the corpora is the < n> tag for a protein-name (< /n> = terminating tag).

Tagged Corpora

  1. Yapex corpus (we kept the original tagging by Franzen et al.; 200 abstracts; Yapex website)
  2. GENIA corpus as it was retagged by us (original tag "protein_molecule" was transformed to < n>; 2000 abstracts; GENIA website)
  3. BioCreative: a corpus used in the BioCreative competition (7,500 sentences for training; 2,500 for testing)

Please note that none of the corpora above was tagged by ourselves!

  • Recent166: the recent 166 abstracts (Nov/Dec '03 from EMBO J and Cell) automatically tagged by the final version of our program

The Recent166 corpus should not be used for training, since not all tags were correctly placed by NLProt.

Data used for Filtering

Common Dictionary as we derived it from the Merriam-Webster (MW) online dictionary (Note that this file is not complete since our algorithm can access the MW through the internet and constantly adds words to this local version)

  1. Species: List of species' names from SWISS-PROT
  2. Tissue: List of tissue names from SWISS-PROT
  3. Minerals: List of mineral/salt formulas and their names
  4. Endings of Chemicals: List of 130 typical endings of chemicals

Filtering rules

Filtering rules List of all filtering rules used by NLProt to pre-filter input text

Personal tools