NLProt: extracting protein names and sequences from papers
AUTOR(ES)
Mika, Sven
FONTE
Oxford University Press
RESUMO
Automatically extracting protein names from the literature and linking these names to the associated entries in sequence databases is becoming increasingly important for annotating biological databases. NLProt is a novel system that combines dictionary- and rule-based filtering with several support vector machines (SVMs) to tag protein names in PubMed abstracts. When considering partially tagged names as errors, NLProt still reached a precision of 75% at a recall of 76%. By many criteria our system outperformed other tagging methods significantly; in particular, it proved very reliable even for novel names. Names encountered particularly frequently in Drosophila, such as white, wing and bizarre, constitute an obvious limitation of NLProt. Our method is available both as an Internet server and as a program for download (http://cubic.bioc.columbia.edu/services/NLProt/). Input can be PubMed/MEDLINE identifiers, authors, titles and journals, as well as collections of abstracts, or entire papers.
ACESSO AO ARTIGO
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=441565Documentos Relacionados
- Molecular linguistics: Extracting information from gene and protein sequences
- Should reviewers of papers have their names published?
- Extracting protein alignment models from the sequence database.
- Should reviewers of papers have their names published?: Let reviewers own responsibility for the papers they pass
- Should reviewers of papers have their names published?: Go one step further