• DocumentCode
    3408994
  • Title

    AZuRE, a scalable system for automated term disambiguation of gene and protein names

  • Author

    Podowski, Raf M. ; Cleary, John G. ; Goncharoff, Nicholas T. ; Amoutzias, Gregory ; Hayes, William S.

  • Author_Institution
    Karolinska Inst., Stockholm, Sweden
  • fYear
    2004
  • fDate
    16-19 Aug. 2004
  • Firstpage
    415
  • Lastpage
    424
  • Abstract
    Researchers, hindered by a lack of standard gene and protein-naming conventions, endure long, sometimes fruitless, literature searches. A system is described which is able to automatically assign gene names to their LocusLink ID (LLID) in previously unseen MEDLINE abstracts. The system is based on supervised learning and builds a model for each LLID. The training sets for all LLIDs are extracted automatically from MEDLINE references in the LocusLink and SwissProt databases. A validation was done of the performance for all 20,546 human genes with LLIDs. Of these, 7,344 produced good quality models (F-measure > 0.7, nearly 60% of which were > 0.9) and 13,202 did not, mainly due to insufficient numbers of known document references. A hand validation of MEDLINE documents for a set of 66 genes agreed well with the system´s internal accuracy assessment. It is concluded that it is possible to achieve high quality gene disambiguation using scaleable automated techniques.
  • Keywords
    biology computing; genetics; learning (artificial intelligence); molecular biophysics; physiological models; proteins; AZuRE; LocusLink ID; MEDLINE; SwissProt databases; automated term disambiguation; gene names; human genes; internal accuracy assessment; literature searches; protein names; scalable system; supervised learning; Abstracts; Databases; Humans; Machine learning; Natural language processing; Natural languages; Proteins; Research and development; Supervised learning; Training data;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computational Systems Bioinformatics Conference, 2004. CSB 2004. Proceedings. 2004 IEEE
  • Print_ISBN
    0-7695-2194-0
  • Type

    conf

  • DOI
    10.1109/CSB.2004.1332454
  • Filename
    1332454