Abstract :
MEDLINE is a collection of more than 12 million references
and abstracts covering recent life science literature.
With its continued growth and cutting-edge terminology,
spell-checking with a traditional lexicon based
approach requires significant additional manual followup.
In this work, an internal corpus based context quality
rating , frequency, and simple misspelling transformations
are used to rank words from most likely to be
misspellings to least likely. Eleven-point average precisions
of 0.891 have been achieved within a class of
42,340 all alphabetic words having an score less than
10. Our models predict that 16,274 or 38% of these
words are misspellings. Based on test data, this result
has a recall of 79% and a precision of 86%. In other
words, spell checking can be done by statistics instead
of with a dictionary. As an application we examine the
time history of low words in MEDLINE titles and
abstracts.