Francine Chen
Dan Bloomberg
Xerox PARC

Extraction of Indicative Summary Sentences from Imaged Documents

Techniques for creating computer-generated summaries of textual documents have been developed by a number of researchers. In contrast to these text-based techniques, we have developed a method for automatically selecting sentences for creating a summary from an imaged document without recognition of the characters in each word. Summary sentence selection is performed using a statistical classifier to determine the likelihood of each sentence in a document being a summary sentence. The sentences most likely to be a summary sentence are then selected for extraction. Prior to sentence selection, the imaged document is processed to identify the word locations, the reading order of words, and the location of sentence and paragraph boundaries in the text. The words are grouped into equivalence classes to mimic the terms in a text document. The system was evaluated against a set of abstracts created by a professional abstracting company, and the results are compared with text-based abstracts.

Back to Bay Area OCR home page