Gary E. Kopec
Xerox PARC

Enlivening Legacy Documents in the UC Berkeley Digital Library Project

The UC Berkeley Environmental Digital Library Project is one of six university-led projects that were initiated in the fall of 1994 as part of a four-year digital library initiative sponsored by the NSF, NASA and ARPA. The Berkeley project is particularly interesting from a document image analysis perspective because its testbed document collection consists almost entirely of scanned paper materials such as environment impact reports, county general plans, water usage bulletins and educational pamphlets. One of the major research goals of the project is to develop technology for "enlivening" such legacy collections by providing content-based means of access, display and manipulation. This talk will describe two key elements of the project's approach to enlivening legacy documents- the multivalent document (MVD) model and the use of document-specific decoders. A multivalent document consists of multiple "layers" of distinct, but intimately related content. In the case of scanned documents, typical layers include the bitmap image, an OCR text layer, a layer of word image coordinates and a table structure layer. Document-specific decoders recognizers that are tailored to the fonts, layout, information structure, etc of specific documents. They are used in the digital library to create accurate, richly-structured transcriptions of selected high-value documents for database entry and content repurposing.

Back to Bay Area OCR home page