Jonathan J. Hull
Ricoh California Research Center

Content-Based Retrieval From Document Image Databases

A method for organizing a database of document images is proposed that uses image data extracted from the text portion of the documents. This allows for the location of matching documents in a database that may have been reformatted or re-imaged. Each document image is represented by a number of descriptors that are invariant to the geometric distortions of translation, rotation, and scaling. Individual descriptors capture information about local features and the overall set of descriptors for a document image provide a redundant description of its content. Given the descriptors extracted from an input document, an equivalent matching document is located by accessing a hash table that lists the documents that contain each descriptor. An innovative application of this approach termed ``iconic paper'' is described. This is a paper-based interface to a large document image database in which icons of document images printed on paper are used as keys for retrieval.


J. J. Hull, ``Document image matching and retrieval with multiple distortion-invariant descriptors,'' International Association for Pattern Recognition Workshop on Document Analysis Systems (DAS94), Kaiserlautern, Germany, October 18-20, 1994, 383-399.

M. Peairs, ``Iconic Paper,'' Proceedings of the Third International Conference on Document Analysis and Recognition, Montreal, Canada, August 14-16, 1995, 1174-1179.

Back to Bay Area OCR home page