OCR-Free Recognition and Labeling of Forms

In order to connect Gutenberg's world with the electronic medium, computer applicable techniques are essential in order to utilize original printed information for a further processing by electronic means. One of the key issues for solving this task is that of structure. As a matter of fact, electronic data usually is available in terms of portions, schemata, patterns or structured objects. This also holds for printed documents where the individual units of informations are arranged in a two-dimensional manner to guide the reader. If it would be possible to automatically identify such fields, often determined as regions of interest or logical objects, an important step towards an expectation driven further analysis of their contents is made.

In my talk, I will discuss the fundamental principles of the system FormClas. It is build upon the highly structured nature of printed information and deals as a basis for automatic form processing. It establishes weighted syntactic representations, i.e. a reference pattern, from detected layout features using position and dimension. Based on this, it recognizes a form and retrieves information from its reference pattern in order to determine logical objects. Principally, FormClas works independent from any textual information as well as from any preprinted line structure. Furthermore, it not only runs on a single type of form but considers various form types at the same time.

Back to Bay Area OCR home page