Radovan Krtolica
Canon Research America
4009 Miranda Ave., CA 94304

Learning character recognition by localized interpretation of character images


Recognition algorithms encompass segmentation, feature extraction and classification, but these components might be difficult to isolate because of strong interactions between them, and the lack of crisp criteria telling where one stops and where the other begins, [1]. Extreme variability of text images, and of hand written texts in particular, makes it difficult to tune any of those three parts of a recognition algorithm to real data. Automatic parameter tuning (training or learning) requires parametrization of at least a part of the algorithm. As it is more convenient to parametrize classification than the rest of the recognition algorithm, "machine learned recognition" usually means that the recognition classifier has been trained (or tuned) automatically, [2].

Machine training is quite popular in recognition, and it will be difficult to find nowadays a commercial OCR program without some kind of machine trainable recognizer. Still, learning ability of these programs is far from perfect. Indeed, major improvement in any of the three main parts of the recognition program (i.e. segmentation, feature extraction and classification) leads to significant improvement of the overall recognition, because the main difficulty in recognition can be shifted from one of these three parts to another. For instance, good classification can be used in "recognition based" segmentation, [3], and good quality segmentation can solve a significant part of the classification problem. One can say that today both classification and segmentation are equally good (or equally bad) so that there is no significant gain to be expected in shifting the difficulty from one part of the program to another.

Parametrization models for machine learning can be: instance-based (direct), symbolic, numeric or mixed. Learning trees, [4], as the best known symbolic model, are not often used in character recognition because they are brittle when the number of classes is large. Numerical models are best represented by neural networks, [5], although some of them are processing symbolic information as well. The use of neural networks in OCR/ICR seems to be quite popular. However, predicted classes have to be encoded in some way in the network. For instance, if the number of classes lies between 2 to the power of b-1 and 2 to the power of b, (so that b bits are required to identify a class), each of b output units of the neural network can be made to represent one bit of the class code. Alternatively, one output unit can be associated with every class. Learning may become difficult as the size of the problem increases, because of the computational complexity of finding a network that will perform well, [6]. In other words, there is a lack of intuition on what type of neural network is a good model for the OCR/ICR classification process. Instance-based or direct learning models circumvent the need for more sophisticated symbolic or numeric models, but it might be impossible to store all potential instances of the character images due to the limited memory size, and if the memory requirements are lessened by the use of search methods, the classifier can become unwieldy and slow (-the well known relation between space and time constraints). If one attempts to reduce the number of instances of character-images stored, one has to treat several instances as a single entity and devise an appropriate criterion of similarity. In addition, one has to decide which instances are to be stored. In the rest of the paper we show that our Box Connectivity Approach (BCA) to feature extraction, and Localized Interpretation within the classifier, provide solutions to the outlined problems, and allow efficient implementation of direct learning.

Implementation of Direct Learning.

BCA transform partitions the minimal bounding rectangle of a character-image into a predefined number of rectangular partitions (boxes). As the dividing line between boxes may cross one or more pixels, boxes contain in general fractional number of pixels. Then, a binary value is attributed to each of the boxes, depending on the distribution of the pixels inside the boxes. In this way, original character-images are mapped into uniformly sized binary matrices ("character-maps"). It has been shown in previous publications, [7,8], that variability of character-images is significantly reduced by the BCA transform. Instances of BCA character-maps might be stored together with their true interpretations in a look-up table. The look-up table is consulted whenever a particular instance of a character-map needs to be interpreted (recognized).

Underlying the concept of Localized Interpretation is the intuition that two halves of a character-image, even though separated, can still be interpreted as belonging to an image of a particular character. That process can be iterated to a certain limit, after which the interpretation of the character-image is completely lost (the fragments can belong to any particular character of the alphabet). The assignement of one or more characters to a fragment of a character-image is called Localized Interpretation. Usefulness of the Localized Interpretation concept depends on the size of fragments that can be interpreted. A basic fact, established in our recognition experiments, is that true interpretation is localized to a certain extent, i.e. - that the size of the fragments of a character-image still carrying class discriminatory information, is small enough to reduce significantly memory requirements of the look-up table. There is considerable freedom in the way this fact can be exploited, and our implementation scheme is just one of the possible ways of doing it. Note that actual implementation is also an approximation, because the minimal size of fragments that still carry discriminatory information is not well defined.

We have partitioned each character-map in equal number of squares (fragments), all squares having the same number of elements. The number of binary elements encompassed by each of the squares is selected so that all their instantiations can be easily recorded (e.g. for 4x4 squares, there will be 65,536 instantiations). When a particular instance of a character-map is interpreted by visual inspection, all of its square instantiations are interpreted in the same way, that is - a true character associated with the character-map is assigned to each of the squares. Square instantiations, locations, and interpretations are stored in the look-up table. In order to recognize an instance of an overall character-map, the look-up table is consulted to find the interpretations associated with each of its square instantiations. Logical conjunction of these interpretations for all of the square partitions of the character-map yields the interpretation of the overall character-map.

However, when an interpretation of one character-map instance is extended to its square partitions, and the same interpretation of a different instance of the character-map is already present in the look-up table, all possible combinations of recorded instances of square partitions have the same interpretation, i.e. all resulting instances of character-maps are associated with the same interpretation symbol. This may lead to multiple interpretations of the same instance of a character-map. Multiple interpretations can be seen as ambiguities or inconsistencies, depending on whether they are due to an ambiguous visual recognition or to the spurious effect of the approximation used to implement the idea of Localized Interpretation. In any case, multiple interpretations of the same character-map are undesirable, and should not be tolerated in a look-up table. Therefore, after each training session, the look-up table should be checked for consistency, which is a test ensuring that each character-map instance recorded has a single interpretation. (Character-map instances with no interpretation at all are also allowed as long as the training is not complete.) The consistency test for the look-up table is based on the following idea. For each interpretation symbol in the table, a Variation Table is calculated. The Variation Table of a particular symbol indicates all square instantiations associated with that symbol (see Fig. 1).

Square                  SQUARE LOCATION CODE            interpretation
instantiation                                                   symbol
code            0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

0                  x        x              x
1                  x     x     x     x     x             x
..               a     x     a  a     a  a    x     x  a  a  a     A
..                 a  a  a     x  a     x  a a  a  a
..               x                 x                         x
65535           x           x        x       x        x

Fig. 1a Variation tables used to check consistency of the look-up table: recorded square instantiation is indicated by the symbols 'x' or 'a'; same instantiation at the same location in tables '1a' and '1b' is indicated by the symbol 'a', otherwise, an instantiation is indicated by the symbol 'x'.

Square                  SQUARE LOCATION CODE            interpretation
instantiation                                                   symbol
code            0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

0               x                 x     x                x
1                                 x     x                   x
..               a           a  a     a  a  x    x     a  a  a     P
..                 a  a  a  x     a        a a  a  a  x     x
..                                                     x
65535              x  x        x

Fig. 1b Variation tables used to check consistency of the look-up table: recorded square instantiation is indicated by the symbols 'x' or 'a'; same instantiation at the same location in tables '1a' and '1b' is indicated by the symbol 'a', otherwise, an instantiation is indicated by the symbol 'x'.

The look-up table is inconsistent when there are two Variation Tables in which each of the columns of the first table has at least one element (one instantiation) in common with the second table.

Experimental Results.

Coding the recognition program based on the outlined procedure has been completed. In the early training stages, this program is used in combination with a feature-based recognition of BCA character-maps. Preliminary tests on 10,000 hand-printed character-images, segmented from scans of hand-filled forms (transmittal sheets used to automatically store scanned document images into specified data-base folders), have shown good overall recognition. First 5000 character-images were used for training (filling partially the look-up table) and tuning the feature-based recognition algorithm. The combined recognition program was tested on the remaining 5000 character-images. It was shown that accuracy of ca. 80% of the feature-based recognition algorithm was increased to 96% when combined with the learning algorithm, where the feature-based algorithm was used as back-up. More extensive experiments are in progress.


Direct learning has been successfully implemented in a program that recognizes constrained hand-printed character-images, due to:

(1) BCA transforms of character-images, which significantly reduce variability of the original images,

(2) Local Interpretation of fragments of BCA character-maps, which is based on the fact that true interpretation of character-images can be extended to its fragments, when those fragments are large enough,

(3) consistency check of the look-up table after training.


[1] S.Mori,C.Y.Suen,K.Yamamoto, Historical Review of OCR Research and Development, Proc. of the IEEE,V.80,N.7,1992,pp.1029-58.

[2] M.Bokser, Omnidocument Technologies, Proc. of the IEEE,V.80,N.7,1992,pp1066-78.

[3] R.C.Casey and E.Lecolinet, A Survey of Methods and Strategies in Character Segmentation, IEEE Trans. PAMI-18,N.7,1996,pp.690-706.

[4] J.R.Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publ., San Mateo,Ca.,1994.

[5] S.Haykin, Neural Networks: A Comprehensive Foundation, Mac Millan College Publ. Co.,New York, 1994.

[6] W.S.Lee,P.L.Bartlett,R.C.Williamson, Efficient Agnostic Learning of Neural Networks with Bounded Fan-in, IEEE Trans. IT-42,N.6,1966,pp.2118-32.

[7] R.Krtolica,S.Malitsky,Two-Stage Box Connectivity Algorithm for Optical Character Recognition, Proc. of the Second International Conference on Document Analysis and Recognition, Tsukuba, Japan, October 20-22, 1993,pp.179-82.

[8] R.V.Krtolica,S.Malitsky,Multifont Optical Character Recognition Using a Box Connectivity Approach, US Patent No 5,539,840, July 23,1996.

Back to Bay Area OCR home page