Two Architectures for Segmenting Hand-printed Text
George Mills(*)
(*)Research conducted at Apple Computer

In this paper I compare two basic architectures for doing handwriting recognition. In both architectures possible translations of a hand printed word are generated by piecing together the outputs from a lower level "recognizer" such as an artificial neural network which is trained to identify individual characters. The two architectures differ in their approach to character segmentation, that is in how they deal with competing ways to partition the strokes into groups to form characters.

In the "CR" (character recognizer) architecture, all plausible stroke groups are considered. Each group is presented in isolation to a "character recognizer" (a "CR") whose job is to recognize the group as a single character. Because all plausible groups are considered, segmentation is typically "solved" by the technique of "negative training," i.e. by including examples of incorrectly grouped strokes in the training of the CR to ensure that it will assign low probabilities to all translations for incorrect groups.

In the "GCR" (group and character recognizer) architecture, the recognizer itself is called upon to identify both the correct stroke grouping and the character translation of the group. Candidate word translations are pieced together from left to right, with the GCR at each stage being presented with the next several unconsumed strokes in the word. The job of the GCR is not only to recognize the next character in the word but also to identify which unconsumed strokes belong to that character.

To compare the effectiveness of these competing architectures, I concoct as a test case a highly simplified model of the handwriting recognition task. For this simplified task I derive exact mathematical formulas for how each architecture (if perfectly implemented) would score the candidate translations for any given "written word". This enables me to compare the two architectures (actually, six different variants of each architecture) both formally and empirically.

The formal analysis shows that the formula for the GCR architecture conforms very closely to the Bayes-optimal formula, whereas that for the CR architecture deviates from the Bayes formula in ways that could in principal produce significantly worse results. Monte Carlo simulations, however, tell a different story, with both architectures performing at very nearly optimal Bayesian rates in recognizing artificially generated "written words". The study is thus somewhat inconclusive as to which architecture could be expected to perform best at recognizing real handwriting. One is left to puzzle over why the CR architecture, which appears to be formally incorrect, nevertheless seems to perform well in practice. Despite this apparent stalemate, comparison of the different variants of each architecture yields valuable insights into the correct implementation of each, with particularly important and surprising conclusions to be drawn concerning the proper role of a linguistic component (language model) in handwriting recognition.

Back to Bay Area OCR home page