A non-deterministic tokeniser for finite-state parsing

Masakazu Tateno, Hiroshi Masuichi, and Hiroshi Umemoto (Fuji Xerox)

This paper depicts an optimal method to construct a lexical transducer for Japanese by describing the stems and suffixes in different lexicons separately and adding an extra level of the transducers for transducing between canonical citation forms and stem-suffix style forms. This method makes it possible to reduce the complexity of rule descriptions and the computational load of intersecting compared with other methods. We made the full-size lexical transducer for Japanese. The number of states is about 60 thousand and the number of arcs is about 300 thousands. The physical size is from 800KB to 1.5MB depending on compaction methods.

PDF version (4 pages, 152k)