Finite state segmentation of discourse into clauses

Eva Ejerhed (University of Umea)

The paper presents background and motivation for a processing model that segments discourse into units that are simple, non-nested clauses, prior to the recognition of clause internal phrasal constituents, and experimental results in support of this model.

One set of results are derived from a statistical reanalysis of the Swedish empirical data in Strangert, Ejerhed and Huber (1993) concerning the linguistic structure o f major prosodic units.

The other set of results is derived from experiments in segmenting part-of-speech annotated Swedish text corpora into clauses, using a new clause segmentation algorithm. The clause segmented corpus data is taken from the Stockholm Umea Corpus (SUC), 1 M words of Swedish texts from different genres, part-of-speech annotated by hand, and from the Umea corpus DAGENS INDUSTRI 1993 (DI93), 5 M words of Swedish financial newspaper text, processed by fully automatic means consisting of tokenizing, lexical analysis, and probabilistic POS tagging.

The results of these two experiments show that the proposed clause segmentation algorithm is 96% correct when applied to manually tagged text, and 91% correct when applied to probabilistically tagged text.

PDF version (10 pages, 160k)