##
Language modeling for speech recognition

**Frederick Jelinek** (Johns Hopkins University)

The speech recognition (transcription) problem is formulated as
follows. Given an observed sequence of acoustic data A, find that
sequence of words W' that maximizes the product P(A|W)P(W). P(W)
denotes the a priori probability that the user of the system will wish
to utter the word string W. The mechanism capable of assigning the
value P(W) to any string W=w1 w2 ... wn is called the language model.

Denoting the "history" of the word wi by hi=w1 w2 ... wi-1, it follows
from basic Bayes theorem that P(W)=P(w1|h1) P(w2|h2) ... P(wn|hn).
Therefore, language modeling means assigning values to the factors
P(wi|hi). Since, obviously, that quantity is based on too large an
argument space, we must decide on an equivalence classification of
histories, $[hi], and then approximate P(wi|hi) by P(wi|$[hi]).

The art of language modeling thus consists of finding (a) an
appropriate classification scheme $[h], and (b) a method of reliably
estimating P(w|$[h]) from training data.

The paper describes three basic methods of equivalence
classification, all finite state: trigram, decision tree, and minimum
divergence (maximum entropy).

PDF version (7 pages, 192k)