Language modeling for speech recognition

Frederick Jelinek (Johns Hopkins University)

The speech recognition (transcription) problem is formulated as follows. Given an observed sequence of acoustic data A, find that sequence of words W' that maximizes the product P(A|W)P(W). P(W) denotes the a priori probability that the user of the system will wish to utter the word string W. The mechanism capable of assigning the value P(W) to any string W=w1 w2 ... wn is called the language model.

Denoting the "history" of the word wi by hi=w1 w2 ... wi-1, it follows from basic Bayes theorem that P(W)=P(w1|h1) P(w2|h2) ... P(wn|hn). Therefore, language modeling means assigning values to the factors P(wi|hi). Since, obviously, that quantity is based on too large an argument space, we must decide on an equivalence classification of histories, $[hi], and then approximate P(wi|hi) by P(wi|$[hi]).

The art of language modeling thus consists of finding (a) an appropriate classification scheme $[h], and (b) a method of reliably estimating P(w|$[h]) from training data.

The paper describes three basic methods of equivalence classification, all finite state: trigram, decision tree, and minimum divergence (maximum entropy).

PDF version (7 pages, 192k)