README: This file IO: The full system involved four separate threads running on four different machines. The input thread used a modified version of the pcomm public domain terminal emulator to dial Dow Jones and download the Wall Street Journal, see IO/pcomm and IO/config. The core processing looked at the semaphore files of the input process, see IO/script, IO/interface. Large-scale dictionary and wordlist lookup, called longlex, was separated from immediate "shortlex" lookup as a separate thread. Finally, the results were output in two ways: exported to an Informix DB, and made viewable under a GUI (intially Sunview, later X), see IO/iu*. While _some_ form of IO is necessary for every working system, in a contemporary system we would want these to take a very different form, probably the input returned from HTTP calls and the output viewable in a browser. The code performing IO was therefore segregated in this directory, which contains several versions of these modules. cleaner: The first stage of the extraction pipeline removes extraneous null bytes (see cleaner/tool/strip/null_strip) and cleaner cuts the issue into article-sized files. In ../WSJ/ usually .dat is the raw, .dat.strip is the null-stripped, and -nnn are the article files, with the three digit article number nnn assigned sequentially by the cleaner. See ../Doc/cl.doc. tm: The table marker stage marks off tables by the tags. This is a classic case of how feeding order tends to be maximized: it's not that we look for tables early on because we deeply care about tabular material! To the contrary, we know it in advance that tables will not contain PCT info.However, tables would mess up rules designed for regular text, just as null bytes mess up most file utilities, so to keep later stages simple we are forced to deal with the special cases first. In more contemporary MR texts we would need similar specialcasing for URLs, e-mail address, and explicitely HTML-marked segments -- here the other specialcasing that happens is for ellipses "..." to enable subsequent tokenization to only have a lookahead of 3. sb: The sentence breaker stage (see ../Doc/sb.doc) attempts to locate sentence boundaries. For the logic of an early version of sb see ../Doc/sentence-bound. Since analysis actually happens at the article level, the performance of the sentence breaker never was a critical issue. wb: The word breaker performs the final tokenization before "shortlex" lookup. It explicitely tags capitalization patterns, and deals with the kind of nitty gritty only a computational linguist could love. Up to and including this stage, much of what has been done in lex/yacc would probably be done in perl these days, though considerations of speed (and ease of maintenance) could still argue for lex/yacc. short_lex: This library is the reason why lexical lookup runs like greased lightning. At the heart of the system is short_lex/hash_words a.k.a. the Holy Grail (see also ../Subdicts), which contains the bitfield encoding for all tags, keywords, and frequent lexical items --when making short_lex.a this gets interpolated in the hashing code. Speed is gained in two ways: first, the shortlex lookup itself is very fast, and second, the use of a separate short lexicon takes the load off the long lexicon, which gets queries much more infrequently than it would without the use of a short lexicon. mark_short: This library created the vectorized data structures in memory. mr, new_mr: This is the first real extraction module, operating directly on the structures created in memory. new_mr is a library version. flat_text: This is the context-sensitive part of the system, identifying tokens across sentences in the whole article. tool: Various tools and utilities used in creating and debugging the system, see tool/README.