Overview of /mnt/cdrom/Kornai

The best documentation on the high-level goals, design, architecture, and history of the NewsMonitor system is in the paper "Vectorized Finite State Automata" included on this CD in postscript and pdf and, in a slightly revised version, in the companion volume. The organization of the filesystem follows the unix tradition (bin, lib, include, src, ... ) rather than the organization of the paper, but even people whose primary interest is with the code should read at least Section 2 of the paper first. Following is a brief overview of the top level directories -- note that most have their own README files and further details can often be found in the material that is now collected (but neither fully organized nor properly updated) in Doc.

The {new,old}{bin,lib}.sunos directories are useful only for old Sparcs. They are included for the sake of completeness, and because "these machines are workhorses that never seem to die" (Steven Baker, Unix Review 1998/1, p. 18).

Those interested in hacking the system should of course rebuild these from the sources, see src/README. Since extracting PCT information from the Dow Jones wire is unlikely to be the next killer app, there is no toplevel Makefile and "straight out of the box" configuring and making of the entire system is not supported. Phil Hopely made a first pass at creating a smoother configure/make/install process and tackled some of the endianism issues: his source tree and related files are in the _phil directories.

The WSJ and Test directories contain several issues of the Wall Street Journal in various stages of processing, together with files of a more technical nature generated by the system both for normal operation and for debugging, testing, development, and performance tuning. With the wide availability of the Linguistic Data Consortium's CDs which contain an order of magnitude more WSJ data, the importance of these files as a development testbed is somewhat diminished, and they are retained here primarily because without them the technical files would be next to impossible to interpret. These files carry their own copyright notice.

The Data directory contains various lexical resources used in the development of the dictionaries and frequency lists that play a key role in NewsMonitor. These are still directly useful, and in many ways different from the LDC materials. The Subdicts directory contains both the final (sub)dictionaries and more technical intermediate files. The whole material, with the exception of the original WSJ articles, is placed under Artistic License with Andras Kornai as "Copyright Holder" on Version 1.0 as presented on this CD. If someone develops a substantially more general version, s/he may share in the copyright, as long as the new version remains under Artistic License, GNU GPL, or any other form of copyright protection designed to preserve free access to the sources. The contributions of the other developers, Bich Nguyen, Darin Okuyama, Josef Schreiner, and now Phil Hopely are acknowledged here and should be acknowledged in later versions as well.

THIS MATERIAL IS PROVIDED "AS IS" AND WITHOUT WARRANTIES AS TO PERFORMANCE OR MERCHANTABILITY. THIS SOFTWARE IS PROVIDED WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES WHATSOEVER. BECAUSE OF THE DIVERSITY OF CONDITIONS AND HARDWARE UNDER WHICH THIS SOFTWARE MAY BE USED, NO WARRANTY OF FITNESS FOR A PARTICULAR PURPOSE IS OFFERED. THE USER IS ADVISED TO TEST THE PROGRAMS THOROUGHLY BEFORE RELYING ON THEM. THE USER MUST ASSUME THE ENTIRE RISK OF USING THE PROGRAMS OR DATA FILES.


Back to CD ROM index page