UNIPARSE RECONSTRUCTION --reconstruction of a parser developed 1958-1959 as a part of the TRANSFORMATIONS AND DISCOURSE ANALYSIS PROJECT at the University of Pennsylvania. MAIN README Aravind Joshi & Philip Hopely joshi@linc.cis.upenn.edu, phopely@linc.cis.upenn.edu November 23, 1997 Revised 2.0 - February 24, 1998 SYNOPSIS A parsing program was designed and implemented at the University of Pennsylvania during the period from June 1958 to July 1959. This program was part of the Transformations and Discourse Analysis Project (TDAP) directed by Zellig S. Harris. The techniques used in this program, besides being influenced by the particular linguistic theory, arose out of the need to deal with the extremely limited computational resources available at that time. The program was essentially a cascade of finite state transducers (FSTs). To the best of our knowledge, this is the first application of FSTs to parsing. Lila Gleitman, Aravind Joshi, Bruria Kauffman, Naomi Sager, and a little later, Carol Chomsky were involved in the development and implementation of this program. A brief description of the program appears in (Joshi 1961) and a somewhat generalized description of the grammar appears in (Harris 1962). Phil Hopely began a faithful reconstruction of this program as his undergraduate research project under the guidance of Aravind Joshi. First public presentation of this work was made in May 1996. ACKNOWLEDGEMENTS A number of people have helped us during the course of the reconstruction, of the parser, its testing and presentation of the work. In particular, we want to thank Lauri Karttunen, Andras Kornai, Mark Liberman, Mitch Marcus, Tom Morton, Mehryar Mohri, Anoop Sarkar and B. Srinivas. The following documentation has been prepared by Phil Hopely. CD CONTENTS The distribution consists of the tdap parsing system proper, as well as other optional software components. The directory tdap/ contains the tdap uniparse project reconstruction, along with various utilities and extensions. Supplemental software developed for the XTAG project (not a part of the original parser system) is located in the xtag/ subdirectory. The XTAG syntax and morphological databases were interfaced to permit a bit easier sampling of the parser by extending it's lexicon, and the nbest trigram tagger was interfaced to see the effect on performance. More detailed information and documentation is available in the doc/ subdirectory. The software on this cd has been tested under linux and sunos - the code should port to other architectures with some effort. The parsing system proper requires between 10 and 20 megabytes of disk space, and a minimum 400k of runtime memory. The runtime memory requirement is due to the large amount of trace code, which I estimate consumes at least 60 percent of the space. See the QUICK START section below for a fast introduction to the use of the parser from a live cd image under intel linux. If you are not running linux or want to install the parser from the cd, refer to the INSTALLATION OVERVIEW for a discussion of the possibilities. Uniparse/ PACKING LIST: DIRECTORY - DESCRIPTION LICENSE - licensing information Makefile - high level installation control (refer to INSTALLATION OVERVIEW) provides targets: install.tdap - install tdap parsing system cdinstall - same as install.tdap install.xtag - install xtag components, upgrade tdap script pointers install.wsj_int - install kornai wall street journal interface localinstall - total installation of tdap and xtag - also useful for upgrading scripts w/o reinstalling everything. README - this file doc/ - contains a ps version of the long version of the "Parser from Antiquity" paper, as well as lot of technical documentation, xfigures, etc. tdap/ - the tdap parsing system. tdap/x - X11 graphical user interface/controller system for the tdap parsing system. wsj_int/ - interface scripts and software for Kornai's wall street journal corpus sentences xtag/db - version 1.79 of the berkeley database package, by Seltzer and Yigit. * This may be upgraded by final distribution time xtag/morph - a distribution of the University of Pennsylvania morphology database package, revision 1.4+. xtag/syn - a distribution of the University of Pennsylvania syntactic database package, originally designed for use with the XTAG project. xtag/pos-tagger - a distribution of the University of Pennsylvania part of speech tagger systems, originally designed for use with the XTAG project. xtag/nbest-tagger - as above, but nbest. Refer to the documentation associated with these packages for further information. INSTALLATION OVERVIEW There are currently three major degrees of installation: 1. A "non-invasive" installation is provided where the software may be run directly from the cd on an intel linux box. This method, if it works with your system, may require no work at all, or only the creation of a few symbolic links on the user's file system (see HACK FIXES). Refer to the QUICK START documentation for more information about this mode. 2. A "semi-invasive" installation is provided where you may install & recompile the tdap uniparser system, but leave the (rather large) XTAG subsystems on the compact disc. This is useful under linux for limited resource conditions. Refer to the UNIPARSE INSTALLATION NOTES documentation for more information. --> [from Uniparse/ use 'make cdinstall' or 'make install.tdap'] 3. A "fully-invasive" installation is provided, where all software is installed on the destination system. This installs & recompiles the tdap uniparser system, as well as the complete battery of XTAG project tools onto the host file system. You may also optionally choose to install the wall street journal corpus interface tools. A fully-invasive install is useful under conditions where resources are not of a great concern. Refer to the TOTAL LOCAL INSTALL DOCUMENTATION for more information. --> [from Uniparse/ use 'make localinstall'] Finally, to upgrade from a "semi-invasive" to a "fully-invasive" installation, from Uniparse/ issue a 'make install.xtag' command to load the xtag subsystems. The wall street journal interface tools may be installed from Uniparse/ by using a 'make install.wsj_int' command. QUICK START This section contains information about how to quickly begin using the uniparse reconstruction. The systems below should work live directly from the mounted compact disc with a minimum of installation under linux version 2.0 and after. If you are not using an intel-based linux version 2.0 or later, skip this section and move down to the UNIPARSE INSTALLATION NOTES and TOTAL LOCAL INSTALL DOCUMENTATION sections for installation procedures. *Note that all scripts expect "bash" or an equivalent shell to be available. If you have any problems with using the software as is, check the hack fix notes at the end of this document, or contact us with your troubles and if we can help, we will... QUICK START - VINTAGE TDAP UNIPARSE COMPUTATION You can use the uniparse tdap parser such that the computation is as accurately reproduced as possible, relative to the original system description as found in the tdap papers 15-19 [see the long documentation for references]. This script uses a small handbuilt dictionary that has lexical information derived from the original text. As such, the parser is somewhat restricted in the lexicon that it is capable of processing. The default dictionary file should be located within lexclasses/lexclass relative to the base installation directory. To try the parser in this fashion, from the base installation directory, use: ./tdap uniparse ... and feeding the standard input (i.e.: typing at the keyboard or piping in from some other source) a list of sentences, then sending an eof (usually control+d) on the next blank line. Two example sentences from the original documentation are: The examination results in a definite pairing. The examination results in both cases are positive. More sentences from the original documentation may be found in the Uniparse/tdap/ex/ directory. QUICK START - VINTAGE TDAP UNIPARSE COMPUTATION W/ DB INTERFACE You can also use the tdap parser with the option that when a lexical item is not found in the handbuilt dictionary, the system can consult the University of Pennsylvania's xtag project morphology and syntactic databases. The default dictionary file should be located within lexclasses/lexclass relative to the base installation directory. To try the parser in this fashion, from the base installation directory, use: ./tdap uniparsedb ... and feeding the standard input (i.e.: typing at the keyboard or piping in from some other source) a list of sentences, then sending an eof (usually control+d) on the next blank line. Two example sentences not from the original documentation are: I like to eat cheese. I think that it tastes good. *Thanks to the xtag project linguistic information database maintainers for constructing the databases. QUICK START - UNIPARSE W/ DB INTERFACE & APEX EXTENSIONS You can also use the tdap parser with the option that when a lexical item is not found in the handbuilt dictionary, the system can consult the University of Pennsylvania's xtag project morphology and syntactic databases. These packages are accessed from the cd-rom by default, but these can be installed on the host for fast db access times. However, these packages must be recompiled if they are moved from the location at which they were last compiled (such as the default pathname /cdrom/xtag). - refer to the TOTAL LOCAL INSTALL section for further instructions. To start this process, from the base installation directory, use: ./tdap dbapex 2> /dev/null ... and feeding the standard input an eof-terminated list of sentences. One example from the original documentation and one that is not are: The apparatus plans control operations. I think it tastes good. The script should pause at eof and then return statistics and parser structural results. For the technical user, this script uses a sleep method of feigned waiting for the so-called "harvest process" to come to a hault. Results will not be available until the harvester process exits, so the script should wait until the harvest process terminates - but I have not had the time to implement this yet. Also, the mapping from xtag to tdap is difficult in some cases, so some level of error is to be expected. Further, the grammatical idiom handler does not use the extended db information yet. APEX acronyms "alternative path exploration" - meaning that this permits the parser (when faced with an ambiguity) to clone itself and follow each possible valuations (for the ambiguity), one to each clone. For example, this allows rebracketing of first-order chunks, as well as the investigation of possible affects of verb subcategorizations other than that one selected as the 'preferred' subcategorization strategy. *Thanks again to the xtag project linguistic information database maintainers for constructing this database. QUICK START - VINTAGE COMPUTATION W/ NBEST POS TRIGRAM TAGGER INTERFACE You can run the parser with the option to exclusively use the nbest pos trigram tagger (supplemented by the morphology and syntactic databases) instead of normal dictionary access by using: ./tdap uniparsenbest 2> /dev/null ... with similar notes to the above. An example from the original documentation that is appreciably aided by nbest tagging is: The apparatus plans control operations. Note that this parse method requires some time to initialize the tagger, so expect some starting delay. This wait should only happen at startup time - once initialized, the code should not need reinitialization. I could have optimized this code in a few places, but time did not permit this before this release. Also, the mapping is approximate as above, and sometimes the tagger produces unusual results which are probably the result of errors in the training data set. *Thanks to Richard Pito and Yuji for writing & developing the tagger code that I interfaced to the parser. QUICK START - UNIPARSE W/ NBEST POS TRIGRAM TAGGER INTERFACE & APEX EXTENSIONS You can run the parser with the option to exclusively use the nbest pos trigram tagger (supplemented by the morphology and syntactic databases) instead of normal dictionary access by using: ./tdap napex 2> /dev/null ... with similar notes to the above. An example from the original documentation that has interesting processing statistics to contrast with the processing of the corresponding dbapex script processing statistics is: The apparatus plans control operations. Note that this parse method requires some time to initialize the tagger, so expect some starting delay. This wait should only happen at startup time - once initialized, the code should not need reinitialization. This method utilizes the aforementioned apex extensions. I could have optimized this code in a few places, but time did not permit this before this release. Also, the mapping is approximate as above, and sometimes the tagger produces unusual results which are probably the result of errors in the training data set. *Thanks again to Richard Pito and Yuji for writing & developing the tagger code that I interfaced to the parser. QUICK START - GUI The tdap x11 graphical user interface may be run from the base installation location via: ./tdap t You can select from the set of canned example sentences to parse using the "Define Input Collection" menu. Click on the "select" button to toggle the function to "remove" and clear off the selected example list, if any exists. Then click on the "remove" button to toggle back to "select" mode. Click the "load" button and concatenate any of: all_ex all_wsj all_atis all_ibm to the pathname to choose a sentence collection to load. After loading, return to the main menu and select the "parse selected examples" submenu. The parse may then be started by clicking the "begin parse" button. More detailed information is available in the paper directory. Note that some environment variables related to window geometries might need to be customized for your display dimensions; this may require you to install the tdap parser to your file system. HACK FIXES When running the parser live from the CD, a number of problems are common. Problem: The software is compiled for a cdrom mount point of /mnt/cdrom but my cd mount point is /cdrom (or something else). Hack: Try something like: cd / ln -s /cdrom /mnt/cdrom ... to create a symbolic link pathway to the mount point. You will probably have to be root to perform this action. If all else fails, install the tdap uniparse software and recompile it on the destination architecture (see UNIPARSE INSTALLATION NOTES). Problem: ld is complaining about libdb (or some other library). Hack: Try using something like: cd /usr/lib ln -s libdb.so.2 libdb.so.1 ... to create a sym link to the (db, in this case) package... You will probably have to be root to perform this action. This probably will not work with all installations, particularly those which are using older version of these libraries. If all else fails, install the tdap uniparse software and recompile it on the destination architecture (see UNIPARSE INSTALLATION NOTES). UNIPARSE INSTALLATION NOTES This section describes how to install the tdap uniparser onto your filesystem from the cd. Note that all scripts expect "bash" or an equivalent shell to be available. Though the system has only been tested with linux and SunOS 5.5.1 it should be portable to other unix flavors without too much effort. The procedure for installing only the tdap parser proper to your filesystem may be started from the Uniparse/ subdirectory using the command: make install.tdap or make cdinstall The system should prompt for the desired destination location and then proceed with the installation. If you want to upgrade and install all the xtag auxillary packages too, refer to the UPGRADING FROM A SEMI-INVASIVE (CD) TO A FULLY INVASIVE (LOCAL) INSTALLATION section. FOREIGN ARCHITECTURE PORTING NOTE If you are attempting to recompile for an architecture other than Linux or SunOS, you will probably want to modify the following scripts: batch_tdap/preempt batch_tdap/tdap.2 tdap.2.env.proto ...such that the proper ps command is issued, and in the case of preempt the proper field is extracted (specifically the process id field from ps). TOTAL LOCAL INSTALL DOCUMENTATION The uniparse system may be sampled under linux live from the mounted cd, but it may be desirable to install & recompile the entire contents of the cd distribution onto a machine (for example, if you are not running linux). This installation process discussed here covers a total installation, including both the TDAP uniparser as well as the XTAG auxillary packages. TOTAL LOCAL INSTALL In order to install the tdap uniparse software, the morph & syn databases, and the part of speech tagger software to your local file system, you must recompile their executables at a stable destination location. You need 80-100 megabytes to do a total install. You also need to be running bash. Switch to the Uniparse/ subdirectory of the cdrom, and use: make localinstall A script will prompt you for all necessary information, as it is needed. It will probably take a little while to copy and reinstall; the warnings messages are mostly harmless and can probably be ignored under most circumstances. PORTING NOTE: If you are attempting to recompile on an architecture other than Linux, you will very likely need to set the environment variable XTAG_UNAME to overrider the probed "PORT" directory value for the db package (for example, if your uname reveals SunOS version 5.5.1, PORT should be set to sunos.5.5.1). Refer to the db package documentation for more information. Hopefully someday the db package will be upgraded to a more recent version which supports autoconf. The snag is that that version is not yet debugged, so this temporary workaround is necessary. The compilation process could be very easily optimized by hand; I use scripts that I developed for the xtag project installation. UPGRADING FROM A SEMI-INVASIVE (CD) TO A FULLY INVASIVE (LOCAL) INSTALLATION From Uniparse/, if you issue a "make install.xtag" command, the system should first install the xtag package suite, then it will prompt you for the location of your tdap/ uniparser base directory. The base directory information is used to automatically update the installed versions of tdap uniparser scripts to allow them to find the xtag package suite on the local file system. If something goes wrong with this, it may be necessary to change scripts by hand. You must alter the "batch_tdap/tdap.2" script (as well as "tdap.2.env.proto") in the installed tdap/ for the environment variables TDAP_MORPH, TDAP_SYN, TDAP_TAGGER, and TAGSET_DIRECTORY to point to the respective new package locations. Note: do not change the endings of these pathname; only change the preceeding pathway specification to correspond to where you have installed these packages. SCRIPTS This section contains some advanced information about available scripts. This information is supplied here for technical users. DIRECTORY STRUCTURE Scripts which are very important to the uniparse reconstruction are located in the subdirectory batch_tdap/ Scripts which start the parsing system particularly interesting operational modes may be found in batch_scratch/ Miscellaneous scripts may be found in batch/ BATCH_TDAP/ findis - script for extracting "is" lexclass entry. preempt - script for re-initializing the process space for the parser. tdap.2 - auto-generated script, generated by install.tdap from tdap.2.env.proto BATCH_SCRATCH/ uniparse* - refer to QUICK START section dbapex - refer to QUICK START section napex - refer to QUICK START section (this code could be heavily optimized) stagapex - supertagger translation interface - refer to tdapsupertagger/README hozo - present a file named 'feeder' to uniparse w/ apex active. Examples: ./tdap hozo hyzy - give a feeder file to uniparse w/ apex, enable verbose tracing, trace has directory tree structure isomorphic to forking of apex - useful for debugging st2 & apex... Examples: ./tdap hyzy NOTE: this will create lots of directories with trace files in each. Remember to clean your system up after using this. BATCH/ _cat $@ - permits redirection from setting of "TDAP_SOURCE" (see batch_tdap/tdap.2) for input selection from an alternative source for the command $@. _noah $@ - permits redirection of dictionary setting of TDAP_LEXFILE to alternative setting for the command $@. demo - where src_label is a selection from "TDAP_SOURCE" (try "ex1", "ex2., etc.) - gives trace of each stage's processing results, up to & including the vp machine packeting. Examples: ./tdap demo ex1 ./tdap _cat demo wsj/wsj10 do_tr [npr|npl|ap|vp|st2]+ - where src_label is a selection from "TDAP_SOURCE" - a monitoring system/debugger for the first-order finite state and second-order bracketing machines; permits single stepping through a trace on the specified input source and specified machine; analysis of rule application in an organized multi-windowed environment. Useful for examining first-order finite state machine behaviour, in particular. Examples: ./tdap do_tr ex1 npr ./tdap _cat do_tr wsj/wsj11 npl go - where src_label is a selection from "TDAP_SOURCE" - output first-order bracketing results in stream format Examples: ./tdap _cat go wsj/wsj12 xo - where src_label is a selection from "TDAP_SOURCE" - output vintage uniparse verbose trace results to stdout. Examples: ./tdap _cat xo wsj/wsj10 showbracket - where src_label is a selection from "TDAP_SOURCE" - output first-order bracketing trace results in a human-readable form for each first-order machine (including a 'selected' interpretation and an ambiguous possibilities statement). Examples: ./tdap _cat showbracket wsj/wsj10 pk - where src_label is a selection from "TDAP_SOURCE" - output first-order bracketing trace results in a human-readable form only after vp first-order machine (including a 'selected' interpretation and an ambiguous possibilities statement), as well as a vintage st2 parse attempt on the input string. Examples: ./tdap _cat pk wsj/wsj10 BUGS & PROBLEMS Below is a discussion of the "major" known problems of the tdap system. These will probably be fixed someday in the future, when I hopefully will have some time to look at this work again. KNOWN PROBLEMS 0. The apex system gets hung up/freezes. CD USERS NOTE: When running from the cd, this condition sometimes occurs when results are matured before the harvest process has set up the channel to communicate, resulting usually in a bad ftok call, returning an errno of 2, and failure to perform any action. This error is generally harmless; it appears to show up the first time that the harvest process is invoked from cd. A sleep statement has been added to the scripts to feign waiting for the harvester to awaken, giving it a minor 'head start' before the main parse engine pipeline becomes active. If this is a nuisance, it appears to not occur when the system is installed onto the host's hard drive. Also this may occur if, during parallel mode operation, the "harvest" process is unable to terminate normally for some reason. This harvest process state is usually indicated by an ftok error message complaining about the ipcs queue when you attempt to start a new parse process. This, as well as any other unusual states, can usually be cured with: ./tdap preempt ... to clean up any processes that might be in an indeterminate state, and refresh the system to start over. I was going to add a timeout system in the harvester to lookout for whacky process states, but I haven't had the time yet. 1. The X11 interface does not entirely comply with the Xlib development specifications. This results in window refresh not happening during a parse process. Also, the interface may not work reliably when more than one user is running it on a single host. 2. The system does not use "autoconf" and it should. 3. The TMP directory must be maintained by hand - parse results are automatically stored there by the system. These must be purged manually by the user. FUTURE WORK Extend the "decision systems" to recognize additional types of sentence structures (interrogatives, imperatives, etc.). Add alternative fork model support. CONTACT INFORMATION We are available via email, and would be glad to hear of your thoughts on this work, complaints, bug reports, etc. Please feel free to contact us at: joshi@linc.cis.upenn.edu phopely@linc.cis.upenn.edu we'd love to hear from you.